Bug 99737 - [PATCH] Fix_stuck_on_bsd_ring for Clarkdale/Ironlake and kernel 4.4.x
Summary: [PATCH] Fix_stuck_on_bsd_ring for Clarkdale/Ironlake and kernel 4.4.x
Status: CLOSED WONTFIX
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-09 18:44 UTC by Rainer Fiebig
Modified: 2017-04-04 08:58 UTC (History)
1 user (show)

See Also:
i915 platform: ILK
i915 features:


Attachments
Fixes "stuck on bsd ring" for Intel-Clarkdale/Ironlake and kernel 4.4.x (15.13 KB, patch)
2017-02-09 18:44 UTC, Rainer Fiebig
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Rainer Fiebig 2017-02-09 18:44:00 UTC
Created attachment 129445 [details] [review]
Fixes "stuck on bsd ring" for Intel-Clarkdale/Ironlake and kernel 4.4.x

A while ago I noticed (and reported) that longterm-kernel 4.4 doesn't work properly with a/m CPU (and perhaps other older Intel-CPU/GPUs). The system produces certain messages after resume from s2disk and becomes unstable after repeated s2disk. The problems are still there in 4.4.47 (see 1).

Based on two committs by Chris Wilson (see 2)) - and with considerable effort ;) - I have created a patch that fixes the problem. I have used it with 4.4.22/38 for months now without any problems. 

Today I have applied it to 4.4.47 (see 3)) and it works there, too. So I'm confident enough now to think that others may benefit as well and a review may be justified. 

Accordingly, I have attached the patch for review/testing or whatever seems appropriate to you.

Regards!

Rainer Fiebig


-----

1)
...
Restarting tasks ... done.
r8169 0000:03:00.0 eth0: link up
[drm] stuck on bsd ring
[drm] GPU HANG: ecode 5:1:0x01000000, reason: Ring hung, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
drm/i915: Resetting chip after gpu hang
~> uname -r
4.4.47-1.g8038aea-desktop+

2)
[f9326be5f1d3ff2c689de8a1754bdafd03879b58]
[62e6300768f6550ba24fa3ba2d4c66d725e3c890]

3)
~/Downloads/Kernel/linux-4.4.47> patch -p1 < ../patch-fix_stuck_on_bsd_ring_for_4.4.x --dry-run
checking file drivers/gpu/drm/i915/i915_drv.h
Hunk #1 succeeded at 3142 (offset -2 lines).
checking file drivers/gpu/drm/i915/i915_gem.c
checking file drivers/gpu/drm/i915/i915_gem_context.c
checking file drivers/gpu/drm/i915/i915_gem_gtt.c
checking file drivers/gpu/drm/i915/i915_gem_gtt.h
Comment 1 Carlos Jimenez 2017-04-02 22:10:05 UTC
dont work with latest kernel 

it hibernates
but dont resume at all 
just black screen and reboot .

have you tried with newer kernels of the 4.4 branch ?
Comment 2 Rainer Fiebig 2017-04-03 07:15:22 UTC
(In reply to Carlos Jimenez from comment #1)
> dont work with latest kernel 
> 
> it hibernates
> but dont resume at all 
> just black screen and reboot .
> 
> have you tried with newer kernels of the 4.4 branch ?

Sorry to hear that.

The last kernel I had applied it to was 4.4.49 and it worked/works well. The system is an Intel Core i3 Clarkdale/Ironlake.

If I find time I'll check out the latest 4.4.x and give feedback here.

If possible, switch to kernel 4.9 - the problem is gone since 4.8 and 4.9 is a longterm-kernel that runs well here.
Comment 3 Rainer Fiebig 2017-04-03 10:35:13 UTC
(In reply to Carlos Jimenez from comment #1)
> dont work with latest kernel 
> 
> it hibernates
> but dont resume at all 
> just black screen and reboot .
> 
> have you tried with newer kernels of the 4.4 branch ?

Cursorily tested it with 4.4.59: the patch applied well like before, kernel compiled without error-messages, s2disk/s2both/resume all well, no problems so far.

So it works nicely here but probably needs improvement. I also have not tested it with newer Intel-CPUs.

The sad thing is that obviously non of the Intel-devs took a look. It shouldn't take a pro too long (5 minutes?) to see whether the patch is worth improving on or simply total crap (not here, though). Or provide a better solution at long last! I had even sent it directly to the guy who had written the a/m committs and the kernel-maintainer but got no response. So be it, I've done enough in this matter.

The problems for older Intel CPUs/GPUs started with 4.2 and lasted up to and including 4.7. That's 6 kernels, including a long-term-kernel that is used by quite a few distributions, stands at .59 and is still not fixed.
Nothing to be proud of, imo.
Comment 4 Jani Nikula 2017-04-03 16:17:49 UTC
This bugzilla is about bugs that are unresolved in our current upstream kernel tree. I'll take your word for how long it took to fix the bug, it certainly seems it took too long, but as I understand it, it's fixed now.

The stable kernels (with the three level version numbers x.y.z) are based on released upstream kernels (x.y) with backports of fixes from later kernels. How big the .z is merely the number of times a batch of fixes has been backported to the stable kernel. It's not a measure of quality or how much development effort has been put to it.

The fixes to be backported to stable kernels must adhere to a pretty strict set of rules [1]. For example, the fix must exist in a later upstream kernel, and it cannot be bigger than 100 lines, with context. Your proposed fix is 500+ lines, and as such won't be considered. (And we don't maintain the stable kernels or backports.)

The kernel is a fast moving target, and from upstream perspective v4.4 is old. Just in drm/i915 the diffstat since then is 211 files changed, 103334 insertions(+), 48260 deletions(-). It's not trivial to assess the validity of a backport across such a diff.

I can only recommend to consider upgrading to a more recent kernel release (and that does not mean a more recent stable .z release of an old kernel).

[1] https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
Comment 5 Jani Nikula 2017-04-03 16:26:01 UTC
(In reply to Rainer Fiebig from comment #3)
> I had even sent it directly to the guy who had
> written the a/m committs and the kernel-maintainer but got no response.

Also, please always include the proper mailing list(s) rather than just send mails to people directly. The developers and maintainers are seriously burdened by the constant firehose of email pointed at their general direction, and private mails don't scale. The information about mailing lists etc. is in the MAINTAINERS file at the kernel source tree root.
Comment 6 Rainer Fiebig 2017-04-03 19:41:05 UTC
(In reply to Jani Nikula from comment #4)
> This bugzilla is about bugs that are unresolved in our current upstream
> kernel tree. I'll take your word for how long it took to fix the bug, it
> certainly seems it took too long, but as I understand it, it's fixed now.
> 

Lapsus memoriae on my side: "stuck on bsd ring" started with 4.3 not 4.2 (I kept a log). So, only 5 instead of 6 kernels until fix. But given the amount of time and quality of life I lost due to "stuck...", I won't apologize. ;)

> The stable kernels (with the three level version numbers x.y.z) are based on
> released upstream kernels (x.y) with backports of fixes from later kernels.
> How big the .z is merely the number of times a batch of fixes has been
> backported to the stable kernel. It's not a measure of quality or how much
> development effort has been put to it.
> 
> The fixes to be backported to stable kernels must adhere to a pretty strict
> set of rules [1]. For example, the fix must exist in a later upstream
> kernel, and it cannot be bigger than 100 lines, with context. Your proposed
> fix is 500+ lines, and as such won't be considered. (And we don't maintain
> the stable kernels or backports.)
> 

Alright, then it won't get fixed. The a/m 2 committs cannot be backported easily, the differences between 4.7/4.8 and 4.4 are too big. The patch is 500 lines for a reason. That doesn't mean that it cannot be compressed down to 100 lines. You i915-devs surely can but I can't.

> The kernel is a fast moving target, and from upstream perspective v4.4 is
> old. Just in drm/i915 the diffstat since then is 211 files changed, 103334
> insertions(+), 48260 deletions(-). It's not trivial to assess the validity
> of a backport across such a diff.
> 

Not trivial but not impossible either. If *I* as average user could analyze the problem, find the 2 committs, solve the puzzle between vastly apart kernels and come up with a patch despite huge differences in code, any of you experienced i915-devs should be able to check and improve that patch with significantly less effort and time than it took me to produce it.

> I can only recommend to consider upgrading to a more recent kernel release
> (and that does not mean a more recent stable .z release of an old kernel).
> 

The "stuck on bsd ring"-story recommends exactly the contrary.

For me it's no longer a problem: I have a patch - and I can easily switch to post-4.7 or pre-4.3. 
But not all users of affected distros can.

I wouldn't mind if you closed this well-intentioned but obviously ill-placed report.

> [1] https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
Comment 7 Jani Nikula 2017-04-04 08:57:14 UTC
(In reply to Rainer Fiebig from comment #6)
> Alright, then it won't get fixed. The a/m 2 committs cannot be backported
> easily, the differences between 4.7/4.8 and 4.4 are too big. The patch is
> 500 lines for a reason. That doesn't mean that it cannot be compressed down
> to 100 lines. You i915-devs surely can but I can't.

The rule is about picking upstream commits that conform to the rules, not about picking any commits and condensing them to fit the rules.

> Not trivial but not impossible either. If *I* as average user could analyze
> the problem, find the 2 committs, solve the puzzle between vastly apart
> kernels and come up with a patch despite huge differences in code, any of
> you experienced i915-devs should be able to check and improve that patch
> with significantly less effort and time than it took me to produce it.

The reality is that we'd have to consider several *other* generations of hardware with varying configurations, not just your hardware. I'm sorry, but the risk of non-trivial backports breaking stuff in stable is not to be taken lightly. I understand you are not happy with the situation, I would not be happy in your situation, but a stable backport breaking a system is a much worse scenario. And then factor in that we're all looking at v4.12, and moving to v4.13 within the week. I'll be honest with you, backporting a fix to v4.4 is not interesting or a priority.

> I wouldn't mind if you closed this well-intentioned but obviously ill-placed
> report.

Thanks, will do, and apologies we couldn't help you.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.