Created attachment 78473 [details] X.org log for SNA crash Even though the error mentioned above is SNA-only, a very similar one occurs with UXA. Unfortunately I trashed my old Xorg log by rebooting twice but if I test again with UXA I'll paste it too. I inspected the errno (5) and that stands for input/output error, which is the same thing that was mentioned by the UXA log, I recall. I've been trying to play videos with vaapi-mplayer because I want to use it as a media station that can run for at least a day. Usually the quality is top notch and the CPU usage is very low, which motivates me more to try to get this to work stably. Unfortunately, due to the multitude of the packages and the fact that the crashes happen seemingly at random (sometimes it takes an hour, sometimes 2 days), I can't test all permutations. I think I could use some help in diagnosing it better. I started upgrading packages because I thought it would alleviate the issue. I started out with all stock debian wheezy packages and ended up with the current install with SNA after lots of compiling. In my limited testing it appears that SNA crashes much sooner than UXA. The last SNA run took about an hour to crash. I've seen many crashes over different kernel versions and whatnot, and the general idea that I get is that the card somehow runs out of memory or some such, which is possible I guess since the movies are large and maybe something in the chain does not release the VA surfaces... If this is true it would seem that both UXA and SNA are leaking, but SNA at a faster rate. Just some speculation here but maybe there could be (switchable) support for VA surface expiration somewhere... Should I perhaps cross-post to the libva bug-tracker as well? Graphics card: Intel HD3000 (Sandy Bridge) Distribution: debian wheezy Kernel: 3.9-rc8 (latest as of this writing) libdrm: 2.4.43 (compiled from debian git) mesa: 8.0.5-4 (stock) intel-vaapi-driver: 1.0.21.pre1 (I tried stock too, ofcourse) libva: 1.1.2.pre1 (VA-API version 0.33.0) intel-xorg-driver: 2.21.6 Anything else I could try? Btw, I always check i915_error_state and I've never seen anything but this: # cat /sys/kernel/debug/dri/0/i915_error_state cat: /sys/kernel/debug/dri/0/i915_error_state: Cannot allocate memory
If you can grab the i915_error_state (kill X, kill everything, and try again) file a bug against libva for causing the GPU hang. However, I blame Daniel for never believing me when I sent patches to prevent this... :-p
(In reply to comment #1) > If you can grab the i915_error_state (kill X, kill everything, and try > again) file a bug against libva for causing the GPU hang. However, I blame > Daniel for never believing me when I sent patches to prevent this... :-p At the bottom of my original bug report I had already mentioned that I never got anything substantial out of the i915_error_log. I'll try again on the next crash, but last time I went so far as to kill apache, mongodb, fluentd, sshd, rpcbind, ... basically anything I could get my hands on (trust me the process list was really small after that, memory usage below 100 MB according to htop), and still it told me it had insufficient memory. Is there anything special I could try? Btw X kills itself after that error, no need to kill it twice. The unit only reports having about 1.8GB of memory (should be 2GB but hey...). Does the unit just have too little memory for a good crash report? Is there maybe something I can use to force linux to flush everything? I have to admit I don't really know how gem/dri/"the driver" tries to allocate that buffer. Should I report to the libva list regardless of getting a good i915_error_state trace? It's a bit heartening to know that you at least seem to be aware that this could be an issue! Were my suspicions correct about it not releasing GPU memory or is it something else entirely? No need for a big explanation, I guess I'm just curious :)
Created attachment 78479 [details] cat of /proc/meminfo, lots of vmalloc
Created attachment 78480 [details] Second time SNA crashed, once again took about an hour
Created attachment 78481 [details] Second crash, dmesg output
(In reply to comment #1) > If you can grab the i915_error_state (kill X, kill everything, and try > again) file a bug against libva for causing the GPU hang. However, I blame > Daniel for never believing me when I sent patches to prevent this... :-p Ok, so it happened again and this time I tried literally disabling everything I could find. At the end htop was reporting 56 MB out 1804 MB in use. With 25 tasks left. I used the following to try and make linux flush some stuff: sync && echo 3 > /proc/sys/vm/drop_caches After that I piped /proc/meminfo to a file, you can see it in the attachments, I don't know but that vmalloc number seems ridiculously high. Is that normal? Should I take this to the libva bug tracker?
More interesting fail: [ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 4366.361694] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state [ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip. [ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for the gpu reset to complete Pity we can't get the i915_error_state back, but if you can isolate the cause of the hang that will be enough for the libva (or whoever to work on).
(In reply to comment #7) > More interesting fail: > > [ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... > GPU hung > [ 4366.361694] [drm] capturing error event; look for more information > in/sys/kernel/debug/dri/0/i915_error_state > [ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip. > [ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for > the gpu reset to complete > > > Pity we can't get the i915_error_state back, but if you can isolate the > cause of the hang that will be enough for the libva (or whoever to work on). Please try this specifically for the reset hang: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 8539177..df9dfa5 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev) I915_WRITE(GEN7_MSG_CTL, temp); } + DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n"); + I915_WRITE(ILK_DISPLAY_CHICKEN2, + I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14)); + i915_gem_l3_remap(dev); i915_gem_init_swizzling(dev);
(In reply to comment #8) > (In reply to comment #7) > > More interesting fail: > > > > [ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... > > GPU hung > > [ 4366.361694] [drm] capturing error event; look for more information > > in/sys/kernel/debug/dri/0/i915_error_state > > [ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip. > > [ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for > > the gpu reset to complete > > > > > > Pity we can't get the i915_error_state back, but if you can isolate the > > cause of the hang that will be enough for the libva (or whoever to work on). > > Please try this specifically for the reset hang: > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index 8539177..df9dfa5 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev) > I915_WRITE(GEN7_MSG_CTL, temp); > } > > + DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n"); > + I915_WRITE(ILK_DISPLAY_CHICKEN2, > + I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14)); > + > i915_gem_l3_remap(dev); > > i915_gem_init_swizzling(dev); (In reply to comment #8) > (In reply to comment #7) > > More interesting fail: > > > > [ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... > > GPU hung > > [ 4366.361694] [drm] capturing error event; look for more information > > in/sys/kernel/debug/dri/0/i915_error_state > > [ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip. > > [ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for > > the gpu reset to complete > > > > > > Pity we can't get the i915_error_state back, but if you can isolate the > > cause of the hang that will be enough for the libva (or whoever to work on). > > Please try this specifically for the reset hang: > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index 8539177..df9dfa5 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev) > I915_WRITE(GEN7_MSG_CTL, temp); > } > > + DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n"); > + I915_WRITE(ILK_DISPLAY_CHICKEN2, > + I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14)); > + > i915_gem_l3_remap(dev); > > i915_gem_init_swizzling(dev); @Chris: I'm not sure I'll deduce the cause of the hang by just repeating my testing (i.e.: playing a lot of movies, often multiple at a time), for SNA it seems to happen about every hour, with UXA it takes quite a bit longer. I'm trying UXA again and it's been going for 4 hours now. I hope I get some genious insight in the near future :). @Ben: I'll see what I can do! That's some crazy magic right there. I'll report tomorrow with the results. Is this codepath supposed to be called with UXA, SNA or both? Because I will start testing with SNA, since that seems to crash earlier and more predictably. Thanks! Nicolas
> Please try this specifically for the reset hang: > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index 8539177..df9dfa5 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev) > I915_WRITE(GEN7_MSG_CTL, temp); > } > > + DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n"); > + I915_WRITE(ILK_DISPLAY_CHICKEN2, > + I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14)); > + > i915_gem_l3_remap(dev); > > i915_gem_init_swizzling(dev); Btw I applied this path manually, since I don't know the right magic incantation for it. It seems that we have slightly different versions of the kernel (mine is 3.9-rc8, here is the complete, adjusted i915_gem_init_hw function I'm currently compiling: int i915_gem_init_hw(struct drm_device *dev) { drm_i915_private_t *dev_priv = dev->dev_private; int ret; if (INTEL_INFO(dev)->gen < 6 && !intel_enable_gtt()) return -EIO; if (IS_HASWELL(dev) && (I915_READ(0x120010) == 1)) I915_WRITE(0x9008, I915_READ(0x9008) | 0xf0000); DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n"); I915_WRITE( ILK_DISPLAY_CHICKEN2, I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14)); i915_gem_l3_remap(dev); i915_gem_init_swizzling(dev); ret = i915_gem_init_rings(dev); if (ret) return ret; /* * XXX: There was some w/a described somewhere suggesting loading * contexts before PPGTT. */ i915_gem_context_init(dev); i915_gem_init_ppgtt(dev); return 0; } Is it still ok?
(In reply to comment #10) [snip] > Is it still ok? yes
Created attachment 78498 [details] X.org crash with vaapi playback and UXA: [DRI2] Dri2SwapComplete: bad drawable
Created attachment 78499 [details] dmesg log with UXA crash, never seen this type before, appears to be something with chromium as well
Created attachment 78500 [details] MPlayer error when crashing with UXA: dri2GetRenderingBuffer: assertion 'buffers' failed
(In reply to comment #11) > (In reply to comment #10) > [snip] > > Is it still ok? > yes Ok, perfect. I compiled the kernel again (with the debian package kernel package builder make-kpkg. I did not do make-kpkg clean because I wanted to let the compile be over quicker. When I look at the date of the last modified file I see that i915_gem.o, i915.o, modules.order and i915.ko have been regenerated so I think that's ok). Btw, I just tried another run with UXA. Sometimes, with UXA, it does not crash the X server, but it does disable acceleration and makes vaapi-mplayer unable to get another surface. The result can be seen in t he last 3 files I posted. I had already almost forgotten about the DRI2SwapComplete: Bad Drawable messages, but they nearly always happen "near the end" when UXA is used. Correct me if I'm wrong here but I thought DRI had something to do with Mesa? Is that also involved here? I could try upgrading to the latest (9.1.1), should I do that? I'm going to cross post on the libva list and see if those guys can chime in as well
DRI is just the direct rendering protocol between X server and clients and used by both Mesa for OpenGL and libva for video decoding. Once the gpu is hung (and reset didn't work) the X server tells the client that by refusing to pass on new buffers.
Except that "SwapBuffers: Bad Drawable" is indicative of a stupid client bug requesting swaps on random windows (i.e. windows it has not created DRI2 surfaces for or subsequently closed). What have libva done this time?
(In reply to comment #11) > (In reply to comment #10) > [snip] > > Is it still ok? > yes Ben, I just tried your patch with SNA acceleration and something new happened: the system froze, I can still see the videos on the screen but it does nothing anymore. I can't connect to the system with ssh anymore either, it seems totally non-responsive. I'll have to reboot. @Daniel: ah, nice, thanks for the explanation! @Chris: hmmm, though it is strange that this can take a very long amount of time (by the time UXA crashes, I think 1000's of vaapi-mplayer instances have been created). Maybe you can get a similar error once a certain resource is exhausted?
Created attachment 78502 [details] [review] Suppress spurious EIO when moving away from the gpu This should keep the kernel functioning in this extreme case.
(In reply to comment #19) > Created attachment 78502 [details] [review] [review] > Suppress spurious EIO when moving away from the gpu > > This should keep the kernel functioning in this extreme case. Should I try this with Ben's path or standalone?
Since Ben's patch seems to have caused the machine to freeze upon GPU reset, I'd highly recommend to drop it.
(In reply to comment #21) > Since Ben's patch seems to have caused the machine to freeze upon GPU reset, > I'd highly recommend to drop it. Ok, I'll do that. It appears we're working with different kernel versions here, as the patch does not apply cleanly. It seems it's mostly line number changes so I'll manually adjust for now. Which kernel version are you developing on?
I work on drm-intel-next[-queued] which is now post-3.10: http://cgit.freedesktop.org/~danvet/drm-intel
Created attachment 78511 [details] [review] Suppress spurious EIO when moving away from the gpu Against v3.9-rc8, and added one more EIO check.
(In reply to comment #23) > I work on drm-intel-next[-queued] which is now post-3.10: > http://cgit.freedesktop.org/~danvet/drm-intel Alright. I guess I'll keep working on 3.9 then (unless you think it's important I change) and just backport your changes. So that if (when) this is fixed I have a reasonably stable kernel on which I can run my video player :). I'm currently testing with SNA, so hopefully if it crashes it will do so in about 30 minutes. I already manually backported the changes, compiled and am running. If it crashes, I'll report, add the extra fix, recompile and test again.
(In reply to comment #25) > (In reply to comment #23) > > I work on drm-intel-next[-queued] which is now post-3.10: > > http://cgit.freedesktop.org/~danvet/drm-intel > > Alright. I guess I'll keep working on 3.9 then (unless you think it's > important I change) and just backport your changes. So that if (when) this > is fixed I have a reasonably stable kernel on which I can run my video > player :). I'm currently testing with SNA, so hopefully if it crashes it > will do so in about 30 minutes. > > I already manually backported the changes, compiled and am running. If it > crashes, I'll report, add the extra fix, recompile and test again. Just to make sure, there are crashes with and without: '[ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung' in dmesg? Please always check the dmesg after crash and report if above line is present and if it is try to get error state (if we would be so lucky).
(In reply to comment #26) > (In reply to comment #25) > > (In reply to comment #23) > > > I work on drm-intel-next[-queued] which is now post-3.10: > > > http://cgit.freedesktop.org/~danvet/drm-intel > > > > Alright. I guess I'll keep working on 3.9 then (unless you think it's > > important I change) and just backport your changes. So that if (when) this > > is fixed I have a reasonably stable kernel on which I can run my video > > player :). I'm currently testing with SNA, so hopefully if it crashes it > > will do so in about 30 minutes. > > > > I already manually backported the changes, compiled and am running. If it > > crashes, I'll report, add the extra fix, recompile and test again. > > Just to make sure, there are crashes with and without: > > '[ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... > GPU hung' > > in dmesg? > > Please always check the dmesg after crash and report if above line is > present and if it is try to get error state (if we would be so lucky). I mostly check dmesg and I can confirm that the line is always there. Right now it just crashed again, I will upload the lines with 'drm' in them from dmesg so you can see. I will also upload the newest Xorg.0.log. @Chris: so it crashes as well with the first patch you gave me. I'll see if the fifth check helps now.
Created attachment 78515 [details] Xorg.0.log after first patch Chris
Created attachment 78516 [details] dmesg after first patch Chris (SNA)
Try adding diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 87c62cc..2bd8d7a 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1334,6 +1334,8 @@ int i915_gem_fault(struct vm_area_struct *vma, struct vm_f bool write = !!(vmf->flags & FAULT_FLAG_WRITE); ret = i915_mutex_lock_interruptible(dev); + if (ret == -EIO) + ret = mutex_lock_interruptible(dev); if (ret) goto out; as well as the other EIO suppressions.
(In reply to comment #30) > Try adding > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index 87c62cc..2bd8d7a 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1334,6 +1334,8 @@ int i915_gem_fault(struct vm_area_struct *vma, struct > vm_f > bool write = !!(vmf->flags & FAULT_FLAG_WRITE); > > ret = i915_mutex_lock_interruptible(dev); > + if (ret == -EIO) > + ret = mutex_lock_interruptible(dev); > if (ret) > goto out; > > > as well as the other EIO suppressions. I'm testing now with the 6 EIO suppressions you gave me. An interesting note: I just checked on my Ivy Bridge (HD 4000, gen7) system which I was running with the same test: it's still running! It's been 4 days now I think. Pushing 16 (!) 1080p videos of between 100 and 200MB and constantly getting its surfaces destroyed, same as the sandy bridges. This is with all stock debian wheezy packages, except for kernel 3.9-rc8. I have seen the Ivy Bridge lock up, but that was with wheezy's 3.2. Maybe the Ivy Bridges stability in this round is just a fluke though...
Sorry for the spam, but I forgot to add something: The Ivy Bridge system I was talking about has a lot of dri2SwapComplete: bad drawable notices in its Xorg.0.log. But it still keeps on playing all the videos just fine. There's nothing in dmesg for that system
(In reply to comment #30) > Try adding > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index 87c62cc..2bd8d7a 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1334,6 +1334,8 @@ int i915_gem_fault(struct vm_area_struct *vma, struct > vm_f > bool write = !!(vmf->flags & FAULT_FLAG_WRITE); > > ret = i915_mutex_lock_interruptible(dev); > + if (ret == -EIO) > + ret = mutex_lock_interruptible(dev); > if (ret) > goto out; > > > as well as the other EIO suppressions. I used all the EIO suppressions you gave me (6 in total) and the result was the same as with all the other patches (your first patch and Ben's patch): the system completely locks up and I can't even ssh in. Unfortunately I don't find anything in the logs after reboot (dmesg.0, Xorg.0.old and my app log).
That patch shouldn't itself cause a lockup - so I wonder if the system would have eventually locked up anyway if X didn't die first. But that the patch eliminates the false SIGBUS and failed mmaps is a good sign! :)
(In reply to comment #34) > That patch shouldn't itself cause a lockup - so I wonder if the system would > have eventually locked up anyway if X didn't die first. But that the patch > eliminates the false SIGBUS and failed mmaps is a good sign! :) Good question, my intuition also tells me that it would've locked up eventually. But that's because I'm still under the assumption that it's a resource exhaustion problem, which might be completely false, since I've never dealt with video card code in my life. In this way X serves as a full-crash shield, how nice ;). The system locking up completely does make it very hard to diagnose anything at all though. Any idea on something I could try?
You can try either: i915.i915_enable_rc6=0 or i915.reset=0 and see if they stop the hard hangs.
(In reply to comment #36) > You can try either: > > i915.i915_enable_rc6=0 > > or > > i915.reset=0 > > and see if they stop the hard hangs. Unfortunately that doesn't seem to be the case, drat.
(In reply to comment #21) > Since Ben's patch seems to have caused the machine to freeze upon GPU reset, > I'd highly recommend to drop it. Can we please also try setting both bits to 1, instead of clearing them? I'm pretty concerned about the reset failure.
Created attachment 78538 [details] [review] Potentially make reset work Based on 3.9-rc8. Please run, reproduce hang, and and post the result of dmesg | grep PCH
(In reply to comment #39) > Created attachment 78538 [details] [review] [review] > Potentially make reset work > > Based on 3.9-rc8. > > Please run, reproduce hang, and and post the result of dmesg | grep PCH Ok, I will start compiling the kernel with your patch. With or without Chris' EIO suppressions? I'm going to leave them out for now, since you want dmesg output and both your first and Chris' later patches made the system lock up in a way that I couldn't get to dmesg anymore. I even tried some magic sysrq tricks to no avail. I'll report back after I reproduce the crash. Also don't know if you saw me mentioning it, but the Ivy Bridge system that I'm now running since 1.5 days with 32 (!) simultaneous videos (and before that 4 days with 16 simultaneous videos without reboot) is still going strong without even showing slowdown, which invariably does happen to the Sandy Bridge systems. I'm not sure how different the DRM (or any other subsystem really) treat gen6 and gen7, or how different the hardware really is, but maybe the solution is in there. Just an uneducated guess though.
SandyBridge and IvyBridge are really two different beasts when it comes to the GPU. There are a lot of similarities, but the silicon is more than just an evolution of the SandyBridge design. Just to put things in perspective. So that IVB is stable unlike SNB, doesn't rule out any part of the stack. Hopefully we can find a way to prevent the lockup, and to find a way to reset the GPU after the hang, and fix the hangs in the first place!
(In reply to comment #39) > Created attachment 78538 [details] [review] [review] > Potentially make reset work > > Based on 3.9-rc8. > > Please run, reproduce hang, and and post the result of dmesg | grep PCH System came to a complete halt again, could not extract dmesg. But I noticed I was running with i915.i915_disable_rc6=0. So I'll turn that off and redo it.
(In reply to comment #42) > (In reply to comment #39) > > Created attachment 78538 [details] [review] [review] [review] > > Potentially make reset work > > > > Based on 3.9-rc8. > > > > Please run, reproduce hang, and and post the result of dmesg | grep PCH > > System came to a complete halt again, could not extract dmesg. But I noticed > I was running with i915.i915_disable_rc6=0. So I'll turn that off and redo > it. I tried again with no boot parameters and this time (thankfully) the X.org server crashed before completely locking up the system, allowing me to extract a dmesg! I'll upload both dmesg and Xorg.0.log right now.
Created attachment 78556 [details] dmesg after Ben's #2 patch (SNA), X.org crashed
Created attachment 78557 [details] Xorg.0.log after patch #2 by Ben
(In reply to comment #44) > Created attachment 78556 [details] > dmesg after Ben's #2 patch (SNA), X.org crashed Thank you very much for collecting the data. Unfortunately I have no other ideas why the reset could fail.
(In reply to comment #46) > (In reply to comment #44) > > Created attachment 78556 [details] > > dmesg after Ben's #2 patch (SNA), X.org crashed > > Thank you very much for collecting the data. Unfortunately I have no other > ideas why the reset could fail. No problem at all :). Sometimes these things take time to mull over. If anyone has any ideas they wanna try I'm always happy to patch, recompile, run and collect data. Thanks already for the fantastic feedback guys, I hope someday we find this nasty little bug, wherever it be hiding.
Created attachment 79420 [details] [review] Patch to get error state out on low/fragmented memory situations
(In reply to comment #48) > Created attachment 79420 [details] [review] [review] > Patch to get error state out on low/fragmented memory situations Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you prefer if I try to backport or start using another kernel? Checking patch drivers/gpu/drm/i915/i915_debugfs.c... error: while searching for: .release = i915_error_state_release, }; static int i915_next_seqno_get(void *data, u64 *val) { error: patch failed: drivers/gpu/drm/i915/i915_debugfs.c:866 error: drivers/gpu/drm/i915/i915_debugfs.c: patch does not apply Checking patch drivers/gpu/drm/i915/i915_drv.h... Hunk #1 succeeded at 803 (offset -49 lines).
Created attachment 79755 [details] [review] for v3.9.2
(In reply to comment #49) > (In reply to comment #48) > > Created attachment 79420 [details] [review] [review] [review] > > Patch to get error state out on low/fragmented memory situations > > Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you > prefer if I try to backport or start using another kernel? I have attached a proper patch to get error state out even if memory is fragmented, backported to 3.9.2. Patch is also included in: http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued Nicolas, could you please try the patch so that we could get an error state for this bug. Thanks.
(In reply to comment #51) > (In reply to comment #49) > > (In reply to comment #48) > > > Created attachment 79420 [details] [review] [review] [review] [review] > > > Patch to get error state out on low/fragmented memory situations > > > > Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you > > prefer if I try to backport or start using another kernel? > > I have attached a proper patch to get error state out even if > memory is fragmented, backported to 3.9.2. > > Patch is also included in: > http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued > > Nicolas, could you please try the patch so that we could get > an error state for this bug. Thanks. Mika, I had already more or less backported the patch to 3.9.3 (see the other bug report: https://bugs.freedesktop.org/show_bug.cgi?id=63946). Now I'm having difficulties getting it to crash without locking hard. Often it doesn't crash but lowers its output frequency as I explain in that bug report (which is for the same bug). I'm trying out on a lot of units todays so hopefully I strike gold! I'll keep you guys posted. Thanks for the correct backport.
(In reply to comment #52) > (In reply to comment #51) > > (In reply to comment #49) > > > (In reply to comment #48) > > > > Created attachment 79420 [details] [review] [review] [review] [review] [review] > > > > Patch to get error state out on low/fragmented memory situations > > > > > > Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you > > > prefer if I try to backport or start using another kernel? > > > > I have attached a proper patch to get error state out even if > > memory is fragmented, backported to 3.9.2. > > > > Patch is also included in: > > http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued > > > > Nicolas, could you please try the patch so that we could get > > an error state for this bug. Thanks. > > Mika, I had already more or less backported the patch to 3.9.3 (see the > other bug report: https://bugs.freedesktop.org/show_bug.cgi?id=63946). Now > I'm having difficulties getting it to crash without locking hard. Often it > doesn't crash but lowers its output frequency as I explain in that bug > report (which is for the same bug). I'm trying out on a lot of units todays > so hopefully I strike gold! I'll keep you guys posted. Thanks for the > correct backport. Nicolas, it is not a backport of the previous patch in this bug (it was vmalloc hack). The newly attached patch avoids seq_file completely and should work in very low memory and/or fragmented situations also.
(In reply to comment #53) > (In reply to comment #52) > > (In reply to comment #51) > > > (In reply to comment #49) > > > > (In reply to comment #48) > > > > > Created attachment 79420 [details] [review] [review] [review] [review] [review] [review] > > > > > Patch to get error state out on low/fragmented memory situations > > > > > > > > Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you > > > > prefer if I try to backport or start using another kernel? > > > > > > I have attached a proper patch to get error state out even if > > > memory is fragmented, backported to 3.9.2. > > > > > > Patch is also included in: > > > http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued > > > > > > Nicolas, could you please try the patch so that we could get > > > an error state for this bug. Thanks. > > > > Mika, I had already more or less backported the patch to 3.9.3 (see the > > other bug report: https://bugs.freedesktop.org/show_bug.cgi?id=63946). Now > > I'm having difficulties getting it to crash without locking hard. Often it > > doesn't crash but lowers its output frequency as I explain in that bug > > report (which is for the same bug). I'm trying out on a lot of units todays > > so hopefully I strike gold! I'll keep you guys posted. Thanks for the > > correct backport. > > Nicolas, it is not a backport of the previous patch in this bug (it was > vmalloc hack). The newly attached patch avoids seq_file completely and > should work in very low memory and/or fragmented situations also. Acknowledged, that sounds good. I'll compile a fresh kernel and run with your newest patch.
Created attachment 79765 [details] i915_error_state2 after 32 concurrent video playback soft crash (could still access dmesg and debugfs)
(In reply to comment #55) > Created attachment 79765 [details] > i915_error_state2 after 32 concurrent video playback soft crash (could still > access dmesg and debugfs) So finally I have an i915_error_state. I was lucky to have it soft crash this time. This was obtained through $ cat /sys/kernel/debug/dri/0/i915_error_state2 > /home/user/i915_error_state && gzip /home/user/i915_error_state kernel: 3.9.3 with Mika's first patch libva: 1.2.0.pre1 intel-driver (vaapi): 1.1.0.pre1 vaapi version: 0.34.0 intel-xorg-driver: 2.21.7 P.S.: dmesg output: [ 6.232455] [drm] Initialized drm 1.1.0 20060810 [ 6.558826] [drm] Memory usable by graphics device = 2048M [ 6.604230] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). [ 6.604234] [drm] Driver supports precise vblank timestamp query. [ 6.627854] [drm] Wrong MCH_SSKPD value: 0x16040307 [ 6.627859] [drm] This can cause pipe underruns and display issues. [ 6.627861] [drm] Please upgrade your BIOS to fix this. [ 6.645302] fbcon: inteldrmfb (fb0) is primary device [ 6.874096] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device [ 6.879106] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0 [ 7.823019] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off [ 4342.301602] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 4342.301617] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state [ 4454.300480] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 4454.804436] [drm:i915_reset] *ERROR* Failed to reset chip.
(In reply to comment #55) > Created attachment 79765 [details] > i915_error_state2 after 32 concurrent video playback soft crash (could still > access dmesg and debugfs) Shrug. vaapi-intel remains garbage. Here the bsd batchbuffer is itself executing a random address and ends up overwriting other batches.
Created attachment 79770 [details] i915_error_state2 after 32 concurrent video playback soft crash #2 A second run, don't know if it will give anything useful but when I try to diagnose bugs in my stuff I always like the contrast. So I'll provide at least two i915_error_state's. Can provide more on request. The text file is gzipped (just like the last one btw).
(In reply to comment #57) > (In reply to comment #55) > > Created attachment 79765 [details] > > i915_error_state2 after 32 concurrent video playback soft crash (could still > > access dmesg and debugfs) > > Shrug. vaapi-intel remains garbage. Here the bsd batchbuffer is itself > executing a random address and ends up overwriting other batches. Do you mean the following bsd batchbuffer is invalid ? 13608000 524288 3f 00 5a5200 0 purgeable bsd snooped (LLC) bsd ring --- gtt_offset = 0x13608000 00000000 : 13000082
It's the render ring that is corrupt, my postulate was that the corruption was occuring from commands from the bsd ring.
I am not sure you are still experiencing this issue or not. Recently we found some cases works well with GTT but hangs with PPGTT on SNB with an old kernel. Is it possible to disable PPGTT or try a new kernel?
*** Bug 63946 has been marked as a duplicate of this bug. ***
Hi, Nicolas Is the issue still reproduced after using the latest intel-driver? If it is reproduced, please attach the error log of /sys/class/drm/card0/error. BTW: If the issue is reproduced, can you try to disable the PPGTT and see whether it is helpful? The PPGTT can be disabled by adding the kernel option of "i915.enable_ppgtt=0". Thanks
I closed this bug as wontfix because of no response over 1 year. Please feel free to reopen the bug if you still have this issue on SNB.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.