Created attachment 56342 [details] dmesg with drm.debug=0x06 I recently upgraded the intellinux driver from 2.13 to 2.15 and I got this “GPU hang” bug. Repro steps: - headless boot - plug a monitor (vga or dvi) once Xorg is started I saw that v2.15 was tested with xserver v1.10.0, but I am using xserver v1.9.3. Setup info Chipset: i945 xf86-video-intel: 2.15 xserver: 1.9.3 libdrm: 2.4.25 Xorg release is 7.6 for all packages, except libdrm that I updated for this bug. Linux distribution: Linux From Scratch, kernel from kernel.org uname –a: Linux 2.6.39.4 #95540 SMP PREEMPT Fri Sep 30 14:34:26 EDT 2011 i686 GNU/Linux Display connector: none to start with, connecting DVI or VGA triggers the bug. 100% reproductible Debugfs is not enabled currently so I don't have access to the latest batch buffer. I will recompile the kernel with debugfs if necessary. Attachment: dmesg log with boot option "drm.debug=0x06" Thanks in advance !
Created attachment 56343 [details] /var/log/messages
Created attachment 56344 [details] Xorg.0.log
Created attachment 56345 [details] xorg.conf
Hmm, not much point enabling the debugfs just yet, you need to pull drm-intel-next from http://cgit.freedesktop.org/~danvet/drm/ first in order for the error state to be written in this case. I suspect this will be a userspace driver bug, and so the i915_error_state is the first point of call.
I followed your link, but I haven't found drm-intel-next. I searched around and found the drm-intel-next package at http://cgit.freedesktop.org/~danvet/drm-intel/ I expected to get the drm kernel driver but instead it is the whole kernel. I assume I can't copy drivers/gpu/drm from the drm-intel-next into kernel 2.6.39.4, so what would be the best way to trouble shoot this bug ? Thanks
My apologies, Daniel wiser choose to separate his experimental branches from his official branches for upstream code, and ~danvet/drm-intel is indeed the home for the drm-intel-next branch. If you already have a kernel tree, all you need to do is merge with drm-intel-next, but since we don't have the drm-driver backported to 2.6.39.4, you will end up with 3.2.0-rc6+... And it's not even as simple as that... Daniel, is there a branch that simply contains all the known bugfixes that we can use as a basis for testing?
Thanks Chris for the follow up. I am compiling drm-intel-next right now and I'll see if the problem is still there. In the end though, I'd prefer to keep 2.6.39.4. Do you know if it's feasible to keep the current packages from Xorg 7.6 and only update the kernel to drm-intel-next?
Yes, we aim for kernel updates to not break existing userspace (and for new userspace to keep working on old kernels). Going back and forth between 2.6.39 and 3.2.0, you and your userspace should encounter no problems.
> --- Comment #8 from Chris Wilson <chris@chris-wilson.co.uk> 2012-01-31 10:22:48 PST --- > Yes, we aim for kernel updates to not break existing userspace (and for new > userspace to keep working on old kernels). Going back and forth between 2.6.39 > and 3.2.0, you and your userspace should encounter no problems. Even more so, if something blows up when you upgrade your kernel, we want to know about it. For branches to test I recommend drm-intel-testing from the same git repo - that one also contains the latest stuff from -fixes. It does not yet contain all the bugfixes and patches for know issues, but it's much closer than 3.6.39 ;-)
Ok, I successfully built the branch drm-intel-testing from the git repro. I got the same GPU hang problem: I am using the same packages but with this 3.2 kernel. I forgot to mention that when the crash occurs, I have a video being rendered in fullscreen, using XV. The hotplug works otherwise, i.e. when Xorg is idle using twm. Other useful info: this bug was not happening with the following configuration kernel 2.6.29.6 Xorg 7.4 xserver 1.6.0 libdrm 2.4.7 Would the dmesg logs help with this new drm-intel-testing setup ? Thanks!
I have done another test: with my original setup using 2.6.39.4. This time, I get the GPU hang even with Xorg being idle. So there is a small upgrade by using kernel 3.2.0, as it only crashes when using XV.
Correction: I got the same behavior under 3.2.0 and 2.6.39.4: the GPU hang bug only appears with a XV video playing.
Hi Chris and Daniel, I would like some help to troubleshoot this bug. From what I understand, there is something in the hotplug event that makes the GPU hang if it's processing a XV batch buffer, and it happened somewhere between 2.6.29 and 2.6.39.4 and it's still in kernel 3.2. Would debugfs/last batch buffer help in this case ? Or you guys have other ideas as where the bug might come from ? Is there any kernel options I can add to make the drm/intel driver more cautious about the hotplug event ? like a sync option ? Thanks again! Francis
Yes, the /sys/kernel/debug/dri/0/i915_error_state would have confirmed whether or not you are suffering from bug 36515.
I got the drm-intel-testing kernel built with debugfs and here's the error_state: root@localhost:~# cat /sys/kernel/debug/dri/0/i915_error_state no error state collected Now that I got debugfs, is there anything else that I can provide you so I can debug this ? There is alot of debug info in there, but is anything useful in this case ?
I there anything I can do to debug this GPU hang ? Any help would be appreciated !! Thanks again. Francis
As Chris said, please attach the error_state. But you first need to rehang your machine otherwise there'll be nothing interesting in it.
Like I said previously, the error_state only contains "no error state collected", even if I see "GPU hung" in /var/log/messages, as stated in Comment 15. Is there other files that I can attach to this bug ?
Can you please attach dmesg so I can check why the kernel might not dump the error_state?
Created attachment 56889 [details] dmesg with drm.debug=0x06 and debugfs Also, it's my first time using debugfs. As per the kernel doc, I activated it using the following command: mount -t debugfs none /sys/kernel/debug Is there anything else ? The other files are populated and used by the drive, ie. /sys/kernel/debug/dri/0/i915_capabilities lists the caps... Thanks.
Ah, your gpu is stuck on an MI_WAIT for which we don't capture the error_state yet but just kick the waiting gpu. You need the latest drm-intel-next branch from http://cgit.freedesktop.org/~danvet/drm-intel/
Ok! I was using drm-intel-testing so I'll recompile drm-intel-next and post the error_state file. Thanks :)
Created attachment 57043 [details] dmesg using drm-intel-next branch Here's the dmesg output when I reproduce this bug with the git branch drm-intel-next. I'm still getting "no error state collected" though... :( I also noticed that with drm-intel-next, I can reproduce this bug quite easily by unplugging and replugging my VGA monitor, which is far worse than the original bug. I understand this branch is 3.2-rc6 :) Any help would be appreciated! Thanks in advance, Francis
Ok, to clear things up, I re-tested the three different kernels and made sure to reproduce the hangs multiple times: they are 2.6.39.4, git drm-intel-testing (3.2) and git drm-intel-next (3.2-rc6). I had some strange behavior previously and it was probably because I was using a kvm switch on the VGA port. Now that it's removed, the GPU hangs is 100% reproducible with first hotplug event, just like stated in Description. I can't reproduce the "unplug/plug" hang events, like Comment 23. Sorry for the mess !! (I wish comments could be removed in bugzilla)
Please double-check that your drm-intel-next contains commit 653d7bed26a0c298dee7d60f6ab4bb442acf8b82 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Dec 14 13:57:21 2011 +0100 drm/i915: capture error_state also for stuck rings with that you should be able to grab the i915_error_state from sysfs
Argh, I've ment debugfs instead of sysfs.
Indeed, my git version didn't have that commit... which is quite recent btw! I am also new to git, and I've learned from my errors! Compiling the kernel again...
Created attachment 57172 [details] error_state with drm-intel-next the error_state from my GPU hung bug... at last !
Created attachment 57173 [details] dmesg associated with i915_error_state
Confirms the WAIT_FOR_EVENT hang. Can you please try the patch https://bugs.freedesktop.org/attachment.cgi?id=48043 Though I guess that needs rebasing...
Applying the patch was easy, only one function changed in the code: i915_gem_object_flush_gpu() became i915_gem_object_finish_gpu() I assumed it's the same function and tried the patch, but I still get the same GPU hang. So the patch is ineffective :( Is there anything else I can try ? Would you like the new i915_error_state output ? Thanks again !
That does depend upon having a fixed finish_gpu()... :|
So where can I get a fixed finish_gpu() ? I saw that you recently updated the branch drm-intel-next with the 3.3.0-rc2 kernel. I'll try it and see ...
With the latest git from drm-intel-next and the patch you provided things are a bit better: I still get the GPU hung error in dmesg, but now Xorg comes back alive. In this state there is with corruption though, like if I do "ls -lR /" in an xterm window, the characters overwrite each other as if the white rectangle command wouldn't be processed. Previously, I was stuck in "console mode" and a reboot was needed.
I git-pulled drm-intel-next again, retested this bug today and it's still hanging, but with an unstable unusable Xorg (more or less like comment 34). Is there anything else I could try to fix this ? And about that Fixed finish_gpu() ? Anyway I can help for that ?? Also, I don't think updating Xorg packages would fix this bug, but what do you think ? I am out of options otherwise. Are you guys still supporting the i945 chipset ? ... :P
Keep attaching the fresh error states. Which kernel did you test exactly? There's another known bug for gen3 (fence untiled blits) that you need a recent patch for as well.
Created attachment 59279 [details] See comment 35 Error State using drm-intel-next: 121d527a323f3fde313a8f522060ba859ee405b3 with "WAIT_FOR_EVENT hang" patch from comment 30.
Oh wait, this is one of the WAIT_EVENT on disabled pipe bug. So you need all of the above plus the linked patch. Let me refresh that now I have a tester :)
Created attachment 59281 [details] [review] Finish gpu before disabling pipe
Created attachment 59288 [details] test using attachment 59281 [details] [review] I tested attachment 59281 [details] [review], which is 'almost' the same as the patch from comment 30, and I still get the hang. Here's the error_state. The only difference between both patches is that dev_priv->mm.interruptible is set to false before finish_gpu() ... Anything else I can try ?
Back to square 0; I need a new theory to start testing. Thanks for your testing.
So it is definitely a pipe misconfiguration that is at the heart of the problem. The pipe[0] size is SRC: 027f01df, 640x384 And the wait is programmed for: 0x0081b008: 0x09000000: MI_LOAD_SCAN_LINES_INCL (pipe = 0) 0x0081b00c: 0x000002fd: dword 1 (range y1=0, y2=637) 0x0081b010: 0x01800002: MI_WAIT_FOR_EVENT That scanline can never be reached in the current configuration, ergo the indefinite wait. Can you attach /sys/kernel/debug/dri/0/i915_gem_framebuffer from a hang? I want to try and see which buffer it thinks is actually attached to the pipe.
Created attachment 59291 [details] Here's the gem_framebuffer ! Thank you for perseverance.
Right, so by the time we submit the batch with the WAIT our framebuffer is already unpinned and the pipe is reused for something else (probably load-detection.) Oh boy.
That doesn't sound good :S Please take a look back at the setup that I mentioned in Description. Would upgrading some packages help ? Are these versions too old ? Thanks
I've pushed a patch to xf86-video-intel.git for SNA to close the hotplug window: commit cc20c45aa0ca15720510668d6918bf3c99104626 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 30 22:51:21 2012 +0100 sna: Minimise the risk of hotplug hangs by checking fb before vsync Everytime we issue a MI_WAIT_FOR_EVENT on a scan-line from userspace we run the risk of that pipe being disable before we submit a batch. As the pipe is then disabled or configured differently, we encounter an indefinite wait and trigger a GPU hang. To minimise the risk of a hotplug event being detected and submitting a vsynced batch prior to noticing the removal of the pipe, perform an explicit query of the current CRTC and delete the wait if we spot that our framebuffer is no longer attached. This is about as good as we can achieve without extra help from the kernel. Reported-by: Francis Leblanc <Francis.Leblanc-Lebeau@verint.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=45413 (and others) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> It is not a perfect fix, but it should make the problem "unreproducible". You need to clone git://git.freedesktop.org/git/xorg/driver/xf86-video-intel and configure with --enable-sna. Hope this helps.
And for completeness we still need the most recent patched kernel (i.e. finish-gpu fixes).
Hi Chris, Do you think it would be possible to get this patch against a non-sna driver, like v2.15 ? Because the git link you sent me needs xserver 1.10 and the dependencies needed is a headache (I am using Xorg 7.6 from LFS, xserver 1.9.3). I've tried updating most packages that were flagged by the autoconf/configure process, but now I've hit a wall where xserver doesn't compile and I don't know which package upgrade is needed. Thanks
A patch referencing this bug report has been merged in Linux v3.4-rc3: commit 14667a4bde4361b7ac420d68a2e9e9b9b2df5231 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 3 17:58:35 2012 +0100 drm/i915: Finish any pending operations on the framebuffer before disabling
A second patch generalising upon the first kernel fix: commit 0f91128d88bbb8b0a8e7bb93df2c40680871d45a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 17 10:05:38 2012 +0100 drm/i915: Wait for all pending operations to the fb before disabling the pip During modeset we have to disable the pipe to reconfigure its timings and maybe its size. Userspace may have queued up command buffers that depend upon the pipe running in a certain configuration and so the commands may become confused across the modeset. At the moment, we use a less than satisfactory kick-scanline-waits should the GPU hang during the modeset. It should be more reliable to wait for the pending operations to complete first, even though we still have a window for userspace to submit a broken command buffer during the modeset. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> And you may want to backport (or at least the uxa chunks): commit b817200371bfe16f44b879a793cf4a75ad17bc5c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 17 17:54:58 2012 +0100 Don't issue a scanline wait while VT switched Be paranoid and check that we own the VT before emitting a scanline wait. If we attempt to wait on a fb/pipe that we do not own, we may issue an illegal command and cause a lockup. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Those I think are sufficient to close the race entirely.
I'm compiling the kernel and the xorg driver with these updates right now and will come back with the results...
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.