Bug 45413

Summary: [i945 monitor hotplug] GPU hang drm:i915_hangcheck_elapsed
Product: xorg Reporter: Francis Leblanc <Francis.Leblanc-Lebeau>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: daniel, florian
Version: 7.6 (2010.12)   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg with drm.debug=0x06
none
/var/log/messages
none
Xorg.0.log
none
xorg.conf
none
dmesg with drm.debug=0x06 and debugfs
none
dmesg using drm-intel-next branch
none
error_state with drm-intel-next
none
dmesg associated with i915_error_state
none
See comment 35
none
Finish gpu before disabling pipe
none
test using attachment 59281
none
Here's the gem_framebuffer ! none

Description Francis Leblanc 2012-01-30 15:00:30 UTC
Created attachment 56342 [details]
dmesg with drm.debug=0x06

I recently upgraded the intellinux driver from 2.13 to 2.15 and I got this “GPU hang” bug.
Repro steps:
- headless boot
- plug a monitor (vga or dvi) once Xorg is started

I saw that v2.15 was tested with xserver v1.10.0, but I am using xserver v1.9.3.

Setup info
Chipset: i945
xf86-video-intel: 2.15
xserver: 1.9.3
libdrm: 2.4.25
Xorg release is 7.6 for all packages, except libdrm that I updated for this bug.
Linux distribution: Linux From Scratch, kernel from kernel.org
uname –a: Linux 2.6.39.4 #95540 SMP PREEMPT Fri Sep 30 14:34:26 EDT 2011 i686 GNU/Linux
Display connector: none to start with, connecting DVI or VGA triggers the bug.
100% reproductible

Debugfs is not enabled currently so I don't have access to the latest batch buffer. I will recompile the kernel with debugfs if necessary.

Attachment: dmesg log with boot option "drm.debug=0x06"

Thanks in advance !
Comment 1 Francis Leblanc 2012-01-30 15:01:24 UTC
Created attachment 56343 [details]
/var/log/messages
Comment 2 Francis Leblanc 2012-01-30 15:03:18 UTC
Created attachment 56344 [details]
Xorg.0.log
Comment 3 Francis Leblanc 2012-01-30 15:04:24 UTC
Created attachment 56345 [details]
xorg.conf
Comment 4 Chris Wilson 2012-01-30 15:13:39 UTC
Hmm, not much point enabling the debugfs just yet, you need to pull drm-intel-next from http://cgit.freedesktop.org/~danvet/drm/ first in order for the error state to be written in this case. I suspect this will be a userspace driver bug, and so the i915_error_state is the first point of call.
Comment 5 Francis Leblanc 2012-01-31 07:56:07 UTC
I followed your link, but I haven't found drm-intel-next. I searched around and found the drm-intel-next package at http://cgit.freedesktop.org/~danvet/drm-intel/
	
I expected to get the drm kernel driver but instead it is the whole kernel.
I assume I can't copy drivers/gpu/drm from the drm-intel-next into kernel 2.6.39.4, so what would be the best way to trouble shoot this bug ?

Thanks
Comment 6 Chris Wilson 2012-01-31 08:17:34 UTC
My apologies, Daniel wiser choose to separate his experimental branches from his official branches for upstream code, and ~danvet/drm-intel is indeed the home for the drm-intel-next branch. If you already have a kernel tree, all you need to do is merge with drm-intel-next, but since we don't have the drm-driver backported to 2.6.39.4, you will end up with 3.2.0-rc6+...

And it's not even as simple as that...

Daniel, is there a branch that simply contains all the known bugfixes that we can use as a basis for testing?
Comment 7 Francis Leblanc 2012-01-31 10:18:22 UTC
Thanks Chris for the follow up. I am compiling drm-intel-next right now and I'll see if the problem is still there. In the end though, I'd prefer to keep 2.6.39.4.
Do you know if it's feasible to keep the current packages from Xorg 7.6 and only update the kernel to drm-intel-next?
Comment 8 Chris Wilson 2012-01-31 10:22:48 UTC
Yes, we aim for kernel updates to not break existing userspace (and for new userspace to keep working on old kernels). Going back and forth between 2.6.39 and 3.2.0, you and your userspace should encounter no problems.
Comment 9 Daniel Vetter 2012-01-31 10:38:57 UTC
> --- Comment #8 from Chris Wilson <chris@chris-wilson.co.uk> 2012-01-31 10:22:48 PST ---
> Yes, we aim for kernel updates to not break existing userspace (and for new
> userspace to keep working on old kernels). Going back and forth between 2.6.39
> and 3.2.0, you and your userspace should encounter no problems.

Even more so, if something blows up when you upgrade your kernel, we want
to know about it. For branches to test I recommend drm-intel-testing from
the same git repo - that one also contains the latest stuff from -fixes.
It does not yet contain all the bugfixes and patches for know issues, but
it's much closer than 3.6.39 ;-)
Comment 10 Francis Leblanc 2012-02-02 06:29:33 UTC
Ok, I successfully built the branch drm-intel-testing from the git repro. I got the same GPU hang problem: I am using the same packages but with this 3.2 kernel. 

I forgot to mention that when the crash occurs, I have a video being rendered in fullscreen, using XV. The hotplug works otherwise, i.e. when Xorg is idle using twm.

Other useful info: this bug was not happening with the following configuration
kernel 2.6.29.6
Xorg 7.4
xserver 1.6.0
libdrm 2.4.7

Would the dmesg logs help with this new drm-intel-testing setup ?
Thanks!
Comment 11 Francis Leblanc 2012-02-02 07:50:25 UTC
I have done another test: with my original setup using 2.6.39.4.
This time, I get the GPU hang even with Xorg being idle.

So there is a small upgrade by using kernel 3.2.0, as it only crashes when using XV.
Comment 12 Francis Leblanc 2012-02-02 10:30:14 UTC
Correction: I got the same behavior under 3.2.0 and 2.6.39.4: the GPU hang bug only appears with a XV video playing.
Comment 13 Francis Leblanc 2012-02-03 10:46:30 UTC
Hi Chris and Daniel,

I would like some help to troubleshoot this bug. From what I understand, there is something in the hotplug event that makes the GPU hang if it's processing a XV batch buffer, and it happened somewhere between 2.6.29 and 2.6.39.4 and it's still in kernel 3.2.

Would debugfs/last batch buffer help in this case ?
Or you guys have other ideas as where the bug might come from ?
Is there any kernel options I can add to make the drm/intel driver more cautious about the hotplug event ? like a sync option ?

Thanks again!
Francis
Comment 14 Chris Wilson 2012-02-03 11:07:40 UTC
Yes, the /sys/kernel/debug/dri/0/i915_error_state would have confirmed whether or not you are suffering from bug 36515.
Comment 15 Francis Leblanc 2012-02-07 08:11:23 UTC
I got the drm-intel-testing kernel built with debugfs and here's the error_state:
root@localhost:~# cat /sys/kernel/debug/dri/0/i915_error_state
no error state collected

Now that I got debugfs, is there anything else that I can provide you so I can debug this ? There is alot of debug info in there, but is anything useful in this case ?
Comment 16 Francis Leblanc 2012-02-09 10:59:55 UTC
I there anything I can do to debug this GPU hang ?
Any help would be appreciated !!
Thanks again.
Francis
Comment 17 Daniel Vetter 2012-02-10 12:13:34 UTC
As Chris said, please attach the error_state. But you first need to rehang your machine otherwise there'll be nothing interesting in it.
Comment 18 Francis Leblanc 2012-02-10 12:23:14 UTC
Like I said previously, the error_state only contains "no error state collected", even if I see "GPU hung" in /var/log/messages, as stated in Comment 15. Is there other files that I can attach to this bug ?
Comment 19 Daniel Vetter 2012-02-10 12:24:57 UTC
Can you please attach dmesg so I can check why the kernel might not dump the error_state?
Comment 20 Francis Leblanc 2012-02-10 13:04:34 UTC
Created attachment 56889 [details]
dmesg with drm.debug=0x06 and debugfs

Also, it's my first time using debugfs. As per the kernel doc, I activated it using the following command: mount -t debugfs none /sys/kernel/debug
Is there anything else ? The other files are populated and used by the drive, ie. /sys/kernel/debug/dri/0/i915_capabilities  lists the caps...
Thanks.
Comment 21 Daniel Vetter 2012-02-10 13:21:35 UTC
Ah, your gpu is stuck on an MI_WAIT for which we don't capture the error_state yet but just kick the waiting gpu. You need the latest drm-intel-next branch from

http://cgit.freedesktop.org/~danvet/drm-intel/
Comment 22 Francis Leblanc 2012-02-10 13:46:20 UTC
Ok! I was using drm-intel-testing so I'll recompile drm-intel-next and post the error_state file. Thanks :)
Comment 23 Francis Leblanc 2012-02-14 10:48:21 UTC
Created attachment 57043 [details]
dmesg using drm-intel-next branch

Here's the dmesg output when I reproduce this bug with the git branch drm-intel-next.
I'm still getting "no error state collected" though... :(

I also noticed that with drm-intel-next, I can reproduce this bug quite easily by unplugging and replugging my VGA monitor, which is far worse than the original bug. I understand this branch is 3.2-rc6 :)

Any help would be appreciated!
Thanks in advance,
Francis
Comment 24 Francis Leblanc 2012-02-15 12:50:20 UTC
Ok, to clear things up, I re-tested the three different kernels and made sure to reproduce the hangs multiple times: they are 2.6.39.4, git drm-intel-testing (3.2) and git drm-intel-next (3.2-rc6).

I had some strange behavior previously and it was probably because I was using a kvm switch on the VGA port. Now that it's removed, the GPU hangs is 100% reproducible with first hotplug event, just like stated in Description. I can't reproduce the "unplug/plug" hang events, like Comment 23.

Sorry for the mess !! (I wish comments could be removed in bugzilla)
Comment 25 Daniel Vetter 2012-02-16 01:52:58 UTC
Please double-check that your drm-intel-next contains

commit 653d7bed26a0c298dee7d60f6ab4bb442acf8b82
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Dec 14 13:57:21 2011 +0100

    drm/i915: capture error_state also for stuck rings

with that you should be able to grab the i915_error_state from sysfs
Comment 26 Daniel Vetter 2012-02-16 01:53:16 UTC
Argh, I've ment debugfs instead of sysfs.
Comment 27 Francis Leblanc 2012-02-16 07:55:55 UTC
Indeed, my git version didn't have that commit... which is quite recent btw!
I am also new to git, and I've learned from my errors! Compiling the kernel again...
Comment 28 Francis Leblanc 2012-02-16 10:27:40 UTC
Created attachment 57172 [details]
error_state with drm-intel-next

the error_state from my GPU hung bug... at last !
Comment 29 Francis Leblanc 2012-02-16 11:03:32 UTC
Created attachment 57173 [details]
dmesg associated with i915_error_state
Comment 30 Chris Wilson 2012-02-16 13:38:11 UTC
Confirms the WAIT_FOR_EVENT hang. Can you please try the patch https://bugs.freedesktop.org/attachment.cgi?id=48043

Though I guess that needs rebasing...
Comment 31 Francis Leblanc 2012-02-17 07:01:27 UTC
Applying the patch was easy, only one function changed in the code:
i915_gem_object_flush_gpu() became i915_gem_object_finish_gpu()

I assumed it's the same function and tried the patch, but I still get the same GPU hang. So the patch is ineffective :(

Is there anything else I can try ?
Would you like the new i915_error_state output ?

Thanks again !
Comment 32 Chris Wilson 2012-02-17 13:24:56 UTC
That does depend upon having a fixed finish_gpu()... :|
Comment 33 Francis Leblanc 2012-02-28 07:49:45 UTC
So where can I get a fixed finish_gpu() ?
I saw that you recently updated the branch drm-intel-next with the 3.3.0-rc2 kernel. I'll try it and see ...
Comment 34 Francis Leblanc 2012-03-01 07:10:41 UTC
With the latest git from drm-intel-next and the patch you provided things are a bit better: I still get the GPU hung error in dmesg, but now Xorg comes back alive. In this state there is with corruption though, like if I do "ls -lR /" in an xterm window, the characters overwrite each other as if the white rectangle command wouldn't be processed.

Previously, I was stuck in "console mode" and a reboot was needed.
Comment 35 Francis Leblanc 2012-03-30 08:01:01 UTC
I git-pulled drm-intel-next again, retested this bug today and it's still hanging, but with an unstable unusable Xorg (more or less like comment 34). Is there anything else I could try to fix this ?

And about that Fixed finish_gpu() ? Anyway I can help for that ??

Also, I don't think updating Xorg packages would fix this bug, but what do you think ? I am out of options otherwise.

Are you guys still supporting the i945 chipset ? ... :P
Comment 36 Chris Wilson 2012-03-30 08:05:49 UTC
Keep attaching the fresh error states. Which kernel did you test exactly? There's another known bug for gen3 (fence untiled blits) that you need a recent patch for as well.
Comment 37 Francis Leblanc 2012-03-30 08:10:53 UTC
Created attachment 59279 [details]
See comment 35

Error State using drm-intel-next: 121d527a323f3fde313a8f522060ba859ee405b3
with "WAIT_FOR_EVENT hang" patch from comment 30.
Comment 38 Chris Wilson 2012-03-30 08:18:43 UTC
Oh wait, this is one of the WAIT_EVENT on disabled pipe bug. So you need all of the above plus the linked patch. Let me refresh that now I have a tester :)
Comment 39 Chris Wilson 2012-03-30 08:27:19 UTC
Created attachment 59281 [details] [review]
Finish gpu before disabling pipe
Comment 40 Francis Leblanc 2012-03-30 10:48:05 UTC
Created attachment 59288 [details]
test using attachment 59281 [details] [review]

I tested attachment 59281 [details] [review], which is 'almost' the same as the patch from comment 30, and I still get the hang. Here's the error_state.
The only difference between both patches is that dev_priv->mm.interruptible is set to false before finish_gpu() ...

Anything else I can try ?
Comment 41 Chris Wilson 2012-03-30 13:18:47 UTC
Back to square 0; I need a new theory to start testing. Thanks for your testing.
Comment 42 Chris Wilson 2012-03-30 13:55:29 UTC
So it is definitely a pipe misconfiguration that is at the heart of the problem.

The pipe[0] size is SRC: 027f01df, 640x384

And the wait is programmed for:

0x0081b008:      0x09000000: MI_LOAD_SCAN_LINES_INCL (pipe = 0)
0x0081b00c:      0x000002fd:    dword 1 (range y1=0, y2=637)
0x0081b010:      0x01800002: MI_WAIT_FOR_EVENT

That scanline can never be reached in the current configuration, ergo the indefinite wait.

Can you attach /sys/kernel/debug/dri/0/i915_gem_framebuffer from a hang? I want to try and see which buffer it thinks is actually attached to the pipe.
Comment 43 Francis Leblanc 2012-03-30 14:02:43 UTC
Created attachment 59291 [details]
Here's the gem_framebuffer !

Thank you for perseverance.
Comment 44 Chris Wilson 2012-03-30 14:10:47 UTC
Right, so by the time we submit the batch with the WAIT our framebuffer is already unpinned and the pipe is reused for something else (probably load-detection.)

Oh boy.
Comment 45 Francis Leblanc 2012-03-30 14:20:12 UTC
That doesn't sound good :S

Please take a look back at the setup that I mentioned in Description. Would upgrading some packages help ? Are these versions too old ?

Thanks
Comment 46 Chris Wilson 2012-03-30 15:10:53 UTC
I've pushed a patch to xf86-video-intel.git for SNA to close the hotplug window:

commit cc20c45aa0ca15720510668d6918bf3c99104626
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 30 22:51:21 2012 +0100

    sna: Minimise the risk of hotplug hangs by checking fb before vsync
    
    Everytime we issue a MI_WAIT_FOR_EVENT on a scan-line from userspace we
    run the risk of that pipe being disable before we submit a batch. As the
    pipe is then disabled or configured differently, we encounter an
    indefinite wait and trigger a GPU hang.
    
    To minimise the risk of a hotplug event being detected and submitting a
    vsynced batch prior to noticing the removal of the pipe, perform an
    explicit query of the current CRTC and delete the wait if we spot that
    our framebuffer is no longer attached. This is about as good as we can
    achieve without extra help from the kernel.
    
    Reported-by: Francis Leblanc <Francis.Leblanc-Lebeau@verint.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=45413 (and others)
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

It is not a perfect fix, but it should make the problem "unreproducible".

You need to clone git://git.freedesktop.org/git/xorg/driver/xf86-video-intel and configure with --enable-sna.

Hope this helps.
Comment 47 Chris Wilson 2012-03-30 15:21:28 UTC
And for completeness we still need the most recent patched kernel (i.e. finish-gpu fixes).
Comment 48 Francis Leblanc 2012-04-03 10:57:22 UTC
Hi Chris,
Do you think it would be possible to get this patch against a non-sna driver, like v2.15 ? Because the git link you sent me needs xserver 1.10 and the dependencies needed is a headache (I am using Xorg 7.6 from LFS, xserver 1.9.3).

I've tried updating most packages that were flagged by the autoconf/configure process, but now I've hit a wall where xserver doesn't compile and I don't know which package upgrade is needed.

Thanks
Comment 49 Florian Mickler 2012-04-16 14:28:42 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc3:

commit 14667a4bde4361b7ac420d68a2e9e9b9b2df5231
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 3 17:58:35 2012 +0100

    drm/i915: Finish any pending operations on the framebuffer before disabling
Comment 50 Chris Wilson 2012-04-18 12:47:33 UTC
A second patch generalising upon the first kernel fix:

commit 0f91128d88bbb8b0a8e7bb93df2c40680871d45a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 17 10:05:38 2012 +0100

    drm/i915: Wait for all pending operations to the fb before disabling the pip
    
    During modeset we have to disable the pipe to reconfigure its timings
    and maybe its size. Userspace may have queued up command buffers that
    depend upon the pipe running in a certain configuration and so the
    commands may become confused across the modeset. At the moment, we use a
    less than satisfactory kick-scanline-waits should the GPU hang during
    the modeset. It should be more reliable to wait for the pending
    operations to complete first, even though we still have a window for
    userspace to submit a broken command buffer during the modeset.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

And you may want to backport (or at least the uxa chunks):

commit b817200371bfe16f44b879a793cf4a75ad17bc5c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 17 17:54:58 2012 +0100

    Don't issue a scanline wait while VT switched
    
    Be paranoid and check that we own the VT before emitting a scanline
    wait. If we attempt to wait on a fb/pipe that we do not own, we may
    issue an illegal command and cause a lockup.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Those I think are sufficient to close the race entirely.
Comment 51 Francis Leblanc 2012-04-25 13:02:25 UTC
I'm compiling the kernel and the xorg driver with these updates right now and will come back with the results...

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.