52624 – [845G] [SNA] Spuriously detected a "hung GPU"

Bug 52624 - [845G] [SNA] Spuriously detected a "hung GPU"

Summary: [845G] [SNA] Spuriously detected a "hung GPU"

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-07-28 17:08 UTC by Karol Błażewicz
Modified:	2012-08-15 11:01 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg.0.log (22.52 KB, text/plain) 2012-07-28 17:08 UTC, Karol Błażewicz	no flags	Details
dmesg output (34.74 KB, text/plain) 2012-07-28 17:10 UTC, Karol Błażewicz	no flags	Details
2nd Xiorg log - gpu hangs even w/o playing with GL (22.52 KB, text/plain) 2012-07-28 18:42 UTC, Karol Błażewicz	no flags	Details
2nd dmesg output - no GL this time (34.62 KB, text/plain) 2012-07-28 18:44 UTC, Karol Błażewicz	no flags	Details
i915_error_state for a sna crash (673.55 KB, text/plain) 2012-08-14 12:34 UTC, tang0th	no flags	Details
Another dmesg output on SNA hung (34.31 KB, text/plain) 2012-08-15 10:53 UTC, Balló György	no flags	Details
Another i915_error_state on SNA hung (673.55 KB, text/plain) 2012-08-15 10:55 UTC, Balló György	no flags	Details
Another Xorg log on SNA hung (62.56 KB, text/plain) 2012-08-15 10:57 UTC, Balló György	no flags	Details
View All

Description Karol Błażewicz 2012-07-28 17:08:47 UTC

Created attachment 64826 [details]
Xorg.0.log

This is my first bug report here, so please be gentle.

I've been using UXA with
  Option "Shadow" "True"
on xf86-video-intel up to and including version 2.19 and it worked OK. From version 2.20 I get a hung GPU with both UXA and SNA. This report is about SNA - should I open another one for UXA?


-- chipset: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) (prog-if 00 [VGA controller])
-- system architecture: 32-bit
-- xf86-video-intel: 2.20.2
-- xserver: 1.12.3
-- libdrm: 2.4.37
-- kernel: 3.5.0-1-ARCH
-- Linux distribution: Arch Linux
-- Machine or mobo model: IBM NetVista A30p

Comment 1 Karol Błażewicz 2012-07-28 17:10:36 UTC

Created attachment 64827 [details]
dmesg output

Comment 2 Karol Błażewicz 2012-07-28 17:16:15 UTC

I did follow http://intellinuxgraphics.org/i915_error_state.html but

$ findmnt -nt debugfs
/sys/kernel/debug debugfs debugfs rw,relatime
$ cat /sys/kernel/debug/dri/0/i915_error_state
no error state collected
$ cat /sys/kernel/debug/dri/64/i915_error_state
no error state collected

Comment 3 Chris Wilson 2012-07-28 17:18:48 UTC

How did you trigger the hang? Presumably through the use of DRI, which is controlled through a separate flag, just in case you wanted to use it in spite of the contrary evidence.

Can you please attach the i915_error_state so that I can check that it is conforming to the usual failure mode for 845g?

Comment 4 Chris Wilson 2012-07-28 17:19:46 UTC

glines! So you attempted to use GL...

Comment 5 Chris Wilson 2012-07-28 17:27:20 UTC

So suggesting to use Option "DRI" "false" would have been a little more effective if I remembered to make SNA check for it...

commit 3d45f0affe263985f440e144203ed7cbb3803696
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jul 28 18:21:08 2012 +0100

    sna: Honour the Option "DRI"
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=52624
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

However, it looks like the system functioned as intended insofar as it did not crash, and there should have been no functional difference wrt the ddx before or after the detected hang.

With regards to the missing error state, did you reboot before looking in debugfs?

Comment 6 Karol Błażewicz 2012-07-28 17:36:20 UTC

> glines! So you attempted to use GL...

I noticed the hung gpu a couple hours ago, after (among other things) playing
with gnome-games, so I restarted the computer, mounted debugfs and proceeded to
diligently play each game from the package, grepping X log after each one to
see it the gpu hung. It didn't.
I'm not an expert, but the numbers representing time, are pretty different:

dmesg:
[ 3933.926243] glines[1348]: segfault at 17 ip b6e644fa sp bfcf7ad0 error 4 in
libcairo.so.2.11200.2[b6e1b000+106000]

Xorg log:
[  7948.252] (EE) intel(0): Detected a hung GPU, disabling acceleration.


After playing through the whole collection I proceeded to browse the web and
view hundreds of pictures from my hard drive.


> However, it looks like the system functioned as intended insofar as it did not
> crash, and there should have been no functional difference wrt the ddx before
> or after the detected hang.

There is little if any difference in how stuff works here before and after the hung gpu and it's veeery different from the hung gpu I get with UXA (w/o shadow).
With SNA I found out the gpu hung only by grepping the log while with UXA
everything slowed down to a crawl and it was impossible *not* to notice.

> With regards to the missing error state,
> did you reboot before looking in debugfs?

No, I didn't reboot yet. Should I do it know or collect some other evidence
before doing so?

Comment 7 Karol Błażewicz 2012-07-28 17:58:27 UTC

OK, I did reboot, still 'no error state collected'.
Am I doing it wrong?


> sna: Honour the Option "DRI"

Should I give the -git version of the drivers a try?
Do I have to use some options in the xorg.conf?

Comment 8 Chris Wilson 2012-07-28 18:02:54 UTC

Oh, I understand now. It's a spurious warning from SNA in the sense that it marked the device as wedged because it was a 845g but didn't suppress the warning if we ever received an EIO from the kernel.

Now that EIO should in theory be impossible because all the paths that lead up to should be prevented by checking for wedged (i.e. we should only get an EIO after performing an operation with the GPU and they should all be verboten as we believe the GPU is on fire.)

Is it possible for your to recompile with --enable-debug and run your X server under a debugger so that I can see where we neglect the check for a wedged GPU (it should abort if we miss a check)?

Comment 9 Karol Błażewicz 2012-07-28 18:22:33 UTC

(In reply to comment #8)
> Is it possible for your to recompile with --enable-debug and run your X server
> under a debugger so that I can see where we neglect the check for a wedged GPU
> (it should abort if we miss a check)?

Do you mean recompile xorg-server like in: http://www.x.org/wiki/Development/Documentation/ServerDebugging ?
I've never done it and it says "You'll really want to have a second machine around." Unfortunately I don't have any other around. If http://www.x.org/wiki/Development/Documentation/ServerDebugging#Debugging_with_one_machine is viable, I can give it a shot.

While I was typing this, the gpu hung again:
$ grep -E '\((WW|EE)' /var/log/Xorg.0.log
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[    16.655] (WW) intel(0): Detected unsupported/dysfunctional hardware, disabling acceleration.
[    17.254] (WW) intel(0): Textured video not supported on this hardware
[    17.265] (WW) intel(0): loading DRI2 whilst the GPU is wedged.
[    74.649] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[    74.649] (EE) intel(0): When reporting this, please include i915_error_state from debugfs and the full dmesg.

Comment 10 Chris Wilson 2012-07-28 18:29:53 UTC

You only need to compile xf86-video-intel with --enable-debug, and the trick for using gdb to automatically grab the full bt when it asserts should work.

Comment 11 Karol Błażewicz 2012-07-28 18:42:14 UTC

Created attachment 64831 [details]
2nd Xiorg log - gpu hangs even w/o playing with GL

Comment 12 Karol Błażewicz 2012-07-28 18:44:08 UTC

Created attachment 64832 [details]
2nd dmesg output - no GL this time

Comment 13 Chris Wilson 2012-07-28 18:53:43 UTC

You convinced me the first time, which is why I'm interested in finding out where we are circumventing the defences... :)

Comment 14 Karol Błażewicz 2012-07-28 20:41:09 UTC

I can't get that debugging script to work + if I want to run a regular X with the intel drivers compiled with --enable-debug, X crashes when I launch firefox (not sure if related / normal, I'm a noob, I'm just reporting what I see).

When trying to launch X as described in http://www.x.org/wiki/Development/Documentation/ServerDebugging#Version_1 I get nothing but a timeout:

Waiting for X sever to begin accepting connections
..
..
..
<a lot more dots>
..
..
xinit: giving up
xinit: unable to connect to X server: connection refused
xinit: server error


Any idea what went wrong?
I removed '-nolisten tcp' from /etc/X11/xinit/xserverrc but it didn't change anything.
Does it have something to do with .Xauthority or ...?

Comment 15 Chris Wilson 2012-07-29 08:09:17 UTC

Hmm, that script should just work afaics. As regards to the firefox issue, it is either a genuine crash of the variety we are trying to diagnose, or more likely it is just that firefox opens multiple connections at startup and the default behaviour of X is to quit when the last client quits. So if you startup firefox on a bare X, X dies. A way around that is to startup an xterm and launch firefox from within the xterm.

Comment 16 Karol Błażewicz 2012-07-29 15:33:39 UTC

1. Sorry for the terrible wording wrt to using regular X and intel 2.20.2 driver with '--enable-debug': X doesn't crash when I start firefox - the whole system does crash. I get black screen with mouse cursor I can't move, I can't switch back to the console (out of X), the only thing I can do is magic_sysrq restart.
I start X via 'startx', it launches dwm, Once it's launched, I start urxvt and launch firefox from the terminal.



2. I've tried xf86-video-intel-git with "DRI" set to "false":

$ grep -v "#" /etc/X11/xorg.conf.d/20-intel.conf
Section "Device"
  Identifier "card0"
  Driver "intel"
  Option "AccelMethod" "sna"
  Option "DRI" "false"
EndSection

So far I got neither a hung gpu nor a wedged one, but switching windows is not as fluid as it was with UXA.

The only warnings I've seen so far in the Xorg log are:

[   410.682] (WW) intel(0): Detected unsupported/dysfunctional hardware, disabling acceleration.
[   411.216] (WW) intel(0): Textured video not supported on this hardware

Comment 17 Chris Wilson 2012-07-29 16:34:13 UTC

(In reply to comment #16)
> So far I got neither a hung gpu nor a wedged one, but switching windows is not
> as fluid as it was with UXA.

Which to be expected as I forgot to restore the code to kill acceleration for 845g in UXA (the code was commented out as shadow itself was broken). The problem is that sooner or later the GPU will hang, usually sooner, as the GMCH is incoherent.

Comment 18 Chris Wilson 2012-07-29 22:03:04 UTC

I've rearranged the code so that we risk the GPU hang on 845g by default and allow the user to elect to disable acceleration instead. I think the hangs are safe in that we shouldn't be killing the entire machine - though if we do get any such reports I shall have to disable acceleration on 845g by default. (Until such a day as we find a safe way to use GEM!)

Comment 19 Chris Wilson 2012-07-29 22:05:51 UTC

I've reworked the original offending code not to spuriously warn, and enabling acceleration on 845g should render the hunt futile.

I'll  do some fault injection and continue to hunt for missing wedged checks on my machines.

Comment 20 Karol Błażewicz 2012-08-07 12:43:54 UTC

I've tested different configurations for xf86-video-intel 2.20.3:

With SNA I always get a warning:
[    17.845] (WW) intel(0): Textured video not supported on this hardware
and the GPU hangs unless I disable 2D acceleration:

* Option "AccelMethod" "sna" = at least hung GPU. 50% of the time the whole computer freezes and I have to bail out using magic_sysrq.

* Option "AccelMethod" "sna" + Option "DRI" "false" = hung GPU.

* Option "AccelMethod" "sna" + Option "NoAccel" "true" = some warnings but otherwise "OK":
[    18.912] (WW) intel(0): Textured video not supported on this hardware
[    18.925] (WW) intel(0): loading DRI2 whilst the GPU is wedged.

* Option "AccelMethod" "sna" + Option "NoAccel" "true" + Option "DRI" "false" = just the "standard SNA warning" but otherwise "OK":
[    18.380] (WW) intel(0): Textured video not supported on this hardware


* Option "AccelMethod" "uxa" = hung GPU.

* Option "AccelMethod" "uxa" + Option "DRI" "false" = hung GPU

* Option "AccelMethod" "uxa" + Option "NoAccel" "true" = a warning in dmesg but otherwise "OK":
[    18.284] (WW) intel(0): cannot enable DRI2 whilst forcing software fallbacks

* Option "AccelMethod" "uxa" + Option "NoAccel" "true" + Option "DRI" "false" = no errors, no warnings - and no acceleration ;P


With 2D acceleration (either UXA or SNA) things are working really smooth, but also really unstable :-(

Comment 21 Chris Wilson 2012-08-07 12:59:39 UTC

Can you check the dmesg for the freeze? That you experience a freeze but the kernel still responds to sysrq suggests that it is not a hard system hang, but an oops.

Comment 22 Karol Błażewicz 2012-08-07 16:45:20 UTC

/var/log/everything.log has only

Aug  1 15:22:12 black kernel: [ 2194.043134] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Aug  1 15:22:12 black kernel: [ 2194.043147] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Aug  1 15:23:37 black kernel: [ 2278.959435] SysRq : Keyboard mode set to system default
Aug  1 15:23:37 black syslog-ng[243]: syslog-ng shutting down; version='3.3.5'
Aug  1 15:23:37 black kernel: [ 2279.060536] SysRq : Terminate All Tasks
Aug  1 15:23:37 black vnstatd[401]: SIGTERM received, exiting.
Aug  1 15:23:37 black acpid: exiting
Aug  1 15:23:37 black dhcpcd[452]: received SIGTERM, stopping
Aug  1 15:23:37 black dhcpcd[452]: eth0: removing interface

Aug  6 20:48:36 black kernel: [  482.286642] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Aug  6 20:48:36 black kernel: [  482.286657] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Aug  6 20:49:15 black kernel: [  521.324443] SysRq : Keyboard mode set to system default
Aug  6 20:49:16 black kernel: [  521.416526] SysRq : Terminate All Tasks
Aug  6 20:49:16 black syslog-ng[265]: syslog-ng shutting down; version='3.3.5'
Aug  6 20:49:16 black vnstatd[475]: SIGTERM received, exiting.
Aug  6 20:49:16 black acpid: exiting
Aug  6 20:49:16 black dhcpcd[563]: received SIGTERM, stopping
Aug  6 20:49:16 black dhcpcd[563]: eth0: removing interface

The first time it hung when I was using the git version of the drivers, the second time using the latest release - 2.20.3.
/sys/kernel/debug/dri/0/i915_error_state was empty both times - no error state collected.

Comment 23 Chris Wilson 2012-08-07 16:57:34 UTC

That's getting bizarre. :|

Comment 24 Karol Błażewicz 2012-08-07 19:16:53 UTC

Maybe I'm doing something wrong.
When I said the whole computer froze, I meant that it stopped responding to keyboard and mouse input except for sysrq. The second time it happened, I've waited for ten minutes and the situation didn't change.

I couldn't replicate this behavior today, I got just a hung GPU 8 times.
I mentioned it because I've read http://lists.x.org/archives/xorg-announce/2012-August/002051.html that you said "the GPU is (...) unlikely to hang the system".

Comment 25 Chris Wilson 2012-08-07 19:36:53 UTC

Right, and not responding to normal input just indicates another bug somewhere. Of the top of my head would be the page-fault-of-doom, where we fail to make forward progress as we fail to perform a pagefault as we do not handle an EIO correctly in the kernel. The unfixable hangs are where the machine no longer even responds to pings or sysrq. (Although they should be preventable, if not outright fixable per se.)

Comment 26 Chris Wilson 2012-08-08 08:04:01 UTC

Conversely, I find the opposite to be true; that SNA is more resilient to hangs than UXA on my 845g. The character of the hang is the same in either case, the command streamer reads a completely different set of bytes than was written by the CPU, and that the error is much more likely under memory pressure.

Comment 27 tang0th 2012-08-14 12:33:16 UTC

Hi,

This is also my first post here, so I'm not too experienced in this stuff :)

I'd just like to say that this bug affects me too, I also have a i195_error_state copy for a crash with sna (uxa also crashes) on a 'Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset'.

My xorg log and dmesg say around the same kind of thing as the reporters.

Hopefully it helps.

As a side note using sna (uxa is fine in this respect) on my machine makes scrolling on chromium really weird (some text stays in the same place) and some drop down boxes generally in KDE don't work, but that's most likely a different issue.

Comment 28 tang0th 2012-08-14 12:34:40 UTC

Created attachment 65543 [details]
i915_error_state for a sna crash

Comment 29 Balló György 2012-08-15 10:51:05 UTC

Same here. Until version 2.19, there was a stable 2D acceleration for the 82845G/GL chipset with the "Shadow" option, but now I get random lockups with both uxa and sna acceleration.

Now I get a freezed screen, and I'm not even able to switch to terminal with Ctrl-Alt-F1 but I'm able to log in through ssh.

Package versions:
- linux 3.4.8-1
- libdrm 2.4.37-1
- xorg-server 1.12.3.901-1
- xf86-video-intel 2.20.3-1

Xorg config:
  Option "AccelMethod"  "sna"
  Option "DRI"     "False"

My log files are attached.

Comment 30 Balló György 2012-08-15 10:53:46 UTC

Created attachment 65599 [details]
Another dmesg output on SNA hung

Comment 31 Balló György 2012-08-15 10:55:47 UTC

Created attachment 65600 [details]
Another i915_error_state on SNA hung

Comment 32 Balló György 2012-08-15 10:57:17 UTC

Created attachment 65601 [details]
Another Xorg log on SNA hung

Comment 33 Chris Wilson 2012-08-15 11:01:54 UTC

(In reply to comment #29)
> Same here. Until version 2.19, there was a stable 2D acceleration for the
> 82845G/GL chipset with the "Shadow" option, but now I get random lockups with
> both uxa and sna acceleration.
> 
> Now I get a freezed screen, and I'm not even able to switch to terminal with
> Ctrl-Alt-F1 but I'm able to log in through ssh.

That's a kernel bug, page-fault-of-doom.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.