Created attachment 64826 [details]
This is my first bug report here, so please be gentle.
I've been using UXA with
Option "Shadow" "True"
on xf86-video-intel up to and including version 2.19 and it worked OK. From version 2.20 I get a hung GPU with both UXA and SNA. This report is about SNA - should I open another one for UXA?
-- chipset: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) (prog-if 00 [VGA controller])
-- system architecture: 32-bit
-- xf86-video-intel: 2.20.2
-- xserver: 1.12.3
-- libdrm: 2.4.37
-- kernel: 3.5.0-1-ARCH
-- Linux distribution: Arch Linux
-- Machine or mobo model: IBM NetVista A30p
Created attachment 64827 [details]
I did follow http://intellinuxgraphics.org/i915_error_state.html but
$ findmnt -nt debugfs
/sys/kernel/debug debugfs debugfs rw,relatime
$ cat /sys/kernel/debug/dri/0/i915_error_state
no error state collected
$ cat /sys/kernel/debug/dri/64/i915_error_state
no error state collected
How did you trigger the hang? Presumably through the use of DRI, which is controlled through a separate flag, just in case you wanted to use it in spite of the contrary evidence.
Can you please attach the i915_error_state so that I can check that it is conforming to the usual failure mode for 845g?
glines! So you attempted to use GL...
So suggesting to use Option "DRI" "false" would have been a little more effective if I remembered to make SNA check for it...
Author: Chris Wilson <email@example.com>
Date: Sat Jul 28 18:21:08 2012 +0100
sna: Honour the Option "DRI"
Signed-off-by: Chris Wilson <firstname.lastname@example.org>
However, it looks like the system functioned as intended insofar as it did not crash, and there should have been no functional difference wrt the ddx before or after the detected hang.
With regards to the missing error state, did you reboot before looking in debugfs?
> glines! So you attempted to use GL...
I noticed the hung gpu a couple hours ago, after (among other things) playing
with gnome-games, so I restarted the computer, mounted debugfs and proceeded to
diligently play each game from the package, grepping X log after each one to
see it the gpu hung. It didn't.
I'm not an expert, but the numbers representing time, are pretty different:
[ 3933.926243] glines: segfault at 17 ip b6e644fa sp bfcf7ad0 error 4 in
[ 7948.252] (EE) intel(0): Detected a hung GPU, disabling acceleration.
After playing through the whole collection I proceeded to browse the web and
view hundreds of pictures from my hard drive.
> However, it looks like the system functioned as intended insofar as it did not
> crash, and there should have been no functional difference wrt the ddx before
> or after the detected hang.
There is little if any difference in how stuff works here before and after the hung gpu and it's veeery different from the hung gpu I get with UXA (w/o shadow).
With SNA I found out the gpu hung only by grepping the log while with UXA
everything slowed down to a crawl and it was impossible *not* to notice.
> With regards to the missing error state,
> did you reboot before looking in debugfs?
No, I didn't reboot yet. Should I do it know or collect some other evidence
before doing so?
OK, I did reboot, still 'no error state collected'.
Am I doing it wrong?
> sna: Honour the Option "DRI"
Should I give the -git version of the drivers a try?
Do I have to use some options in the xorg.conf?
Oh, I understand now. It's a spurious warning from SNA in the sense that it marked the device as wedged because it was a 845g but didn't suppress the warning if we ever received an EIO from the kernel.
Now that EIO should in theory be impossible because all the paths that lead up to should be prevented by checking for wedged (i.e. we should only get an EIO after performing an operation with the GPU and they should all be verboten as we believe the GPU is on fire.)
Is it possible for your to recompile with --enable-debug and run your X server under a debugger so that I can see where we neglect the check for a wedged GPU (it should abort if we miss a check)?
(In reply to comment #8)
> Is it possible for your to recompile with --enable-debug and run your X server
> under a debugger so that I can see where we neglect the check for a wedged GPU
> (it should abort if we miss a check)?
Do you mean recompile xorg-server like in: http://www.x.org/wiki/Development/Documentation/ServerDebugging ?
I've never done it and it says "You'll really want to have a second machine around." Unfortunately I don't have any other around. If http://www.x.org/wiki/Development/Documentation/ServerDebugging#Debugging_with_one_machine is viable, I can give it a shot.
While I was typing this, the gpu hung again:
$ grep -E '\((WW|EE)' /var/log/Xorg.0.log
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[ 16.655] (WW) intel(0): Detected unsupported/dysfunctional hardware, disabling acceleration.
[ 17.254] (WW) intel(0): Textured video not supported on this hardware
[ 17.265] (WW) intel(0): loading DRI2 whilst the GPU is wedged.
[ 74.649] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[ 74.649] (EE) intel(0): When reporting this, please include i915_error_state from debugfs and the full dmesg.
You only need to compile xf86-video-intel with --enable-debug, and the trick for using gdb to automatically grab the full bt when it asserts should work.
Created attachment 64831 [details]
2nd Xiorg log - gpu hangs even w/o playing with GL
Created attachment 64832 [details]
2nd dmesg output - no GL this time
You convinced me the first time, which is why I'm interested in finding out where we are circumventing the defences... :)
I can't get that debugging script to work + if I want to run a regular X with the intel drivers compiled with --enable-debug, X crashes when I launch firefox (not sure if related / normal, I'm a noob, I'm just reporting what I see).
When trying to launch X as described in http://www.x.org/wiki/Development/Documentation/ServerDebugging#Version_1 I get nothing but a timeout:
Waiting for X sever to begin accepting connections
<a lot more dots>
xinit: giving up
xinit: unable to connect to X server: connection refused
xinit: server error
Any idea what went wrong?
I removed '-nolisten tcp' from /etc/X11/xinit/xserverrc but it didn't change anything.
Does it have something to do with .Xauthority or ...?
Hmm, that script should just work afaics. As regards to the firefox issue, it is either a genuine crash of the variety we are trying to diagnose, or more likely it is just that firefox opens multiple connections at startup and the default behaviour of X is to quit when the last client quits. So if you startup firefox on a bare X, X dies. A way around that is to startup an xterm and launch firefox from within the xterm.
1. Sorry for the terrible wording wrt to using regular X and intel 2.20.2 driver with '--enable-debug': X doesn't crash when I start firefox - the whole system does crash. I get black screen with mouse cursor I can't move, I can't switch back to the console (out of X), the only thing I can do is magic_sysrq restart.
I start X via 'startx', it launches dwm, Once it's launched, I start urxvt and launch firefox from the terminal.
2. I've tried xf86-video-intel-git with "DRI" set to "false":
$ grep -v "#" /etc/X11/xorg.conf.d/20-intel.conf
Option "AccelMethod" "sna"
Option "DRI" "false"
So far I got neither a hung gpu nor a wedged one, but switching windows is not as fluid as it was with UXA.
The only warnings I've seen so far in the Xorg log are:
[ 410.682] (WW) intel(0): Detected unsupported/dysfunctional hardware, disabling acceleration.
[ 411.216] (WW) intel(0): Textured video not supported on this hardware
(In reply to comment #16)
> So far I got neither a hung gpu nor a wedged one, but switching windows is not
> as fluid as it was with UXA.
Which to be expected as I forgot to restore the code to kill acceleration for 845g in UXA (the code was commented out as shadow itself was broken). The problem is that sooner or later the GPU will hang, usually sooner, as the GMCH is incoherent.
I've rearranged the code so that we risk the GPU hang on 845g by default and allow the user to elect to disable acceleration instead. I think the hangs are safe in that we shouldn't be killing the entire machine - though if we do get any such reports I shall have to disable acceleration on 845g by default. (Until such a day as we find a safe way to use GEM!)
I've reworked the original offending code not to spuriously warn, and enabling acceleration on 845g should render the hunt futile.
I'll do some fault injection and continue to hunt for missing wedged checks on my machines.
I've tested different configurations for xf86-video-intel 2.20.3:
With SNA I always get a warning:
[ 17.845] (WW) intel(0): Textured video not supported on this hardware
and the GPU hangs unless I disable 2D acceleration:
* Option "AccelMethod" "sna" = at least hung GPU. 50% of the time the whole computer freezes and I have to bail out using magic_sysrq.
* Option "AccelMethod" "sna" + Option "DRI" "false" = hung GPU.
* Option "AccelMethod" "sna" + Option "NoAccel" "true" = some warnings but otherwise "OK":
[ 18.912] (WW) intel(0): Textured video not supported on this hardware
[ 18.925] (WW) intel(0): loading DRI2 whilst the GPU is wedged.
* Option "AccelMethod" "sna" + Option "NoAccel" "true" + Option "DRI" "false" = just the "standard SNA warning" but otherwise "OK":
[ 18.380] (WW) intel(0): Textured video not supported on this hardware
* Option "AccelMethod" "uxa" = hung GPU.
* Option "AccelMethod" "uxa" + Option "DRI" "false" = hung GPU
* Option "AccelMethod" "uxa" + Option "NoAccel" "true" = a warning in dmesg but otherwise "OK":
[ 18.284] (WW) intel(0): cannot enable DRI2 whilst forcing software fallbacks
* Option "AccelMethod" "uxa" + Option "NoAccel" "true" + Option "DRI" "false" = no errors, no warnings - and no acceleration ;P
With 2D acceleration (either UXA or SNA) things are working really smooth, but also really unstable :-(
Can you check the dmesg for the freeze? That you experience a freeze but the kernel still responds to sysrq suggests that it is not a hard system hang, but an oops.
/var/log/everything.log has only
Aug 1 15:22:12 black kernel: [ 2194.043134] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Aug 1 15:22:12 black kernel: [ 2194.043147] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Aug 1 15:23:37 black kernel: [ 2278.959435] SysRq : Keyboard mode set to system default
Aug 1 15:23:37 black syslog-ng: syslog-ng shutting down; version='3.3.5'
Aug 1 15:23:37 black kernel: [ 2279.060536] SysRq : Terminate All Tasks
Aug 1 15:23:37 black vnstatd: SIGTERM received, exiting.
Aug 1 15:23:37 black acpid: exiting
Aug 1 15:23:37 black dhcpcd: received SIGTERM, stopping
Aug 1 15:23:37 black dhcpcd: eth0: removing interface
Aug 6 20:48:36 black kernel: [ 482.286642] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Aug 6 20:48:36 black kernel: [ 482.286657] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Aug 6 20:49:15 black kernel: [ 521.324443] SysRq : Keyboard mode set to system default
Aug 6 20:49:16 black kernel: [ 521.416526] SysRq : Terminate All Tasks
Aug 6 20:49:16 black syslog-ng: syslog-ng shutting down; version='3.3.5'
Aug 6 20:49:16 black vnstatd: SIGTERM received, exiting.
Aug 6 20:49:16 black acpid: exiting
Aug 6 20:49:16 black dhcpcd: received SIGTERM, stopping
Aug 6 20:49:16 black dhcpcd: eth0: removing interface
The first time it hung when I was using the git version of the drivers, the second time using the latest release - 2.20.3.
/sys/kernel/debug/dri/0/i915_error_state was empty both times - no error state collected.
That's getting bizarre. :|
Maybe I'm doing something wrong.
When I said the whole computer froze, I meant that it stopped responding to keyboard and mouse input except for sysrq. The second time it happened, I've waited for ten minutes and the situation didn't change.
I couldn't replicate this behavior today, I got just a hung GPU 8 times.
I mentioned it because I've read http://lists.x.org/archives/xorg-announce/2012-August/002051.html that you said "the GPU is (...) unlikely to hang the system".
Right, and not responding to normal input just indicates another bug somewhere. Of the top of my head would be the page-fault-of-doom, where we fail to make forward progress as we fail to perform a pagefault as we do not handle an EIO correctly in the kernel. The unfixable hangs are where the machine no longer even responds to pings or sysrq. (Although they should be preventable, if not outright fixable per se.)
Conversely, I find the opposite to be true; that SNA is more resilient to hangs than UXA on my 845g. The character of the hang is the same in either case, the command streamer reads a completely different set of bytes than was written by the CPU, and that the error is much more likely under memory pressure.
This is also my first post here, so I'm not too experienced in this stuff :)
I'd just like to say that this bug affects me too, I also have a i195_error_state copy for a crash with sna (uxa also crashes) on a 'Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset'.
My xorg log and dmesg say around the same kind of thing as the reporters.
Hopefully it helps.
As a side note using sna (uxa is fine in this respect) on my machine makes scrolling on chromium really weird (some text stays in the same place) and some drop down boxes generally in KDE don't work, but that's most likely a different issue.
Created attachment 65543 [details]
i915_error_state for a sna crash
Same here. Until version 2.19, there was a stable 2D acceleration for the 82845G/GL chipset with the "Shadow" option, but now I get random lockups with both uxa and sna acceleration.
Now I get a freezed screen, and I'm not even able to switch to terminal with Ctrl-Alt-F1 but I'm able to log in through ssh.
- linux 3.4.8-1
- libdrm 2.4.37-1
- xorg-server 126.96.36.1991-1
- xf86-video-intel 2.20.3-1
Option "AccelMethod" "sna"
Option "DRI" "False"
My log files are attached.
Created attachment 65599 [details]
Another dmesg output on SNA hung
Created attachment 65600 [details]
Another i915_error_state on SNA hung
Created attachment 65601 [details]
Another Xorg log on SNA hung
(In reply to comment #29)
> Same here. Until version 2.19, there was a stable 2D acceleration for the
> 82845G/GL chipset with the "Shadow" option, but now I get random lockups with
> both uxa and sna acceleration.
> Now I get a freezed screen, and I'm not even able to switch to terminal with
> Ctrl-Alt-F1 but I'm able to log in through ssh.
That's a kernel bug, page-fault-of-doom.