Bugzilla – Bug 17638
[945GM] intermittent crashes (Ring of Death) if ExaNoComposite unset
Last modified: 2009-07-31 12:16:23 UTC
Created attachment 18972 [details]
Logfile with ring of Death!
I have following problem to report. I own a Medion Akoya E1210 (Aka: MSI WIND U100) Netbook with an GMA950 graphic chipset inside it.
For this and some other reasons - that doesn't matter for the problem description - I had to compile a recent version of XOrg (the release that is supposed to become 7.4). The idea behind that was to go with a much recent version of XOrg to benefit from recent drivers, better support, better way to report issues like this one.
XOrg works so far, no real problem. Compiz and all the fine 3D effects work as expected. But the real issue that I get is this ...
... The driver crashes every now and then at random times and gives Ring of Death messages in the X.log files (please refer to my attachment) ...
I like to mention that compiling XOrg was a quite trivial task since I use to do things like this back since 1996 - so some sort of experience exists. XOrg is not having issues with older drivers, not having issues with Kernel related DRI, not having any "l33t" configure flags (just configure --prefix=/usr/X11R7 ... yeah I am historical) and the GCC vesion used was 4.1.2. There is also no ABI (C++) or API incomatibilities around the Tools that I use that might cause this. Since this is happening all the time regardless what I use... Be it Firefox, Evolution, VMware, I even get these crashes while using TWM with nothing more than just firefox and xterm (tested this one too to avoid reporting false bugreports).
I'd like to get some feedback here and would really like to help solving this issue - or at least some feedback howto "workaround" it for the time being. I also tested this by disabling DRI and AIGLX, I need to mention that I do use a git version of libdrm - which as mentioned in the documentation was supposed to be the right thing 2.3.1 was last and some sort of 2.4 was required.
Further information can be found in the X.log file attached below.
Created attachment 18973 [details]
In case it helps here is my X.conf
More recent driver is required to dump ring content when this is triggered.
Created attachment 21009 [details] [review]
add more debug output for 945 chips
This patch dumps the ESR and page table error registers a bit more. You'll need to reproduce the problem from a fresh boot and then post your logs again, otherwise the ESR might have stale bits. Based on your last log, it looks like plane A is trying to fetch bits for some reason... Does the problem go away if you use the pipeAforce quirk? (See the man page.)
Created attachment 21018 [details] [review]
dump ICS register too
Here's an updated one that dumps the ICS register too... please add Option "modedebug" "true" to your xorg.conf and/or collect a dump from intel_reg_dumper with this patch applied.
Ok I did the following today.
a) I compiled a new libdrm (2.4.2 from SVN)
b) I compiled the intel drivers 220.127.116.11
c) done steps a) and b) ontop of my existing setup
So far I am enjoying the situation for not receiving any crashes so far. This for sure is no 100% guarantee that this won't happen anymore. So please let me test a while and I will get back to you again. I also enabled the further debug flags within my xorg.conf. The patch for the ICS registers is not applied yet. I will do this as soon as I receive these crashes again.
*** Bug 13733 has been marked as a duplicate of this bug. ***
*** Bug 18640 has been marked as a duplicate of this bug. ***
*** Bug 18123 has been marked as a duplicate of this bug. ***
I wanted to report back. The update to 18.104.22.168 solved all my issues. I get no crashes anymore. The driver works stable, reliable and good. Only some small font fragmentation when switching to VT and back to X (forth and back) sometimes, but I think that's a matter of time to get fixed and not the primary issue.
Thanks for the progress and the new driver.
Great, thanks for testing.
Well for me libdrm-2.4.3 and xf86-video-intel-22.214.171.124 didn't work at all ... so I would resolve this as Fixed still.
Created attachment 21698 [details]
Output of intel_reg_dumper from Toshiba R500
I'm experiencing an issue which most probably is related to this bug, described at https://bugzilla.novell.com/show_bug.cgi?id=463127 .
Attached is the output of intel_reg_dumper in the configuration as described in comment #4 .
Created attachment 21699 [details]
Output of intel_gtt
Attached is the output of intel_gtt obtained before loading the intel-agp module, right after loading it and right after starting X.
Rafael can reproduce this easily so I'm reopening. Rafael, can you reproduce it after setting the ExaNoComposite option in your xorg.conf?
I've been running Xorg with ExaNoComposite = true for about 1.5 days without a single hang. I think a hang would have happened without it during that time.
Note, however, that ExaNoComposite = true has some undesirable effects on my desktop (the icons in the KDE4's system tray are not visible and sometimes some artifacts appear in the background after closing a window).
Reassign to our render accel guru. :)
Update: I've been running with DRI disabled in the Xorg config and the problem is not reproducible, although the XRender compositing type is chosen in the KDE4 configuration.
In summary, it seems that the problem appears with DRI enabled and ExaNoComposite unset only.
*** Bug 16780 has been marked as a duplicate of this bug. ***
*** Bug 19634 has been marked as a duplicate of this bug. ***
*** Bug 19674 has been marked as a duplicate of this bug. ***
*** Bug 19319 has been marked as a duplicate of this bug. ***
*** Bug 20634 has been marked as a duplicate of this bug. ***
I posted Bug 20634, and since it was marked "duplicate," I assume that this is the same problem that I'm having, and I had some questions.
Does the following xorg.conf properly implement the workaround?
Identifier "Configured Video Device"
# Are the following two lines "the workaround?"
Option "ExaNoComposite" "true"
Identifier "Configured Monitor"
Identifier "Default Screen"
Monitor "Configured Monitor"
Device "Configured Video Device"
Well assuming I implemented comment 17 properly, the workaround doesn't work for me. It crashed again this morning, using the config from comment 23.
I got away without rebooting by issuing the following from tty1 (sometimes, the other ttys are not available, but this time, they were):
pm-suspend #then woke it up
sudo /etc/init.d/gdm restart
Obviously, this killed my Gnome session, so it was still disruptive. I issued pm-suspend twice, so I lost my xorg.0.log.old.
In the past, I think I've gotten away with just a pm-suspend/wakeup (without the gdm restart) but I'm not 100% sure now.
(In reply to comment #23)
The suggested change appears to have solved my problems. I have been running several days since I made the changes and have not have had a crash.
> I posted Bug 20634, and since it was marked "duplicate," I assume that this is
> the same problem that I'm having, and I had some questions.
> Does the following xorg.conf properly implement the workaround?
> Section "Device"
> Identifier "Configured Video Device"
> # Are the following two lines "the workaround?"
> Option "NoDRI"
> Option "ExaNoComposite" "true"
> Section "Monitor"
> Identifier "Configured Monitor"
> Section "Screen"
> Identifier "Default Screen"
> Monitor "Configured Monitor"
> Device "Configured Video Device"
I was greeted with a crash first thing this morning, just after login to Ubuntu (Gnome).
This is using the config in comment 23. Other relevant info is in (duplicate) bug 20634.
If anybody's got any other workarounds in the meantime, please post them. I usually lose at least my gnome session, and this happens several times a day, so it's really affecting my work.
Sorry, corrected log link for comment 26: http://jamiejackson.pastebin.com/f529e9f08
I know a lot of people have been hit by this bug, and that it's really frustrating. I'm glad that some people have at least been able to get back to work with intermittent crashes with the ExaNoComposite workaround, but obviously that's not ideal and we'd like to come up with a real fix.
We engineers working on the intel driver haven't been able to easily replicate this bug, so I'm hoping I can get some help debugging this from someone who can easily replicate the bug, and is willing to do a little work to provide some debugging information.
I've just posted three patches to the intel-gfx mailing list which can be found here:
Those are to be applied to the "drm-intel-next" kernel source tree which is available from a git clone of Eric Anholt's tree here:
So here's a recipe for getting the kernel code and applying the patches:
1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
2. git checkout -m drm-intel-next origin/drm-intel-next
3. Download my three email messages with patches
4. Run "git am <filename-of-saved-email-message>" for each message in turn
5. Compile the kernel and modules, install, and reboot into new kernel
6. Mount debugfs with something like the following command, "mkdir /debug; mount -t debugfs debug /debug"
7. Understood that I'm very grateful that you're willing to go through all these steps. Please accept my kind thanks!
8. Replicate the X server crash
9. At this point, you should have several files in /debug/dri/0. The most significant are likely i915_batchbuffers, i915_ringbuffer_info, and i915_ringbuffer_data. But you might as well just capture all of the files while you're at it. But post at least those to this bug report.
At that point, we'll work to interpret the results, identify the bug, and then fix it. And there will be much rejoicing!
Thanks in advance for anyone willing to help here. And please let me know if you have any questions or difficulties with any of the above.
I am currently not able to test the steps recommended, but maybe this information helps someone:
- Problems occur on both DracoGNU/Linux and FreeBSD
- Switching to XAA solves the problems on both platforms, resulting in working DRI and Composite, but broken xvideo...
(In reply to comment #28)
> 1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
> 2. git checkout -m drm-intel-next origin/drm-intel-next
Thanks, for the instructions on debugging the crashes. I'm a git noob (as well as a kernel compiling noob), I'm hung up on the second step. Please advise?
jamie@mercury:~/xorgcrashdebug$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
jamie@mercury:~/xorgcrashdebug$ git checkout -m drm-intel-next origin/drm-intel-next
fatal: Not a git repository
I'm also a victim of Intel's Ring of Death driver. Unfortunately it's on a productive machine on which I don't want to make much experimentals.
At the moment it's always the same procedure.
2. After some-hour (max. some days) all virtuals consoles crashes except the XOrg server itself
3. Than sometimes I got crashes when login into KDE4 session
4. After around a week the whole X-Server crashes (RoD)
Why do I have to reboot the machine. I tried via ssh
a) goto init 3
b) restard xdm/kdm
c) unload/reload intel kernel-modules
I never made it to bring up again the X-Server. Is there a chance to do so?
Intel driver: 2.5.0
Unfortunately openSUSE withdraw the complete Xorg repository for openSUSE 11.0 some days ago.
(In reply to comment #30)
> (In reply to comment #28)
> > 1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
> > 2. git checkout -m drm-intel-next origin/drm-intel-next
> Thanks, for the instructions on debugging the crashes. I'm a git noob (as well
> as a kernel compiling noob), I'm hung up on the second step. Please advise?
Thanks for trying this stuff out. What you're missing is a step between those two:
1.5 cd drm-intel
And if you're new to compiling the kernel, then I suggest you find a good guide to follow. But here are some tips off the top of my head:
* You'll need to configure the kernel, (for which there are various approaches such as "make menuconfig" and others), but all methods result in creating a .config file.
* Since configuring a kernel from scratch can involve a lot of work, (and many frustrating failure-to-boot cycles), I recommend starting with a known-good configuration. You should be able to find the configuration for your currently running kernel in the /boot.
* So a couple of useful commands to get started with a hopefully-good configuration file are the following:
cp /boot/config-$(uname -r) .config
That will copy your current configuration and then ask you many questions for any new options that have been created since your current kernel version, (and the defaults should be fine for most of these).
* After that, a "make" should be good enough for compiling the kernel and modules
* To install things, you'll need both "make install" and "make modules_install"
* If your configuration specifies that a ramdisk will be used when booting, then you'll have to ensure that one is created. On Debian, there's an update-initramfs command that I use for this, (with a command like "update-initramfs -c -k 2.6.29-rc7" where you would change the version to match the version of the kernel you just installed). You can use that if possible, or a similar script shipped with your OS of choice, or even change the kernel configuration to not require a ramdisk.
* Finally, you'll likely need to let your bootloader know about the new kernel image. I do this (again, with Debian), by simply running update-grub. One could achieve a similar effect by manually editing /boot/grub/menu.lst and adding a new stanza for the new kernel.
Hopefully that's enough to help you get started. If not, you'll probably want to consult some guide online that goes into more detail.
Good luck, and thanks again for helping to explore this bug for us!
*** Bug 14464 has been marked as a duplicate of this bug. ***
I'm marking this bug as resolved since we haven't been able to reproduce the bug and we are aware that many similar issues have been fixed in the latest code.
If anyone can test our latest code, (both xf86-video-intel and Linux kernel from git master), and let us know if the issue persists, then that would be very helpful. And in that case, we can reopen the bug. Otherwise, I'll just assume that things are working now for people.