A few weeks ago, X started displaying weird errors during normal use. These range from simple output resets to application crashes due to X BadAlloc responses. In detail, this is what I see happening:
- output blink. Usually the screen goes dark for about a second. When the screen returns, I see either:
- display position getting cancelled (xrandr --pos 0x0) on my left screen
when this happens I can usually get it back to the correct position via the xrandr command line tool, but sometimes that triggers a WM crash
- display rotation getting cancelled (xrandr --rotate normal) on the right screen. When this happens:
- quodlibet (python-gtk app) crashing with BadAlloc (serial 255 error_code 11 request_code 53 minor_code 0). It will keep crashing like this until I restart X
- whatever other app I had running on the right screen has disappeared as well
- window manager (e17) crashing. It can recover succesfully, though
- when I re-enable screen rotation, the display gets completely garbled
I have started seeing this happen with 2.6.35-rc3 (was running 2.6.34-rc6 before that), and with 22.214.171.124 this happens a lot less frequently (from once a day to twice in a week).
From the latest crash, I can give the following info from xsession-errors:
E17 INIT: XINERAMA CHOSEN: , 1080x1920+1280+0
E17 INIT: XINERAMA CHOSEN: , 1280x1024+0+896
(output blinking starts)
E17 INIT: XINERAMA SCREEN: , 1280x1024+0+0
E17 INIT: XINERAMA CHOSEN: , 0x153045764+0+153046580
(e17 crash, try to recover)
E17 INIT: XINERAMA CHOSEN: , 1920x1080+1280+0
E17 INIT: XINERAMA CHOSEN: , 1280x1024+0+0
(rotation lost on , position lost on )
E17 INIT: XINERAMA CHOSEN: , 1080x1920+1280+0
E17 INIT: XINERAMA CHOSEN: , 0x141138580+0+141139396
E17 INIT: XINERAMA CHOSEN: , 1280x1024+0+896
(restoring rotation confuses E17)
###!!! ABORT: X_CreatePixmap: BadAlloc (insufficient resources for operation);
6 requests ago: file nsX11ErrorHandler.cpp, line 182
UNKNOWN [/usr/lib/xulrunner-1.9.2/libxul.so +0x001CA781]
Now, the reason for the E17 crash looks like a use-after-free bug. Question is: who is freeing the Xinerama structures, and why are they freed at all? I never had these problems before, and when using stock 2.6.32 (from Debian) I don't have these problems either, so I'm ruling out hardware failure.
I can see the following lines logged by the kernel (KMS enabled on radeon 9550):
[12045.280155] [drm:radeon_dvi_detect] *ERROR* DVI-I-1: probed a monitor but no|invalid EDID
[13500.386132] [drm:radeon_vga_detect] *ERROR* VGA-1: probed a monitor but no|invalid EDID
[13797.568760] [drm:radeon_dvi_detect] *ERROR* DVI-I-1: probed a monitor but no|invalid EDID
And I believe these are harmless (and reported in #27708). I can see these messages being reported for as long my kernel logs go back. But when the crashes occur, they are followed by these lines:
[35753.097508] i2c i2c-1: sendbytes: NAK bailout.
[35762.752866] i2c i2c-1: sendbytes: NAK bailout.
[35772.912023] i2c i2c-1: readbytes: ack/nak timeout
Which I believe are not so harmless. Or they could be a red herring, I'm not qualified to tell.
Since the same time, I've been seeing a lot of lines logged by g-s-d, like
gdk_pixbuf_format_get_name: assertion `format != NULL' failed
But I do not believe these are relevant.
After a short testing experience with 2.6.36-rc2+, I can state the following:
- problem is worse in .36 - I got three Xrandr resets in the few hours of testing, including one while still in gdm
- it's not X-related. I get output blinks even when working on a text console, and even with the DVI monitor switched off (soft standby)
- every time the text console blinks, there is another [drm:radeon_dvi_detect] message logged to the current console
I have reverted to 126.96.36.199, which works perfectly (and incidentally, does not log any edid messages)
Sounds like it's connector polling related. Does the patch to disable polling in bug 29389 help? Also, please attach your dmesg.
Created attachment 38265 [details] [review]
re-spun patch of #29389
Interesting. Since 29389#c1 also mentions slow-work threads, I must say that I have noticed kslowd001 showing up high in top (with 2.6.35). So far I had attributed that to NFS. Anyway, the patch wouldn't apply cleanly, so I'm attaching a new version (against v2.6.36-rc2-237-gd4348c6). I'm now running the kernel with the patch, but without any additional command-line options.
I'm still seeing EDID messages getting logged, but so far no blink or output reset has occurred. I'll keep running this until the display fails again, then I'll boot with the option. One question though, which is the correct commandline option:
radeon.drm_kms_helper.poll=0 or drm_kms_helper.poll=0?
Created attachment 38266 [details]
problem is still present with the patch but without the command-line option (since I can't reproduce it at will, it can take some time to verify). I'm now running -rc3 with the patch and drm.poll=0
I'm still seeing "invalid EDID" messages, though. Is that expected, or did I do something wrong? Because I would have expected these messages to disappear...
(In reply to comment #6)
> I'm still seeing "invalid EDID" messages, though. Is that expected, or did I do
> something wrong? Because I would have expected these messages to disappear...
I think newer kernels got more verbose about buggy edids.
Although the error seems to have gone down, I still experience unexpected behaviour with the patch and drm.poll=0. About an hour ago the xrandr position for the left monitor got cancelled again.
To give an indication of the error rate (still don't know if the edid errors are related):
23:51:06 up 14:58, 7 users, load average: 0.58, 0.91, 0.97
aschuring@neminis:~$ dmesg|grep EDID|wc -l
aschuring@neminis:~$ dmesg|grep -i nak|wc -l
While I still can't reproduce this at will, both the EDID errors and xrandr resets seem related to system load. When the system is idle, or I'm simply browsing or text-editing, the problems don't happen. However, when I'm stressing the system (which is not that hard on this old machine), for example creating a backup or compiling a kernel, I get multiple EDID errors per minute and occasional resets.
Since I've been running with the patch, I have only seen position resets on the left monitor. The right monitor has never lost its rotation.
I can't reproduce this by pegging the cpu, I need to generate a lot of disk activity. So I think it is related to latency rather than throughput. Also, X seems to hang for 50-500ms every time an EDID message appears in syslog, as I can see the mouse cursor freezing. When the freeze lasts more than a second, it is usually followed by an XRandR reset.
I see essentially the same thing as Arno. I have an AMD 780G motherboard (radeon hd3200 / rs780). The only time I noticed the problem was when running mplayer - it pauses for a split second every minute or so. I'm using HDMI for video and audio.
The option to disable polling is (now?) part of drm_kms_helper, so adding "drm_kms_helper.poll=0" to the kernel cmdline disables polling and completely fixes the problem for me. I'm using kernel 2.6.36-rc6. Hopefully this helps anyone else who stumbles on this problem.
I just remembered (too late, doh!), that a while ago I set X and hd-audio to RT priority when running mplayer as an experiment to test performance. I'm pretty sure I saw the problem before that, but if I ever get around to rebooting I'll test with polling enabled and the priorities unchanged. Presumably Arno does not change his priorities and still sees the problem.
As additional datapoint, I also experience these problems under GNOME and with just one monitor (by issuing xrandr --output DVI-0 --off). Under these circumstances, no crashes happen but the screen output gets garbled, and subsequently (when the display recovers) the DVI-0 monitor gets enabled again (as in xrandr --output DVI-0 --same-as VGA-0).
Am now running 2.6.36, still with drm.poll=0. Everything is at default priority -- which is 19 according to ps, and 20 according to top -- but a number of kernel threads (including [radeon]) are at 39(ps)/0(top). rtprio is unset for all processes.
It seems that both the EDID errors and the XRandR failures do not occur when I disable preemption (that, and I'm running 37-rc2 now). I'm currently rebuilding -rc2 with forced preemption again, will alternate between those kernels for the next few days to see what happens.
To give an idea of my current system activity: I'm compiling a kernel while reloading all my tabs in Firefox, having two redraw-happy windows open at the same time, and having three md5sum processes working on large files. System load is peeking at 6.8 according to top (on a UP system), and yet I don't see a single EDID or i2c warning in my dmesg.
This seems in line with my earlier conjecture that the problems are latency-triggered. I will report back once I have more than the current 60 minutes of experience with the new kernel, with both forced preemption and no preemption.
(In reply to comment #12)
> It seems that both the EDID errors and the XRandR failures do not occur when I
> disable preemption (that, and I'm running 37-rc2 now). I'm currently rebuilding
> -rc2 with forced preemption again, will alternate between those kernels for the
> next few days to see what happens.
Getting the EDID is a timing sensitive bit banging operation, it the driver gets preempted in the middle of an EDID fetch the fetch may fail or get incomplete data.
Grumble. So, how will this be fixed? Sprinkle the code with preempt_disable()/enable() pairs? Advise all users (well, me) to disable preemption? Change the code so it no longer acts on intermittent EDID failures? Ask your colleagues to design sane hardware that doesn't rely on software interaction for timing-critical operations ;) ?
FWIW, I'm perfectly happy to disable preemption in my kernel, haven't run a distro-provided kernel since 2.6.27. But your explanation sounds like this could bite a lot more users than just me.
(In reply to comment #14)
> Grumble. So, how will this be fixed? Sprinkle the code with
> preempt_disable()/enable() pairs? Advise all users (well, me) to disable
> preemption? Change the code so it no longer acts on intermittent EDID failures?
> Ask your colleagues to design sane hardware that doesn't rely on software
> interaction for timing-critical operations ;) ?
> FWIW, I'm perfectly happy to disable preemption in my kernel, haven't run a
> distro-provided kernel since 2.6.27. But your explanation sounds like this
> could bite a lot more users than just me.
It has nothing to do with the hw, it's the nature of i2c. All drivers that use bit banging i2c (hwmon, v4l, lots of embedded stuff) would potentially be affected.
I think the priority can be lowered, since it seems to affect only me (most people don't run a 7-year old machine I guess). Also, I'm not seeing any crashes any more with these kernel settings:
I also removed Pulseaudio, which seems to have dropped my idle system load (while playing music) from 0.60 to 0.01 according to top, and am now running 2.6.38-rc4.
Created attachment 46971 [details]
I'm experiencing a similar error (with more info on dmesg).
I'm on x86_64 and running 2.6.39 kernel.
I could provide more info if you tell me where to look.
(In reply to comment #15)
> (In reply to comment #14)
> > Grumble. So, how will this be fixed? Sprinkle the code with
> > preempt_disable()/enable() pairs? Advise all users (well, me) to disable
> > preemption? Change the code so it no longer acts on intermittent EDID failures?
> > Ask your colleagues to design sane hardware that doesn't rely on software
> > interaction for timing-critical operations ;) ?
> > FWIW, I'm perfectly happy to disable preemption in my kernel, haven't run a
> > distro-provided kernel since 2.6.27. But your explanation sounds like this
> > could bite a lot more users than just me.
> It has nothing to do with the hw, it's the nature of i2c. All drivers that use
> bit banging i2c (hwmon, v4l, lots of embedded stuff) would potentially be
If the EEPROM supports clock stretching I see no reason why bit banging would fail.
I got these timeout errors on one system at some point. I had an idea for a possible fix, but I never got around to testing it. I'll attach the patch.
Created attachment 46972 [details] [review]
possible i2-algo-bit preempt fix
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/153.