29787 – [RADEON:KMS::EDID] i2c bit banging + preempt kernel -> i2c failure (random XRandR failures)

Bug 29787 - [RADEON:KMS::EDID] i2c bit banging + preempt kernel -> i2c failure (random XRandR failures)

Summary: [RADEON:KMS::EDID] i2c bit banging + preempt kernel -> i2c failure (random XR...

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Radeon (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	low normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-08-24 12:54 UTC by Arno Schuring
Modified:	2019-11-19 08:15 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
re-spun patch of #29389 (1.97 KB, patch) 2010-08-29 09:48 UTC, Arno Schuring	no flags	Details \| Splinter Review
dmesg (34.34 KB, text/plain) 2010-08-29 09:54 UTC, Arno Schuring	no flags	Details
dmesg output (71.89 KB, text/plain) 2011-05-21 03:35 UTC, Stratos Zolotas	no flags	Details
possible i2-algo-bit preempt fix (902 bytes, patch) 2011-05-21 04:32 UTC, Ville Syrjala	no flags	Details \| Splinter Review
View All

Description Arno Schuring 2010-08-24 12:54:26 UTC

A few weeks ago, X started displaying weird errors during normal use. These range from simple output resets to application crashes due to X BadAlloc responses. In detail, this is what I see happening:

- output blink. Usually the screen goes dark for about a second. When the screen returns, I see either:
- display position getting cancelled (xrandr --pos 0x0) on my left screen
  when this happens I can usually get it back to the correct position via the xrandr command line tool, but sometimes that triggers a WM crash
- display rotation getting cancelled (xrandr --rotate normal) on the right screen. When this happens:
  - quodlibet (python-gtk app) crashing with BadAlloc (serial 255 error_code 11 request_code 53 minor_code 0). It will keep crashing like this until I restart X
  - whatever other app I had running on the right screen has disappeared as well
  - window manager (e17) crashing. It can recover succesfully, though
  - when I re-enable screen rotation, the display gets completely garbled

I have started seeing this happen with 2.6.35-rc3 (was running 2.6.34-rc6 before that), and with 2.6.35.1 this happens a lot less frequently (from once a day to twice in a week).

From the latest crash, I can give the following info from xsession-errors:
(E17 init)
  E17 INIT: XINERAMA CHOSEN: [1], 1080x1920+1280+0
  E17 INIT: XINERAMA CHOSEN: [0], 1280x1024+0+896
(output blinking starts)
  E17 INIT: XINERAMA SCREEN: [0], 1280x1024+0+0
  E17 INIT: XINERAMA CHOSEN: [153045780], 0x153045764+0+153046580
(e17 crash, try to recover)
  E17 INIT: XINERAMA CHOSEN: [1], 1920x1080+1280+0
  E17 INIT: XINERAMA CHOSEN: [0], 1280x1024+0+0
(rotation lost on [1], position lost on [0])
  E17 INIT: XINERAMA CHOSEN: [1], 1080x1920+1280+0
  E17 INIT: XINERAMA CHOSEN: [141138596], 0x141138580+0+141139396
  E17 INIT: XINERAMA CHOSEN: [0], 1280x1024+0+896
(restoring rotation confuses E17)
  ###!!! ABORT: X_CreatePixmap: BadAlloc (insufficient resources for operation);
 6 requests ago: file nsX11ErrorHandler.cpp, line 182
  UNKNOWN [/usr/lib/xulrunner-1.9.2/libxul.so +0x001CA781]
(firefox crashing)

Now, the reason for the E17 crash looks like a use-after-free bug. Question is: who is freeing the Xinerama structures, and why are they freed at all? I never had these problems before, and when using stock 2.6.32 (from Debian) I don't have these problems either, so I'm ruling out hardware failure.


I can see the following lines logged by the kernel (KMS enabled on radeon 9550):
  [12045.280155] [drm:radeon_dvi_detect] *ERROR* DVI-I-1: probed a monitor but no|invalid EDID
  [13500.386132] [drm:radeon_vga_detect] *ERROR* VGA-1: probed a monitor but no|invalid EDID
  [13797.568760] [drm:radeon_dvi_detect] *ERROR* DVI-I-1: probed a monitor but no|invalid EDID
And I believe these are harmless (and reported in #27708). I can see these messages being reported for as long my kernel logs go back. But when the crashes occur, they are followed by these lines:
  [35753.097508] i2c i2c-1: sendbytes: NAK bailout.
  [35762.752866] i2c i2c-1: sendbytes: NAK bailout.
  [35772.912023] i2c i2c-1: readbytes: ack/nak timeout
Which I believe are not so harmless. Or they could be a red herring, I'm not qualified to tell.

Since the same time, I've been seeing a lot of lines logged by g-s-d, like
  gdk_pixbuf_format_get_name: assertion `format != NULL' failed
But I do not believe these are relevant.

Comment 1 Arno Schuring 2010-08-28 05:39:15 UTC

After a short testing experience with 2.6.36-rc2+, I can state the following:

- problem is worse in .36 - I got three Xrandr resets in the few hours of testing, including one while still in gdm
- it's not X-related. I get output blinks even when working on a text console, and even with the DVI monitor switched off (soft standby)
- every time the text console blinks, there is another [drm:radeon_dvi_detect] message logged to the current console

I have reverted to 2.6.34.6, which works perfectly (and incidentally, does not log any edid messages)

Comment 2 Alex Deucher 2010-08-28 17:14:20 UTC

Sounds like it's connector polling related.  Does the patch to disable polling in bug 29389 help?  Also, please attach your dmesg.

Comment 3 Arno Schuring 2010-08-29 09:48:48 UTC

Created attachment 38265 [details] [review]
re-spun patch of #29389

Interesting. Since 29389#c1 also mentions slow-work threads, I must say that I have noticed kslowd001 showing up high in top (with 2.6.35). So far I had attributed that to NFS. Anyway, the patch wouldn't apply cleanly, so I'm attaching a new version (against v2.6.36-rc2-237-gd4348c6). I'm now running the kernel with the patch, but without any additional command-line options.

I'm still seeing EDID messages getting logged, but so far no blink or output reset has occurred. I'll keep running this until the display fails again, then I'll boot with the option. One question though, which is the correct commandline option:
radeon.drm_kms_helper.poll=0 or drm_kms_helper.poll=0?

Comment 4 Arno Schuring 2010-08-29 09:54:01 UTC

Created attachment 38266 [details]
dmesg

Comment 5 Alex Deucher 2010-08-29 14:48:24 UTC

drm.poll=0

Comment 6 Arno Schuring 2010-08-30 07:45:12 UTC

problem is still present with the patch but without the command-line option (since I can't reproduce it at will, it can take some time to verify). I'm now running -rc3 with the patch and drm.poll=0

I'm still seeing "invalid EDID" messages, though. Is that expected, or did I do something wrong? Because I would have expected these messages to disappear...

Comment 7 Alex Deucher 2010-08-30 07:48:29 UTC

(In reply to comment #6)
> I'm still seeing "invalid EDID" messages, though. Is that expected, or did I do
> something wrong? Because I would have expected these messages to disappear...

I think newer kernels got more verbose about buggy edids.

Comment 8 Arno Schuring 2010-09-01 14:59:12 UTC

Although the error seems to have gone down, I still experience unexpected behaviour with the patch and drm.poll=0. About an hour ago the xrandr position for the left monitor got cancelled again.

To give an indication of the error rate (still don't know if the edid errors are related):

aschuring@neminis:~$ uptime
 23:51:06 up 14:58,  7 users,  load average: 0.58, 0.91, 0.97
aschuring@neminis:~$ dmesg|grep EDID|wc -l
126
aschuring@neminis:~$ dmesg|grep -i nak|wc -l
24

Comment 9 Arno Schuring 2010-09-17 02:19:58 UTC

While I still can't reproduce this at will, both the EDID errors and xrandr resets seem related to system load. When the system is idle, or I'm simply browsing or text-editing, the problems don't happen. However, when I'm stressing the system (which is not that hard on this old machine), for example creating a backup or compiling a kernel, I get multiple EDID errors per minute and occasional resets.

Since I've been running with the patch, I have only seen position resets on the left monitor. The right monitor has never lost its rotation.

I can't reproduce this by pegging the cpu, I need to generate a lot of disk activity. So I think it is related to latency rather than throughput. Also, X seems to hang for 50-500ms every time an EDID message appears in syslog, as I can see the mouse cursor freezing. When the freeze lasts more than a second, it is usually followed by an XRandR reset.

Comment 10 j.bormolini 2010-10-04 20:38:36 UTC

I see essentially the same thing as Arno. I have an AMD 780G motherboard (radeon hd3200 / rs780). The only time I noticed the problem was when running mplayer - it pauses for a split second every minute or so. I'm using HDMI for video and audio.

The option to disable polling is (now?) part of drm_kms_helper, so adding "drm_kms_helper.poll=0" to the kernel cmdline disables polling and completely fixes the problem for me. I'm using kernel 2.6.36-rc6. Hopefully this helps anyone else who stumbles on this problem.

I just remembered (too late, doh!), that a while ago I set X and hd-audio[01] to RT priority when running mplayer as an experiment to test performance. I'm pretty sure I saw the problem before that, but if I ever get around to rebooting I'll test with polling enabled and the priorities unchanged. Presumably Arno does not change his priorities and still sees the problem.

Comment 11 Arno Schuring 2010-10-23 04:49:21 UTC

As additional datapoint, I also experience these problems under GNOME and with just one monitor (by issuing xrandr --output DVI-0 --off). Under these circumstances, no crashes happen but the screen output gets garbled, and subsequently (when the display recovers) the DVI-0 monitor gets enabled again (as in xrandr --output DVI-0 --same-as VGA-0).

Am now running 2.6.36, still with drm.poll=0. Everything is at default priority -- which is 19 according to ps, and 20 according to top -- but a number of kernel threads (including [radeon]) are at 39(ps)/0(top). rtprio is unset for all processes.

Comment 12 Arno Schuring 2010-11-19 03:31:13 UTC

It seems that both the EDID errors and the XRandR failures do not occur when I disable preemption (that, and I'm running 37-rc2 now). I'm currently rebuilding -rc2 with forced preemption again, will alternate between those kernels for the next few days to see what happens.

To give an idea of my current system activity: I'm compiling a kernel while reloading all my tabs in Firefox, having two redraw-happy windows open at the same time, and having three md5sum processes working on large files. System load is peeking at 6.8 according to top (on a UP system), and yet I don't see a single EDID or i2c warning in my dmesg.

This seems in line with my earlier conjecture that the problems are latency-triggered. I will report back once I have more than the current 60 minutes of experience with the new kernel, with both forced preemption and no preemption.

Comment 13 Alex Deucher 2010-11-19 07:30:51 UTC

(In reply to comment #12)
> It seems that both the EDID errors and the XRandR failures do not occur when I
> disable preemption (that, and I'm running 37-rc2 now). I'm currently rebuilding
> -rc2 with forced preemption again, will alternate between those kernels for the
> next few days to see what happens.

Getting the EDID is a timing sensitive bit banging operation, it the driver gets preempted in the middle of an EDID fetch the fetch may fail or get incomplete data.

Comment 14 Arno Schuring 2010-11-20 05:13:34 UTC

Grumble. So, how will this be fixed? Sprinkle the code with preempt_disable()/enable() pairs? Advise all users (well, me) to disable preemption? Change the code so it no longer acts on intermittent EDID failures? Ask your colleagues to design sane hardware that doesn't rely on software interaction for timing-critical operations ;) ?

FWIW, I'm perfectly happy to disable preemption in my kernel, haven't run a distro-provided kernel since 2.6.27. But your explanation sounds like this could bite a lot more users than just me.

Comment 15 Alex Deucher 2010-11-20 09:10:56 UTC

(In reply to comment #14)
> Grumble. So, how will this be fixed? Sprinkle the code with
> preempt_disable()/enable() pairs? Advise all users (well, me) to disable
> preemption? Change the code so it no longer acts on intermittent EDID failures?
> Ask your colleagues to design sane hardware that doesn't rely on software
> interaction for timing-critical operations ;) ?
> 
> FWIW, I'm perfectly happy to disable preemption in my kernel, haven't run a
> distro-provided kernel since 2.6.27. But your explanation sounds like this
> could bite a lot more users than just me.

It has nothing to do with the hw, it's the nature of i2c.  All drivers that use bit banging i2c (hwmon, v4l, lots of embedded stuff) would potentially be affected.

Comment 16 Arno Schuring 2011-02-10 03:30:17 UTC

I think the priority can be lowered, since it seems to affect only me (most people don't run a 7-year old machine I guess). Also, I'm not seeing any crashes any more with these kernel settings:
CONFIG_HZ=100
CONFIG_NO_HZ=y
CONFIG_PREEMPT=y

I also removed Pulseaudio, which seems to have dropped my idle system load (while playing music) from 0.60 to 0.01 according to top, and am now running 2.6.38-rc4.

Comment 17 Stratos Zolotas 2011-05-21 03:35:11 UTC

Created attachment 46971 [details]
dmesg output

I'm experiencing a similar error (with more info on dmesg).

I'm on x86_64 and running 2.6.39 kernel.

I could provide more info if you tell me where to look.

Comment 18 Ville Syrjala 2011-05-21 04:31:35 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > Grumble. So, how will this be fixed? Sprinkle the code with
> > preempt_disable()/enable() pairs? Advise all users (well, me) to disable
> > preemption? Change the code so it no longer acts on intermittent EDID failures?
> > Ask your colleagues to design sane hardware that doesn't rely on software
> > interaction for timing-critical operations ;) ?
> > 
> > FWIW, I'm perfectly happy to disable preemption in my kernel, haven't run a
> > distro-provided kernel since 2.6.27. But your explanation sounds like this
> > could bite a lot more users than just me.
> 
> It has nothing to do with the hw, it's the nature of i2c.  All drivers that use
> bit banging i2c (hwmon, v4l, lots of embedded stuff) would potentially be
> affected.

If the EEPROM supports clock stretching I see no reason why bit banging would fail.

I got these timeout errors on one system at some point. I had an idea for a possible fix, but I never got around to testing it. I'll attach the patch.

Comment 19 Ville Syrjala 2011-05-21 04:32:55 UTC

Created attachment 46972 [details] [review]
possible i2-algo-bit preempt fix

Comment 20 Martin Peres 2019-11-19 08:15:18 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/153.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.