68125 – Radeon HD6950: Failed to parse relocation -35 (Linux 3.11-rc regression)

Bug 68125 - Radeon HD6950: Failed to parse relocation -35 (Linux 3.11-rc regression)

Summary: Radeon HD6950: Failed to parse relocation -35 (Linux 3.11-rc regression)

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Radeon (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-08-14 23:07 UTC by Andreas Bombe
Modified:	2013-08-17 21:13 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
Linux 3.11-rc5 dmesg (131.10 KB, text/plain) 2013-08-14 23:07 UTC, Andreas Bombe	no flags	Details
associated Xorg.log (47.59 KB, text/plain) 2013-08-14 23:10 UTC, Andreas Bombe	no flags	Details
View All

Description Andreas Bombe 2013-08-14 23:07:17 UTC

Created attachment 84074 [details]
Linux 3.11-rc5 dmesg

After running in X for a while, I get a monitor signal dropout for a few seconds and find the follow messages in the kernel log:

[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -35!
radeon 0000:01:00.0: GPU reset succeeded, trying to resume

Messages about the reset procedure, ring test etc. follow, with different possible more or less successful recovery scenarios and messages playing out on different occurences. Usually the second occurence kills X and further attempts at logging in either end up in GNOME fallback mode or simply die immediately, leaving only a reboot to solve the problem. With one run in 3.11-rc5 at least it always recovered, reducing the impact of the bug to the occasional annoying black screen for a few seconds.

This never happened before and never happens in Linux 3.10. The first kernel in the 3.11 development cycle was 3.10.0-08918-g8133633, which already showed this problem.

I'm running Debian unstable, xorg is 7.7, xserver-xorg-video-radeon 6.14.4, libdrm-radeon1 2.4.46, mesa 9.1.6.

I have attached the log of the 3.11-rc5 which has plenty of these occurences thanks to always recovering successfully.

Comment 1 Alex Deucher 2013-08-14 23:09:14 UTC

Can you bisect?

Comment 2 Andreas Bombe 2013-08-14 23:10:57 UTC

Created attachment 84076 [details]
associated Xorg.log

Comment 3 Andreas Bombe 2013-08-14 23:26:15 UTC

(In reply to comment #1)
> Can you bisect?

I will have a go at it tomorrow then. That is going to be very time consuming, I expect, what with the bug only showing itself after quite some time.


What I also meant to mention is that the complexity of whatever it is doing does not seem to affect the likelihood of the bug occuring. For example, I played Team Fortress 2 for quite some time on the 3.11-rc5 run, and that resulted only in one reset and successful recovery. They don't appear to occur when the system is really idle though, often they happen just as I start to move the mouse or scroll in gvim.

Comment 4 Andreas Bombe 2013-08-17 21:13:48 UTC

While bisecting I noticed that CONFIG_DEBUG_WW_MUTEX_SLOWPATH was introduced and that I had it enabled. Since that potentially introduces additional -EDEADLK I suspected that check.

Turns out that check was buggy, the fix 85f489612 "mutex: Fix w/w mutex deadlock injection" was merged since I posted this bug and the commit even mentions explicitly that the bug caused spurious GPU lockups on radeon. The problem appears to be gone with this fix.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.