Bug 78533 - [3.15 regression] relocation value wraparound due to bios-fb preserved hole at 0
Summary: [3.15 regression] relocation value wraparound due to bios-fb preserved hole at 0
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: high normal
Assignee: Jani Nikula
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 78876 79013 79539 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-05-10 16:42 UTC by Kenny MacDermid
Modified: 2017-07-24 22:54 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
/sys/class/drm/card0/error compressed (403.80 KB, text/plain)
2014-05-10 16:42 UTC, Kenny MacDermid
no flags Details
Please consume whiskey first. (2.67 KB, patch)
2014-05-14 18:36 UTC, Chris Wilson
no flags Details | Splinter Review
Prevent negative relocation deltas from causing wraparound (6.34 KB, patch)
2014-05-15 06:31 UTC, Chris Wilson
no flags Details | Splinter Review
Prevent negative relocation deltas from causing wraparound (7.02 KB, patch)
2014-05-15 07:48 UTC, Chris Wilson
no flags Details | Splinter Review
Prevent negative relocation deltas from causing wraparound (8.16 KB, patch)
2014-05-15 12:32 UTC, Chris Wilson
no flags Details | Splinter Review
Offsect batch buffers to prevent delta wrapping (11.03 KB, patch)
2014-05-15 15:58 UTC, Chris Wilson
no flags Details | Splinter Review

Description Kenny MacDermid 2014-05-10 16:42:30 UTC
Created attachment 98823 [details]
/sys/class/drm/card0/error compressed

[11359.443122] [drm] stuck on render ring
[11359.444476] [drm] GPU HANG: ecode 0:0x86dffffd, in X [813], reason: Ring hung, action: reset
[11359.444490] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[11359.444494] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[11359.444498] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[11359.444501] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[11359.444505] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[11361.444736] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[11365.447937] [drm] stuck on render ring
[11365.449339] [drm] GPU HANG: ecode 0:0x86dffffd, in X [813], reason: Ring hung, action: reset
[11365.449462] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
[11367.449564] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

Using the linux-mainline kernel on Arch from the AUR:
Linux orange 3.15.0-1-mainline #1 SMP PREEMPT Tue May 6 15:54:05 CEST 2014 x86_64 GNU/Linux
Comment 1 Chris Wilson 2014-05-10 17:21:01 UTC
Hmm, could you tell if this was a recent regression?
Comment 2 Kenny MacDermid 2014-05-10 21:07:27 UTC
I haven't seen this error before upgrading to a 3.15 kernel. I've only run it a couple days and this has occurred twice. The actual kernel version is 3.15rc4.

It looks like the Arch AUR package maintainer has updated the package to rcc5, so I can try that and let you know if it continues happening.

Possibly unrelated but just for completeness I've also noticed a decreased frame rate in TagPro.

For my /etc/X11/xorg.conf.d/20-intel.conf I'm using:

Section "Device"
	Identifier "Intel Graphics"
	Option "SwapbuffersWait" "true"
	Option "AccelMethod" "sna"
	Option "TearFree" "true"
EndSection

The laptop is a Lenovo Yoga 2 Pro so it has a high dpi screen. 3200x1800 iirc.
Comment 3 Chris Wilson 2014-05-11 07:06:44 UTC
Please do check whether the current packages work with a 3.14 kernel. That will narrow down the error to being in the kernel.
Comment 4 Kenny MacDermid 2014-05-14 13:21:43 UTC
Switched back to 3.14.2 for the last 3 days and the hangs do not happen.

The framerate seems the same though, so perhaps that was another package.
Comment 5 Chris Wilson 2014-05-14 18:01:10 UTC
Hmm, I didn't notice this first time around:

0x0000a044:      0x61010008: STATE_BASE_ADDRESS
0x0000a048:      0x00000000:    general state base not updated
0x0000a04c:      0xffffd001:    surface state base address 0xffffd000
0x0000a050:      0x044a1501:    dynamic state base address 0x044a1500
0x0000a054:      0x00000000:    indirect state base not updated
0x0000a058:      0x044a1501:    instruction state base address 0x044a1500
0x0000a05c:      0x00000000:    general state upper bound not updated
0x0000a060:      0x00000001:    dynamic state upper bound disabled
0x0000a064:      0x00000000:    indirect state upper bound not updated
0x0000a068:      0x00000001:    instruction state upper bound disabled

Oh boy. This is going to be fun.
Comment 6 Chris Wilson 2014-05-14 18:36:33 UTC
Created attachment 99041 [details] [review]
Please consume whiskey first.

Urgh. Something like this.
Comment 7 Chris Wilson 2014-05-15 06:31:42 UTC
Created attachment 99059 [details] [review]
Prevent negative relocation deltas from causing wraparound
Comment 8 Chris Wilson 2014-05-15 07:47:59 UTC
Testcase: igt/gem_bad_reloc

commit daa9e3d80a6c25667b259e864376ac929d5a11bd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu May 15 08:43:11 2014 +0100

    Add gem_bad_reloc
    
    This test feeds a batch containing self-references into the kernel and
    checks that the relocation offsets remain as valid GTT addresses. This
    is to exercise SNA passing in negative relocation deltas which can hang
    the GPU if they wrap around.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=78533
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 9 Chris Wilson 2014-05-15 07:48:39 UTC
Created attachment 99064 [details] [review]
Prevent negative relocation deltas from causing wraparound
Comment 10 Chris Wilson 2014-05-15 12:32:52 UTC
Created attachment 99083 [details] [review]
Prevent negative relocation deltas from causing wraparound

Now actually handles SNA's batchbuffers.
Comment 11 Daniel Vetter 2014-05-15 14:07:14 UTC
I still like to know which patch introduced this regression ...

Kenny, can you please try to do a bisect?
Comment 12 Kenny MacDermid 2014-05-15 14:25:09 UTC
I can try, but it was only occurring around once a day so it'll take a bit.

Is there a start commit I should use other than 3.14?
Comment 13 Chris Wilson 2014-05-15 14:40:52 UTC
It's BIOS fb preservation leaving a hole at 0.
Comment 14 Daniel Vetter 2014-05-15 14:42:11 UTC
Hey, at least that works. But yeah, makes tons of sense ... So no bisect result needed.
Comment 15 Chris Wilson 2014-05-15 15:58:37 UTC
Created attachment 99103 [details] [review]
Offsect batch buffers to prevent delta wrapping

An alternative, Daniel's suggestion. The problem, imo, is that this bakes in assumptions about userspace, pessimising all (and fragile) rather than fixing the pathological cases.
Comment 16 Chris Wilson 2014-05-19 06:29:02 UTC
*** Bug 78876 has been marked as a duplicate of this bug. ***
Comment 17 Daniel Vetter 2014-05-19 08:44:13 UTC
We need tested-bys on Chris' latest patch ... That goes to all the people who's report has been de-duped to this one here, too.
Comment 18 Chris Wilson 2014-05-21 12:47:44 UTC
*** Bug 79013 has been marked as a duplicate of this bug. ***
Comment 19 Chris Wilson 2014-05-28 18:30:31 UTC
commit d23db88c3ab233daed18709e3a24d6c95344117f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 23 08:48:08 2014 +0200

    drm/i915: Prevent negative relocation deltas from wrapping
Comment 20 Chris Wilson 2014-06-02 11:27:21 UTC
*** Bug 79539 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.