83925 – [byt 3.14] bios trashes ringbuffer

Bug 83925 - [byt 3.14] bios trashes ringbuffer

Summary: [byt 3.14] bios trashes ringbuffer

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-09-16 10:42 UTC by Manuel Bachmann
Modified:	2017-07-24 22:51 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Code sample (5.80 KB, text/plain) 2014-09-16 10:43 UTC, Manuel Bachmann	no flags	Details
gpu_crashdump.log (2.14 MB, text/plain) 2014-09-16 10:46 UTC, Manuel Bachmann	no flags	Details
drm.log - extracted from dmesg (270.62 KB, text/plain) 2014-09-16 10:47 UTC, Manuel Bachmann	no flags	Details
View All

Description Manuel Bachmann 2014-09-16 10:42:30 UTC

System is :
- Kernel 3.14.14
- Mesa 10.1.3
- Intel HD 4400 (Intel NUC DE3815TYKE)
- OS : Tizen Common x86_64
- uname -a : Linux common_box 3.14.14-10.4-common-x86_64-default #1 SMP PREEMPT Fri Aug 29 19:26:37 UTC 2014 x86_64 GNU/Linux

Summary:
 When initializing the DRM hardware manually, then creating a GBM buffer on it, and then creating an EGL context on the GBM display, the i915 driver will hang will the following messages :
 [drm:valleyview_set_repos], GPU freq request from 375 Mhz (201) to 375 Mhz (201)
 [drm:valleyview_set_repos], GPU freq request from 375 Mhz (201) to 375 Mhz (201)
   (repeated many times)
 [drm] stuck on render ring
 [drm] GPU crash dump saved to /sys/class/drm/card0/error
 [drm] GPU hangs can indicate a bug anywhere [...] Please file a _new_ bug report [...]
 (drm:i915_error_work_func], resetting chip

The chip then resets, which will freeze the client application for about 6 seconds. The application may work, but startup will be slow. This only happens once, not depending on any time nor boot sequence.

This typically happens when running Weston with the DRM backend and GL renderer (weston --backend=drm-backend.so). Attached a sample program which mimics Weston startup and always reproduces the crash.

Note : issuing non-EGL commands on the GBM surface, such as software Pixman/Cairo, will not reproduce the crash.

Steps to reproduce :
 1. Disable any other display system (X.Org, Weston...)
 2. Reboot with the following kernel options : drm.debug=0x06 log_buf_len=10M initcall_debug ignore_loglevel debug.
 3. Compile and, as root, run the sample program.
 4. Run "dmesg |grep -i "drm"" and notice the crash.

Comment 1 Manuel Bachmann 2014-09-16 10:43:40 UTC

Created attachment 106361 [details]
Code sample

Comment 2 Manuel Bachmann 2014-09-16 10:46:50 UTC

Created attachment 106362 [details]
gpu_crashdump.log

Comment 3 Manuel Bachmann 2014-09-16 10:47:54 UTC

Created attachment 106363 [details]
drm.log - extracted from dmesg

Comment 4 Manuel Bachmann 2014-09-17 07:53:18 UTC

Oops, forgot a very important detail, sorry for that :

This only happens if using a HDMI monitor.
The issue will not be triggered with a VGA monitor (or at least, we did not reproduce it).

Comment 5 Manuel Bachmann 2014-09-18 14:31:31 UTC

Ok, I debunked it.

It happens precisely in "src/mesa/drivers/dri/i965/intel_extensions.c" (it is 1965 ? the log message specified "i915", though).

In "intelInitExtensions()", there is a test function named "can_do_pipelined_register_writes()" line 306.
This function will itself call "drm_intel_bo_map()" twice ; the second call is line 99 and will do "drm_intel_bo_map(..., FALSE).

This is this second call which will hang the driver, *but only the first time it is used*. Calls following the driver restart will work, and the function will return success.

The call itself lives in "libdrm", in "intel/intel_bufmgr.c" and seems to do ioctls and more, though I have difficulties logging what is happening in it (systemd-journal logging does not work).

Comment 6 Ian Romanick 2014-09-18 16:24:52 UTC

i915 is the kernel driver shared by all Intel GPUs.  i965 is the user-mode OpenGL driver used by recent (almost everything made in the last 5 years) Intel GPUs.

Are you able to reproduce this on more recent versions of Mesa?

The problem sounds familiar, and I think Ken may have already fixed the underlying issue.  My memory is a bit fuzzy.

Comment 7 Manuel Bachmann 2014-09-18 17:18:12 UTC

Hi Ian, and thanks a lot for your insights,

Ok, so that means the crash is really happening in the kernel-mode driver... was guessing that, but not sure, thanks.

I will try to rebase our repository on the latest stable Mesa (10.2.7), and let you know the results.

Comment 8 Manuel Bachmann 2014-09-19 08:00:53 UTC

So here are my findings :
I upgraded to kernel 3.16 and Mesa 10.2.7 : problem is fixed.
Downgraded to kernel 3.14.9, which is the latest version we intend to support on Tizen : problem happens again.

So it must have been fixed somewhere between 3.14.9 and 3.16. I intend to backport the fix ; would be nice if you had a pointer ! Meanwhile, I will try to find out by myself by looking at logs.

Comment 9 Manuel Bachmann 2014-09-24 11:09:25 UTC

Hi,

Any help on this ?

Found that kernel 3.15 works, but 3.14.9 does not work. I tried to backport or cherry-pick half of the commits between 3.14.9 and 3.15, but did not help. There is also a big "bugfix" branch merge (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/i915/i915_gem.c?id=889fa782bf8ebe7c0d0ed0a9429bf43197f0f64e) which I did not try because it is huge.

We cannot use the latest kernel as-is because it is explicitly disallowed in our workflow.

For reference, the precise call with leads to the hang is in "drivers/gpu/drm/i915/i915_gem.c", "i915_gem_set_domain_ioctl()" function, line 1227. Somewhere in or just after "obj = to_intel_bo(drm_gem_object_lookup(dev, file, args->handle));".

Comment 10 Kenneth Graunke 2014-09-24 15:59:33 UTC

Reassigning to the kernel component since comments #8-9 indicate that it works with kernel 3.15, but doesn't work with kernel 3.14.9.  It's probably not a Mesa issue then.

(Also, note the conflicting information in comment #1, which says "HD 4400" - Haswell GT2 - but then mentions Baytrail.  Based on the NUC product number and backtrace, I believe this really is Baytrail, not Haswell.)

Manuel, I suspect the kernel guys will suggest bisecting to find the fix.  It's usually pretty effective.  To find a fix, you want to treat "working" as "bad" and "hanging" as "good".  Then bisect will find the first "bad" commit, i.e. the commit that makes it start working.

$ git bisect start -- drivers/gpu/drm/i915 # or just (git bisect start, but this will probably be faster and ought to work.)
$ git bisect good v3.14.9
$ git bisect bad v3.15
<build and test a kernel.  if working, 'git bisect bad'.  if hang, 'git bisect good'>

Comment 11 Chris Wilson 2014-09-24 19:46:52 UTC

BIOS overwrites the driver's ringbuffer, the GPU explodes. My guess is that the bisect will lead to the plane preservation code - not really a suitable candidate for backporting, so hope it leads somewhere else.

Comment 12 Manuel Bachmann 2014-09-24 20:14:08 UTC

Hi Kenneth,

I guessed the GPU model by looking at online docs, but I am pretty sure of the NUC product number. So I think you are right, it must be Baytrail.

Thanks a lot for the great advice and command samples ! I will definitely try them. Half of the "easy" commits are already done, so it may not be so hard.


Hi Chris,

Thanks a lot for your analysis ! I hope it is not this part, then. Well, we will see. At worst, if it is really not backportable, I will report back to my QA team and give them the options.

Comment 13 Jani Nikula 2014-09-25 13:50:25 UTC

(In reply to comment #12)
> I guessed the GPU model by looking at online docs, but I am pretty sure of
> the NUC product number. So I think you are right, it must be Baytrail.

Side note, this can be confirmed from the PCI ID, which I did, and I believe Chris did when setting [byt] to subject.

Comment 14 Jani Nikula 2014-10-08 13:18:20 UTC

Manuel, how about that reverse bisect from v3.14 to v3.15 to see which commit starts working? See Kenneth's comment #10 and e.g. https://wiki.ubuntu.com/Kernel/KernelBisection.

Comment 15 Manuel Bachmann 2014-10-08 15:21:32 UTC

Hello Jan,

Thanks a lot for following the topic closely ! I have been discharged of this task, so I cannot give any schedules nor hints about the bisect, but a coworker may be in charge soon. I guess it is safe to let this bug open for now, if nobody minds.

Comment 16 Jani Nikula 2015-01-29 13:32:08 UTC

Closing, since it seems we've fixed the bug since v3.15. If the problem reappears with newer kernels, feel free to reopen.

If you eventually do the reverse bisect and find the specific commit that fixes things, please report it on intel-gfx@lists.freedesktop.org mailing list so we can do a backport request if it ends up being suitable for backporting. Thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.