Bug 105397

Summary: [hsw] GPU HANG in paraview
Product: Mesa Reporter: Giuseppe Bilotta <giuseppe.bilotta>
Component: Drivers/DRI/i965Assignee: Intel 3D Bugs Mailing List <intel-3d-bugs>
Status: RESOLVED MOVED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: HSW i915 features: GPU hang
Attachments: /sys/class/drm/card0/error contents

Description Giuseppe Bilotta 2018-03-08 11:36:45 UTC
Created attachment 137889 [details]
/sys/class/drm/card0/error contents

I just experienced this while using Paraview to visualize a largish dataset (some time after the hang and reset, PV crashed).

This is a Dell XPS 15 9530 featuring an i7-4712HQ CPU @ 2.30GHz, running Debian unstable

uname -a: Linux oblomov 4.15.0-1-amd64 #1 SMP Debian 4.15.4-1 (2018-02-18) x86_64 GNU/Linux

* paraview 5.4.1+dfsg3-1+b2
* xorg 1:7.7+19
* mesa 17.3.6-1
* libdrm-intel1 2.4.90-1

The crash dump is attached.
Comment 1 Elizabeth 2018-03-08 15:56:46 UTC
Not sure if this information is relevant:

From error decode

0x7fda78a4:      0x02850000:    buffer 0: valid, type 0x0085, src offset 0x0000 bytes
0x7fda78a8:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
0x7fda78ac:      0x06850000:    buffer 1: valid, type 0x0085, src offset 0x0000 bytes
0x7fda78b0:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
0x7fda78b4:      0x04870000:    buffer 1: invalid, type 0x0087, src offset 0x0000 bytes
0x7fda78b8:      0x12520000:    (X, 0.0, VID, 0.0), dst offset 0x00 bytes
0x7fda78bc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78c0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78c4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78c8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78cc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78d0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78d4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78d8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78dc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78e0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78e4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78e8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78ec:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78f0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78f4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78f8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78fc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7900:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7904:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7908:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda790c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7910:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7914:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7918:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda791c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7920:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7924:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7928:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda792c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7930:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7934:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7938:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda793c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7940:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7944:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7948:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda794c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7950:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7954:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7958:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda795c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7960:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7964:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7968:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda796c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7970:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7974:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7978:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda797c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7980:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7984:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7988:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda798c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7990:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7994:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7998:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda799c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda79a0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda79a4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda79a8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda79ac:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda79b0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
Comment 2 Giuseppe Bilotta 2018-03-08 17:13:01 UTC
I don't know if this is relevant, but the GPU reset repeatedly in a relatively short time. This is the dmesg output from the first reset to the paraview segfault:


[Mar 8 12:21] [drm] GPU HANG: ecode 7:1:0xf4ebffff, in Xorg [1731], reason: No progress on bcs0, action: reset
[  +0.000004] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  +0.000002] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  +0.000001] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  +0.000002] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  +0.000002] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  +0.000038] i915 0000:00:02.0: Resetting chip after gpu hang
[  +8.086583] i915 0000:00:02.0: Resetting chip after gpu hang
[  +7.935792] i915 0000:00:02.0: Resetting chip after gpu hang
[Mar 8 12:22] i915 0000:00:02.0: Resetting chip after gpu hang
[  +7.935842] i915 0000:00:02.0: Resetting chip after gpu hang
[  +8.063832] i915 0000:00:02.0: Resetting chip after gpu hang
[Mar 8 12:24] paraview[13787]: segfault at 5572c408d160 ip 00005572c408d160 sp 00007fff7cefda78 error 15


Is it possible that the crash dump got overwritten during the reset?
Comment 3 Chris Wilson 2018-03-08 17:15:45 UTC
Oddly the kernel is blaming the semaphore-wait as the guilty party in the first hang. It's waiting for the stuck paraview batch, which fortunately is also captured.
Comment 4 Giuseppe Bilotta 2018-03-12 08:56:37 UTC
The system was under rather heavy load at the time, could this have interfered?
Comment 5 Elizabeth 2018-03-12 15:29:43 UTC
It may be the case, could you try to find a way to easily reproduce the hang?
Comment 6 Giuseppe Bilotta 2018-03-13 06:22:04 UTC
I will give it a try.
Comment 7 GitLab Migration User 2019-09-25 19:10:04 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1703.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.