Bug 105397 - [hsw] GPU HANG in paraview
Summary: [hsw] GPU HANG in paraview
Status: NEEDINFO
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-03-08 11:36 UTC by Giuseppe Bilotta
Modified: 2018-03-13 06:22 UTC (History)
1 user (show)

See Also:
i915 platform: HSW
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error contents (52.69 KB, text/x-log)
2018-03-08 11:36 UTC, Giuseppe Bilotta
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Giuseppe Bilotta 2018-03-08 11:36:45 UTC
Created attachment 137889 [details]
/sys/class/drm/card0/error contents

I just experienced this while using Paraview to visualize a largish dataset (some time after the hang and reset, PV crashed).

This is a Dell XPS 15 9530 featuring an i7-4712HQ CPU @ 2.30GHz, running Debian unstable

uname -a: Linux oblomov 4.15.0-1-amd64 #1 SMP Debian 4.15.4-1 (2018-02-18) x86_64 GNU/Linux

* paraview 5.4.1+dfsg3-1+b2
* xorg 1:7.7+19
* mesa 17.3.6-1
* libdrm-intel1 2.4.90-1

The crash dump is attached.
Comment 1 Elizabeth 2018-03-08 15:56:46 UTC
Not sure if this information is relevant:

From error decode

0x7fda78a4:      0x02850000:    buffer 0: valid, type 0x0085, src offset 0x0000 bytes
0x7fda78a8:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
0x7fda78ac:      0x06850000:    buffer 1: valid, type 0x0085, src offset 0x0000 bytes
0x7fda78b0:      0x11230000:    (X, Y, 0.0, 1.0), dst offset 0x00 bytes
0x7fda78b4:      0x04870000:    buffer 1: invalid, type 0x0087, src offset 0x0000 bytes
0x7fda78b8:      0x12520000:    (X, 0.0, VID, 0.0), dst offset 0x00 bytes
0x7fda78bc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78c0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78c4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78c8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78cc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78d0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78d4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78d8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78dc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78e0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78e4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78e8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78ec:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78f0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78f4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda78f8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda78fc:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7900:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7904:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7908:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda790c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7910:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7914:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7918:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda791c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7920:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7924:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7928:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda792c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7930:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7934:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7938:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda793c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7940:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7944:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7948:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda794c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7950:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7954:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7958:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda795c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7960:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7964:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7968:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda796c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7970:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7974:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7978:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda797c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7980:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7984:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7988:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda798c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7990:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda7994:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda7998:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda799c:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda79a0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda79a4:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda79a8:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
0x7fda79ac:      0x00000000:    buffer 0: invalid, type 0x0000, src offset 0x0000 bytes
0x7fda79b0:      0x00000000:    (nostore, nostore, nostore, nostore), dst offset 0x00 bytes
Comment 2 Giuseppe Bilotta 2018-03-08 17:13:01 UTC
I don't know if this is relevant, but the GPU reset repeatedly in a relatively short time. This is the dmesg output from the first reset to the paraview segfault:


[Mar 8 12:21] [drm] GPU HANG: ecode 7:1:0xf4ebffff, in Xorg [1731], reason: No progress on bcs0, action: reset
[  +0.000004] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  +0.000002] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  +0.000001] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  +0.000002] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  +0.000002] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  +0.000038] i915 0000:00:02.0: Resetting chip after gpu hang
[  +8.086583] i915 0000:00:02.0: Resetting chip after gpu hang
[  +7.935792] i915 0000:00:02.0: Resetting chip after gpu hang
[Mar 8 12:22] i915 0000:00:02.0: Resetting chip after gpu hang
[  +7.935842] i915 0000:00:02.0: Resetting chip after gpu hang
[  +8.063832] i915 0000:00:02.0: Resetting chip after gpu hang
[Mar 8 12:24] paraview[13787]: segfault at 5572c408d160 ip 00005572c408d160 sp 00007fff7cefda78 error 15


Is it possible that the crash dump got overwritten during the reset?
Comment 3 Chris Wilson 2018-03-08 17:15:45 UTC
Oddly the kernel is blaming the semaphore-wait as the guilty party in the first hang. It's waiting for the stuck paraview batch, which fortunately is also captured.
Comment 4 Giuseppe Bilotta 2018-03-12 08:56:37 UTC
The system was under rather heavy load at the time, could this have interfered?
Comment 5 Elizabeth 2018-03-12 15:29:43 UTC
It may be the case, could you try to find a way to easily reproduce the hang?
Comment 6 Giuseppe Bilotta 2018-03-13 06:22:04 UTC
I will give it a try.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.