90008 – [SKL]Piglit sporadically causes GPU hang

Bug 90008 - [SKL]Piglit sporadically causes GPU hang

Summary: [SKL]Piglit sporadically causes GPU hang

Status:	VERIFIED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	All Linux (All)

Importance:	highest blocker
Assignee:	Ben Widawsky
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-04-13 08:05 UTC by lu hua
Modified:	2015-07-03 18:14 UTC (History)
CC List:	7 users (show)

See Also:
i915 platform:
i915 features:

Attachments
piglit case list (221.17 KB, application/gzip) 2015-04-13 08:08 UTC, lu hua	Details
result list (1.67 MB, text/plain) 2015-04-13 08:11 UTC, lu hua	Details
piglit list 2 (1.75 KB, text/plain) 2015-04-13 08:12 UTC, lu hua	Details
dmesg (390.43 KB, text/plain) 2015-05-20 06:16 UTC, lu hua	Details
dmesg(without kernel.printk="7417") (148.77 KB, text/plain) 2015-05-20 06:30 UTC, lu hua	Details
gpu hang error state (375.52 KB, application/gzip) 2015-05-20 08:13 UTC, Tapani Pälli	Details
error state (205.34 KB, text/plain) 2015-05-29 07:28 UTC, lu hua	Details
dmesg_piglit_skly05_gpuhang_notsystemhang_0618 (2.52 MB, text/plain) 2015-06-18 05:57 UTC, wendy.wang	Details
dmesg_skly05_piglit_gpuhang_systemhang_0618 (465.79 KB, text/plain) 2015-06-18 05:57 UTC, wendy.wang	Details
i915_error_state_skly05_piglit_gpuhang_notsystemhang (389.68 KB, text/plain) 2015-06-18 06:02 UTC, wendy.wang	Details
dmesg system hang on fbo-depth-array (125.90 KB, text/plain) 2015-07-03 14:24 UTC, Olivier Berthier	Details
View All

Description lu hua 2015-04-13 08:05:15 UTC

System Environment:
--------------------------
Platform: SKL
Libdrm:		(master)libdrm-2.4.60-31-g6f90b77ea903756c87ae614c093e3d816ebb26fc
Mesa:		(master)50e9fa2ed69cb5f76f66231976ea789c0091a64d
Xserver:(master)xorg-server-1.17.0-72-gf1da6bf5d94911e78d2e27e6accf0c6e3aefb331
Xf86_video_intel:(master)2.99.917-256-gfbefc8f2bd4242c3f01b02e25276340237b34a88
Libva:		(master)062a63932c0f1439aa587aa986bbcfb758ff38f2
Libva_intel_driver:(master)ed03aebc6e702dab65204cc1469eef0da73e2372
Kernel:   (drm-intel-nightly)044307a99b418258ac0d775460d73b20b80277c1

Bug detailed description:
-----------------------------
It sporadically causes system hang. Run full piglit case multiple rounds, It happens on different case.
Run attached piglit case list, execute the result_list and system hang. then run list 2, it doesn't cause system hang.

Reproduce steps:
---------------------------- 
1. xinit
2. run attached piglit list.

Comment 1 lu hua 2015-04-13 08:08:39 UTC

Created attachment 115046 [details]
piglit case list

Comment 2 lu hua 2015-04-13 08:11:42 UTC

Created attachment 115047 [details]
result list

Comment 3 lu hua 2015-04-13 08:12:08 UTC

Created attachment 115048 [details]
piglit list 2

Comment 4 lu hua 2015-04-14 01:29:16 UTC

I am not sure it regression or not. Run full piglit case, it also has GPU hang or system bug 89493 and bug 89037.
These 3 bugs are random.

Comment 5 lu hua 2015-04-16 07:01:09 UTC

It also happens on BSW.

Comment 6 lu hua 2015-05-20 06:16:26 UTC

Created attachment 115907 [details]
dmesg

Test the latest mesa master branch and the latest drm-intel-nightly kernel on BSW, Run full piglit, it causes GPU hang then system, attached the dmesg.

[  378.564175] [drm] GPU HANG: ecode 8:0:0x85dffdfb, in ext_framebuffer [7399], reason: Ring hung, action: reset
[  378.564269] [drm:i915_reset_and_wakeup] resetting chip
[  378.565766] drm/i915: Resetting chip after gpu hang

Comment 7 lu hua 2015-05-20 06:30:36 UTC

Created attachment 115908 [details]
dmesg(without kernel.printk="7417")

comment 6's dmesg with "sysctl -w kernel.printk="7417"".
Retest without "sysctl -w kernel.printk="7417"", call trace is clear. 

[  761.617901] BUG: unable to handle kernel paging request at 00007f9ee1236008
[  761.704932] IP: [<ffffffff817ade61>] error_entry+0x1/0x5b
[  761.773069] PGD 175b80067 PUD 0 
[  761.815214] Thread overran stack, or stack corrupted
[  761.878022] Oops: 0002 [#1] SMP 
[  761.920190] Modules linked in: ipv6 dm_mod snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support serio_raw pcspkr i2c_i801 lpc_ich mfd_core snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore battery ac acpi_cpufreq i915 button[  762.231906] gmain[4824]: segfault at a940 ip 000000000000a940 sp 00007fea30e81d80 error 14 in accounts-daemon[400000+26000]

[  762.376039]  video drm_kms_helper drm
[  762.430270] CPU: 1 PID: 5360 Comm: python Not tainted 4.1.0-rc3_drm-intel-nightly_056608_20150519+ #410
[  762.547037] task: ffff880178216240 ti: ffff88006ae40000 task.ti: ffff88006ae40000
[  762.640897] RIP: 0010:[<ffffffff817ade61>]  [<ffffffff817ade61>] error_entry+0x1/0x5b
[  762.739098] RSP: 0000:ffff88006ae43ef0  EFLAGS: 00010092
[  762.806946] RAX: 0000000000000004 RBX: fa9af535ea216240 RCX: 0000000000000000
[  762.896755] RDX: 00007f09f5cfc0c0 RSI: 00007f09f791a3b0 RDI: 00007f09fa4ee5e9
[  762.986599] RBP: ec5ca7b044216240 R08: ffff880175474a80 R09: 00007f09e8022000
[  763.076428] R10: ffff880178216620 R11: 9b96372c6f000000 R12: c490f629e7b55d88
[  763.166257] R13: 9f4a419c87103000 R14: 286a50d5df040d87 R15: df02be047c216240
[  763.256080] FS:  00007f09f5cfd700(0000) GS:ffff88017fc80000(0000) knlGS:0000000000000000
[  763.357378] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  763.430585] CR2: 00007f9ee1236008 CR3: 00000001754bf000 CR4: 00000000001006e0
[  763.520480] Stack:
[  763.548896]  ffffffff817add1c 0000000000000000 ffff880178216240 ffffffff8103f557
[  763.642527]  ffff88017a411128 ffff88017a411128 0000000000000000 0000000001128fb0
[  763.736242]  0000000001425ac0 00007f09f7cf3a70 f4e4bcc0d163fdd0 0d124f3036000000
[  763.830004] Call Trace:
[  763.863917]  [<ffffffff817add1c>] ? page_fault+0xc/0x30
[  763.931234]  [<ffffffff8103f557>] ? task_stopped_code+0x3a/0x3a
[  764.006943]  [<ffffffff817ade61>] ? error_entry+0x1/0x5b
[  764.075405]  [<ffffffff817add1c>] ? page_fault+0xc/0x30
[  764.142824] Code: 4c 8b 44 24 48 48 [  764.181178] PANIC: double fault, error_code: 0x0
[  764.181186] CPU: 2 PID: 5373 Comm: ext_framebuffer Not tainted 4.1.0-rc3_drm-intel-nightly_056608_20150519+ #410
[  764.181192] task: ffff880178866240 ti: ffff88006ae54000 task.ti: ffff88006ae54000
[  764.181195] RIP: 0010:[<ffffffff817add17>]  [<ffffffff817add17>] page_fault+0x7/0x30
[  764.181208] RSP: 0000:ffff8800201fffd8  EFLAGS: 00010096
[  764.181209] RAX: 00000000817ace77 RBX: 0000000000000001 RCX: ffffffff817ace77
[  764.181212] RDX: 000000000000a940 RSI: 0000000000000000 RDI: ffff880020200098
[  764.181213] RBP: 0000000000000009 R08: 0000000000000000 R09: 0000000000000001
[  764.181215] R10: 0000000000000034 R11: 0000000002104d60 R12: 00007f9ee8da2e10
[  764.181218] R13: 00007f9ee8da0038 R14: 000000000000008e R15: 0000000000000000
[  764.181220] FS:  00007f9ee8dc7780(0000) GS:ffff88017fd00000(0000) knlGS:0000000000000000
[  764.181222] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  764.181224] CR2: ffff8800201fffc8 CR3: 000000006e080000 CR4: 00000000001006e0
[  764.181226] Stack:

[  765.382838] 8b 44 24 50 48 8b 4c 24 58 48 8b 54 24 60 48 8b 74 24 68 48 8b 7c 24 70 48 81 c4 80 00 00 00 e9 10 f0 ff ff fc <4c> 89 5c 24 38 4c 89 54 24 40 4c 89 4c 24 48 4c 89 44 24 50 48 
[  765.603067] RIP  [<ffffffff817ade61>] error_entry+0x1/0x5b
[  765.674383]  RSP <ffff88006ae43ef0>
[  765.721595] CR2: 00007f9ee1236008
[  765.766899] BUG: unable to handle kernel paging request at 0000000000010092
[  765.856096] IP: [<ffffffff81127340>] __d_lookup_rcu+0x65/0x123
[  765.931681] PGD 1754bd067 PUD 179d7e067 PMD 0 
[  765.990816] Oops: 0000 [#2] SMP 
[  766.035223] Modules linked in: ipv6 dm_mod snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support serio_raw pcspkr i2c_i801 lpc_ich mfd_core snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore battery ac acpi_cpufreq i915 button video drm_kms_helper drm
[  766.390273] CPU: 1 PID: 5380 Comm: kworker/u8:1 Not tainted 4.1.0-rc3_drm-intel-nightly_056608_20150519+ #410
[  766.515404] task: ffff8801782149b0 ti: ffff88006aec8000 task.ti: ffff88006aec8000
[  766.611410] RIP: 0010:[<ffffffff81127340>]  [<ffffffff81127340>] __d_lookup_rcu+0x65/0x123
[  766.716978] RSP: 0018:ffff88006aecbb88  EFLAGS: 00010206
[  766.786973] RAX: 0000000000000003 RBX: 0000000000010096 RCX: 000000000000000d
[  766.878871] RDX: 0000000000000000 RSI: ffff88006aecbd88 RDI: ffff880076938300
[  766.970767] RBP: 000000000001008e R08: 8080808080808080 R09: fefefefefefefeff
[  767.062698] R10: 2f2f2f2f2f2f2f2f R11: ffff88006aecbc04 R12: 000000037797fe36
[  767.154668] R13: ffff880002cc701d R14: ffff88006aecbd88 R15: ffff880076938300
[  767.246650] FS:  0000000000000000(0000) GS:ffff88017fc80000(0000) knlGS:0000000000000000
[  767.350164] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  767.425536] CR2: 0000000000010092 CR3: 00000001754bf000 CR4: 00000000001006e0

Comment 8 lu hua 2015-05-20 06:54:26 UTC

(In reply to lu hua from comment #6)
> Created attachment 115907 [details]
> dmesg
> 
> Test the latest mesa master branch and the latest drm-intel-nightly kernel
> on BSW, Run full piglit, it causes GPU hang then system, attached the dmesg.
> 

It causes GPU hang then system hang.

Comment 9 Tapani Pälli 2015-05-20 08:13:02 UTC

Created attachment 115912 [details]
gpu hang error state

Not sure if this helps but here's a error state from the gpu hang. It is easy to reproduce, just run tests/quick.tests and in some point it will hang. Also later on (as described in comments) the whole system hangs.

Comment 10 Ben Widawsky 2015-05-28 05:46:36 UTC

Tapani, do you know if it's always the same test which is failing for you? The error state you posted may be fixed by 5ae6c7bfce5c9fb91ab6cef2ea74a39af091d5f6 in master. Just a hunch (EU 0 in the GS went out to lunch).

Comment 11 Tapani Pälli 2015-05-29 07:02:45 UTC

(In reply to Ben Widawsky from comment #10)
> Tapani, do you know if it's always the same test which is failing for you?
> The error state you posted may be fixed by
> 5ae6c7bfce5c9fb91ab6cef2ea74a39af091d5f6 in master. Just a hunch (EU 0 in
> the GS went out to lunch).

OK thanks, I've pulled drm-nightly and current Mesa at 065978d and will try to reproduce.

As additional info for the GPU hangs, I've found that following tests cause hangs reliably:

arb_gpu_shader5/execution/sampler_array_indexing/gs-weird-uniforms.shader_test
arb_gpu_shader5/execution/sampler_array_indexing/fs-weird-uniforms.shader_test

still not sure what makes the whole machine hang, will keep digging.

Comment 12 lu hua 2015-05-29 07:28:41 UTC

Created attachment 116141 [details]
error state

Test on the latest mesa master branch,commit 10aacf5ae8f3e90e2f0967fbdcf96df93e346e20.
Run full piglit case on 1910（rev 02）, it has gpu hang but not system hang, attached the error state.
Run full piglit case on 190c (rev 03), I meet twice system hang.
Run full piglit case on BSW twice, I don't meet system but see GPU hang.

Comment 13 Tapani Pälli 2015-05-29 07:47:50 UTC

To add more into comment #11 it seems many of the dynamic sampler array indexing for fs and gs tests cause hang, for example:

arb_gpu_shader5/execution/sampler_array_indexing/fs-simple.shader_test
arb_gpu_shader5/execution/sampler_array_indexing/fs-nonzero-base.shader_test

however vs ones seem to pass, maybe this helps.

Comment 14 Tapani Pälli 2015-05-29 08:47:03 UTC

some other gpu hang reproducers (likely few different issues here):

bin/tex-miplevel-selection *GradARB Cube -auto -fbo
bin/tex-miplevel-selection textureGrad CubeArray -auto -fbo

Comment 15 wendy.wang 2015-05-29 10:57:44 UTC

(In reply to lu hua from comment #12)
> Created attachment 116141 [details]
> error state
> 
> Test on the latest mesa master branch,commit
> 10aacf5ae8f3e90e2f0967fbdcf96df93e346e20.
> Run full piglit case on 1910（rev 02）, it has gpu hang but not system hang,
> attached the error state.
> Run full piglit case on 190c (rev 03), I meet twice system hang.
> Run full piglit case on BSW twice, I don't meet system but see GPU hang.

Based on this test result, remove BSW platform from this bug title.

Comment 16 Anuj Phogat 2015-05-29 18:53:05 UTC

(In reply to Tapani Pälli from comment #13 and comment #14)
I'm able to reproduce these GPU hangs on SKL. No system hang.

Comment 17 Anuj Phogat 2015-05-29 21:17:28 UTC

(In reply to Anuj Phogat from comment #16)
> I'm able to reproduce these GPU hangs on SKL. No system hang.
sampler_array_indexing tests don't hang with latest mesa master. Failures are fixed by Neil's patch on mailing list: http://patchwork.freedesktop.org/patch/50710/

Comment 18 Marta Löfstedt 2015-06-03 08:15:58 UTC

Neils patch does not help for SKL-Y. I still get full system hang after 2 GPU hangs, when running the quick.py piglit set. So, the patch only seem to solve the problem for SKL-S.

Comment 19 Ben Widawsky 2015-06-03 16:08:19 UTC

Marta, can you please add the error state.

Comment 20 Neil Roberts 2015-06-03 16:25:02 UTC

Could you please also test with this patch?

http://patchwork.freedesktop.org/patch/50676/

Without that patch the GS tests for sampler array indexing are failing. On my SKL-Y machine once one of those GS tests fails some of the other sampler array indexing tests seem to start failing too. I wonder if it puts the hardware in some broken state.

Comment 21 lu hua 2015-06-05 08:18:57 UTC

(In reply to Neil Roberts from comment #20)
> Could you please also test with this patch?
> 
> http://patchwork.freedesktop.org/patch/50676/
> 
> Without that patch the GS tests for sampler array indexing are failing. On
> my SKL-Y machine once one of those GS tests fails some of the other sampler
> array indexing tests seem to start failing too. I wonder if it puts the
> hardware in some broken state.

Apply this patch, GPU hang still exists.

Comment 22 Gavin Hindman 2015-06-11 18:32:36 UTC

I was under the impression that this issue is resolved, at least from the Mesa side.  Is that not so, or is this issue not updated?

Comment 23 Ben Widawsky 2015-06-11 18:34:36 UTC

Can someone from QA please confirm it exists on master from today?

Comment 24 lu hua 2015-06-12 06:09:05 UTC

Test on the latest mesa master branch, It still exists.
run: bin/tex-miplevel-selection *GradARB Cube -auto -fbo

dmesg:
[13484.007161] [drm:i915_gem_open]
[13484.033555] [drm:i915_gem_context_create_ioctl] HW context 1 created
[13489.597268] [drm] stuck on render ring
[13489.597800] [drm] GPU HANG: ecode 9:0:0x85dffffb, in tex-miplevel-se [7157], reason: Ring hung, action: reset
[13489.597827] [drm:i915_reset_and_wakeup] resetting chip
[13489.600043] drm/i915: Resetting chip after gpu hang
[13489.600076] [drm:gen8_init_common_ring] Execlists enabled for render ring
[13489.600079] [drm:gen8_init_common_ring] Execlists enabled for bsd ring
[13489.600081] [drm:gen8_init_common_ring] Execlists enabled for blitter ring
[13489.600083] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring
[13491.597002] [drm] RC6 on
[13495.596368] [drm] stuck on render ring
[13495.596581] [drm] GPU HANG: ecode 9:0:0x85dffffb, in tex-miplevel-se [7157], reason: Ring hung, action: reset
[13495.596605] [drm:i915_reset_and_wakeup] resetting chip
[13495.598694] drm/i915: Resetting chip after gpu hang
[13495.598728] [drm:gen8_init_common_ring] Execlists enabled for render ring
[13495.598730] [drm:gen8_init_common_ring] Execlists enabled for bsd ring
[13495.598733] [drm:gen8_init_common_ring] Execlists enabled for blitter ring
[13495.598735] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring
[13495.601226] [drm:i915_gem_context_destroy_ioctl] HW context 1 destroyed
[13497.596361] [drm] RC6 on

Comment 25 Ben Widawsky 2015-06-12 06:11:45 UTC

Please open a new bug for that failure. If the sporadic hangs are gone, please close this bug.

Thanks.

Comment 26 lu hua 2015-06-12 06:37:45 UTC

Run full piglit case on 190c (rev 03), it still has system hang.

Comment 27 Ben Widawsky 2015-06-12 17:08:23 UTC

Okay. the miplevel selection test you mention (https://bugs.freedesktop.org/show_bug.cgi?id=90008#c24) has had issues on many platforms. I'd prefer to ignore that, or again, file a new bug for the SKL hang only.

Comment 28 Ben Widawsky 2015-06-16 20:15:29 UTC

Tracking the system hang separately. (https://bugs.freedesktop.org/show_bug.cgi?id=90854)

Re-titling this

Comment 29 wendy.wang 2015-06-18 05:55:49 UTC

Based on this command: ./piglit-run.py -1 -x glean -x glx -x fbo-depth-array gpu /tmp

Run 4 cycles on x-skly05, it always has GPU hang, twice have system hang.
Attached dmesg and i915_error_state

Comment 30 wendy.wang 2015-06-18 05:57:03 UTC

Created attachment 116568 [details]
dmesg_piglit_skly05_gpuhang_notsystemhang_0618

Comment 31 wendy.wang 2015-06-18 05:57:33 UTC

Created attachment 116569 [details]
dmesg_skly05_piglit_gpuhang_systemhang_0618

Comment 32 wendy.wang 2015-06-18 06:02:01 UTC

Created attachment 116570 [details]
i915_error_state_skly05_piglit_gpuhang_notsystemhang

Comment 33 Ben Widawsky 2015-06-23 01:26:34 UTC

Can you please test master as of today. I pushed a patch which is fixing some other hangs.

Comment 34 lu hua 2015-06-25 07:06:50 UTC

Run full piglit 3 cycles on mesa commit 6844d6b7f8398a25eff511541b187afeb1199ce0, it doesn't have gpu hang or system hang. Close it.

Comment 35 Olivier Berthier 2015-07-03 14:24:47 UTC

Created attachment 116920 [details]
dmesg system hang on fbo-depth-array

The system still hang during the test ext_texture_array@fbo-depth-array.

Setup:
-------
Hardware
Platform: SKY LAKE Y A0 
CPU : Intel(R) Core(TM) m3-6Y30 CPU @ 0.8GHz 4MB (family: 6, model: 78  stepping: 3)
MCP : SKL-Y  D1  2+2 (ou ULX-D1)
QDF : QYV3 
CPU : SKL D0
Chipset PCH: Sunrise Point LP C1       
CRB : SKY LAKE Y LPDDR3 RVP3 CRB FAB2
Reworks : All Mandatories + FBS02 & FBS03, O-06
Software 
Linux : Ubuntu 14.04 LTS 64 bits
BIOS : SKLSE2R1.R00.X085.B02.150601337
ME FW : 11.0.0.1149
Ksc (EC FW): 1.15
Kernel 4.1-0 (drm-intel-nightly-2015-06-27)
Mesa: mesa-10.5.8 (master) 24b043aab73ce066ded6e4bc93f589008dfc8484
Xf86_video_intel: 2.99.917 (master) baec802b21387d04aebb10ac29e719a1800c5aa0
Libdrm: libdrm-2.4.61 (master) 203983f842a889b279698fdea46e83ee4450a1db
libva: libva-1.6.0.pre1 (master) 0f88a645ab3cea69d63371189e53cd465ab95a20
intel-driver: 1.6.0.pre1 (master) f3f74ea23601750078215fad04dde6748364b88d
xorg: 1.17.99 
Xserver: xorg-server-1.17.2 (master) 2123f7682d522619f101b05fb75efa75dabbe371
Piglit: (master) 107318d835dbbf51af55c62abb2aee154822a4c7

Comment 36 Ben Widawsky 2015-07-03 18:14:07 UTC

Welcome Olivier. First, that is not a sporadic failure, and second there is already a bug for that test specifically:
https://bugs.freedesktop.org/show_bug.cgi?id=91062

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.