53526 – [SNB Regression] gpu hang with calltrace when running piglit case

Bug 53526 - [SNB Regression] gpu hang with calltrace when running piglit case

Summary: [SNB Regression] gpu hang with calltrace when running piglit case

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high major
Assignee:	Daniel Vetter
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-08-15 06:05 UTC by lu hua
Modified:	2017-10-06 14:48 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (201.36 KB, text/plain) 2012-08-15 06:05 UTC, lu hua	no flags	Details
dmesg on Sugarbay (56.21 KB, text/plain) 2012-08-22 03:30 UTC, lu hua	no flags	Details
i915_error_state (2.24 MB, text/plain) 2012-08-22 05:08 UTC, lu hua	no flags	Details
dmesg Ivybridge(i7-3610QM) (66.47 KB, text/plain) 2012-08-24 03:33 UTC, lu hua	no flags	Details
View All

Description lu hua 2012-08-15 06:05:32 UTC

Created attachment 65584 [details]
dmesg

System Environment:
--------------------------
Arch:             x86_64
Platform:         Ivybridge
Libdrm:(master)libdrm-2.4.38-3-g3163cfe4db925429760407e77140e2d595338bc2
Mesa:	(master)605f964d5cc7016fc74e0563829fa794da845c20
Xserver:(master)xorg-server-1.12.99.904
Xf86_video_intel:(master)2.20.3-35-g2f4de90709264ad19a3e3f5f0f79f4bba78a760a
Libva:	(staging)f12f80371fb534e6bbf248586b3c17c298a31f4e
Libva_intel_driver:(staging)82fa52510a37ab645daaa3bb7091ff5096a20d0b
Kernel:	(drm-intel-next-queued) dec3ad8d19a4a496b2588bee2bcd7fce3a6731bc

Bug detailed description:
-------------------------
When nightly testing run piglit case, system hang with calltrace.
It happens on Ivybridge and Sandybridge with -queued kernel. It doesn't happen on -fixes kernel.
It doesn't happen on -queued kernel 20d5a540e55a29daeef12706f9ee73baf5641c16.

Calltrace:
[ 2051.731626]  [<ffffffff81112075>] alloc_buffer_head+0x1c/0x44
[ 2051.734842]  [<ffffffff811121d8>] alloc_page_buffers+0x2d/0xc9
[ 2051.738022]  [<ffffffff811132ef>] __getblk+0x194/0x24f
[ 2051.741163]  [<ffffffff81113452>] __bread+0xb/0x85
[ 2051.744281]  [<ffffffffa01382ff>] ext3_get_branch+0x72/0xf0 [ext3]
[ 2051.747397]  [<ffffffffa013a664>] ext3_get_blocks_handle+0xda/0x9a9 [ext3]
[ 2051.750498]  [<ffffffff810c791f>] ? zone_statistics+0x77/0x80
[ 2051.753580]  [<ffffffffa013afe9>] ext3_get_block+0xb6/0xf6 [ext3]
[ 2051.756649]  [<ffffffff8111b20a>] do_mpage_readpage+0x16d/0x4ed
[ 2051.759730]  [<ffffffff810b18bf>] ? add_to_page_cache_locked+0x77/0xa8
[ 2051.762826]  [<ffffffffa013af33>] ? ext3_get_blocks_handle+0x9a9/0x9a9 [ext3]
[ 2051.765931]  [<ffffffff8111b69e>] mpage_readpages+0xaf/0xf5
[ 2051.769033]  [<ffffffffa013af33>] ? ext3_get_blocks_handle+0x9a9/0x9a9 [ext3]
[ 2051.772139]  [<ffffffff810c791f>] ? zone_statistics+0x77/0x80
[ 2051.775245]  [<ffffffff810e1cef>] ? alloc_pages_current+0xcd/0xee
[ 2051.778362]  [<ffffffffa0138ac2>] ext3_readpages+0x18/0x1a [ext3]
[ 2051.781469]  [<ffffffff810ba0d6>] __do_page_cache_readahead+0x12a/0x1ad
[ 2051.784580]  [<ffffffff810ba411>] ra_submit+0x1c/0x20
[ 2051.787679]  [<ffffffff810b2d67>] filemap_fault+0x159/0x32d
[ 2051.790775]  [<ffffffff810cb444>] __do_fault+0xa7/0x3bb
[ 2051.793866]  [<ffffffff810cd973>] handle_pte_fault+0x28f/0x6b9
[ 2051.796954]  [<ffffffff810cee29>] handle_mm_fault+0x196/0x1ab
[ 2051.800039]  [<ffffffff813cb9bd>] do_page_fault+0x3ad/0x3d2
[ 2051.803135]  [<ffffffff8100c258>] ? syscall_trace_leave+0x3e/0x16a
[ 2051.806242]  [<ffffffff810e6344>] ? kmem_cache_free+0x8a/0xc6
[ 2051.809227]  [<ffffffff813c8e6f>] page_fault+0x1f/0x30
[ 2051.812098] Code: 4d 8b 28 4d 85 ed 75 16 4c 89 f9 83 ca ff 44 89 f6 4c 89 e7 e8 6e a8 2d 00 49 89 c5 eb 2a 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b0 49
[ 2051.818176] RIP  [<ffffffff810e7d4f>] kmem_cache_alloc+0x68/0xfd
[ 2051.821267]  RSP <ffff880146587668>
[ 2051.824358] [drm:intel_prepare_page_flip], preparing flip with no unpin work?
[ 2051.824383] ---[ end trace a6e8c87689f2ad1c ]---

Comment 1 Ben Widawsky 2012-08-15 06:13:23 UTC

Ignoring the fact that it appears to not be our bug based on the dmesg, could you bisect it in the off-chance that something we changed may have effected the block layer?

Comment 2 Chris Wilson 2012-08-15 09:09:58 UTC

Note that this could be explained by a use-after-free in our code.

Try with http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=scatterlist&id=b3f6598375bd46fb7a2309d76fbd95de880f4237 and see if that isolates the fault to ourselves?

Comment 3 Daniel Vetter 2012-08-15 09:13:45 UTC

If you have the slub allocator enabled (CONFIG_SLUB), pls don't forget to boot with slub_debug on the cmdline. Otherwise our own slab might get squashed together with another one that fits nicely.

Comment 4 lu hua 2012-08-17 08:43:24 UTC

Add slub_debug on the cmdline, system still hangs.

Comment 5 lu hua 2012-08-17 09:03:04 UTC

This bug blocks nightly test.

Comment 6 Daniel Vetter 2012-08-17 11:37:15 UTC

Please retest with latest -queued, I've taken out a patch that blows things up.

To confirm it's the same bug, can you please test whether i-g-t/tests/gem_gtt_cpu_tlb is also broken on the affected systems?

Comment 7 lu hua 2012-08-21 01:18:57 UTC

Test on the latest -queued kernel(commit: 83358c85866ebd2).
This issue still happens.

i-g-t/tests/gem_gtt_cpu_tlb doesn't have this issue.

Comment 8 Daniel Vetter 2012-08-21 06:45:33 UTC

Can you please attach a new dmesg with the backtraces (it is very important that you ensure that the first backtrace is included)?

The previous dmesg has a mix of filesystem and i915 issues, and I'm hoping that at least the filesystem issues are gone.

Also, to clarify: Does the i-g-t test (gem_gtt_cpu_tlb) work on both current dinq and the previous kernel, or have you tested only on the new kernel?

Comment 9 lu hua 2012-08-21 07:21:02 UTC

Run gem_gtt_cpu_tlb on -queued dec3ad8d19a4a496b2588bee2bcd7fce3a6731bc, output:
gem_gtt_cpu_tlb: gem_gtt_cpu_tlb.c:103: main: Assertion `ptr[i] == i' failed.
Aborted (core dumped).

Run gem_gtt_cpu_tlb on -queued 83358c85866ebd2af1229fc9870b93e126690671. It passes.

Comment 10 lu hua 2012-08-22 03:30:49 UTC

Created attachment 65924 [details]
dmesg on Sugarbay

Comment 11 lu hua 2012-08-22 03:37:49 UTC

Retest on -queued  bd590bef35cd6f9b015a0.
It doesn't have call trace in dmesg. GPU hung appears in dmesg.

This issue goes away on ivybridge and Huronriver. 
Sandybridge GT1 i7-2600 still hangs in nightly test.

Comment 12 lu hua 2012-08-22 05:08:42 UTC

Created attachment 65927 [details]
i915_error_state

Comment 13 Daniel Vetter 2012-08-22 08:25:19 UTC

Ok, updated the summary, sinc this is only a gpu hang and snb-only now.

Comment 14 Chris Wilson 2012-08-22 08:52:39 UTC

The signaling of he the BLT semaphore into the RCS failed, leaving the RCS stuck waiting on a semaphore that has already passed.

RCS:
  HEAD: 0x7e01e4fc
  TAIL: 0x0001e5c8
  ACTHD: 0x7e01e4fc
  IPEIR: 0x00000000
  IPEHR: 0x0b160001
  INSTDONE: 0xffffffff
  INSTDONE1: 0xbfffffff
    busy: CS
  BBADDR: 0x01fb4204
  INSTPS: 0x8000010b
  INSTPM: 0x00000080
  FADDR: 0x0001f5c8
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  SYNC_0: 0x00000000
  SYNC_1: 0x0019d77b
  seqno: 0x0019d786

BCS:
  HEAD: 0x54c05590
  TAIL: 0x00005590
  ACTHD: 0x54c05590
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00049590
  RC PSMI: 0x00000018
  FAULT_REG: 0x00000000
  SYNC_0: 0x0019d786
  SYNC_1: 0x00000000
  seqno: 0x0019d782

Comment 15 Chris Wilson 2012-08-22 13:32:30 UTC

Does i915.i915_enable_rc6=0 affect the hang?

Comment 16 Ben Widawsky 2012-08-22 22:36:50 UTC

Can you also please try this patch:
https://patchwork.kernel.org/patch/1363021/

Comment 17 lu hua 2012-08-24 03:32:47 UTC

On Ivybridge(i7-3610QM) nightly testing with -queued kernel b4c145c1d245c2cc19754, calltrace appears in dmesg.
Call Trace:
[  193.797253]  [<ffffffffa00ccfd6>] sandybridge_update_wm+0x61/0x414 [i915]
[  193.797364]  [<ffffffffa00ce3ea>] intel_update_watermarks+0x19/0x1b [i915]
[  193.797477]  [<ffffffffa00d7e7b>] ivb_disable_plane+0x95/0x9e [i915]
[  193.797587]  [<ffffffffa00d785a>] intel_disable_plane+0x24/0x60 [i915]
[  193.797696]  [<ffffffffa00d78a5>] intel_destroy_plane+0xf/0x24 [i915]
[  193.797805]  [<ffffffffa005188f>] drm_mode_config_cleanup+0x147/0x17c [drm]
[  193.797915]  [<ffffffffa00bf3c2>] intel_modeset_cleanup+0xf7/0x104 [i915]
[  193.798021]  [<ffffffffa009d1e2>] i915_driver_unload+0xec/0x24b [i915]
[  193.798128]  [<ffffffffa004c1a3>] drm_put_dev+0xd2/0x1af [drm]
[  193.798231]  [<ffffffffa00991f6>] i915_pci_remove+0x18/0x1a [i915]
[  193.798332]  [<ffffffff81214858>] pci_device_remove+0x28/0x4c
[  193.798433]  [<ffffffff81292843>] __device_release_driver+0x67/0xba
[  193.798534]  [<ffffffff81292f43>] driver_detach+0x7e/0xa7
[  193.798633]  [<ffffffff812926aa>] bus_remove_driver+0x89/0xab
[  193.798734]  [<ffffffff812934c3>] driver_unregister+0x64/0x6d
[  193.798835]  [<ffffffff81214adf>] pci_unregister_driver+0x3f/0x84
[  193.798942]  [<ffffffffa004e1ee>] drm_pci_exit+0x3f/0x78 [drm]
[  193.799052]  [<ffffffffa00dc39f>] i915_exit+0x17/0x19 [i915]
[  193.799152]  [<ffffffff8107736c>] sys_delete_module+0x1a2/0x200
[  193.799253]  [<ffffffff8108b2a1>] ? __audit_syscall_entry+0x191/0x1bd
[  193.799354]  [<ffffffff813da522>] system_call_fastpath+0x16/0x1b
[  193.799452] Code: 48 8b bc f0 a8 28 00 00 48 8b 47 28 48 85 c0 74 06 80 7f 30 00 75 14 49 8b 40 18 41 89 03 49 8b 42 18 89 03 31 c0 e9 ca 00 00 00 <44> 8b 68 5c be 08 00 00 00 44 8b a7 80 00 00 00 4d 8b 72 20 44
[  193.802535] RIP  [<ffffffffa00cab8f>] g4x_compute_wm0+0x4d/0x122 [i915]
[  193.802689]  RSP <ffff88021ce97bb0>
[  193.802791] ---[ end trace 52f3f9a1f73037cd ]---

Comment 18 lu hua 2012-08-24 03:33:47 UTC

Created attachment 66046 [details]
dmesg Ivybridge(i7-3610QM)

Comment 19 lu hua 2012-08-24 08:09:00 UTC

(In reply to comment #16)
> Can you also please try this patch:
> https://patchwork.kernel.org/patch/1363021/


Add this patch to -queued kernel b4c145c1d245c2cc19754dbe4b718f5a48755993.
Calltrace appears in dmesg.
Call Trace:
[ 6914.397312]  [<c02259c6>] warn_slowpath_common+0x63/0x78
[ 6914.397314]  [<c021a259>] ? default_send_IPI_mask_logical+0x2d/0xb6
[ 6914.397316]  [<c0225a3f>] warn_slowpath_fmt+0x26/0x2a
[ 6914.397318]  [<c021a259>] default_send_IPI_mask_logical+0x2d/0xb6
[ 6914.397321]  [<c0218aec>] native_send_call_func_ipi+0x4e/0x50
[ 6914.397323]  [<c0263a71>] smp_call_function_many+0x17b/0x193
[ 6914.397325]  [<c0222125>] ? do_flush_tlb_all+0x40/0x40
[ 6914.397327]  [<c0222197>] native_flush_tlb_others+0x21/0x24
[ 6914.397329]  [<c0222361>] flush_tlb_page+0x5a/0x63
[ 6914.397332]  [<c02ba38d>] ptep_clear_flush+0xd/0x14
[ 6914.397335]  [<c02b0ddb>] do_wp_page+0x4a8/0x557
[ 6914.397339]  [<c024b668>] ? __enqueue_entity+0x63/0x69
[ 6914.397341]  [<c02b23c7>] handle_pte_fault+0x589/0x5b5
[ 6914.397343]  [<c02b24bd>] handle_mm_fault+0xca/0xd9
[ 6914.397347]  [<c0545df8>] ? spurious_fault+0xa8/0xa8
[ 6914.397349]  [<c054617e>] do_page_fault+0x386/0x3a2
[ 6914.397351]  [<c02251f0>] ? do_fork+0x16a/0x241
[ 6914.397354]  [<c0275eb1>] ? __audit_syscall_exit+0x32e/0x349
[ 6914.397358]  [<c0207b79>] ? sys_clone+0x1b/0x20
[ 6914.397360]  [<c0545df8>] ? spurious_fault+0xa8/0xa8
[ 6914.397362]  [<c0543f66>] error_code+0x5a/0x60
[ 6914.397364]  [<c0545df8>] ? spurious_fault+0xa8/0xa8
[ 6914.397365] ---[ end trace 5e024e1ca4e544d1 ]---

Comment 20 Daniel Vetter 2012-08-24 08:57:04 UTC

Ok, we seem to have randomly corrupted state at all kinds of core kernel functions (not just our own). Can you please try to bisect this one (please pick of all affected machines the one with the best reproduceability for bisecting)?

Comment 21 lu hua 2012-08-27 03:19:10 UTC

When run I-g-t case 'module_reload', the calltrace appears in dmesg(Bug 54101).  Piglit cases doesn't causes this the calltrace. 

(In reply to comment #17)
> On Ivybridge(i7-3610QM) nightly testing with -queued kernel
> b4c145c1d245c2cc19754, calltrace appears in dmesg.
> Call Trace:
> [  193.797253]  [<ffffffffa00ccfd6>] sandybridge_update_wm+0x61/0x414 [i915]
> [  193.797364]  [<ffffffffa00ce3ea>] intel_update_watermarks+0x19/0x1b [i915]
> [  193.797477]  [<ffffffffa00d7e7b>] ivb_disable_plane+0x95/0x9e [i915]
> [  193.797587]  [<ffffffffa00d785a>] intel_disable_plane+0x24/0x60 [i915]
> [  193.797696]  [<ffffffffa00d78a5>] intel_destroy_plane+0xf/0x24 [i915]
> [  193.797805]  [<ffffffffa005188f>] drm_mode_config_cleanup+0x147/0x17c [drm]
> [  193.797915]  [<ffffffffa00bf3c2>] intel_modeset_cleanup+0xf7/0x104 [i915]
> [  193.798021]  [<ffffffffa009d1e2>] i915_driver_unload+0xec/0x24b [i915]
> [  193.798128]  [<ffffffffa004c1a3>] drm_put_dev+0xd2/0x1af [drm]
> [  193.798231]  [<ffffffffa00991f6>] i915_pci_remove+0x18/0x1a [i915]
> [  193.798332]  [<ffffffff81214858>] pci_device_remove+0x28/0x4c
> [  193.798433]  [<ffffffff81292843>] __device_release_driver+0x67/0xba
> [  193.798534]  [<ffffffff81292f43>] driver_detach+0x7e/0xa7
> [  193.798633]  [<ffffffff812926aa>] bus_remove_driver+0x89/0xab
> [  193.798734]  [<ffffffff812934c3>] driver_unregister+0x64/0x6d
> [  193.798835]  [<ffffffff81214adf>] pci_unregister_driver+0x3f/0x84
> [  193.798942]  [<ffffffffa004e1ee>] drm_pci_exit+0x3f/0x78 [drm]
> [  193.799052]  [<ffffffffa00dc39f>] i915_exit+0x17/0x19 [i915]
> [  193.799152]  [<ffffffff8107736c>] sys_delete_module+0x1a2/0x200
> [  193.799253]  [<ffffffff8108b2a1>] ? __audit_syscall_entry+0x191/0x1bd
> [  193.799354]  [<ffffffff813da522>] system_call_fastpath+0x16/0x1b
> [  193.799452] Code: 48 8b bc f0 a8 28 00 00 48 8b 47 28 48 85 c0 74 06 80 7f
> 30 00 75 14 49 8b 40 18 41 89 03 49 8b 42 18 89 03 31 c0 e9 ca 00 00 00 <44> 8b
> 68 5c be 08 00 00 00 44 8b a7 80 00 00 00 4d 8b 72 20 44
> [  193.802535] RIP  [<ffffffffa00cab8f>] g4x_compute_wm0+0x4d/0x122 [i915]
> [  193.802689]  RSP <ffff88021ce97bb0>
> [  193.802791] ---[ end trace 52f3f9a1f73037cd ]---

Comment 22 lu hua 2012-08-30 09:11:30 UTC

(In reply to comment #3)
> If you have the slub allocator enabled (CONFIG_SLUB), pls don't forget to boot
> with slub_debug on the cmdline. Otherwise our own slab might get squashed
> together with another one that fits nicely.

Add slub_debug causes Bug 54101

Comment 23 lu hua 2012-08-31 06:52:13 UTC

This issue goes away on the latest -queued kernel 0e0428baf7c156bc2ba8a3.

Comment 24 lu hua 2012-09-07 06:24:48 UTC

Verified. Fixed on -queued kernel(commit:8c3f929b6147e142efc58d5d03dc6fa703b14a5d)

Comment 25 Elizabeth 2017-10-06 14:48:44 UTC

Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.