Bug 53526

Summary: [SNB Regression] gpu hang with calltrace when running piglit case
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: high CC: ben, chris, daniel, jbarnes, xunx.fang
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
dmesg on Sugarbay
none
i915_error_state
none
dmesg Ivybridge(i7-3610QM) none

Description lu hua 2012-08-15 06:05:32 UTC
Created attachment 65584 [details]
dmesg

System Environment:
--------------------------
Arch:             x86_64
Platform:         Ivybridge
Libdrm:(master)libdrm-2.4.38-3-g3163cfe4db925429760407e77140e2d595338bc2
Mesa:	(master)605f964d5cc7016fc74e0563829fa794da845c20
Xserver:(master)xorg-server-1.12.99.904
Xf86_video_intel:(master)2.20.3-35-g2f4de90709264ad19a3e3f5f0f79f4bba78a760a
Libva:	(staging)f12f80371fb534e6bbf248586b3c17c298a31f4e
Libva_intel_driver:(staging)82fa52510a37ab645daaa3bb7091ff5096a20d0b
Kernel:	(drm-intel-next-queued) dec3ad8d19a4a496b2588bee2bcd7fce3a6731bc

Bug detailed description:
-------------------------
When nightly testing run piglit case, system hang with calltrace.
It happens on Ivybridge and Sandybridge with -queued kernel. It doesn't happen on -fixes kernel.
It doesn't happen on -queued kernel 20d5a540e55a29daeef12706f9ee73baf5641c16.

Calltrace:
[ 2051.731626]  [<ffffffff81112075>] alloc_buffer_head+0x1c/0x44
[ 2051.734842]  [<ffffffff811121d8>] alloc_page_buffers+0x2d/0xc9
[ 2051.738022]  [<ffffffff811132ef>] __getblk+0x194/0x24f
[ 2051.741163]  [<ffffffff81113452>] __bread+0xb/0x85
[ 2051.744281]  [<ffffffffa01382ff>] ext3_get_branch+0x72/0xf0 [ext3]
[ 2051.747397]  [<ffffffffa013a664>] ext3_get_blocks_handle+0xda/0x9a9 [ext3]
[ 2051.750498]  [<ffffffff810c791f>] ? zone_statistics+0x77/0x80
[ 2051.753580]  [<ffffffffa013afe9>] ext3_get_block+0xb6/0xf6 [ext3]
[ 2051.756649]  [<ffffffff8111b20a>] do_mpage_readpage+0x16d/0x4ed
[ 2051.759730]  [<ffffffff810b18bf>] ? add_to_page_cache_locked+0x77/0xa8
[ 2051.762826]  [<ffffffffa013af33>] ? ext3_get_blocks_handle+0x9a9/0x9a9 [ext3]
[ 2051.765931]  [<ffffffff8111b69e>] mpage_readpages+0xaf/0xf5
[ 2051.769033]  [<ffffffffa013af33>] ? ext3_get_blocks_handle+0x9a9/0x9a9 [ext3]
[ 2051.772139]  [<ffffffff810c791f>] ? zone_statistics+0x77/0x80
[ 2051.775245]  [<ffffffff810e1cef>] ? alloc_pages_current+0xcd/0xee
[ 2051.778362]  [<ffffffffa0138ac2>] ext3_readpages+0x18/0x1a [ext3]
[ 2051.781469]  [<ffffffff810ba0d6>] __do_page_cache_readahead+0x12a/0x1ad
[ 2051.784580]  [<ffffffff810ba411>] ra_submit+0x1c/0x20
[ 2051.787679]  [<ffffffff810b2d67>] filemap_fault+0x159/0x32d
[ 2051.790775]  [<ffffffff810cb444>] __do_fault+0xa7/0x3bb
[ 2051.793866]  [<ffffffff810cd973>] handle_pte_fault+0x28f/0x6b9
[ 2051.796954]  [<ffffffff810cee29>] handle_mm_fault+0x196/0x1ab
[ 2051.800039]  [<ffffffff813cb9bd>] do_page_fault+0x3ad/0x3d2
[ 2051.803135]  [<ffffffff8100c258>] ? syscall_trace_leave+0x3e/0x16a
[ 2051.806242]  [<ffffffff810e6344>] ? kmem_cache_free+0x8a/0xc6
[ 2051.809227]  [<ffffffff813c8e6f>] page_fault+0x1f/0x30
[ 2051.812098] Code: 4d 8b 28 4d 85 ed 75 16 4c 89 f9 83 ca ff 44 89 f6 4c 89 e7 e8 6e a8 2d 00 49 89 c5 eb 2a 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b0 49
[ 2051.818176] RIP  [<ffffffff810e7d4f>] kmem_cache_alloc+0x68/0xfd
[ 2051.821267]  RSP <ffff880146587668>
[ 2051.824358] [drm:intel_prepare_page_flip], preparing flip with no unpin work?
[ 2051.824383] ---[ end trace a6e8c87689f2ad1c ]---
Comment 1 Ben Widawsky 2012-08-15 06:13:23 UTC
Ignoring the fact that it appears to not be our bug based on the dmesg, could you bisect it in the off-chance that something we changed may have effected the block layer?
Comment 2 Chris Wilson 2012-08-15 09:09:58 UTC
Note that this could be explained by a use-after-free in our code.

Try with http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=scatterlist&id=b3f6598375bd46fb7a2309d76fbd95de880f4237 and see if that isolates the fault to ourselves?
Comment 3 Daniel Vetter 2012-08-15 09:13:45 UTC
If you have the slub allocator enabled (CONFIG_SLUB), pls don't forget to boot with slub_debug on the cmdline. Otherwise our own slab might get squashed together with another one that fits nicely.
Comment 4 lu hua 2012-08-17 08:43:24 UTC
Add slub_debug on the cmdline, system still hangs.
Comment 5 lu hua 2012-08-17 09:03:04 UTC
This bug blocks nightly test.
Comment 6 Daniel Vetter 2012-08-17 11:37:15 UTC
Please retest with latest -queued, I've taken out a patch that blows things up.

To confirm it's the same bug, can you please test whether i-g-t/tests/gem_gtt_cpu_tlb is also broken on the affected systems?
Comment 7 lu hua 2012-08-21 01:18:57 UTC
Test on the latest -queued kernel(commit: 83358c85866ebd2).
This issue still happens.

i-g-t/tests/gem_gtt_cpu_tlb doesn't have this issue.
Comment 8 Daniel Vetter 2012-08-21 06:45:33 UTC
Can you please attach a new dmesg with the backtraces (it is very important that you ensure that the first backtrace is included)?

The previous dmesg has a mix of filesystem and i915 issues, and I'm hoping that at least the filesystem issues are gone.

Also, to clarify: Does the i-g-t test (gem_gtt_cpu_tlb) work on both current dinq and the previous kernel, or have you tested only on the new kernel?
Comment 9 lu hua 2012-08-21 07:21:02 UTC
Run gem_gtt_cpu_tlb on -queued dec3ad8d19a4a496b2588bee2bcd7fce3a6731bc, output:
gem_gtt_cpu_tlb: gem_gtt_cpu_tlb.c:103: main: Assertion `ptr[i] == i' failed.
Aborted (core dumped).

Run gem_gtt_cpu_tlb on -queued 83358c85866ebd2af1229fc9870b93e126690671. It passes.
Comment 10 lu hua 2012-08-22 03:30:49 UTC
Created attachment 65924 [details]
dmesg on Sugarbay
Comment 11 lu hua 2012-08-22 03:37:49 UTC
Retest on -queued  bd590bef35cd6f9b015a0.
It doesn't have call trace in dmesg. GPU hung appears in dmesg.

This issue goes away on ivybridge and Huronriver. 
Sandybridge GT1 i7-2600 still hangs in nightly test.
Comment 12 lu hua 2012-08-22 05:08:42 UTC
Created attachment 65927 [details]
i915_error_state
Comment 13 Daniel Vetter 2012-08-22 08:25:19 UTC
Ok, updated the summary, sinc this is only a gpu hang and snb-only now.
Comment 14 Chris Wilson 2012-08-22 08:52:39 UTC
The signaling of he the BLT semaphore into the RCS failed, leaving the RCS stuck waiting on a semaphore that has already passed.

RCS:
  HEAD: 0x7e01e4fc
  TAIL: 0x0001e5c8
  ACTHD: 0x7e01e4fc
  IPEIR: 0x00000000
  IPEHR: 0x0b160001
  INSTDONE: 0xffffffff
  INSTDONE1: 0xbfffffff
    busy: CS
  BBADDR: 0x01fb4204
  INSTPS: 0x8000010b
  INSTPM: 0x00000080
  FADDR: 0x0001f5c8
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  SYNC_0: 0x00000000
  SYNC_1: 0x0019d77b
  seqno: 0x0019d786

BCS:
  HEAD: 0x54c05590
  TAIL: 0x00005590
  ACTHD: 0x54c05590
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00049590
  RC PSMI: 0x00000018
  FAULT_REG: 0x00000000
  SYNC_0: 0x0019d786
  SYNC_1: 0x00000000
  seqno: 0x0019d782
Comment 15 Chris Wilson 2012-08-22 13:32:30 UTC
Does i915.i915_enable_rc6=0 affect the hang?
Comment 16 Ben Widawsky 2012-08-22 22:36:50 UTC
Can you also please try this patch:
https://patchwork.kernel.org/patch/1363021/
Comment 17 lu hua 2012-08-24 03:32:47 UTC
On Ivybridge(i7-3610QM) nightly testing with -queued kernel b4c145c1d245c2cc19754, calltrace appears in dmesg.
Call Trace:
[  193.797253]  [<ffffffffa00ccfd6>] sandybridge_update_wm+0x61/0x414 [i915]
[  193.797364]  [<ffffffffa00ce3ea>] intel_update_watermarks+0x19/0x1b [i915]
[  193.797477]  [<ffffffffa00d7e7b>] ivb_disable_plane+0x95/0x9e [i915]
[  193.797587]  [<ffffffffa00d785a>] intel_disable_plane+0x24/0x60 [i915]
[  193.797696]  [<ffffffffa00d78a5>] intel_destroy_plane+0xf/0x24 [i915]
[  193.797805]  [<ffffffffa005188f>] drm_mode_config_cleanup+0x147/0x17c [drm]
[  193.797915]  [<ffffffffa00bf3c2>] intel_modeset_cleanup+0xf7/0x104 [i915]
[  193.798021]  [<ffffffffa009d1e2>] i915_driver_unload+0xec/0x24b [i915]
[  193.798128]  [<ffffffffa004c1a3>] drm_put_dev+0xd2/0x1af [drm]
[  193.798231]  [<ffffffffa00991f6>] i915_pci_remove+0x18/0x1a [i915]
[  193.798332]  [<ffffffff81214858>] pci_device_remove+0x28/0x4c
[  193.798433]  [<ffffffff81292843>] __device_release_driver+0x67/0xba
[  193.798534]  [<ffffffff81292f43>] driver_detach+0x7e/0xa7
[  193.798633]  [<ffffffff812926aa>] bus_remove_driver+0x89/0xab
[  193.798734]  [<ffffffff812934c3>] driver_unregister+0x64/0x6d
[  193.798835]  [<ffffffff81214adf>] pci_unregister_driver+0x3f/0x84
[  193.798942]  [<ffffffffa004e1ee>] drm_pci_exit+0x3f/0x78 [drm]
[  193.799052]  [<ffffffffa00dc39f>] i915_exit+0x17/0x19 [i915]
[  193.799152]  [<ffffffff8107736c>] sys_delete_module+0x1a2/0x200
[  193.799253]  [<ffffffff8108b2a1>] ? __audit_syscall_entry+0x191/0x1bd
[  193.799354]  [<ffffffff813da522>] system_call_fastpath+0x16/0x1b
[  193.799452] Code: 48 8b bc f0 a8 28 00 00 48 8b 47 28 48 85 c0 74 06 80 7f 30 00 75 14 49 8b 40 18 41 89 03 49 8b 42 18 89 03 31 c0 e9 ca 00 00 00 <44> 8b 68 5c be 08 00 00 00 44 8b a7 80 00 00 00 4d 8b 72 20 44
[  193.802535] RIP  [<ffffffffa00cab8f>] g4x_compute_wm0+0x4d/0x122 [i915]
[  193.802689]  RSP <ffff88021ce97bb0>
[  193.802791] ---[ end trace 52f3f9a1f73037cd ]---
Comment 18 lu hua 2012-08-24 03:33:47 UTC
Created attachment 66046 [details]
dmesg Ivybridge(i7-3610QM)
Comment 19 lu hua 2012-08-24 08:09:00 UTC
(In reply to comment #16)
> Can you also please try this patch:
> https://patchwork.kernel.org/patch/1363021/


Add this patch to -queued kernel b4c145c1d245c2cc19754dbe4b718f5a48755993.
Calltrace appears in dmesg.
Call Trace:
[ 6914.397312]  [<c02259c6>] warn_slowpath_common+0x63/0x78
[ 6914.397314]  [<c021a259>] ? default_send_IPI_mask_logical+0x2d/0xb6
[ 6914.397316]  [<c0225a3f>] warn_slowpath_fmt+0x26/0x2a
[ 6914.397318]  [<c021a259>] default_send_IPI_mask_logical+0x2d/0xb6
[ 6914.397321]  [<c0218aec>] native_send_call_func_ipi+0x4e/0x50
[ 6914.397323]  [<c0263a71>] smp_call_function_many+0x17b/0x193
[ 6914.397325]  [<c0222125>] ? do_flush_tlb_all+0x40/0x40
[ 6914.397327]  [<c0222197>] native_flush_tlb_others+0x21/0x24
[ 6914.397329]  [<c0222361>] flush_tlb_page+0x5a/0x63
[ 6914.397332]  [<c02ba38d>] ptep_clear_flush+0xd/0x14
[ 6914.397335]  [<c02b0ddb>] do_wp_page+0x4a8/0x557
[ 6914.397339]  [<c024b668>] ? __enqueue_entity+0x63/0x69
[ 6914.397341]  [<c02b23c7>] handle_pte_fault+0x589/0x5b5
[ 6914.397343]  [<c02b24bd>] handle_mm_fault+0xca/0xd9
[ 6914.397347]  [<c0545df8>] ? spurious_fault+0xa8/0xa8
[ 6914.397349]  [<c054617e>] do_page_fault+0x386/0x3a2
[ 6914.397351]  [<c02251f0>] ? do_fork+0x16a/0x241
[ 6914.397354]  [<c0275eb1>] ? __audit_syscall_exit+0x32e/0x349
[ 6914.397358]  [<c0207b79>] ? sys_clone+0x1b/0x20
[ 6914.397360]  [<c0545df8>] ? spurious_fault+0xa8/0xa8
[ 6914.397362]  [<c0543f66>] error_code+0x5a/0x60
[ 6914.397364]  [<c0545df8>] ? spurious_fault+0xa8/0xa8
[ 6914.397365] ---[ end trace 5e024e1ca4e544d1 ]---
Comment 20 Daniel Vetter 2012-08-24 08:57:04 UTC
Ok, we seem to have randomly corrupted state at all kinds of core kernel functions (not just our own). Can you please try to bisect this one (please pick of all affected machines the one with the best reproduceability for bisecting)?
Comment 21 lu hua 2012-08-27 03:19:10 UTC
When run I-g-t case 'module_reload', the calltrace appears in dmesg(Bug 54101).  Piglit cases doesn't causes this the calltrace. 

(In reply to comment #17)
> On Ivybridge(i7-3610QM) nightly testing with -queued kernel
> b4c145c1d245c2cc19754, calltrace appears in dmesg.
> Call Trace:
> [  193.797253]  [<ffffffffa00ccfd6>] sandybridge_update_wm+0x61/0x414 [i915]
> [  193.797364]  [<ffffffffa00ce3ea>] intel_update_watermarks+0x19/0x1b [i915]
> [  193.797477]  [<ffffffffa00d7e7b>] ivb_disable_plane+0x95/0x9e [i915]
> [  193.797587]  [<ffffffffa00d785a>] intel_disable_plane+0x24/0x60 [i915]
> [  193.797696]  [<ffffffffa00d78a5>] intel_destroy_plane+0xf/0x24 [i915]
> [  193.797805]  [<ffffffffa005188f>] drm_mode_config_cleanup+0x147/0x17c [drm]
> [  193.797915]  [<ffffffffa00bf3c2>] intel_modeset_cleanup+0xf7/0x104 [i915]
> [  193.798021]  [<ffffffffa009d1e2>] i915_driver_unload+0xec/0x24b [i915]
> [  193.798128]  [<ffffffffa004c1a3>] drm_put_dev+0xd2/0x1af [drm]
> [  193.798231]  [<ffffffffa00991f6>] i915_pci_remove+0x18/0x1a [i915]
> [  193.798332]  [<ffffffff81214858>] pci_device_remove+0x28/0x4c
> [  193.798433]  [<ffffffff81292843>] __device_release_driver+0x67/0xba
> [  193.798534]  [<ffffffff81292f43>] driver_detach+0x7e/0xa7
> [  193.798633]  [<ffffffff812926aa>] bus_remove_driver+0x89/0xab
> [  193.798734]  [<ffffffff812934c3>] driver_unregister+0x64/0x6d
> [  193.798835]  [<ffffffff81214adf>] pci_unregister_driver+0x3f/0x84
> [  193.798942]  [<ffffffffa004e1ee>] drm_pci_exit+0x3f/0x78 [drm]
> [  193.799052]  [<ffffffffa00dc39f>] i915_exit+0x17/0x19 [i915]
> [  193.799152]  [<ffffffff8107736c>] sys_delete_module+0x1a2/0x200
> [  193.799253]  [<ffffffff8108b2a1>] ? __audit_syscall_entry+0x191/0x1bd
> [  193.799354]  [<ffffffff813da522>] system_call_fastpath+0x16/0x1b
> [  193.799452] Code: 48 8b bc f0 a8 28 00 00 48 8b 47 28 48 85 c0 74 06 80 7f
> 30 00 75 14 49 8b 40 18 41 89 03 49 8b 42 18 89 03 31 c0 e9 ca 00 00 00 <44> 8b
> 68 5c be 08 00 00 00 44 8b a7 80 00 00 00 4d 8b 72 20 44
> [  193.802535] RIP  [<ffffffffa00cab8f>] g4x_compute_wm0+0x4d/0x122 [i915]
> [  193.802689]  RSP <ffff88021ce97bb0>
> [  193.802791] ---[ end trace 52f3f9a1f73037cd ]---
Comment 22 lu hua 2012-08-30 09:11:30 UTC
(In reply to comment #3)
> If you have the slub allocator enabled (CONFIG_SLUB), pls don't forget to boot
> with slub_debug on the cmdline. Otherwise our own slab might get squashed
> together with another one that fits nicely.

Add slub_debug causes Bug 54101
Comment 23 lu hua 2012-08-31 06:52:13 UTC
This issue goes away on the latest -queued kernel 0e0428baf7c156bc2ba8a3.
Comment 24 lu hua 2012-09-07 06:24:48 UTC
Verified. Fixed on -queued kernel(commit:8c3f929b6147e142efc58d5d03dc6fa703b14a5d)
Comment 25 Elizabeth 2017-10-06 14:48:44 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.