Created attachment 65584 [details] dmesg System Environment: -------------------------- Arch: x86_64 Platform: Ivybridge Libdrm:(master)libdrm-2.4.38-3-g3163cfe4db925429760407e77140e2d595338bc2 Mesa: (master)605f964d5cc7016fc74e0563829fa794da845c20 Xserver:(master)xorg-server-1.12.99.904 Xf86_video_intel:(master)2.20.3-35-g2f4de90709264ad19a3e3f5f0f79f4bba78a760a Libva: (staging)f12f80371fb534e6bbf248586b3c17c298a31f4e Libva_intel_driver:(staging)82fa52510a37ab645daaa3bb7091ff5096a20d0b Kernel: (drm-intel-next-queued) dec3ad8d19a4a496b2588bee2bcd7fce3a6731bc Bug detailed description: ------------------------- When nightly testing run piglit case, system hang with calltrace. It happens on Ivybridge and Sandybridge with -queued kernel. It doesn't happen on -fixes kernel. It doesn't happen on -queued kernel 20d5a540e55a29daeef12706f9ee73baf5641c16. Calltrace: [ 2051.731626] [<ffffffff81112075>] alloc_buffer_head+0x1c/0x44 [ 2051.734842] [<ffffffff811121d8>] alloc_page_buffers+0x2d/0xc9 [ 2051.738022] [<ffffffff811132ef>] __getblk+0x194/0x24f [ 2051.741163] [<ffffffff81113452>] __bread+0xb/0x85 [ 2051.744281] [<ffffffffa01382ff>] ext3_get_branch+0x72/0xf0 [ext3] [ 2051.747397] [<ffffffffa013a664>] ext3_get_blocks_handle+0xda/0x9a9 [ext3] [ 2051.750498] [<ffffffff810c791f>] ? zone_statistics+0x77/0x80 [ 2051.753580] [<ffffffffa013afe9>] ext3_get_block+0xb6/0xf6 [ext3] [ 2051.756649] [<ffffffff8111b20a>] do_mpage_readpage+0x16d/0x4ed [ 2051.759730] [<ffffffff810b18bf>] ? add_to_page_cache_locked+0x77/0xa8 [ 2051.762826] [<ffffffffa013af33>] ? ext3_get_blocks_handle+0x9a9/0x9a9 [ext3] [ 2051.765931] [<ffffffff8111b69e>] mpage_readpages+0xaf/0xf5 [ 2051.769033] [<ffffffffa013af33>] ? ext3_get_blocks_handle+0x9a9/0x9a9 [ext3] [ 2051.772139] [<ffffffff810c791f>] ? zone_statistics+0x77/0x80 [ 2051.775245] [<ffffffff810e1cef>] ? alloc_pages_current+0xcd/0xee [ 2051.778362] [<ffffffffa0138ac2>] ext3_readpages+0x18/0x1a [ext3] [ 2051.781469] [<ffffffff810ba0d6>] __do_page_cache_readahead+0x12a/0x1ad [ 2051.784580] [<ffffffff810ba411>] ra_submit+0x1c/0x20 [ 2051.787679] [<ffffffff810b2d67>] filemap_fault+0x159/0x32d [ 2051.790775] [<ffffffff810cb444>] __do_fault+0xa7/0x3bb [ 2051.793866] [<ffffffff810cd973>] handle_pte_fault+0x28f/0x6b9 [ 2051.796954] [<ffffffff810cee29>] handle_mm_fault+0x196/0x1ab [ 2051.800039] [<ffffffff813cb9bd>] do_page_fault+0x3ad/0x3d2 [ 2051.803135] [<ffffffff8100c258>] ? syscall_trace_leave+0x3e/0x16a [ 2051.806242] [<ffffffff810e6344>] ? kmem_cache_free+0x8a/0xc6 [ 2051.809227] [<ffffffff813c8e6f>] page_fault+0x1f/0x30 [ 2051.812098] Code: 4d 8b 28 4d 85 ed 75 16 4c 89 f9 83 ca ff 44 89 f6 4c 89 e7 e8 6e a8 2d 00 49 89 c5 eb 2a 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b0 49 [ 2051.818176] RIP [<ffffffff810e7d4f>] kmem_cache_alloc+0x68/0xfd [ 2051.821267] RSP <ffff880146587668> [ 2051.824358] [drm:intel_prepare_page_flip], preparing flip with no unpin work? [ 2051.824383] ---[ end trace a6e8c87689f2ad1c ]---
Ignoring the fact that it appears to not be our bug based on the dmesg, could you bisect it in the off-chance that something we changed may have effected the block layer?
Note that this could be explained by a use-after-free in our code. Try with http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=scatterlist&id=b3f6598375bd46fb7a2309d76fbd95de880f4237 and see if that isolates the fault to ourselves?
If you have the slub allocator enabled (CONFIG_SLUB), pls don't forget to boot with slub_debug on the cmdline. Otherwise our own slab might get squashed together with another one that fits nicely.
Add slub_debug on the cmdline, system still hangs.
This bug blocks nightly test.
Please retest with latest -queued, I've taken out a patch that blows things up. To confirm it's the same bug, can you please test whether i-g-t/tests/gem_gtt_cpu_tlb is also broken on the affected systems?
Test on the latest -queued kernel(commit: 83358c85866ebd2). This issue still happens. i-g-t/tests/gem_gtt_cpu_tlb doesn't have this issue.
Can you please attach a new dmesg with the backtraces (it is very important that you ensure that the first backtrace is included)? The previous dmesg has a mix of filesystem and i915 issues, and I'm hoping that at least the filesystem issues are gone. Also, to clarify: Does the i-g-t test (gem_gtt_cpu_tlb) work on both current dinq and the previous kernel, or have you tested only on the new kernel?
Run gem_gtt_cpu_tlb on -queued dec3ad8d19a4a496b2588bee2bcd7fce3a6731bc, output: gem_gtt_cpu_tlb: gem_gtt_cpu_tlb.c:103: main: Assertion `ptr[i] == i' failed. Aborted (core dumped). Run gem_gtt_cpu_tlb on -queued 83358c85866ebd2af1229fc9870b93e126690671. It passes.
Created attachment 65924 [details] dmesg on Sugarbay
Retest on -queued bd590bef35cd6f9b015a0. It doesn't have call trace in dmesg. GPU hung appears in dmesg. This issue goes away on ivybridge and Huronriver. Sandybridge GT1 i7-2600 still hangs in nightly test.
Created attachment 65927 [details] i915_error_state
Ok, updated the summary, sinc this is only a gpu hang and snb-only now.
The signaling of he the BLT semaphore into the RCS failed, leaving the RCS stuck waiting on a semaphore that has already passed. RCS: HEAD: 0x7e01e4fc TAIL: 0x0001e5c8 ACTHD: 0x7e01e4fc IPEIR: 0x00000000 IPEHR: 0x0b160001 INSTDONE: 0xffffffff INSTDONE1: 0xbfffffff busy: CS BBADDR: 0x01fb4204 INSTPS: 0x8000010b INSTPM: 0x00000080 FADDR: 0x0001f5c8 RC PSMI: 0x00000010 FAULT_REG: 0x00000000 SYNC_0: 0x00000000 SYNC_1: 0x0019d77b seqno: 0x0019d786 BCS: HEAD: 0x54c05590 TAIL: 0x00005590 ACTHD: 0x54c05590 IPEIR: 0x00000000 IPEHR: 0x00000000 INSTDONE: 0xfffffffe INSTPS: 0x00000000 INSTPM: 0x00000000 FADDR: 0x00049590 RC PSMI: 0x00000018 FAULT_REG: 0x00000000 SYNC_0: 0x0019d786 SYNC_1: 0x00000000 seqno: 0x0019d782
Does i915.i915_enable_rc6=0 affect the hang?
Can you also please try this patch: https://patchwork.kernel.org/patch/1363021/
On Ivybridge(i7-3610QM) nightly testing with -queued kernel b4c145c1d245c2cc19754, calltrace appears in dmesg. Call Trace: [ 193.797253] [<ffffffffa00ccfd6>] sandybridge_update_wm+0x61/0x414 [i915] [ 193.797364] [<ffffffffa00ce3ea>] intel_update_watermarks+0x19/0x1b [i915] [ 193.797477] [<ffffffffa00d7e7b>] ivb_disable_plane+0x95/0x9e [i915] [ 193.797587] [<ffffffffa00d785a>] intel_disable_plane+0x24/0x60 [i915] [ 193.797696] [<ffffffffa00d78a5>] intel_destroy_plane+0xf/0x24 [i915] [ 193.797805] [<ffffffffa005188f>] drm_mode_config_cleanup+0x147/0x17c [drm] [ 193.797915] [<ffffffffa00bf3c2>] intel_modeset_cleanup+0xf7/0x104 [i915] [ 193.798021] [<ffffffffa009d1e2>] i915_driver_unload+0xec/0x24b [i915] [ 193.798128] [<ffffffffa004c1a3>] drm_put_dev+0xd2/0x1af [drm] [ 193.798231] [<ffffffffa00991f6>] i915_pci_remove+0x18/0x1a [i915] [ 193.798332] [<ffffffff81214858>] pci_device_remove+0x28/0x4c [ 193.798433] [<ffffffff81292843>] __device_release_driver+0x67/0xba [ 193.798534] [<ffffffff81292f43>] driver_detach+0x7e/0xa7 [ 193.798633] [<ffffffff812926aa>] bus_remove_driver+0x89/0xab [ 193.798734] [<ffffffff812934c3>] driver_unregister+0x64/0x6d [ 193.798835] [<ffffffff81214adf>] pci_unregister_driver+0x3f/0x84 [ 193.798942] [<ffffffffa004e1ee>] drm_pci_exit+0x3f/0x78 [drm] [ 193.799052] [<ffffffffa00dc39f>] i915_exit+0x17/0x19 [i915] [ 193.799152] [<ffffffff8107736c>] sys_delete_module+0x1a2/0x200 [ 193.799253] [<ffffffff8108b2a1>] ? __audit_syscall_entry+0x191/0x1bd [ 193.799354] [<ffffffff813da522>] system_call_fastpath+0x16/0x1b [ 193.799452] Code: 48 8b bc f0 a8 28 00 00 48 8b 47 28 48 85 c0 74 06 80 7f 30 00 75 14 49 8b 40 18 41 89 03 49 8b 42 18 89 03 31 c0 e9 ca 00 00 00 <44> 8b 68 5c be 08 00 00 00 44 8b a7 80 00 00 00 4d 8b 72 20 44 [ 193.802535] RIP [<ffffffffa00cab8f>] g4x_compute_wm0+0x4d/0x122 [i915] [ 193.802689] RSP <ffff88021ce97bb0> [ 193.802791] ---[ end trace 52f3f9a1f73037cd ]---
Created attachment 66046 [details] dmesg Ivybridge(i7-3610QM)
(In reply to comment #16) > Can you also please try this patch: > https://patchwork.kernel.org/patch/1363021/ Add this patch to -queued kernel b4c145c1d245c2cc19754dbe4b718f5a48755993. Calltrace appears in dmesg. Call Trace: [ 6914.397312] [<c02259c6>] warn_slowpath_common+0x63/0x78 [ 6914.397314] [<c021a259>] ? default_send_IPI_mask_logical+0x2d/0xb6 [ 6914.397316] [<c0225a3f>] warn_slowpath_fmt+0x26/0x2a [ 6914.397318] [<c021a259>] default_send_IPI_mask_logical+0x2d/0xb6 [ 6914.397321] [<c0218aec>] native_send_call_func_ipi+0x4e/0x50 [ 6914.397323] [<c0263a71>] smp_call_function_many+0x17b/0x193 [ 6914.397325] [<c0222125>] ? do_flush_tlb_all+0x40/0x40 [ 6914.397327] [<c0222197>] native_flush_tlb_others+0x21/0x24 [ 6914.397329] [<c0222361>] flush_tlb_page+0x5a/0x63 [ 6914.397332] [<c02ba38d>] ptep_clear_flush+0xd/0x14 [ 6914.397335] [<c02b0ddb>] do_wp_page+0x4a8/0x557 [ 6914.397339] [<c024b668>] ? __enqueue_entity+0x63/0x69 [ 6914.397341] [<c02b23c7>] handle_pte_fault+0x589/0x5b5 [ 6914.397343] [<c02b24bd>] handle_mm_fault+0xca/0xd9 [ 6914.397347] [<c0545df8>] ? spurious_fault+0xa8/0xa8 [ 6914.397349] [<c054617e>] do_page_fault+0x386/0x3a2 [ 6914.397351] [<c02251f0>] ? do_fork+0x16a/0x241 [ 6914.397354] [<c0275eb1>] ? __audit_syscall_exit+0x32e/0x349 [ 6914.397358] [<c0207b79>] ? sys_clone+0x1b/0x20 [ 6914.397360] [<c0545df8>] ? spurious_fault+0xa8/0xa8 [ 6914.397362] [<c0543f66>] error_code+0x5a/0x60 [ 6914.397364] [<c0545df8>] ? spurious_fault+0xa8/0xa8 [ 6914.397365] ---[ end trace 5e024e1ca4e544d1 ]---
Ok, we seem to have randomly corrupted state at all kinds of core kernel functions (not just our own). Can you please try to bisect this one (please pick of all affected machines the one with the best reproduceability for bisecting)?
When run I-g-t case 'module_reload', the calltrace appears in dmesg(Bug 54101). Piglit cases doesn't causes this the calltrace. (In reply to comment #17) > On Ivybridge(i7-3610QM) nightly testing with -queued kernel > b4c145c1d245c2cc19754, calltrace appears in dmesg. > Call Trace: > [ 193.797253] [<ffffffffa00ccfd6>] sandybridge_update_wm+0x61/0x414 [i915] > [ 193.797364] [<ffffffffa00ce3ea>] intel_update_watermarks+0x19/0x1b [i915] > [ 193.797477] [<ffffffffa00d7e7b>] ivb_disable_plane+0x95/0x9e [i915] > [ 193.797587] [<ffffffffa00d785a>] intel_disable_plane+0x24/0x60 [i915] > [ 193.797696] [<ffffffffa00d78a5>] intel_destroy_plane+0xf/0x24 [i915] > [ 193.797805] [<ffffffffa005188f>] drm_mode_config_cleanup+0x147/0x17c [drm] > [ 193.797915] [<ffffffffa00bf3c2>] intel_modeset_cleanup+0xf7/0x104 [i915] > [ 193.798021] [<ffffffffa009d1e2>] i915_driver_unload+0xec/0x24b [i915] > [ 193.798128] [<ffffffffa004c1a3>] drm_put_dev+0xd2/0x1af [drm] > [ 193.798231] [<ffffffffa00991f6>] i915_pci_remove+0x18/0x1a [i915] > [ 193.798332] [<ffffffff81214858>] pci_device_remove+0x28/0x4c > [ 193.798433] [<ffffffff81292843>] __device_release_driver+0x67/0xba > [ 193.798534] [<ffffffff81292f43>] driver_detach+0x7e/0xa7 > [ 193.798633] [<ffffffff812926aa>] bus_remove_driver+0x89/0xab > [ 193.798734] [<ffffffff812934c3>] driver_unregister+0x64/0x6d > [ 193.798835] [<ffffffff81214adf>] pci_unregister_driver+0x3f/0x84 > [ 193.798942] [<ffffffffa004e1ee>] drm_pci_exit+0x3f/0x78 [drm] > [ 193.799052] [<ffffffffa00dc39f>] i915_exit+0x17/0x19 [i915] > [ 193.799152] [<ffffffff8107736c>] sys_delete_module+0x1a2/0x200 > [ 193.799253] [<ffffffff8108b2a1>] ? __audit_syscall_entry+0x191/0x1bd > [ 193.799354] [<ffffffff813da522>] system_call_fastpath+0x16/0x1b > [ 193.799452] Code: 48 8b bc f0 a8 28 00 00 48 8b 47 28 48 85 c0 74 06 80 7f > 30 00 75 14 49 8b 40 18 41 89 03 49 8b 42 18 89 03 31 c0 e9 ca 00 00 00 <44> 8b > 68 5c be 08 00 00 00 44 8b a7 80 00 00 00 4d 8b 72 20 44 > [ 193.802535] RIP [<ffffffffa00cab8f>] g4x_compute_wm0+0x4d/0x122 [i915] > [ 193.802689] RSP <ffff88021ce97bb0> > [ 193.802791] ---[ end trace 52f3f9a1f73037cd ]---
(In reply to comment #3) > If you have the slub allocator enabled (CONFIG_SLUB), pls don't forget to boot > with slub_debug on the cmdline. Otherwise our own slab might get squashed > together with another one that fits nicely. Add slub_debug causes Bug 54101
This issue goes away on the latest -queued kernel 0e0428baf7c156bc2ba8a3.
Verified. Fixed on -queued kernel(commit:8c3f929b6147e142efc58d5d03dc6fa703b14a5d)
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.