Bug 48798 - [SNB dinq regression] i915_reset() triggers OOPS
Summary: [SNB dinq regression] i915_reset() triggers OOPS
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: high major
Assignee: Daniel Vetter
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-16 20:37 UTC by lu hua
Modified: 2017-10-06 14:50 UTC (History)
6 users (show)

See Also:
i915 platform:
i915 features:


Attachments
queryAndRenderOnFBO dmesg (10.67 KB, text/plain)
2012-04-16 20:37 UTC, lu hua
no flags Details
dmesgupdate (7.18 KB, text/plain)
2012-04-17 19:37 UTC, lu hua
no flags Details
i915_error_state (2.24 MB, text/plain)
2012-04-17 19:38 UTC, lu hua
no flags Details
SNB build fail error message (102.62 KB, text/plain)
2012-04-26 00:43 UTC, lu hua
no flags Details
dont clobber rps work when reinstalling the irq (2.54 KB, patch)
2012-04-26 13:58 UTC, Daniel Vetter
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description lu hua 2012-04-16 20:37:00 UTC
Created attachment 60145 [details]
queryAndRenderOnFBO dmesg

System Environment:
--------------------------
Arch:            i386
Platform:        Sandybridge
Mesa:		(master)847c89870238fe5813e89831b38d5fab5356158c
Xserver:	(master)xorg-server-1.12.0-66-g80fefc42f5e67e6b4a4b440d8991bee7e5f38359
Xf86_video_intel:(master)2.18.0-211-ga16616209bb2dcb7aaa859b38e154f0a10faa82b
Kernel:	(drm-intel-next-queued) fc6826d1dcd65f3d1e9a5377678882e4e08f02be

Bug detailed description:
-----------------------------
It happens on sandybridge with drm-intel-next-queued kernel.The result is unstable, it happens once in 5 runs.It doesn't happen on fixes kernel.
This case has another Bug 47488, since bug 47488 occured, The result becomes unstable,FAIL or XHANG.

Call Trace:
[ 1553.194706]  [<c02369f6>] ? wq_worker_sleeping+0xc/0x71
[ 1553.195842]  [<c053f376>] __schedule+0x13c/0x766
[ 1553.197216]  [<c02c1d0f>] ? kmem_cache_free+0x95/0xc6
[ 1553.198566]  [<c0221bf5>] ? __cleanup_sighand+0x23/0x26
[ 1553.200060]  [<c0237d53>] ? free_pid+0x8c/0x93
[ 1553.201791]  [<c027a844>] ? call_rcu_sched+0xf/0x12
[ 1553.203618]  [<c0225eb0>] ? release_task+0x368/0x378
[ 1553.205668]  [<c023dd5c>] ? switch_task_namespaces+0xf/0x3a
[ 1553.207680]  [<c053fc03>] schedule+0x51/0x53
[ 1553.209421]  [<c022745a>] do_exit+0x690/0x694
[ 1553.210698]  [<c0541305>] oops_end+0x93/0x9b
[ 1553.211957]  [<c021d1f3>] no_context+0x158/0x162
[ 1553.213206]  [<c021d2e8>] __bad_area_nosemaphore+0xeb/0xf5
[ 1553.214457]  [<c0542aef>] ? spurious_fault+0xad/0xad
[ 1553.215703]  [<c021d2ff>] bad_area_nosemaphore+0xd/0x10
[ 1553.216951]  [<c0542cae>] do_page_fault+0x1bf/0x3a7
[ 1553.218193]  [<c02473b7>] ? default_wake_function+0xb/0xd
[ 1553.219436]  [<c0240153>] ? __wake_up_common+0x34/0x5c
[ 1553.220644]  [<c0542aef>] ? spurious_fault+0xad/0xad
[ 1553.221815]  [<c0540cb2>] error_code+0x5a/0x60
[ 1553.222968]  [<c0542aef>] ? spurious_fault+0xad/0xad
[ 1553.224114]  [<c0237033>] ? process_one_work+0x2f/0x2d3
[ 1553.225269]  [<f8401edf>] ? i915_driver_irq_postinstall+0x156/0x156 [i915]
[ 1553.226413]  [<c02375eb>] worker_thread+0x17f/0x298
[ 1553.227558]  [<c023746c>] ? rescuer_thread+0x195/0x195
[ 1553.228704]  [<c023a02d>] kthread+0x67/0x6c
[ 1553.229820]  [<c0239fc6>] ? kthread_freezable_should_stop+0x4e/0x4e
[ 1553.230922]  [<c0545c76>] kernel_thread_helper+0x6/0xd
[ 1553.231995] Code: e8 ff f6 ff ff 31 c0 59 5b 5e 5f 5d c3 55 64 a1 8c 45 76 c0 8b 80 64 02 00 00 89 e5 5d 8b 40 f8 c3 55 8b 80 64 02 00 00 89 e5 5d <8b> 40 fc c3 55 31 c0 89 e5 5d c3 55 8d 50 04 89 e5 66 c7 00 00
[ 1553.234391] EIP: [<c0239cda>] kthread_data+0xa/0xe SS:ESP 0068:f5715d58
[ 1553.235482] CR2: 00000000fffffffc
[ 1553.236532] ---[ end trace 88094ceb151ece1b ]---

Reproduce steps:
----------------
1. start X
2. ./oglconform -z -suite all -v 2 -D 123 -test conditional_render advanced.fbo.queryAndRenderOnFBO
Comment 1 Chris Wilson 2012-04-17 01:24:22 UTC
What happened to the original dmesg? Can you please attach the full unmolested output?

In future, it is the *first* OOPS that is important from the initial BUG line to the end trace. As the first few lines give the reason for the oops, with the callstack giving where.
Comment 2 Daniel Vetter 2012-04-17 01:35:16 UTC
2 things:
- the dmesg is cut to a width of 80 chars, i.e. a lot of the long lines are not complete.
- dmesg talks about a gpu hang, can you try to grab the i915_error_state?

And like Chris said, for issues with dmesg output, the first error/backtrace is the important one, not the last. We have an oops in i915_driver_irq_postinstall, but unfortunately that one's cut off, too.
Comment 3 Chris Wilson 2012-04-17 01:35:56 UTC
The NULL dereference is from:

static struct cpu_workqueue_struct *get_work_cwq(struct work_struct *work)
{
        unsigned long data = atomic_long_read(&work->data);

        if (data & WORK_STRUCT_CWQ)
                return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
        else
                return NULL;
}

static void process_one_work(struct worker *worker, struct work_struct *work)
__releases(&gcwq->lock)
__acquires(&gcwq->lock)
{
        struct cpu_workqueue_struct *cwq = get_work_cwq(work);
        struct global_cwq *gcwq = cwq->gcwq; <-- OOPS
...


introduced

commit 7e11629d0efec829cbf62366143ba1081944a70e
Author: Tejun Heo <tj@kernel.org>
Date:   Tue Jun 29 10:07:13 2010 +0200

    workqueue: use shared worklist and pool all workers per cpu


So the question is: why are we hitting it, and why now?
Comment 4 Chris Wilson 2012-04-17 01:41:08 UTC
(That disection was based on the assumption that our compiled objects are similar enough.)


If you run "gdb vmlinux; list *process_one_work+0x2f; list *worker_thread+0x17f;" and paste the output as well.
Comment 5 lu hua 2012-04-17 19:37:07 UTC
Created attachment 60227 [details]
dmesgupdate
Comment 6 lu hua 2012-04-17 19:38:35 UTC
Created attachment 60228 [details]
i915_error_state
Comment 7 Chris Wilson 2012-04-18 00:53:16 UTC
The error state is just another mesa fail, annoying but the real bug here lies in the inability to recover from the error.

Does the bug trigger if you "echo 1 > /sys/kernel/debug/dri/0/i915_wedged" immediately upon booting?
Comment 8 lu hua 2012-04-18 02:23:32 UTC
Run 'echo 1 > /sys/kernel/debug/dri/0/i915_wedged'.It has same result.
Comment 9 Chris Wilson 2012-04-18 02:34:46 UTC
Now you have a quick test to use for bisecting. Good luck! :)
Comment 10 Daniel Vetter 2012-04-18 04:37:43 UTC
 ... at least smells like one.
Comment 11 lu hua 2012-04-19 02:51:31 UTC
I will bisect it.
Comment 12 Daniel Vetter 2012-04-23 14:18:12 UTC
Ping on the bisect result of this regression ...
Comment 13 lu hua 2012-04-25 01:51:22 UTC
Bisect shows:
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
3a744038b3709cd467b693f3e146c6d5b8120a18
ed3be9a0e3c1050fe07d69a8c600d86cac76cdc4
fa5a97bb0c65cb8d0382b72a55e2b87e15268289
a171e782a97d4ba55d7fa02f9a46904288b2c229
e841a36abb0f95ea356d52e4386b8e8f762e9c40
2f2cc27f50e3d232602d3b7c972071b4a30e5e38
cee1a799eb044657922c4d63003d7bf71f8c8b8d
15c08f664d8ca4f4d0e202cbd4034422a706ef80
9300928692f835f76f5604b3b51c3085977edf68
e032b376551a61662b20a2c8544fbbc568ab2e7f
eb4168158f79237498e4d3ddcef6e9436db15a4a
5777d9b34aec841429ddade56403b3f53a821a1d
5f12760d289fd2da685cb54eebb08c107b146872
09bf14b901f2c1908b6a72fe934457acdd1fa430
546e78452a3f81eb45ae5c671c71db05389d42c8
51579137c500362018b5341f5dca47807ed558aa
d49fe3c4cd22965de7422dd81d46110fc3d4deef
4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b
We cannot bisect more!

------
4a1e8ebc5e5 is a bad commit. The others skip because of build fail.
Comment 14 Daniel Vetter 2012-04-25 02:23:48 UTC
Build fail is really bad. Can you please paste the compiler error you're getting? I can prep a quick git branch with the compiler error fixed so that you can bisect the complete range.
Comment 15 Chris Wilson 2012-04-25 02:26:45 UTC
Can you confirm that the bug is NOT in 66cfb32772495068fbb5627b2dc88649ad66c3e5, but is in 4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b?
Comment 16 lu hua 2012-04-26 00:43:34 UTC
Created attachment 60598 [details]
SNB build fail error message
Comment 17 lu hua 2012-04-26 00:45:27 UTC
Commit 66cfb32772495068fbb5627b2dc88649ad66c3e5 is a good commit, commit 4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b is a bad commit.(In reply to comment #15)
> Can you confirm that the bug is NOT in
> 66cfb32772495068fbb5627b2dc88649ad66c3e5, but is in
> 4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b?
Comment 18 Daniel Vetter 2012-04-26 01:11:52 UTC
> --- Comment #16 from lu hua <huax.lu@intel.com> 2012-04-26 00:43:34 PDT ---
> Created attachment 60598 [details]
>   --> https://bugs.freedesktop.org/attachment.cgi?id=60598
> SNB build fail error message

Looks like autofs4 fails to compile. You can just disable that in the
configuration with the CONFIG_AUTOFS4_FS option. The you could bisect the
remaining kernels revisions.
Comment 19 Daniel Vetter 2012-04-26 13:58:03 UTC
Created attachment 60635 [details] [review]
dont clobber rps work when reinstalling the irq

I've managed to reproduct this bug by accident, and I think this patch here should fix the problem. Note that it's also included in latest -queued. Please test.
Comment 20 Daniel Vetter 2012-04-26 14:54:27 UTC
Ok, I've tested this on my ivb with a patch series of my. Before this patch, it crashed after 1-3 gpu hangs, now it just survived 250+. I call this fixed, please reopen if this is not the case.

commit f737bc449d93084bccf8718f2f739f86033d914e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 24 22:59:41 2012 +0100

    drm/i915: Unconditionally initialise the interrupt workers
Comment 21 lu hua 2012-04-27 22:46:03 UTC
Verified.
This issue doesn't happen on -queued kernel commit b57aa4007a558be50955f9b58f5da98fcb78aa85.
Comment 22 Elizabeth 2017-10-06 14:50:21 UTC
Closing old verified.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.