Summary: | [SNB dinq regression] i915_reset() triggers OOPS | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||||
Component: | DRM/Intel | Assignee: | Daniel Vetter <daniel> | ||||||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||||||
Severity: | major | ||||||||||||||
Priority: | high | CC: | ben, chris, daniel, jbarnes, xunx.fang, yi.sun | ||||||||||||
Version: | unspecified | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux (All) | ||||||||||||||
Whiteboard: | |||||||||||||||
i915 platform: | i915 features: | ||||||||||||||
Attachments: |
|
Description
lu hua
2012-04-16 20:37:00 UTC
What happened to the original dmesg? Can you please attach the full unmolested output? In future, it is the *first* OOPS that is important from the initial BUG line to the end trace. As the first few lines give the reason for the oops, with the callstack giving where. 2 things: - the dmesg is cut to a width of 80 chars, i.e. a lot of the long lines are not complete. - dmesg talks about a gpu hang, can you try to grab the i915_error_state? And like Chris said, for issues with dmesg output, the first error/backtrace is the important one, not the last. We have an oops in i915_driver_irq_postinstall, but unfortunately that one's cut off, too. The NULL dereference is from: static struct cpu_workqueue_struct *get_work_cwq(struct work_struct *work) { unsigned long data = atomic_long_read(&work->data); if (data & WORK_STRUCT_CWQ) return (void *)(data & WORK_STRUCT_WQ_DATA_MASK); else return NULL; } static void process_one_work(struct worker *worker, struct work_struct *work) __releases(&gcwq->lock) __acquires(&gcwq->lock) { struct cpu_workqueue_struct *cwq = get_work_cwq(work); struct global_cwq *gcwq = cwq->gcwq; <-- OOPS ... introduced commit 7e11629d0efec829cbf62366143ba1081944a70e Author: Tejun Heo <tj@kernel.org> Date: Tue Jun 29 10:07:13 2010 +0200 workqueue: use shared worklist and pool all workers per cpu So the question is: why are we hitting it, and why now? (That disection was based on the assumption that our compiled objects are similar enough.) If you run "gdb vmlinux; list *process_one_work+0x2f; list *worker_thread+0x17f;" and paste the output as well. Created attachment 60227 [details]
dmesgupdate
Created attachment 60228 [details]
i915_error_state
The error state is just another mesa fail, annoying but the real bug here lies in the inability to recover from the error. Does the bug trigger if you "echo 1 > /sys/kernel/debug/dri/0/i915_wedged" immediately upon booting? Run 'echo 1 > /sys/kernel/debug/dri/0/i915_wedged'.It has same result. Now you have a quick test to use for bisecting. Good luck! :) ... at least smells like one. I will bisect it. Ping on the bisect result of this regression ... Bisect shows: There are only 'skip'ped commits left to test. The first bad commit could be any of: 3a744038b3709cd467b693f3e146c6d5b8120a18 ed3be9a0e3c1050fe07d69a8c600d86cac76cdc4 fa5a97bb0c65cb8d0382b72a55e2b87e15268289 a171e782a97d4ba55d7fa02f9a46904288b2c229 e841a36abb0f95ea356d52e4386b8e8f762e9c40 2f2cc27f50e3d232602d3b7c972071b4a30e5e38 cee1a799eb044657922c4d63003d7bf71f8c8b8d 15c08f664d8ca4f4d0e202cbd4034422a706ef80 9300928692f835f76f5604b3b51c3085977edf68 e032b376551a61662b20a2c8544fbbc568ab2e7f eb4168158f79237498e4d3ddcef6e9436db15a4a 5777d9b34aec841429ddade56403b3f53a821a1d 5f12760d289fd2da685cb54eebb08c107b146872 09bf14b901f2c1908b6a72fe934457acdd1fa430 546e78452a3f81eb45ae5c671c71db05389d42c8 51579137c500362018b5341f5dca47807ed558aa d49fe3c4cd22965de7422dd81d46110fc3d4deef 4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b We cannot bisect more! ------ 4a1e8ebc5e5 is a bad commit. The others skip because of build fail. Build fail is really bad. Can you please paste the compiler error you're getting? I can prep a quick git branch with the compiler error fixed so that you can bisect the complete range. Can you confirm that the bug is NOT in 66cfb32772495068fbb5627b2dc88649ad66c3e5, but is in 4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b? Created attachment 60598 [details]
SNB build fail error message
Commit 66cfb32772495068fbb5627b2dc88649ad66c3e5 is a good commit, commit 4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b is a bad commit.(In reply to comment #15) > Can you confirm that the bug is NOT in > 66cfb32772495068fbb5627b2dc88649ad66c3e5, but is in > 4a1e8ebc5e5918079109cc1cd1c44c2f0fd0e11b? > --- Comment #16 from lu hua <huax.lu@intel.com> 2012-04-26 00:43:34 PDT ---
> Created attachment 60598 [details]
> --> https://bugs.freedesktop.org/attachment.cgi?id=60598
> SNB build fail error message
Looks like autofs4 fails to compile. You can just disable that in the
configuration with the CONFIG_AUTOFS4_FS option. The you could bisect the
remaining kernels revisions.
Created attachment 60635 [details] [review] dont clobber rps work when reinstalling the irq I've managed to reproduct this bug by accident, and I think this patch here should fix the problem. Note that it's also included in latest -queued. Please test. Ok, I've tested this on my ivb with a patch series of my. Before this patch, it crashed after 1-3 gpu hangs, now it just survived 250+. I call this fixed, please reopen if this is not the case. commit f737bc449d93084bccf8718f2f739f86033d914e Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 24 22:59:41 2012 +0100 drm/i915: Unconditionally initialise the interrupt workers Verified. This issue doesn't happen on -queued kernel commit b57aa4007a558be50955f9b58f5da98fcb78aa85. Closing old verified. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.