Summary: | [HSW Bisected]igt/gem_seqno_wrap randomly fail and cause system hang with call trace | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||
Component: | DRM/Intel | Assignee: | Ben Widawsky <ben> | ||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | major | ||||||
Priority: | high | CC: | ben, przanoni, xunx.fang, yangweix.shui | ||||
Version: | unspecified | ||||||
Hardware: | All | ||||||
OS: | Linux (All) | ||||||
Whiteboard: | |||||||
i915 platform: | i915 features: | ||||||
Attachments: |
|
Description
lu hua
2013-07-23 02:28:08 UTC
Created attachment 82854 [details]
dmesg
Can you please decode the exact line where the BUG happened? When looking at the backtrace the important part is [ 59.706535] RIP: 0010:[<ffffffffa00733f5>] [<ffffffffa00733f5>] i915_gem_reset+0x11a/0x24c [i915] The address can be decoded to line numbers with addr2line -e drivers/gpu/drm/i915/i915.ko i915_gem_reset+0x11a It is of utmost importance that the hex offset you pass to addr2line from the backtrace is from running the exact i915.ko you pass it. Otherwise the line number will be unusable (and often is really misleading). Otherwise it seems to die in static inline unsigned long i915_gem_obj_ggtt_offset(struct drm_i915_gem_object *o) { BUG_ON(list_empty(&o->vma_list)); return __i915_gem_obj_to_vma(o)->node.start; } so one for Ben. Does this only fail on Haswell or have you seen similar failures (with the BUG in the same place) on other platforms, too? The other trick is to use 'addr2line -i ...', i.e. use -i in the addr2line command to unwind inlined function calls. 1. addr2line -e /lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/i915.ko i915_gem_reset+0x11a i915_drv.c:0 2. It only happens on haswell. Hm, line 0 is unhelpful. Can you please retry with the -i option as Chris suggested, i.e. addr2line -i -e /lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/i915.ko i915_gem_reset+0x11a (In reply to comment #6) > Hm, line 0 is unhelpful. Can you please retry with the -i option as Chris > suggested, i.e. > > addr2line -i -e > /lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/ > i915.ko > i915_gem_reset+0x11a addr2line -i -e /lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/i915.ko i915_gem_reset+0x11a i915_drv.c:0 Is bisection possible? (In reply to comment #8) > Is bisection possible? I will try to bisect the hang. good commit:10cd45b6e8ac1d1a99f6bdf0e0c80f2a1351f3f5. bad commit: cce723ed091ac304d48386bcc3524994c345123e Bisect shows:221ab43e8abe1e395d4bdd475ee3d4c2548f04ca is the first bad commit commit 221ab43e8abe1e395d4bdd475ee3d4c2548f04ca Author: Paulo Zanoni <paulo.r.zanoni@intel.com> AuthorDate: Fri Jul 12 19:52:36 2013 -0300 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Fri Jul 19 18:05:14 2013 +0200 drm/i915: don't read or write GEN6_PMIIR on Gen 5 The register doesn't exist on Gen 5. v2: Simplify checks since pm_iir is always 0 on Gen 5 (Chris) Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> (In reply to comment #10) > Bisect shows:221ab43e8abe1e395d4bdd475ee3d4c2548f04ca is the first bad commit > commit 221ab43e8abe1e395d4bdd475ee3d4c2548f04ca > Author: Paulo Zanoni <paulo.r.zanoni@intel.com> > AuthorDate: Fri Jul 12 19:52:36 2013 -0300 > Commit: Daniel Vetter <daniel.vetter@ffwll.ch> > CommitDate: Fri Jul 19 18:05:14 2013 +0200 > > drm/i915: don't read or write GEN6_PMIIR on Gen 5 > > The register doesn't exist on Gen 5. > > v2: Simplify checks since pm_iir is always 0 on Gen 5 (Chris) > > Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com> > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> This bisect result doesn't make much sense since it bisects to a patch that only touches ironlake_irq_handler, which was not used on Haswell at that point. There's no way a change to ironlake_irq_handler could break Haswell at that point. Since the issue happens randomly, perhaps we could run the test more times at each bisect step so we make sure good commits are not false positives? Lu Hua, can you please double-check the bisect result like Paulo suggested? I bisect it again. Bisect shows:The first bad commit could be any of: 2f63315692b1d3c055972ad33fc7168ae908b97b 52604b1ffabac61eb07cce711f18e18ac74fbeae 2293bb5c0383f522ac659946ccfadb0e6d2f03c5 (In reply to comment #13) > I bisect it again. > Bisect shows:The first bad commit could be any of: > 2f63315692b1d3c055972ad33fc7168ae908b97b > 52604b1ffabac61eb07cce711f18e18ac74fbeae > 2293bb5c0383f522ac659946ccfadb0e6d2f03c5 Hm, can you please elaborate why you couldn't test these three commits further? 30bc9b53878a9921b02e3b5bc4283ac1c6de102a good 52604b1ffabac61eb07cce711f18e18ac74fbeae output: error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel? 2f63315692b1d3c055972ad33fc7168ae908b97b output: error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel? 2293bb5c0383f522ac659946ccfadb0e6d2f03c5 bad Hm, what's the testing result for f7f181843e6c24644b4b71b8631a5ea87de05158 ? If they all fail with the same "error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?" then we first need to figure out where that part broke so that we can do the bisect on the last few commits. (In reply to comment #16) > Hm, what's the testing result for f7f181843e6c24644b4b71b8631a5ea87de05158 ? > > If they all fail with the same "error 2 opening > '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?" then we first > need to figure out where that part broke so that we can do the bisect on the > last few commits. Commit f7f181843e6c24644b4b71b8631a5ea87de05158's result is good. Assigning back to Ben, dunno why this was was changed ... I guess the reason this only happens on Haswell is bug #65387 (In reply to comment #15) > 30bc9b53878a9921b02e3b5bc4283ac1c6de102a good > > 52604b1ffabac61eb07cce711f18e18ac74fbeae > output: > error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel? > > 2f63315692b1d3c055972ad33fc7168ae908b97b > output: > error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel? > > 2293bb5c0383f522ac659946ccfadb0e6d2f03c5 bad For me, commit 2f63315692b1d3c055972ad33fc7168ae908b97b is bad. (In reply to comment #20) > (In reply to comment #15) > > 30bc9b53878a9921b02e3b5bc4283ac1c6de102a good > > > > 52604b1ffabac61eb07cce711f18e18ac74fbeae > > output: > > error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel? > > > > 2f63315692b1d3c055972ad33fc7168ae908b97b > > output: > > error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel? > > > > 2293bb5c0383f522ac659946ccfadb0e6d2f03c5 bad > > For me, commit 2f63315692b1d3c055972ad33fc7168ae908b97b is bad. I did some more investigation and it seems to me that commit 2f63315692b1d3c055972ad33fc7168ae908b97b is the first one that gives me the oops. But the test still fails before the commit. Please try: https://patchwork.kernel.org/patch/2843323/ (In reply to comment #22) > Please try: > https://patchwork.kernel.org/patch/2843323/ Fixed by this patch. (In reply to comment #23) > (In reply to comment #22) > > Please try: > > https://patchwork.kernel.org/patch/2843323/ > > Fixed by this patch. Patch merged, closing bug: http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=d6fc62c1699f4331c6b05a4b82a7796f8d281af3 Thanks everybody for the tests and the patch, Paulo Verified.Fixed. Closing old verified. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.