67198 – [HSW Bisected]igt/gem_seqno_wrap randomly fail and cause system hang with call trace

Bug 67198 - [HSW Bisected]igt/gem_seqno_wrap randomly fail and cause system hang with call trace

Summary: [HSW Bisected]igt/gem_seqno_wrap randomly fail and cause system hang with cal...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high major
Assignee:	Ben Widawsky
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-07-23 02:28 UTC by lu hua
Modified:	2017-10-06 14:44 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (325.72 KB, text/plain) 2013-07-23 02:28 UTC, lu hua	no flags	Details
View All

Description lu hua 2013-07-23 02:28:08 UTC

System Environment:
--------------------------
Platform: Haswell
Kernel:   (nightly)b1bc2058d21f02caa2543f8c776fd2c986bbab5d

Bug detailed description:
-------------------------
It happens on haswell with -nightly branch, drm-intel-next-queued branch, drm-intel-fixes branch. It happens 2 in 5 runs.

output:
gem_seqno_wrap: gem_seqno_wrap.c:501: preset_run_once: Assertion `write_seqno(1) == 0' failed.

Call trace:
[   59.701938] [drm:i915_hangcheck_elapsed] *ERROR* stuck on blitter ring
[   59.702927] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[   59.706398] [drm:i915_driver_open], [   59.706401] [drm:i915_error_work_func], resetting chip
[   59.706506] ------------[ cut here ]------------
[   59.706507] kernel BUG at drivers/gpu/drm/i915/i915_drv.h:1406!
[   59.706508] invalid opcode: 0000 [#1] SMP
[   59.706509] Modules linked in: netconsole configfs ipv6 dm_mod snd_hda_codec_realtek acpi_cpufreq coretemp kvm_intel kvm microcode pcspkr i2c_i801 snd_hda_intel iTCO_wdt iTCO_vendor_support snd_hda_codec snd_hwdep snd_pcm snd_page_alloc lpc_ich mfd_core snd_timer snd soundcore i915 video button drm_kms_helper drm mperf freq_table [last unloaded: netconsole]
[   59.706524] CPU: 7 PID: 1036 Comm: kworker/u16:4 Not tainted 3.10.0_nightlytop_b1bc20_20130722_+ #5982
[   59.706525] Hardware name: Intel Corporation Shark Bay Client platform/Flathead Creek Crb, BIOS HSWLPTU1.86C.0109.R03.1301282055 01/28/2013
[   59.706533] Workqueue: i915 i915_error_work_func [i915]
[   59.706535] task: ffff880244781b40 ti: ffff880242ae2000 task.ti: ffff880242ae2000
[   59.706535] RIP: 0010:[<ffffffffa00733f5>]  [<ffffffffa00733f5>] i915_gem_reset+0x11a/0x24c [i915]
[   59.706542] RSP: 0018:ffff880242ae3d28  EFLAGS: 00010246
[   59.706543] RAX: ffff88024469d980 RBX: ffff88023fbad4d8 RCX: ffff88024469da20
[   59.706544] RDX: 00000000009d5004 RSI: ffff88024469da20 RDI: ffff88023fbac040
[   59.706545] RBP: 0000000000000002 R08: 0000000000000001 R09: ffff88023fbad470
[   59.706545] R10: ffff880244a09750 R11: ffffffffa007347b R12: ffff88023fbac000
[   59.706546] R13: 00000000009d5004 R14: ffff880244448800 R15: 0000000000000011
[   59.706547] FS:  0000000000000000(0000) GS:ffff88024e3c0000(0000) knlGS:0000000000000000
[   59.706548] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.706549] CR2: 0000000001770118 CR3: 0000000242dc0000 CR4: 00000000001407e0
[   59.706549] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   59.706550] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   59.706550] Stack:
[   59.706551]  ffff88023f88b820 ffffffffa00c0a85 ffff880244448800 ffff88023fbad4d0
[   59.706552]  ffff880244a09720 ffff88023fbad5d8 ffffffff8198c5ca ffff880244448800
[   59.706554]  ffff88023fbac000 0000000000000000 ffff880244448820 0000000000000000
[   59.706556] Call Trace:
[   59.706558]  [<ffffffffa0063303>] ? i915_reset+0x36/0x1a9 [i915]
[   59.706563]  [<ffffffffa0067b2f>] ? i915_error_work_func+0xba/0x14b [i915]
[   59.706569]  [<ffffffff8104069c>] ? process_one_work+0x1c5/0x2df
[   59.706573]  [<ffffffff81040bf3>] ? worker_thread+0x1c7/0x2bc
[   59.706575]  [<ffffffff81040a2c>] ? rescuer_thread+0x251/0x251
[   59.706577]  [<ffffffff8104510a>] ? kthread+0xad/0xb5
[   59.706579]  [<ffffffff8104505d>] ? kthread_freezable_should_stop+0x40/0x40
[   59.706581]  [<ffffffff816f232c>] ? ret_from_fork+0x7c/0xb0
[   59.706584]  [<ffffffff8104505d>] ? kthread_freezable_should_stop+0x40/0x40
[   59.706586] Code: 0f 82 33 01 00 00 39 ca 0f 83 2b 01 00 00 eb 64 44 8b 49 04 48 85 c0 74 1a 48 8b 88 a0 00 00 00 48 05 a0 00 00 00 48 39 c1 75 02 <0f> 0b 4c 8b 41 d8 eb 03 45 31 c0 84 d2 44 89 2c 24 48 8b 13 48
[   59.706606] RIP  [<ffffffffa00733f5>] i915_gem_reset+0x11a/0x24c [i915]
[   59.706612]  RSP <ffff880242ae3d28>
[   59.706613] ---[ end trace 2b81663100df000f ]---


Reproduce steps:
----------------
1. ./gem_seqno_wrap

Comment 1 lu hua 2013-07-23 02:28:30 UTC

Created attachment 82854 [details]
dmesg

Comment 2 Daniel Vetter 2013-07-23 05:42:09 UTC

Can you please decode the exact line where the BUG happened?

When looking at the backtrace the important part is

[   59.706535] RIP: 0010:[<ffffffffa00733f5>]  [<ffffffffa00733f5>] i915_gem_reset+0x11a/0x24c [i915]

The address can be decoded to line numbers with

addr2line -e drivers/gpu/drm/i915/i915.ko i915_gem_reset+0x11a

It is of utmost importance that the hex offset you pass to addr2line from the backtrace is from running the exact i915.ko you pass it. Otherwise the line number will be unusable (and often is really misleading).

Otherwise it seems to die in

static inline unsigned long
i915_gem_obj_ggtt_offset(struct drm_i915_gem_object *o)
{
	BUG_ON(list_empty(&o->vma_list));
	return __i915_gem_obj_to_vma(o)->node.start;
}

so one for Ben.

Comment 3 Daniel Vetter 2013-07-23 05:55:26 UTC

Does this only fail on Haswell or have you seen similar failures (with the BUG in the same place) on other platforms, too?

Comment 4 Chris Wilson 2013-07-23 06:12:40 UTC

The other trick is to use 'addr2line -i ...', i.e. use -i in the addr2line command to unwind inlined function calls.

Comment 5 lu hua 2013-07-23 07:52:14 UTC

1. addr2line -e /lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/i915.ko i915_gem_reset+0x11a
i915_drv.c:0

2. It only happens on haswell.

Comment 6 Daniel Vetter 2013-07-23 08:19:10 UTC

Hm, line 0 is unhelpful. Can you please retry with the -i option as Chris suggested, i.e.

addr2line -i -e
/lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/i915.ko
i915_gem_reset+0x11a

Comment 7 lu hua 2013-07-23 08:28:24 UTC

(In reply to comment #6)
> Hm, line 0 is unhelpful. Can you please retry with the -i option as Chris
> suggested, i.e.
> 
> addr2line -i -e
> /lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/
> i915.ko
> i915_gem_reset+0x11a

 addr2line -i -e /lib/modules/3.10.0_nightlytop_b1bc20_20130722_+/kernel/drivers/gpu/drm/i915/i915.ko i915_gem_reset+0x11a
i915_drv.c:0

Comment 8 Ben Widawsky 2013-07-26 21:43:35 UTC

Is bisection possible?

Comment 9 lu hua 2013-07-29 08:48:40 UTC

(In reply to comment #8)
> Is bisection possible?

I will try to bisect the hang.
good commit:10cd45b6e8ac1d1a99f6bdf0e0c80f2a1351f3f5.
bad commit: cce723ed091ac304d48386bcc3524994c345123e

Comment 10 lu hua 2013-07-30 06:33:54 UTC

Bisect shows:221ab43e8abe1e395d4bdd475ee3d4c2548f04ca is the first bad commit
commit 221ab43e8abe1e395d4bdd475ee3d4c2548f04ca
Author:     Paulo Zanoni <paulo.r.zanoni@intel.com>
AuthorDate: Fri Jul 12 19:52:36 2013 -0300
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Fri Jul 19 18:05:14 2013 +0200

    drm/i915: don't read or write GEN6_PMIIR on Gen 5

    The register doesn't exist on Gen 5.

    v2: Simplify checks since pm_iir is always 0 on Gen 5 (Chris)

    Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Comment 11 Paulo Zanoni 2013-07-30 14:47:08 UTC

(In reply to comment #10)
> Bisect shows:221ab43e8abe1e395d4bdd475ee3d4c2548f04ca is the first bad commit
> commit 221ab43e8abe1e395d4bdd475ee3d4c2548f04ca
> Author:     Paulo Zanoni <paulo.r.zanoni@intel.com>
> AuthorDate: Fri Jul 12 19:52:36 2013 -0300
> Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
> CommitDate: Fri Jul 19 18:05:14 2013 +0200
> 
>     drm/i915: don't read or write GEN6_PMIIR on Gen 5
> 
>     The register doesn't exist on Gen 5.
> 
>     v2: Simplify checks since pm_iir is always 0 on Gen 5 (Chris)
> 
>     Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
>     Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

This bisect result doesn't make much sense since it bisects to a patch that only touches ironlake_irq_handler, which was not used on Haswell at that point. There's no way a change to ironlake_irq_handler could break Haswell at that point.

Since the issue happens randomly, perhaps we could run the test more times at each bisect step so we make sure good commits are not false positives?

Comment 12 Daniel Vetter 2013-08-04 19:17:35 UTC

Lu Hua, can you please double-check the bisect result like Paulo suggested?

Comment 13 lu hua 2013-08-05 06:56:29 UTC

I bisect it again.
Bisect shows:The first bad commit could be any of:
2f63315692b1d3c055972ad33fc7168ae908b97b
52604b1ffabac61eb07cce711f18e18ac74fbeae
2293bb5c0383f522ac659946ccfadb0e6d2f03c5

Comment 14 Daniel Vetter 2013-08-05 07:29:27 UTC

(In reply to comment #13)
> I bisect it again.
> Bisect shows:The first bad commit could be any of:
> 2f63315692b1d3c055972ad33fc7168ae908b97b
> 52604b1ffabac61eb07cce711f18e18ac74fbeae
> 2293bb5c0383f522ac659946ccfadb0e6d2f03c5

Hm, can you please elaborate why you couldn't test these three commits further?

Comment 15 lu hua 2013-08-05 07:46:47 UTC

30bc9b53878a9921b02e3b5bc4283ac1c6de102a  good

52604b1ffabac61eb07cce711f18e18ac74fbeae
output:
error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?

2f63315692b1d3c055972ad33fc7168ae908b97b
output:
error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?

2293bb5c0383f522ac659946ccfadb0e6d2f03c5 bad

Comment 16 Daniel Vetter 2013-08-05 10:25:47 UTC

Hm, what's the testing result for f7f181843e6c24644b4b71b8631a5ea87de05158 ?

If they all fail with the same "error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?" then we first need to figure out where that part broke so that we can do the bisect on the last few commits.

Comment 17 lu hua 2013-08-06 02:39:14 UTC

(In reply to comment #16)
> Hm, what's the testing result for f7f181843e6c24644b4b71b8631a5ea87de05158 ?
> 
> If they all fail with the same "error 2 opening
> '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?" then we first
> need to figure out where that part broke so that we can do the bisect on the
> last few commits.

Commit f7f181843e6c24644b4b71b8631a5ea87de05158's result is good.

Comment 18 Daniel Vetter 2013-08-06 21:10:55 UTC

Assigning back to Ben, dunno why this was was changed ...

Comment 19 Daniel Vetter 2013-08-06 21:13:11 UTC

I guess the reason this only happens on Haswell is bug #65387

Comment 20 Paulo Zanoni 2013-08-07 18:02:24 UTC

(In reply to comment #15)
> 30bc9b53878a9921b02e3b5bc4283ac1c6de102a  good
> 
> 52604b1ffabac61eb07cce711f18e18ac74fbeae
> output:
> error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?
> 
> 2f63315692b1d3c055972ad33fc7168ae908b97b
> output:
> error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?
> 
> 2293bb5c0383f522ac659946ccfadb0e6d2f03c5 bad

For me, commit 2f63315692b1d3c055972ad33fc7168ae908b97b is bad.

Comment 21 Paulo Zanoni 2013-08-07 18:47:07 UTC

(In reply to comment #20)
> (In reply to comment #15)
> > 30bc9b53878a9921b02e3b5bc4283ac1c6de102a  good
> > 
> > 52604b1ffabac61eb07cce711f18e18ac74fbeae
> > output:
> > error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?
> > 
> > 2f63315692b1d3c055972ad33fc7168ae908b97b
> > output:
> > error 2 opening '/sys/kernel/debug/dri/16/i915_next_seqno'. too old kernel?
> > 
> > 2293bb5c0383f522ac659946ccfadb0e6d2f03c5 bad
> 
> For me, commit 2f63315692b1d3c055972ad33fc7168ae908b97b is bad.

I did some more investigation and it seems to me that commit 2f63315692b1d3c055972ad33fc7168ae908b97b is the first one that gives me the oops. But the test still fails before the commit.

Comment 22 Ben Widawsky 2013-08-12 23:56:21 UTC

Please try:
https://patchwork.kernel.org/patch/2843323/

Comment 23 lu hua 2013-08-13 02:48:39 UTC

(In reply to comment #22)
> Please try:
> https://patchwork.kernel.org/patch/2843323/

Fixed by this patch.

Comment 24 Paulo Zanoni 2013-08-14 13:49:48 UTC

(In reply to comment #23)
> (In reply to comment #22)
> > Please try:
> > https://patchwork.kernel.org/patch/2843323/
> 
> Fixed by this patch.

Patch merged, closing bug:

http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=d6fc62c1699f4331c6b05a4b82a7796f8d281af3

Thanks everybody for the tests and the patch,
Paulo

Comment 25 lu hua 2013-08-19 01:34:18 UTC

Verified.Fixed.

Comment 26 Elizabeth 2017-10-06 14:44:55 UTC

Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.