Bug 101022

Summary:	[BAT][KBL] Warning at block/blk-mq.c:2667 blk_mq_update_nr_hw_queues+0x118/0x120 in CI
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	krisman
Status:	CLOSED NOTOURBUG	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	critical
Priority:	highest	CC:	intel-gfx-bugs, krisman
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:	KBL	i915 features:

Description Martin Peres 2017-05-12 12:55:59 UTC

All the kabylake machines in CI got a warn starting with CI_DRM_2606 when running 
igt@kms_pipe_crc_basic@suspend-read-crc-pipe-* and igt@gem_exec_suspend@basic-s3.

Here is the relevant dmesg message.

[  404.107454] ------------[ cut here ]------------
[  404.107459] WARNING: CPU: 0 PID: 8092 at block/blk-mq.c:2667 blk_mq_update_nr_hw_queues+0x118/0x120
[  404.107460] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul e1000e snd_hda_intel ghash_clmulni_intel snd_hda_codec snd_hwdep ptp snd_hda_core pps_core snd_pcm mei_me mei prime_numbers pinctrl_sunrisepoint pinctrl_intel i2c_hid
[  404.107489] CPU: 0 PID: 8092 Comm: kworker/u8:13 Tainted: G     U  W       4.11.0-CI-CI_DRM_2606+ #1
[  404.107491] Hardware name: GIGABYTE GB-BKi7(H)A-7500/MFLP7AP-00, BIOS F4 02/20/2017
[  404.107493] Workqueue: nvme nvme_reset_work
[  404.107496] task: ffff8802542bcf40 task.stack: ffffc900004ac000
[  404.107498] RIP: 0010:blk_mq_update_nr_hw_queues+0x118/0x120
[  404.107499] RSP: 0018:ffffc900004afd48 EFLAGS: 00010246
[  404.107501] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000001
[  404.107503] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffff8802619000b0
[  404.107504] RBP: ffffc900004afd68 R08: ffff8802542bd778 R09: 0000000000000000
[  404.107505] R10: 00000000ef6f2e9b R11: 0000000000000001 R12: ffff880261900368
[  404.107506] R13: ffff880261900010 R14: ffff8802619001f0 R15: 0000000000000000
[  404.107508] FS:  0000000000000000(0000) GS:ffff88026dc00000(0000) knlGS:0000000000000000
[  404.107509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  404.107510] CR2: 00007f190801dde8 CR3: 000000025e4ca000 CR4: 00000000003406f0
[  404.107511] Call Trace:
[  404.107514]  nvme_reset_work+0x930/0xfb0
[  404.107521]  process_one_work+0x1fe/0x670
[  404.107525]  worker_thread+0x49/0x3b0
[  404.107528]  kthread+0x10f/0x150
[  404.107530]  ? process_one_work+0x670/0x670
[  404.107532]  ? kthread_create_on_node+0x40/0x40
[  404.107536]  ret_from_fork+0x2e/0x40
[  404.107540] Code: 48 8d 98 58 f6 ff ff 75 e5 5b 41 5c 41 5d 41 5e 5d c3 48 8d bf a0 00 00 00 be ff ff ff ff e8 20 47 ca ff 85 c0 0f 85 06 ff ff ff <0f> ff e9 ff fe ff ff 90 55 31 f6 48 c7 c7 60 b3 ea 81 48 89 e5 
[  404.107605] ---[ end trace e5e4f17f0ef2b96b ]---

More logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2606/fi-kbl-7500u/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

Comment 1 Martin Peres 2017-05-12 12:57:26 UTC

This is due to the pre-rc1 back merge and is of course not due to any of our code.

Comment 2 krisman 2017-05-12 13:27:55 UTC

Yes, this is during the nvme rescan.. I'll mark this as not our bug for now, and ping Keith Busch once I confirm this is not already fixed in linux-next.

Thanks.

Comment 3 Martin Peres 2017-05-17 13:06:32 UTC

(In reply to krisman from comment #2)
> Yes, this is during the nvme rescan.. I'll mark this as not our bug for now,
> and ping Keith Busch once I confirm this is not already fixed in linux-next.
> 
> Thanks.

Any news on this?

Comment 4 krisman 2017-05-30 17:09:21 UTC

(In reply to Martin Peres from comment #3)
> (In reply to krisman from comment #2)
> > Yes, this is during the nvme rescan.. I'll mark this as not our bug for now,
> > and ping Keith Busch once I confirm this is not already fixed in linux-next.
> > 
> > Thanks.
> 
> Any news on this?

Hmm. forgot to get back to it once the merge window closed.

So, I started a discussion in linux-nvme reporting the problem and looking for suggestions.  I'm also looking for a box with NVMe card so I can at least bisect the issue.

Comment 5 krisman 2017-05-30 18:04:13 UTC

(In reply to krisman from comment #4)
> (In reply to Martin Peres from comment #3)
> > (In reply to krisman from comment #2)
> > > Yes, this is during the nvme rescan.. I'll mark this as not our bug for now,
> > > and ping Keith Busch once I confirm this is not already fixed in linux-next.
> > > 
> > > Thanks.
> > 
> > Any news on this?
> 
> Hmm. forgot to get back to it once the merge window closed.
> 
> So, I started a discussion in linux-nvme reporting the problem and looking
> for suggestions.  I'm also looking for a box with NVMe card so I can at
> least bisect the issue.

Keith suggested the following patch 

 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index f2224ffd..1bccced 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -2641,7 +2641,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
         return ret;
  }

 -void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 +static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 +                                                       int nr_hw_queues)
  {
         struct request_queue *q;

 @@ -2665,6 +2666,13 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
         list_for_each_entry(q, &set->tag_list, tag_set_list)
                 blk_mq_unfreeze_queue(q);
  }
 +
 +void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 +{
 +       mutex_lock(&set->tag_list_lock);
 +       __blk_mq_update_nr_hw_queues(set, nr_hw_queues);
 +       mutex_unlock(&set->tag_list_lock);
 +}
  EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);

  /* Enable polling stats and return whether they were already enabled. */


If I send it to intel-gfx, will it be grabbed by the CI and tested or is there other process in place, since it is outside drivers/gpu/drm?

Comment 6 krisman 2017-05-30 19:54:36 UTC

(In reply to krisman from comment #5)
> (In reply to krisman from comment #4)
> > (In reply to Martin Peres from comment #3)
> > > (In reply to krisman from comment #2)
> > > > Yes, this is during the nvme rescan.. I'll mark this as not our bug for now,
> > > > and ping Keith Busch once I confirm this is not already fixed in linux-next.
> > > > 
> > > > Thanks.
> > > 
> > > Any news on this?
> > 
> > Hmm. forgot to get back to it once the merge window closed.
> > 
> > So, I started a discussion in linux-nvme reporting the problem and looking
> > for suggestions.  I'm also looking for a box with NVMe card so I can at
> > least bisect the issue.
> 
> Keith suggested the following patch 

After running on the CI, I confirmed the patch fixes the issue.  It has been queued in Jen's tree for the next -rc:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=e4dc2b32df5573b077f6723e01cf761d236d5113

The CI run:

https://intel-gfx-ci.01.org/CI/Trybot_861/fi-kbl-7500u/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

Comment 7 Martin Peres 2017-05-30 19:58:53 UTC

(In reply to krisman from comment #6)
> (In reply to krisman from comment #5)
> > Keith suggested the following patch 
> 
> After running on the CI, I confirmed the patch fixes the issue.  It has been
> queued in Jen's tree for the next -rc:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/
> ?h=for-linus&id=e4dc2b32df5573b077f6723e01cf761d236d5113
> 
> The CI run:
> 
> https://intel-gfx-ci.01.org/CI/Trybot_861/fi-kbl-7500u/
> igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

You rock! We can merge it into the for-CI branch, so as we don't have to wait for the backmerge to get it back :)

Comment 8 krisman 2017-05-30 20:17:42 UTC

(In reply to Martin Peres from comment #7)
> (In reply to krisman from comment #6)
> > (In reply to krisman from comment #5)
> > > Keith suggested the following patch 
> > 
> > After running on the CI, I confirmed the patch fixes the issue.  It has been
> > queued in Jen's tree for the next -rc:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/
> > ?h=for-linus&id=e4dc2b32df5573b077f6723e01cf761d236d5113
> > 
> > The CI run:
> > 
> > https://intel-gfx-ci.01.org/CI/Trybot_861/fi-kbl-7500u/
> > igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
> 
> You rock! We can merge it into the for-CI branch, so as we don't have to
> wait for the backmerge to get it back :)

What is the process for that? Is there a different list for submitting it?

Comment 9 Martin Peres 2017-05-30 20:45:21 UTC

(In reply to krisman from comment #8)
> (In reply to Martin Peres from comment #7)
> > (In reply to krisman from comment #6)
> > > (In reply to krisman from comment #5)
> > > > Keith suggested the following patch 
> > > 
> > > After running on the CI, I confirmed the patch fixes the issue.  It has been
> > > queued in Jen's tree for the next -rc:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/
> > > ?h=for-linus&id=e4dc2b32df5573b077f6723e01cf761d236d5113
> > > 
> > > The CI run:
> > > 
> > > https://intel-gfx-ci.01.org/CI/Trybot_861/fi-kbl-7500u/
> > > igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
> > 
> > You rock! We can merge it into the for-CI branch, so as we don't have to
> > wait for the backmerge to get it back :)
> 
> What is the process for that? Is there a different list for submitting it?

Danvet wanted Marta and I to have commits right to manage this branch but I have not yet accepted this :p In the mean time, send it to linux-gfx and state that this patch is queued for inclusion and that we want it in the for-CI branch.

Comment 10 Jani Saarinen 2017-05-31 05:56:06 UTC

patch passed try-bot: https://patchwork.freedesktop.org/series/25054/

Comment 11 Jani Saarinen 2017-05-31 06:23:33 UTC

And passes those tests (no dmesg warning) that used to have issues.

Comment 12 Martin Peres 2017-06-05 13:58:00 UTC

(In reply to Jani Saarinen from comment #11)
> And passes those tests (no dmesg warning) that used to have issues.

The patch has been merged in 4.12-rc4. Wasn't Imre supposed to push the patch to core-for-CI?

Otherwise, we will need to wait for the backmerge.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.