Bug 101246

Summary: [BAT][BRW] NULL pointer dereference in snd_hda_codec_generic
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Martin Peres <martin.peres>
Status: CLOSED NOTOURBUG QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: intel-gfx-bugs, jeeja.kp, libin.yang, ricardo.vega
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev ; Upstream bug=https://bugzilla.kernel.org/show_bug.cgi?id=196045
i915 platform: I965G i915 features: display/audio
Attachments:
Description Flags
alsa-info --stdout output none

Description Martin Peres 2017-05-31 10:24:29 UTC
Here is the full trace:

[  292.165356] BUG: unable to handle kernel NULL pointer dereference at 0000000000000a20
[  292.165393] IP: call_hp_automute+0xb/0x30 [snd_hda_codec_generic]
[  292.165407] PGD 0 
[  292.165409] P4D 0 

[  292.165426] Oops: 0000 [#1] PREEMPT SMP
[  292.165437] Modules linked in: snd_hda_intel i915 vgem coretemp tg3 ptp pps_core lpc_ich snd_hda_codec_analog snd_hda_codec_generic snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers [last unloaded: i915]
[  292.165501] CPU: 0 PID: 148 Comm: kworker/0:3 Tainted: G     U          4.12.0-rc3-CI-CI_DRM_2673+ #1
[  292.165520] Hardware name: Dell Inc.                 OptiPlex 745                 /0GW726, BIOS 2.3.1  05/21/2007
[  292.165550] Workqueue: events process_unsol_events [snd_hda_core]
[  292.165566] task: ffff88003b29a600 task.stack: ffffc900001e0000
[  292.165582] RIP: 0010:call_hp_automute+0xb/0x30 [snd_hda_codec_generic]
[  292.165598] RSP: 0018:ffffc900001e3dc8 EFLAGS: 00010286
[  292.165613] RAX: 0000000000000000 RBX: ffff88002e3f19e8 RCX: 0000000000000005
[  292.165630] RDX: ffff8800314f2550 RSI: ffff88002e3f19e8 RDI: ffff8800316c33f8
[  292.165647] RBP: ffffc900001e3dc8 R08: ffff88003b29ae38 R09: 0000000000000000
[  292.165664] R10: ffffc900001e3dc8 R11: 0000000000000001 R12: ffff8800316c33f8
[  292.165684] R13: ffff8800314f2550 R14: ffff88002e2aa5a0 R15: ffff88002e2aa5a8
[  292.165704] FS:  0000000000000000(0000) GS:ffff88003f200000(0000) knlGS:0000000000000000
[  292.165727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  292.165744] CR2: 0000000000000a20 CR3: 000000003a091000 CR4: 00000000000006f0
[  292.165763] Call Trace:
[  292.165786]  call_jack_callback.isra.2+0x20/0xa0 [snd_hda_codec]
[  292.165809]  snd_hda_jack_unsol_event+0x62/0x70 [snd_hda_codec]
[  292.165829]  hda_codec_unsol_event+0x12/0x20 [snd_hda_codec]
[  292.165848]  process_unsol_events+0x5e/0x70 [snd_hda_core]
[  292.165867]  process_one_work+0x1fe/0x670
[  292.165883]  worker_thread+0x49/0x3b0
[  292.165898]  kthread+0x10f/0x150
[  292.165910]  ? process_one_work+0x670/0x670
[  292.165924]  ? kthread_create_on_node+0x40/0x40
[  292.165940]  ret_from_fork+0x27/0x40
[  292.165955] Code: 48 8d 93 0a 05 00 00 be 04 00 00 00 eb c8 4c 89 e7 e8 3a ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 48 8b 87 68 05 00 00 55 48 89 e5 <48> 8b 80 20 0a 00 00 48 85 c0 74 04 ff d0 5d c3 e8 70 ff ff ff 
[  292.166091] RIP: call_hp_automute+0xb/0x30 [snd_hda_codec_generic] RSP: ffffc900001e3dc8
[  292.166112] CR2: 0000000000000a20
[  292.166125] ---[ end trace 65ddd10cfd482602 ]---
[  292.166141] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
[  292.166165] in_atomic(): 0, irqs_disabled(): 1, pid: 148, name: kworker/0:3
[  292.166182] INFO: lockdep is turned off.
[  292.166194] irq event stamp: 3264388
[  292.166207] hardirqs last  enabled at (3264387): [<ffffffff8187bd73>] restore_regs_and_iret+0x0/0x1d
[  292.166232] hardirqs last disabled at (3264388): [<ffffffff8187d2af>] error_entry+0x6f/0xc0
[  292.166256] softirqs last  enabled at (3263952): [<ffffffff810839f9>] __do_softirq+0x1d9/0x4a0
[  292.166279] softirqs last disabled at (3263931): [<ffffffff81083e39>] irq_exit+0xa9/0xc0
[  292.166301] CPU: 0 PID: 148 Comm: kworker/0:3 Tainted: G     UD         4.12.0-rc3-CI-CI_DRM_2673+ #1
[  292.166325] Hardware name: Dell Inc.                 OptiPlex 745                 /0GW726, BIOS 2.3.1  05/21/2007
[  292.166356] Workqueue: events process_unsol_events [snd_hda_core]
[  292.166374] Call Trace:
[  292.166388]  dump_stack+0x67/0x97
[  292.166401]  ___might_sleep+0x162/0x250
[  292.166415]  __might_sleep+0x45/0x80
[  292.166429]  exit_signals+0x1f/0x2a0
[  292.166443]  do_exit+0x90/0xcc0
[  292.166456]  ? kthread+0x10f/0x150
[  292.166469]  ? process_one_work+0x670/0x670
[  292.166483]  rewind_stack_do_exit+0x17/0x20
[  292.166545] do_IRQ: 0.35 No irq handler for vector

And the full IGT logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2673/fi-bwr-2160/igt@drv_module_reload@basic-reload-inject.html
Comment 1 Jani Nikula 2017-06-02 09:25:48 UTC
Cc Libin, Jeeja.

To whom should we assign audio bugs we find in our CI? Should we create an audio component in bugzilla?
Comment 2 Libin Yang 2017-06-05 08:09:14 UTC
i965G is not our team support platform. We didn't any development on it.

Jeeja, will your team support this platform.

Hi Martin,

I notice it is in the generic audio driver. It's interesting.

How can we reproduce this bug?
Could you send out the `alsa-info --stdout`?
Comment 3 Martin Peres 2017-06-05 13:54:56 UTC
Created attachment 131709 [details]
alsa-info --stdout output

Here is the output of alsa-info.

We managed to hit this issue twice in 5 days / 20 runs by running the IGT test igt@drv_module_reload@basic-reload-inject.

This is using Linux 4.12-rc3 as a base for our driver.
Comment 4 Libin Yang 2017-06-06 02:25:44 UTC
From the info, it seems the hd audio module is loaded, and nodes are created. But aplay -l can't find the devices. This is pretty strange. Please try to reload the hda modules, and try playback again. And send the dmesg.
Comment 5 Martin Peres 2017-06-08 08:35:07 UTC
(In reply to Libin Yang from comment #4)
> From the info, it seems the hd audio module is loaded, and nodes are
> created. But aplay -l can't find the devices. This is pretty strange. Please
> try to reload the hda modules, and try playback again. And send the dmesg.

Sorry, but this is quite irrelevant to this bug, isn't it? Misconfiguration should never cause a NULL pointer dereference, so can we concentrate on that?

Here is the function:
static void call_hp_automute(struct hda_codec *codec,
			     struct hda_jack_callback *jack)
{
	struct hda_gen_spec *spec = codec->spec;
	if (spec->hp_automute_hook)
		spec->hp_automute_hook(codec, jack);
	else
		snd_hda_gen_hp_automute(codec, jack);
}

Given that there is a NULL pointer dereference at the offset 0xa20, it means we are trying to read a value at the offset 0xa20 of a structure to which we have a NULL pointer. 

In this case, there are only two de-references, one on codec and one on spec.

struct hda_codec_driver is a pretty small structure while spec is pretty gigantic and the field hp_automute_hook is down at the bottom. To I would go for that theory first.

Now, in what cases could codec->spec be NULL? Is this something you make sure about or are there rare cases where this could happen?

I see the following places where this happens:
 - snd_hda_codec_cleanup_for_unbind
 - snd_hda_gen_free
 - Too many other places that do not directly assign NULL, but a parameter (unlikely to be NULL by mistake).

So, based on this, the fact that it happened twice in 26 runs and the fact that it happened during i915's reload test, it looks quite likely to be a use-after-free in snd_hda_codec_cleanup_for_unbind. Especially since the call came from a workitem and there is no such thing as reference counting in the codec structure, as far as I can tell.

Could this be fixed? I can hide the bug for our CI by adding a NULL pointer check, but please fix this for real by making sure that codec is valid until all users are done with their work.
Comment 6 Libin Yang 2017-06-08 12:33:39 UTC
My suggestion test is to try to find why it is using generic driver. However, if you can fix it in generic driver, please make a patch and send it to community.
Comment 7 Martin Peres 2017-06-08 13:47:44 UTC
(In reply to Libin Yang from comment #6)
> My suggestion test is to try to find why it is using generic driver.

Hmm, I get your point, but whatever the reason why, it should still not make any null pointer de-reference.

However, did you see that this is for the i965G platform, which is 10 years old? Is the non-generic driver supporting this platform?

> However, if you can fix it in generic driver, please make a patch and send
> it to community.

Well, I already wrote a patch that papers around the issue and will still report it in the logs[1], but this is not a proper patch.

I do not have any background in the sound subsystem, so I do not know what are the correct assumptions to have, especially around the binding and unbinding of the driver. I am merely doing some QA/CI job here, and reporting the issue to you, who own this code.

I am however willing to help, but not by configuring the system differently in order to get rid of the problem. This would be a workaround, not a fix.

[1] https://patchwork.freedesktop.org/patch/160525/
Comment 8 Libin Yang 2017-06-08 14:31:18 UTC
(In reply to Martin Peres from comment #7)
> (In reply to Libin Yang from comment #6)
> > My suggestion test is to try to find why it is using generic driver.
> 
> Hmm, I get your point, but whatever the reason why, it should still not make
> any null pointer de-reference.

Yes, there seems 2 issues: 1) why using generic; 2) the null pointer.

For the second issue, it seems you've already find a solution. I recommend you submit a patch.

> 
> However, did you see that this is for the i965G platform, which is 10 years
> old? Is the non-generic driver supporting this platform?

We did not do any test on i965G platform. It's not our scope.

> 
> > However, if you can fix it in generic driver, please make a patch and send
> > it to community.
> 
> Well, I already wrote a patch that papers around the issue and will still
> report it in the logs[1], but this is not a proper patch.
> 
> I do not have any background in the sound subsystem, so I do not know what
> are the correct assumptions to have, especially around the binding and
> unbinding of the driver. I am merely doing some QA/CI job here, and
> reporting the issue to you, who own this code.
> 
> I am however willing to help, but not by configuring the system differently
> in order to get rid of the problem. This would be a workaround, not a fix.
> 
> [1] https://patchwork.freedesktop.org/patch/160525/

Please feel free to make the patch and submit it. Takashi is very professional and nice. I did think we need check the null pointer. But maybe it is not as your patch. Maybe it is in another place.
Comment 9 Martin Peres 2017-06-09 13:08:34 UTC
(In reply to Libin Yang from comment #8)
> (In reply to Martin Peres from comment #7)
> > (In reply to Libin Yang from comment #6)
> > > My suggestion test is to try to find why it is using generic driver.
> > 
> > Hmm, I get your point, but whatever the reason why, it should still not make
> > any null pointer de-reference.
> 
> Yes, there seems 2 issues: 1) why using generic; 2) the null pointer.
> 
> For the second issue, it seems you've already find a solution. I recommend
> you submit a patch.

I will wait to actually catch the bug red-handed. I pushed the patch in our CI,
we should have a clear culprit in the coming days.

> 
> > 
> > However, did you see that this is for the i965G platform, which is 10 years
> > old? Is the non-generic driver supporting this platform?
> 
> We did not do any test on i965G platform. It's not our scope.

That's why automated CI is good ;)

> 
> > 
> > > However, if you can fix it in generic driver, please make a patch and send
> > > it to community.
> > 
> > Well, I already wrote a patch that papers around the issue and will still
> > report it in the logs[1], but this is not a proper patch.
> > 
> > I do not have any background in the sound subsystem, so I do not know what
> > are the correct assumptions to have, especially around the binding and
> > unbinding of the driver. I am merely doing some QA/CI job here, and
> > reporting the issue to you, who own this code.
> > 
> > I am however willing to help, but not by configuring the system differently
> > in order to get rid of the problem. This would be a workaround, not a fix.
> > 
> > [1] https://patchwork.freedesktop.org/patch/160525/
> 
> Please feel free to make the patch and submit it. Takashi is very
> professional and nice. I did think we need check the null pointer. But maybe
> it is not as your patch. Maybe it is in another place.

What mailing list should I start the conversation in? I really think that the
unbind in hda-generic.c is very sketchy, but I may be missing out on something.

Thanks for your feedback!
Comment 10 Libin Yang 2017-06-12 01:58:23 UTC
The community mail list is: alsa-devel@alsa-project.org. Takashi's email is: tiwai@suse.de.

There is a document 'submitting-patches' in kenrel documentation folder telling you how to submit patches to kernel. Please read it before submitting patch.

There are Intel internal mail lists: linux-audio-intel@eclists.intel.com and linux-drivers-review@eclists.intel.com. You can submit the patch to these mail lists for internal review before submitting patches to kernel.
Comment 11 Martin Peres 2017-06-12 14:58:45 UTC
(In reply to Libin Yang from comment #10)
> The community mail list is: alsa-devel@alsa-project.org. Takashi's email is:
> tiwai@suse.de.
> 
> There is a document 'submitting-patches' in kenrel documentation folder
> telling you how to submit patches to kernel. Please read it before
> submitting patch.
> 
> There are Intel internal mail lists: linux-audio-intel@eclists.intel.com and
> linux-drivers-review@eclists.intel.com. You can submit the patch to these
> mail lists for internal review before submitting patches to kernel.

Thanks, but I really don't think the patch is useful.

I instead filed a bug on the official bugtracker of Alsa, and added all the details I could. 

Here it is: https://bugzilla.kernel.org/show_bug.cgi?id=196045
Comment 12 Ricardo 2017-06-13 13:19:08 UTC
Martin will you leave this bug open until the problem from https://bugzilla.kernel.org/show_bug.cgi?id=196045 is solved? should this bug be closed?
Comment 13 Martin Peres 2017-06-13 14:09:07 UTC
(In reply to Ricardo from comment #12)
> Martin will you leave this bug open until the problem from
> https://bugzilla.kernel.org/show_bug.cgi?id=196045 is solved? should this
> bug be closed?

I will leave it open as we can't track bugs from bugzilla.kernel.org in CI-bug-log. Maybe I should change the config file there, so as we could.

Let me see.
Comment 14 Martin Peres 2017-06-13 14:28:19 UTC
(In reply to Martin Peres from comment #13)
> (In reply to Ricardo from comment #12)
> > Martin will you leave this bug open until the problem from
> > https://bugzilla.kernel.org/show_bug.cgi?id=196045 is solved? should this
> > bug be closed?
> 
> I will leave it open as we can't track bugs from bugzilla.kernel.org in
> CI-bug-log. Maybe I should change the config file there, so as we could.
> 
> Let me see.

OK, done. Let's close this bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.