Bug 98690

Summary:

[nouveau bisected] System freeze when starting X using kernel 4.9-rc1 or later

Product:

xorg

Reporter:

Paul Donohue <freedesktop-bugs>

Component:

Driver/nouveau

Assignee:

Nouveau Project <nouveau>

Status:

RESOLVED MOVED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

critical

Priority:

high

CC:

barry, bastian.beischer, fdsfgs, gleb, intel-gfx-bugs, peter.ujfalusi, UshakovVasilii, waltercool

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

SKL

i915 features:

display/Other

Attachments:

Description	Flags
Xorg.0.log	none
dmesg with drm.debug=0x1e log_buf_len=1M	none
dmesg.log	none
patch to add short delay when handling the delayed event	none
dmesg log with "drm.debug=0x1e log_buf_len=1M"	none
HACK: do not enable polling in poll_init()	none
dmesg log	none
Debug patch1 for delayed work	none
Debug patch2 for delayed work	none
patch1 (debug only, won't boot)	none
patch2 (debug+delay, boots up fine)	none
Debug patch1.5 for delayed event	none
dmesg.log for 4.10-rc3 + patch1 + patch1.5	none
dmesg.log for 4.10-rc3 + patch1 + patch1.5 + patch 2	none
dmesg.log for 4.10-rc3 + patch1 + patch1.5 and "nouveau.runpm=0"	none
Patch: start the poll as the last thing on nouveau resume	none
4.10-rc3 + patch1 + patch1.5 + patch3 (w/o nouveau.runpm=0)	none
4.10-rc3 + patch1 + patch1.5 + patch3 (w/ nouveau.runpm=0)	none
4.10-rc3 + patch1 + patch1.5 + patch3 + patch2	none
patch2_1: debug prints (4.10-rc5)	none
patch2_2: hack to reorder the scheduled works (4.10-rc5)	none
patch2_3: same as 68f458eec706 on top of the debug (4.10-rc5)	none
4.10-rc5 + patch2_1	none
4.10-rc5 + patch2_1 + patch2_2	none
4.10-rc3 + patch2_1 + patch2_3	none
patch2_4: enable poll from hpd_work to avoid races (4.10-rc5)	none
4.10-rc5 + patch2_1 + nouveau.runpm=0	none
4.10-rc6 without 1s delay	none
patch2_4_fixed: enable poll from hpd_work to avoid races (4.10-rc5)	none
4.10-rc5 + patch2_1 + patch2_4_fixed	none
4.10-rc5 + patch2_1 + patch2_4_fixed + nouveau.runpm=0	none
dmesg OOB nouveau	none
poll helper fix on top of next-20170421	none
dmesg log for [PATCH] drm/nouveau: Fix drm poll_helper handling	none
dmesg log for [PATCH] drm/nouveau: Fix drm poll_helper handling + delay = 0	none

Description Paul Donohue 2016-11-11 22:10:05 UTC

Created attachment 127923 [details]
Xorg.0.log

Dell 7510 laptop with Intel Xeon E3-1545Mv5 (Iris Pro 580), running Ubuntu 16.04 (stock except for kernel and linux-firmware package - using linux-firmware 1.161 from Ubuntu 16.10).

The stock Ubuntu 16.04 kernel (4.8.0-22-generic) works fine.

Updating to kernel 4.9-rc4 or drm-next or drm-intel-next or drm-intel-nightly from http://kernel.ubuntu.com/~kernel-ppa/mainline/ reliably causes Xorg to hang at start-up, with the following message being printed to Xorg.0.log:
systemd-logind: failed to take device /dev/dri/card0: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

If I boot to a console then run startx, Xorg prints a few standard messages, then the console freezes - the screen does not turn black or change video modes.  CTRL-C, CTRL-ALT-DELETE, and CTRL-ALT-F2 do not respond.  However, ACPI hot-keys do seem to respond - hitting the power button causes Ubuntu to attempt to shut down, and the console prints shutdown messages when this happens.  However, the shutdown process hangs again before it finishes shutting down - it typically gets stuck when shutting down either the network or rng-tools.

The attached dmesg was produced by running the following script ... the produced error.bz2 file was empty:
#!/bin/bash
startx &
sleep 10
dmesg > dmesg
cat /sys/class/drm/card0/error | bz2 > error.bz2

Comment 1 Paul Donohue 2016-11-11 22:10:49 UTC

Created attachment 127924 [details]
dmesg with drm.debug=0x1e log_buf_len=1M

Comment 2 Paul Donohue 2016-11-11 22:32:48 UTC

Commit 567aeca9fbb7f1f00fc6bbdd6861010ce7ddaf22 (4.8.7) works
Commit 1001354ca34179f3db924eb66672442a173147dc (4.9-rc1) does not work

Comment 3 Paul Donohue 2016-11-13 04:19:11 UTC

This bug seems unusually difficult to bisect.

The bug appears to be present in 4.8-rc1 through 4.8-rc3, and appears to have been fixed in 4.8-rc4.  4.8-rc4 through 4.8.7 all work fine.  I have not tested versions earlier than 4.8-rc1.

The 4.9 branch forked from the 4.8 branch after 4.8.0.  The 4.9 branch works fine through 56e520c7a0a490b63b042b047ec9659fc08762a4.  However, 6b25e21fa6f26d0f0d45f161d169029411c84286 and later are broken.

6b25e21fa6f26d0f0d45f161d169029411c84286 merges 69405d3da98b48633b78a49403e4f9cdb7c6a0f5 into the 4.9 branch.  69405d3da98b48633b78a49403e4f9cdb7c6a0f5 appears to have been developed in a branch that was forked at 4.8.0-rc1, so 69405d3da98b48633b78a49403e4f9cdb7c6a0f5 and all prior commits on that branch do not work (they all have the same bug because they are based on a version with the bug).

So, something in the 6b25e21fa6f26d0f0d45f161d169029411c84286 merge appears to have re-introduced the bug from 4.8-rc1 into the 4.9 branch.  But, of course, git bisect can't help me figure out what in the merge caused this.

My next step is to bisect 4.8-rc3 through 4.8-rc4 to see if I can figure out which commit fixed the issue.  Knowing what the fix is should make it easier to spot the problem with the merge.  I'm pretty sure the fix has to be one of the 9 commits between fa8410b355251fd30341662a40ac6b22d3e38468 and 177d91aaea4bcafb29232336bafaa521b85286aa.

Comment 4 Paul Donohue 2016-11-14 19:13:01 UTC

Installing the 4.8.0-rc3 (fa8410b355251fd30341662a40ac6b22d3e38468) amd64 -generic packages from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8-rc3/ does not work.

Compiling my own copy of 4.8.0-rc3 (fa8410b355251fd30341662a40ac6b22d3e38468) works fine.

Ugh...

Comment 5 Paul Donohue 2016-11-17 15:53:30 UTC

I have not yet been able to figure out why the packages on http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8-rc3/ and my builds of those same commits do not behave the same.

However, I started bisecting again from scratch, being careful to avoid taking any shortcuts (such as assuming that I could skip a build because I had a pre-built copy for the commit), and also being careful to do extra testing at each step (it turns out there are other problems that occasionally produce similar-looking symptoms, so I had to do extra testing to ensure that I was consistently reproducing this specific issue).

I've now managed to successfully bisect the issue to commit 339fd36238dd3494bc4617d181e7a37922c29ee9.

Just to double check this, I also built v4.9-rc5, verified that it reliably hangs when starting X, then reverted 339fd36238dd3494bc4617d181e7a37922c29ee9 and re-built v4.9-rc5, and verified that it reliably works correctly.

Comment 6 Chris Wilson 2016-11-17 16:11:24 UTC

Bisect result:

commit 339fd36238dd3494bc4617d181e7a37922c29ee9
Author: Peter Ujfalusi <peter.ujfalusi@ti.com>
Date:   Wed Aug 31 14:09:05 2016 +0300

    drm: drm_probe_helper: Fix output_poll_work scheduling
    
    drm_kms_helper_poll_enable_locked() should check if we have delayed event
    pending and if we have, schedule the work to run without delay.
    
    Currently the output_poll_work is only scheduled if any of the connectors
    have DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT with
    DRM_OUTPUT_POLL_PERIOD delay. It does not matter if we have delayed event
    already registered to be handled. The detection will be delayd by
    DRM_OUTPUT_POLL_PERIOD in any case.
    Furthermore if none of the connectors are marked as POLL_CONNECT or
    POLL_DISCONNECT because all connectors are either POLL_HPD or they are
    always connected: the output_poll_work will not run at all even if we
    have delayed event marked.
    
    When none of the connectors require polling, their initial status change
    from unknown to connected/disconnected is not going to be handled until
    the first kms application starts or if we have fb console enabled.
    
    Note that in general the output poll work should be enabled already
    when this happens, but at driver load usually the first probe happens
    before the output polling is enabled. This patch fixes this case.

Comment 7 Jari Tahvanainen 2016-12-19 09:43:27 UTC

Highest+blocker as being regression w/o workaround.

Comment 8 Peter Ujfalusi 2016-12-20 14:45:01 UTC

I don't think that the root cause is 339fd36238dd3494bc4617d181e7a37922c29ee9. Yes, it triggers something for sure which looks to me a race or synchronization issue in the code paths which happened to be hidden by the 10sec delay.

W/o the patch we had basically two scenarios:
1. When drm_kms_helper_poll_init() was called before drm_helper_probe_single_connector_modes():

If there is a change in the connector state the output_poll_work would be scheduled with 0 delay.

2. When drm_helper_probe_single_connector_modes() was called before drm_kms_helper_poll_init()

If there is a change in the connector state, the code sets delayed_event and will _not_ schedule right away the output_poll_work from drm_helper_probe_single_connector_modes() as the work is not yet initialized.
When the drm_kms_helper_poll_init() is called, the output_poll_work will be scheduled to run with 10sec delay.

In case 1 we run the work w/o delay while in case 2 we run it with 10sec delay.

The patch in question will run the work in both cases w/o delay when it is possible to run it. iow when we have the work initialized and we have a connector change noticed, we schedule the work to be executed w/o delay. This connector change is basically the change from unknown to some known state during the stack bring-up, if I'm not mistaken.

Probably shoprted delay in drm_kms_helper_poll_enable_locked() for the work might hide the issue. It might surface later and the delay needs to be increased.

If the decision is to revert the patch, I'm fine with it, but then no one will ever going to take time and effort to figure out the why.

Comment 9 Gleb Nemshilov 2016-12-21 14:59:09 UTC

Hi everyone!

I came to the same conclusion (bisected) and have the same issue. The only difference is that I have optimus laptop and the Xorg hangs on card1 and if nouveau blacklisted, kernel can be booted.

Is there anything specific I can do to help?

https://bugs.freedesktop.org/show_bug.cgi?id=99147

Comment 10 Gleb Nemshilov 2016-12-21 15:12:48 UTC

Created attachment 128609 [details]
dmesg.log

Dmesg log attached.

I also want to add that since some of the revisions before the one that is before the bad one, nvidia card is always in DynPwr state and wont' go to DynOff. It was fine on 4.8.14/15 which I used since the release.

Comment 11 Peter Ujfalusi 2016-12-21 17:31:55 UTC

(In reply to Gleb from comment #9)
> Hi everyone!
> 
> I came to the same conclusion (bisected) and have the same issue. The only
> difference is that I have optimus laptop and the Xorg hangs on card1 and if
> nouveau blacklisted, kernel can be booted.

The original report also have optimus (Dell Precision 7510). Could this be an issue triggered in the nouveau driver? I'll try to find some time to update my home desktop's kernel to 4.9 and see if it boots or not.

Comment 12 Peter Ujfalusi 2016-12-22 08:10:02 UTC

(In reply to Peter Ujfalusi from comment #11)
> I'll try to find some time to
> update my home desktop's kernel to 4.9 and see if it boots or not.

My desktop works fine with 4.9 (which includes 339fd36238dd3494bc4617d181e7a37922c29ee9) with nouveau. The issue might be only visible on machines with optimus?

Comment 13 Paul Donohue 2017-01-02 22:55:41 UTC

Just to confirm and expand on statements above:

My Dell 7510 laptop does have optimus.

The kernel loads both the i915 and nouveau kernel modules automatically.  Blacklisting the nouveau kernel module, booting with 339fd362 applied, then starting X works fine.  Booting with 339fd362 reverted and starting X with both the i915 and nouveau kernel modules loaded works fine.  Booting with 339fd362 applied and starting X with both the i915 and nouveau kernel modules loaded hangs.

Explicitly configuring xorg.conf to load only the "intel" xorg driver and not attempt to load the "nouveou" xorg driver does not help.

Same behavior can be replicated in kernel 4.10-rc2.

Comment 14 Peter Ujfalusi 2017-01-03 07:36:23 UTC

Created attachment 128720 [details] [review]
patch to add short delay when handling the delayed event

Comment 15 Peter Ujfalusi 2017-01-03 07:43:08 UTC

Paul,

(In reply to Paul Donohue from comment #13)
> Just to confirm and expand on statements above:
> 
> My Dell 7510 laptop does have optimus.
> 
> The kernel loads both the i915 and nouveau kernel modules automatically. 
> Blacklisting the nouveau kernel module, booting with 339fd362 applied, then
> starting X works fine.  Booting with 339fd362 reverted and starting X with
> both the i915 and nouveau kernel modules loaded works fine.  Booting with
> 339fd362 applied and starting X with both the i915 and nouveau kernel
> modules loaded hangs.
> 
> Explicitly configuring xorg.conf to load only the "intel" xorg driver and
> not attempt to load the "nouveou" xorg driver does not help.
> 
> Same behavior can be replicated in kernel 4.10-rc2.

Can you please give a try to the attached patch? It will delay the handling of initial delayed events by 1s instead immediate handling (with 339fd362) or with the 10s delay (w/o 339fd362).

In nouveau_display.c:nouveau_display_create() we have:
...
	drm_kms_helper_poll_init(dev);
	drm_kms_helper_poll_disable(dev);
...

so it is possible that nouveau does not want to have the polling enabled just yet, but it need to call poll_init() to have the poll infra ready. So it calls poll_init() followed by the poll_disable(). But with 339fd362 the work to handle the poll might be scheduled before it got chance to disable it.

Without Optimus based laptop it is hard to say what is going on.

Comment 16 Gleb Nemshilov 2017-01-03 14:00:43 UTC

Created attachment 128725 [details]
dmesg log with "drm.debug=0x1e log_buf_len=1M"

(In reply to Peter Ujfalusi from comment #14)
> Created attachment 128720 [details] [review] [review]
> patch to add short delay when handling the delayed event

Applied the patch to 4.10.0-rc2 and system boots without any issues. dmesg log with "drm.debug=0x1e log_buf_len=1M" attached.

Comment 17 Paul Donohue 2017-01-03 14:54:54 UTC

Same for me.  Applied to 4.10-rc2, booted, started X with both the i915 and nouveau kernel modules loaded, and it started with no issues.

Comment 18 Peter Ujfalusi 2017-01-03 16:50:32 UTC

Paul, Gleb,

thanks for testing the patch. I still consider this as a workaround for a driver bug in nuoveau/optimus. IMHO a better fix would be to change the way how drm_kms_helper_poll_init() works: when called it would not enable the polling as it does, but drivers would need to call poll_enable() if they want the polling to be active.

Tomorrow I can create a hack patch for this on top of 4.10-rc2 if you are willing to give it a try.

It is still awkward to wait with the work for 1 sec, it is better than the 10 sec, but still it does not feel right.

In any way, I will prepare this patch as well for sending as a quick workaround.

Comment 19 Paul Donohue 2017-01-03 18:36:24 UTC

I'm happy to help test changes to drm_kms_helper_poll_init().  Let me know if there is anything else I can do to help.  :)

Comment 20 Peter Ujfalusi 2017-01-04 09:10:38 UTC

Created attachment 128742 [details] [review]
HACK: do not enable polling in poll_init()

Paul,

can you try this hack patch to see if it behaving correctly? Please revert the previous patch which added 1s delay.

The only driver needing this functionality - to have the poll disabled after init - is nouveau, all other drivers are happy (so far) with the no delay handling of the delayed events. The events are delayed because when they happened the poll_init was not called so the poll_work can not be executed.

Thanks,
Péter

Comment 21 Gleb Nemshilov 2017-01-04 09:47:23 UTC

Created attachment 128744 [details]
dmesg log

Unofrtunately with the new patch kernel won't boot, the problem is the same. dmesg log attached.

Comment 22 Peter Ujfalusi 2017-01-04 11:37:47 UTC

Gleb,

(In reply to Gleb Nemshilov from comment #21)
> Created attachment 128744 [details]
> dmesg log
> 
> Unofrtunately with the new patch kernel won't boot, the problem is the same.
> dmesg log attached.

I hoped for a different result ;)
The race looks to be in a different path and I can not find it by reading the code. My laptop (intel) and Desktop (nouveau) boots up fine, so all is pointing to Optimus related code.

I will send the first patch (1sec delay for delayed events) right away.

Thanks for taking time to test the patches!

Péter

Comment 23 Paul Donohue 2017-01-04 21:17:17 UTC

Is there anything simple we can do to find the relevant code path?  Maybe using SystemTap?

Comment 24 Peter Ujfalusi 2017-01-04 21:42:06 UTC

(In reply to Paul Donohue from comment #23)
> Is there anything simple we can do to find the relevant code path?  Maybe
> using SystemTap?

The only scenario which looks broken is when drm_kms_helper_poll_init() is called after drm_helper_probe_single_connector_modes(). I don't see any other path.

So the interesting thing is what happens during the 1s after we scheduled the work in drm_kms_helper_poll_enable_locked() because the dev->mode_config.delayed_event was true. Between we schedule the work and the work is actually run.
Then compare the events when we don't have the 1s delay when scheduling the work to handle the delayed event.

The fact the nouveau tries to disable the poll right after it initialized it means that it does not want it to execute just now. This is why I was surprised that the second patch did not worked as it did prevented the work to be scheduled when the poll_init() was called. Later on a poll enable will pick up the delayed event and it will schedule the work right away, but this looks to be not quite right for Optimus.

I'm a bit puzzled.

Péter

Comment 25 Peter Ujfalusi 2017-01-04 21:43:12 UTC

FYI: the patch adding the 1s delay is on dri-devel:
https://lists.freedesktop.org/archives/dri-devel/2017-January/128822.html

Comment 26 Gleb Nemshilov 2017-01-05 17:00:12 UTC

Could it be the runpm related issue? Because with "nouveau.runpm=0" kernel boots fine (as stated in another bug report I reported here with the same issue). I can see with dmesg the following strings:

===kernel booted===
===these lines below means runpm turned off discrete adapter===
[62233.697650] nouveau 0000:02:00.0: DRM: suspending console...
[62233.697656] nouveau 0000:02:00.0: DRM: suspending display...
[62233.697673] nouveau 0000:02:00.0: DRM: evicting buffers...
[62233.697676] nouveau 0000:02:00.0: DRM: waiting for kernel channels to go idle...
[62233.697711] nouveau 0000:02:00.0: DRM: suspending client object trees...
[62233.698025] nouveau 0000:02:00.0: DRM: suspending kernel object tree...

===these lines appear when the discrete GPU is tunred on===
[63751.307103] nouveau 0000:02:00.0: DRM: resuming kernel object tree...
[63751.539847] nouveau 0000:02:00.0: DRM: resuming client object trees...
[63751.539975] nouveau 0000:02:00.0: DRM: resuming display...
[63751.539999] nouveau 0000:02:00.0: DRM: resuming console...


It's also strange that those lines appear long after the system started, because the GPU should be turned off all the time.

As I understand, "suspending" stuff means that discrete GPU is truned off, and "resuming" means it turns on. So when the affected kernel boots, the last line is "DRM: resuming console..." and it's stuck, X server won't load. But with not affected kernel the last lines are "DRM: suspending client object trees..." and "DRM: suspending kernel object tree...".

Comment 27 Peter Ujfalusi 2017-01-10 14:11:19 UTC

Created attachment 128857 [details] [review]
Debug patch1 for delayed work

Hi,

this patch is on top of 4.10-rc3.
It will only print debug message when we schedule the work and when the work is executed in regards to delayed_events.

Comment 28 Peter Ujfalusi 2017-01-10 14:14:22 UTC

Created attachment 128858 [details] [review]
Debug patch2 for delayed work

Hi,

this patch is on top of 4.10-rc3 + debug patch1.
On top of the debug prints we will delay the work execution by 1sec which fixes optimus.

I hope that from the logs (patch1 vs patch1+patch2) we can see something which can help to narrow down the issue.

Comment 29 Gleb Nemshilov 2017-01-10 22:16:11 UTC

Created attachment 128877 [details]
patch1 (debug only, won't boot)

4.10-rc3 + patch1 (debug only) -- won't boot

Comment 30 Gleb Nemshilov 2017-01-10 22:17:12 UTC

Created attachment 128878 [details]
patch2 (debug+delay, boots up fine)

4.10-rc3 + patch1 + patch2 (debug+delay) -- boots up fine

Comment 31 Gleb Nemshilov 2017-01-13 02:18:14 UTC

It seems that 4.8 is EOL now and 4.9 will be new LTS kernel. Can we expect that the fix (anything that will solve the issue) will land on 4.9.x? At least it could be reverted because for now kernel is unusable and reverting to old 4.4 is not a good idea.

If there's something wrong with the logs, I can try to do some more stuff.

Comment 32 Peter Ujfalusi 2017-01-13 12:36:11 UTC

(In reply to Gleb Nemshilov from comment #31)
> It seems that 4.8 is EOL now and 4.9 will be new LTS kernel. Can we expect
> that the fix (anything that will solve the issue) will land on 4.9.x? At
> least it could be reverted because for now kernel is unusable and reverting
> to old 4.4 is not a good idea.

The patch should be making it's way to 4.9 stable:
https://lkml.org/lkml/2017/1/9/372

> If there's something wrong with the logs, I can try to do some more stuff.

I'm still wondering what debug would be useful, but it is somewhere nouveau_drm.c:nouveau_do_resume().

But out of curiosity if you could test the patch1.5 applied on top of patch1 and then all three, that would be great. The initial delayed work is disabled several times according to the code, it might shed some lite to the events - but certainly not going to get us closer to the race itself.

Since plain nouveau works w/o the delay it must be the Optimus/switcheroo calls that fails. It might be that we have some work (delayed work?), tasklet run to get the Optimus working and we have execution ordering issue? With delay the theoretical work was running before the poll_work, which w/o the delay the poll_work is executed first followed by the other work?

This is my best bet atm, but I need to study the code a bit more. Debugging DRM just by reading the code is pita ;)

I'll attach the patch1.5

Comment 33 Peter Ujfalusi 2017-01-13 12:38:56 UTC

Created attachment 128929 [details] [review]
Debug patch1.5 for delayed event

It will not solve any issue, but I'd like to see when the poll work is disabled.
The interesting logs are:
patch1+patch1.5
patch1+patch1.5+patch2

Thank you,
Péter

Comment 34 Gleb Nemshilov 2017-01-13 22:59:33 UTC

Created attachment 128942 [details]
dmesg.log for 4.10-rc3 + patch1 + patch1.5

4.10-rc3 + patch1 + patch1.5

Comment 35 Gleb Nemshilov 2017-01-13 23:00:18 UTC

Created attachment 128943 [details]
dmesg.log for 4.10-rc3 + patch1 + patch1.5 + patch 2

 dmesg.log for 4.10-rc3 + patch1 + patch1.5 + patch 2

Comment 36 Gleb Nemshilov 2017-01-13 23:01:54 UTC

Created attachment 128944 [details]
dmesg.log for 4.10-rc3 + patch1 + patch1.5 and "nouveau.runpm=0"

I also did the following:
dmesg.log for 4.10-rc3 + patch1 + patch1.5 and "nouveau.runpm=0" as kernel parameter

So it's the same kernel as with 4.10-rc3 + patch1 + patch1.5

Comment 37 Peter Ujfalusi 2017-01-14 19:58:15 UTC

Gleb,

(In reply to Gleb Nemshilov from comment #36)
> Created attachment 128944 [details]
> dmesg.log for 4.10-rc3 + patch1 + patch1.5 and "nouveau.runpm=0"
> 
> I also did the following:
> dmesg.log for 4.10-rc3 + patch1 + patch1.5 and "nouveau.runpm=0" as kernel
> parameter
> 
> So it's the same kernel as with 4.10-rc3 + patch1 + patch1.5

Thank you for the logs! The race looks to be related to PM indeed. There is one path I can see where work scheduling order can be different when the PM is enabled.

I'll attach a patch to see if that is the case.

Regards,
Péter

Comment 38 Peter Ujfalusi 2017-01-14 20:03:11 UTC

Created attachment 128952 [details] [review]
Patch: start the poll as the last thing on nouveau resume

Gleb,

can you test this patch on top of 4.10-rc3 + patch1+patch1.5 (w/o and w/o "nouveau.runpm=0") and on top of 4.10-rc3 + patch1+patch1.5+patch2?

In theory this should fix the race. It is not a correct fix, but the way how the kernel behaves will help to narrow down things and come up with a proper solution. I hope.

Thanks,
Péter

Comment 39 Gleb Nemshilov 2017-01-15 01:15:53 UTC

Created attachment 128962 [details]
4.10-rc3 + patch1 + patch1.5 + patch3 (w/o nouveau.runpm=0)

New logs for the following:

1) 4.10-rc3 + patch1 + patch1.5 + patch3 (w/o nouveau.runpm=0), doesn't boot
2) 4.10-rc3 + patch1 + patch1.5 + patch3 (w/ nouveau.runpm=0), does boot
3) 4.10-rc3 + patch1 + patch1.5 + patch3 + patch2, does boot

Comment 40 Gleb Nemshilov 2017-01-15 01:16:30 UTC

Created attachment 128963 [details]
4.10-rc3 + patch1 + patch1.5 + patch3 (w/ nouveau.runpm=0)

Comment 41 Gleb Nemshilov 2017-01-15 01:17:01 UTC

Created attachment 128964 [details]
4.10-rc3 + patch1 + patch1.5 + patch3 + patch2

Comment 42 Gleb Nemshilov 2017-01-22 17:15:06 UTC

Any news on this? Anything I can help with? More compiling and testing? :-)

I also want to ask what is the typical delay/time until the patch gets into the kernel? (if this patch with 1s delay will ever get as it's mostly hack and workaround rather than a normal solution).

Comment 43 brae.04+freedesktop 2017-01-28 19:45:15 UTC

Here to help as well.

Comment 44 Peter Wu 2017-01-30 14:17:35 UTC

*** Bug 99147 has been marked as a duplicate of this bug. ***

Comment 45 Peter Ujfalusi 2017-01-31 09:58:43 UTC

FWIW the workaround for nouveau/optimus is now in 4.10-rc6 (68f458eec706) I assume it will be popping up in 4.9 stable soon.

I have tried nouveau.runpm=1 on my desktop, but it did not triggered the problem there.

Can you try a new set of patches on top of 4.10-rc5?

patch2_1: debug prints
patch2_2: hack to reorder the scheduled works
patch2_3: same as 68f458eec706 on top of the debug

The interesting logs will be:
patch2_1
patch2_1 + patch2_2
patch2_1 + patch2_3

Comment 46 Peter Ujfalusi 2017-01-31 09:59:38 UTC

Created attachment 129249 [details] [review]
patch2_1: debug prints (4.10-rc5)

Comment 47 Peter Ujfalusi 2017-01-31 10:00:00 UTC

Created attachment 129250 [details] [review]
patch2_2: hack to reorder the scheduled works (4.10-rc5)

Comment 48 Peter Ujfalusi 2017-01-31 10:00:21 UTC

Created attachment 129251 [details] [review]
patch2_3: same as 68f458eec706 on top of the debug (4.10-rc5)

Comment 49 Jari Tahvanainen 2017-02-02 12:48:50 UTC

Reducing priority for Regression with workaround, and since System Freeze failure is now fixed.

Comment 50 Gleb Nemshilov 2017-02-02 14:23:35 UTC

Created attachment 129294 [details]
4.10-rc5 + patch2_1

4.10-rc5 + patch2_1 -- didn't boot

Comment 51 Gleb Nemshilov 2017-02-02 14:24:21 UTC

Created attachment 129295 [details]
4.10-rc5 + patch2_1 + patch2_2

4.10-rc5 + patch2_1 + patch2_2 -- didn't boot

Comment 52 Gleb Nemshilov 2017-02-02 14:25:05 UTC

Created attachment 129296 [details]
4.10-rc3 + patch2_1 + patch2_3

4.10-rc3 + patch2_1 + patch2_3 -- boots fine, as expected

Comment 53 Peter Ujfalusi 2017-02-03 23:13:58 UTC

Created attachment 129326 [details] [review]
patch2_4: enable poll from hpd_work to avoid races (4.10-rc5)

Gleb,

thanks for testing the patches.
Can you do another two runs on top of 4.10-rc5:
patch2_1 + "nouveau.runpm=0"
and
patch2_1 + patch2_4 (which I'm attaching)

Based on the logs I still think the main issue is the hpd_work and the poll_work execution ordering, but there might be also locking issues to prevent races.

Thank you for your time,
Péter

Comment 54 Gleb Nemshilov 2017-02-04 07:52:55 UTC

Created attachment 129331 [details]
4.10-rc5 + patch2_1 + nouveau.runpm=0

4.10-rc5 + patch2_1 + nouveau.runpm=0 attached.

Unfortunately 4.10-rc5 with patch2_4 failed to build with the following errors:

  CC [M]  drivers/gpu/drm/nouveau/nouveau_dp.o
drivers/gpu/drm/nouveau/nouveau_display.c: In function ‘nouveau_display_hpd_work’:
drivers/gpu/drm/nouveau/nouveau_display.c:367:29: error: passing argument 1 of ‘drm_kms_helper_poll_enable’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  drm_kms_helper_poll_enable(drm->dev->dev);
                             ^
In file included from drivers/gpu/drm/nouveau/nouveau_display.c:31:0:
./include/drm/drm_crtc_helper.h:75:13: note: expected ‘struct drm_device *’ but argument is of type ‘struct device *’
 extern void drm_kms_helper_poll_enable(struct drm_device *dev);
             ^
  CC [M]  drivers/hwmon/acpi_power_meter.o
cc1: some warnings being treated as errors
make[4]: *** [scripts/Makefile.build:293: drivers/gpu/drm/nouveau/nouveau_display.o] Error 1
make[4]: *** Waiting for unfinished jobs....

Comment 55 Gleb Nemshilov 2017-02-04 08:10:49 UTC

Created attachment 129332 [details]
4.10-rc6 without 1s delay

Hi Peter!

I've seen some changes were introduced in 4.10-rc6 with something related to polling in nouveau by Lyude Paul, specifically this patch: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cae9ff036eea577856d5b12860b4c79c5e71db4a

So I decided to try 4.10-rc6 without your patch with delayed event (so I reverted it) and kernel booted.

Can you confirm this?

dmesg log for 4.10-rc6 attached.

Comment 56 Gleb Nemshilov 2017-02-04 08:24:56 UTC

Here's another patch that reverted some changes also related to polling:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=54a07c7bb0da0343734c78212bbe9f3735394962

The only disadvantage of my previous dmesg log for 4.10-rc6 is that it doesn't print about any events. debug prints (latest patch2_1) cannot be applied to this kernel version. So if it's necessary, I can provide some other logs with appropriate patches.

Comment 57 Peter Ujfalusi 2017-02-06 13:00:40 UTC

Gleb,

(In reply to Gleb Nemshilov from comment #55)
> Created attachment 129332 [details]
> 4.10-rc6 without 1s delay
> 
> Hi Peter!
> 
> I've seen some changes were introduced in 4.10-rc6 with something related to
> polling in nouveau by Lyude Paul, specifically this patch:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=cae9ff036eea577856d5b12860b4c79c5e71db4a
> 
> So I decided to try 4.10-rc6 without your patch with delayed event (so I
> reverted it) and kernel booted.
> 
> Can you confirm this?

I think the patch you have linked is essentially going to disable the poll_work as such (imho the commit is bogus).
The mode_config.poll_enabled is set to true in the poll_init and it will never going to be false after that. The poll_enable will not schedule the poll_work if poll_enable is not true, the patch will not call the enable_poll in case mode_config.poll_enabled is true. So the poll_work will never run.

> 
> dmesg log for 4.10-rc6 attached.

Comment 58 Gleb Nemshilov 2017-02-06 13:05:04 UTC

Peter,

so is it still necessary to solve the problem? For me it's hard to understand everything about polling and etc.

The last patch you provided failed to build (2.4), so I'm not sure what to do next. If there's something I can help with, please let me know.

Comment 59 Peter Ujfalusi 2017-02-06 13:27:41 UTC

Created attachment 129364 [details] [review]
patch2_4_fixed: enable poll from hpd_work to avoid races (4.10-rc5)

Gleb,

I should have at least compiled the patches before attaching them. Sorry to waste your time.

Can you try the following:
v4.10-rc5 + patch2_1 + patch2_4_fixed

_if_ boots up, then can you run the same kernel with nouveau.runpm=0?

What this patch does is that it will call the poll_enable from the hpd_work worker function only.
But, if you boot with runpm=0 then the pm_runtime callbacks will never going to be called, so the poll_work is not going to run at all. This is why I try to start that manually when runpm is 0.

The hpd_work is inside of nouveau and it's function is to detect changes in connected displays the driver might missed due to pm_runtime. It is only scheduled when there is a need to check for changes, like power on.

The poll_work on the other hand is in the DRM core and it is a work running every 10 seconds when it is enabled and it is also checking changes in connector states, but only the connectors which does not support change notifications and are not static.

Comment 60 Peter Ujfalusi 2017-02-06 13:31:09 UTC

Gleb,

(In reply to Gleb Nemshilov from comment #58)
> Peter,
> 
> so is it still necessary to solve the problem? For me it's hard to
> understand everything about polling and etc.

I think sooner or later someone will finds out that the poll_work is not running because of the patch you have mentioned and when it will get fixed we (to be precise you) will have issues booting again.
I believe it is better to try to figure out the proper fix to avoid situations like I have caused while fixing seemingly unrelated code.

> The last patch you provided failed to build (2.4), so I'm not sure what to
> do next. If there's something I can help with, please let me know.

I have attached an updated patch. Sorry.

Kind Regards,
Péter

Comment 61 Gleb Nemshilov 2017-02-16 17:35:05 UTC

Created attachment 129681 [details]
4.10-rc5 + patch2_1 + patch2_4_fixed

Hi Peter!

Sorry for the delay, was busy last week.

So, I finally compiled:

"4.10-rc5 + patch2_1 + patch2_4_fixed"

and it also boots fine! I hope I didn't broke anything. I verified that only 2_1 and 2_4_fixed are applied. Let me know if there's anything suspicious in the logs, I can probably try it again if necessary.

Log with nouveau.runpm=0 is in the next attachement.

Comment 62 Gleb Nemshilov 2017-02-16 17:36:21 UTC

Created attachment 129682 [details]
4.10-rc5 + patch2_1 + patch2_4_fixed + nouveau.runpm=0

As requested, here's another log with the same kernel and only "nouveau.runpm=0" added to the kernel boot options.

Comment 63 bastian.beischer 2017-02-27 10:51:40 UTC

Hey,

not sure whether this is the same bug but I can't start X with nouveau from Linux 4.10.1, unless I pass nouveau.runpm=0.

The symptoms are the same as in this bug report, but I don't use Optimus for this PC. I didn't have any problems with Linux 4.9.11.

Here's a relevant piece from dmesg:

[  245.346354] INFO: task kworker/2:2:258 blocked for more than 120 seconds.
[  245.346364]       Tainted: G          I     4.10.1-1-ARCH #1
[  245.346370] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  245.346376] kworker/2:2     D    0   258      2 0x00000000
[  245.346382] Workqueue: pm pm_runtime_work
[  245.346383] Call Trace:
[  245.346388]  __schedule+0x22f/0x700
[  245.346389]  schedule+0x3d/0x90
[  245.346391]  rpm_resume+0x118/0x6f0
[  245.346393]  ? wake_atomic_t_function+0x60/0x60
[  245.346395]  ? pci_pm_runtime_resume+0xa0/0xa0
[  245.346397]  pm_runtime_forbid+0x4c/0x60
[  245.346420]  nouveau_pmops_runtime_suspend+0xcc/0xd0 [nouveau]
[  245.346421]  pci_pm_runtime_suspend+0x55/0x180
[  245.346422]  ? pci_pm_runtime_resume+0xa0/0xa0
[  245.346423]  __rpm_callback+0xb9/0x1f0
[  245.346424]  ? put_prev_entity+0x33/0xa10
[  245.346425]  rpm_callback+0x24/0x80
[  245.346426]  ? pci_pm_runtime_resume+0xa0/0xa0
[  245.346427]  rpm_suspend+0x132/0x690
[  245.346428]  ? finish_task_switch+0x78/0x200
[  245.346428]  pm_runtime_work+0x6f/0xb0
[  245.346430]  process_one_work+0x1e5/0x470
[  245.346431]  worker_thread+0x48/0x4e0
[  245.346432]  kthread+0x101/0x140
[  245.346433]  ? process_one_work+0x470/0x470
[  245.346434]  ? kthread_create_on_node+0x60/0x60
[  245.346435]  ret_from_fork+0x2c/0x40

Comment 64 Roman Stingler 2017-02-27 13:36:41 UTC

Created attachment 129954 [details]
dmesg OOB nouveau

Comment 65 Roman Stingler 2017-02-27 13:38:25 UTC

On archlinux graphics works normal, but when display goes dark, I can resume on a black screen some time later a cursor appears but screen stays black

Comment 66 bastian.beischer 2017-02-27 13:43:26 UTC

Roman, this sounds like https://bugs.freedesktop.org/show_bug.cgi?id=99922.

Comment 67 Barry G 2017-03-25 22:39:29 UTC

Hi all,

Google brought me here as I am having the same issue as Bastian since upgrading from 4.9.11 (worked great) to 4.10.1 (Xorg won't start):
[ 1242.018471] INFO: task kworker/15:1:175 blocked for more than 120 seconds.
[ 1242.019277]       Tainted: P          IO    4.10.5-1-ARCH #1
[ 1242.020065] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1242.020866] kworker/15:1    D    0   175      2 0x00000000
[ 1242.020877] Workqueue: pm pm_runtime_work
[ 1242.020879] Call Trace:
[ 1242.020886]  __schedule+0x22f/0x700
[ 1242.020890]  schedule+0x3d/0x90
[ 1242.020893]  rpm_resume+0x118/0x6f0
[ 1242.020897]  ? wake_atomic_t_function+0x60/0x60
[ 1242.020901]  ? pci_pm_runtime_resume+0xa0/0xa0
[ 1242.020904]  pm_runtime_forbid+0x4c/0x60
[ 1242.020945]  nouveau_pmops_runtime_suspend+0xcc/0xd0 [nouveau]
[ 1242.020949]  pci_pm_runtime_suspend+0x55/0x180
[ 1242.020952]  ? pci_pm_runtime_resume+0xa0/0xa0
[ 1242.020955]  __rpm_callback+0xb9/0x1f0
[ 1242.020958]  rpm_callback+0x24/0x80
[ 1242.020961]  ? pci_pm_runtime_resume+0xa0/0xa0
[ 1242.020963]  rpm_suspend+0x132/0x690
[ 1242.020966]  ? finish_task_switch+0x78/0x200
[ 1242.020969]  pm_runtime_work+0x6f/0xb0
[ 1242.020974]  process_one_work+0x1e5/0x470
[ 1242.020977]  worker_thread+0x48/0x4e0
[ 1242.020980]  kthread+0x101/0x140
[ 1242.020983]  ? process_one_work+0x470/0x470
[ 1242.020986]  ? kthread_create_on_node+0x60/0x60
[ 1242.020988]  ret_from_fork+0x2c/0x40

Not sure if this is the same issue as I have dual Nvidia GTX 980s (nouveau).  Reverting to 4.9 fixes stuff and my desktop opens on startup.  Would be happy to open another bug if you desire but the PM stuff seems to make them related?

Comment 68 Viktor Kuzmin 2017-04-10 20:54:12 UTC

Same for me.

MacBook Pro 11.3, Gentoo Linux, kernel 4.10.8.

[   49.196313] Workqueue: pm pm_runtime_work
[   49.196314] Call Trace:
[   49.196321]  ? dump_stack+0x46/0x59
[   49.196323]  ? __warn+0xb9/0xe0
[   49.196327]  ? pci_pm_runtime_resume+0xa0/0xa0
[   49.196329]  ? warn_slowpath_fmt+0x4a/0x50
[   49.196349]  ? gen6_read32+0x92/0x1e0 [i915]
[   49.196369]  ? hsw_enable_pc8+0x6b7/0x720 [i915]
[   49.196371]  ? pci_pm_runtime_resume+0xa0/0xa0
[   49.196384]  ? intel_runtime_suspend+0x142/0x250 [i915]
[   49.196386]  ? pci_pm_runtime_suspend+0x50/0x140
[   49.196387]  ? __rpm_callback+0xb1/0x1f0
[   49.196389]  ? rpm_callback+0x1a/0x70
[   49.196390]  ? pci_pm_runtime_resume+0xa0/0xa0
[   49.196392]  ? rpm_suspend+0x11d/0x670
[   49.196396]  ? _raw_write_unlock_irq+0xe/0x20
[   49.196400]  ? finish_task_switch+0xa7/0x260
[   49.196403]  ? __update_idle_core+0x1b/0xb0
[   49.196405]  ? pm_runtime_work+0x62/0xa0
[   49.196407]  ? process_one_work+0x133/0x480
[   49.196408]  ? worker_thread+0x42/0x4c0
[   49.196411]  ? kthread+0xef/0x130
[   49.196412]  ? process_one_work+0x480/0x480
[   49.196415]  ? kthread_create_on_node+0x40/0x40
[   49.196416]  ? ret_from_fork+0x23/0x30

Comment 69 Gleb Nemshilov 2017-04-11 18:31:32 UTC

The issue in this bug was resolved (not fully though) in 4.10-rc6. It highly unlikely that you have the same problem as in this bug.

Comment 70 Gleb Nemshilov 2017-04-11 18:32:51 UTC

I wonder if Peter is still active on this bug? 2 months passed since I've attached the logs with latest test patches on 4.10-rc5 and one of the patches actually helped (the one that failed to compile first time).

Comment 71 Peter Ujfalusi 2017-04-16 20:11:17 UTC

Gleb,

(In reply to Gleb Nemshilov from comment #70)
> I wonder if Peter is still active on this bug? 2 months passed since I've
> attached the logs with latest test patches on 4.10-rc5 and one of the
> patches actually helped (the one that failed to compile first time).

I can try to forward port my patches if you can test them again on top on current mainline.

I have rechecked the state and I still maintain that the poll_work is not running at all after:
cae9ff036eea drm/nouveau: Don't enabling polling twice on runtime resume

The commit checks that
+	if (!drm_dev->mode_config.poll_enabled)
+		drm_kms_helper_poll_enable(drm_dev);

which is wrong in two ways.
First it should not peek into mode_config and especially it should not check poll_enabled as it is the wrong boolean to check. poll_enabled indicates that the poll_init has been called and nothing more.

The nouveau driver does this:
poll_init() followed by poll_disable() during the driver probe. poll_init() will set the poll_enabled to true, poll_disable() will kill the poll_work. Now since the mentioned patch is checking !drm_dev->mode_config.poll_enabled to enable the poll again, it will never going to enable the work again.

This might be OK as HDMI/DP have hpd (hotplug detection) and the LCD is always connected, but VGA might not have hpd and it's state change need to be polled. Or not. I'm not sure about it.

This means that the poll_work can not cause anymore any issue on nouveau ;)

But, the patches I have attached to this bug are fixing the issue correctly (khm, I believe) and keeps the poll_work functional.

Give me couple of days to forward port and test the patches.

Comment 72 Peter Ujfalusi 2017-04-23 15:37:10 UTC

Created attachment 130990 [details] [review]
poll helper fix on top of next-20170421

Gleb,

can you check this patch on top of next-20170421 (most likely it will apply on top of mainline as well)
if it does not break the optimus, can you also check with the following change:

diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c
index 1b0c14ab3fff..dbd227de1d37 100644
--- a/drivers/gpu/drm/drm_probe_helper.c
+++ b/drivers/gpu/drm/drm_probe_helper.c
@@ -160,7 +160,7 @@ void drm_kms_helper_poll_enable(struct drm_device *dev)
 		 * was enabled before.
 		 */
 		poll = true;
-		delay = HZ;
+		delay = 0;
 	}
 
 	if (poll)


Thanks,
Peter

Comment 73 Gleb Nemshilov 2017-04-29 13:05:09 UTC

Created attachment 131148 [details]
dmesg log for [PATCH] drm/nouveau: Fix drm poll_helper handling

Hi Peter!

Finally, the weekend and I've tested both patches and both times kernel compiled, booted and everything were working fine. Log files atteched.

log attached for: 130990: poll helper fix on top of next-20170421

Comment 74 Gleb Nemshilov 2017-04-29 13:07:09 UTC

Created attachment 131149 [details]
dmesg log for [PATCH] drm/nouveau: Fix drm poll_helper handling + delay = 0

Another log for kernel with both patches.

Please let me know whether more testing need to be done.

Best regards,
Gleb

Comment 75 Ricardo 2017-05-09 17:54:13 UTC

Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

Comment 76 Jani Nikula 2017-06-15 07:18:42 UTC

Seems like a nouveau bug. Moving to DRM/other.

Comment 77 Gleb Nemshilov 2017-07-30 04:14:25 UTC

Hi Peter and everyone!

I've seen that the latest patch which solves the issue was merged to 4.12 and 4.9, but the initial fix with 1s delay is still in the tree. Will this also be resolved? It looks like the previous patch can be reverted as the proper fix is now available.

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/drm_probe_helper.c#L246

Comment 78 Martin Peres 2019-12-04 09:21:30 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/302.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.