Bug 91557

Summary: [NVE4] freezes: HUB_INIT timed out
Product: xorg Reporter: wolf480
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: NEW --- QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: wolf480
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg log from freeze with runpm=0
none
Xorg log from freeze with runpm=0
none
netconsole log from freeze with default runpm setting
none
Xorg log from freeze with default runpm settings
none
journalctl output (kernel messages only) from freeze with default runpm settings
none
mmiotrace of successful nouveau initialization
none
dmesg log from successful nouveau initialization
none
lspci output
none
mmiotrace of nouveau loading with HUB_INIT timeout
none
dmesg log from nouveau loading with HUB_INIT timeout
none
mmiotrace of nouveau loading with grctx timeout
none
dmesg log from nouveau loading with grctx timeout
none
dmesg log from nvidia-smi with proprietary driver none

Description wolf480 2015-08-04 17:34:38 UTC
Created attachment 117518 [details]
dmesg log from freeze with runpm=0

I have a Medion X7827 laptop with GK104 GPU in an Optimus setup, running:
Linux 4.1.2 x86_64
Mesa 10.6.2
Xorg 1.17.2

I've been experiencing some freezes:
- a total freeze (no ping, no sysrq, only hard reset) shortly after xorg start - if nouveau is loaded *without* runpm=0
- a recoverable freeze (sysrq+K worked) when exiting xorg - if nouveau is loaded *with* runpm=0

On #nouveau IRC channel I've been told to try the hack-gk106m branch of this repository: http://... , with runpm=0
At first I thought it helped, but then I noticed the freezez happen randomly.

When runpm=0 is set, the freeze has about 60% chance of happening. I've tested it with both in-tree nouveau.ko and one built from hack-gk106m branch, and looks like the chance is the same on both.
When the freeze happens, there's either a "HUB_INIT timed out" message or "grctx template channel unload timeout" message in dmesg.
If the freeze is to happen, the error message shows up at nouveau module load time, and then again when Xorg starts. Full logs in attachments.

I did mmiotraces of the nouveau.ko from hack-gk106m branch (can repeat with in-tree nouveau.ko if necessary), with runpm=0, for all of the cases:
- the driver loading succesfully
- the driver loading with HUB_INIT timeout error
- the driver loading with grctx timeout error
The traces and corresponding dmesg logs are in attachments. I have more traces, but included only one per case.
I did not try to start xorg and trigger the freeze during the mmiotraces, because:
a) I believe the problem happens at nouveau load time, when it tries to initialize the GPU
b) The traces compressed with `xz -9` barely fit in the max attachment size of bugzilla, if they were longer I doubt I could make them fit.

I hope these traces will be useful and help figure out why it sometimes works and sometimes doesn't, and how to make it always work.
Let me know if there's anything more I could to to help you figure this out.
Comment 1 wolf480 2015-08-04 17:35:49 UTC
Created attachment 117519 [details]
Xorg log from freeze with runpm=0
Comment 2 wolf480 2015-08-04 17:36:45 UTC
Created attachment 117520 [details]
netconsole log from freeze with default runpm setting
Comment 3 wolf480 2015-08-04 17:38:13 UTC
Created attachment 117521 [details]
Xorg log from freeze with default runpm settings
Comment 4 wolf480 2015-08-04 17:39:53 UTC
Created attachment 117522 [details]
journalctl output (kernel messages only) from freeze with default runpm settings
Comment 5 wolf480 2015-08-04 17:41:02 UTC
(In reply to wolf480 from comment #0)
> On #nouveau IRC channel I've been told to try the hack-gk106m branch of this
> repository: http://... , with runpm=0
I mean http://cgit.freedesktop.org/~darktama/nouveau/log/?h=hack-gk106m
Comment 6 wolf480 2015-08-04 17:42:55 UTC
Created attachment 117523 [details]
mmiotrace of successful nouveau initialization
Comment 7 wolf480 2015-08-04 17:45:27 UTC
Created attachment 117524 [details]
dmesg log from successful nouveau initialization
Comment 8 wolf480 2015-08-04 17:47:22 UTC
Created attachment 117525 [details]
lspci output

from successful nouveau initialization, dunno if that matters
Comment 9 wolf480 2015-08-04 17:49:05 UTC
Created attachment 117526 [details]
mmiotrace of nouveau loading with HUB_INIT timeout
Comment 10 wolf480 2015-08-04 17:50:20 UTC
Created attachment 117527 [details]
dmesg log from nouveau loading with HUB_INIT timeout
Comment 11 wolf480 2015-08-04 17:53:37 UTC
Created attachment 117528 [details]
mmiotrace of nouveau loading with grctx timeout
Comment 12 wolf480 2015-08-04 17:54:24 UTC
Created attachment 117529 [details]
dmesg log from nouveau loading with grctx timeout
Comment 13 wolf480 2015-08-04 19:44:49 UTC
I also did an mmiotrace of the proprietary nvidia driver being loaded when running nvidia-smi. Even compressed it doesn't fit in a bugzilla attachment, so I uploaded it here: https://www.dropbox.com/s/14mchinykvfqrg9/nvidia-trace.txt.xz?dl=1
Comment 14 wolf480 2015-08-04 19:47:01 UTC
Created attachment 117531 [details]
dmesg log from nvidia-smi with proprietary driver

the corresponding mmiotrace (over 3MiB compressed) is here:
https://www.dropbox.com/s/14mchinykvfqrg9/nvidia-trace.txt.xz?dl=1
Comment 15 Ilia Mirkin 2015-10-26 05:02:20 UTC
Please try (a) kernel v4.3-rc7, and (b) kernel v4.3-rc7 booted with nouveau.config=War00C800_0=1

The former has an updated pgob protocol, the latter enables an additional workaround necessary for some laptops. If it works, we can whitelist your specific subdevice, so please provide the output from

lspci -vnn -d 10de::300
Comment 16 Ilia Mirkin 2015-10-31 18:43:49 UTC
Update from OP on IRC: War00C800_0=1 makes it work

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK104M [GeForce GTX 870M] [10de:1199] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:1106]

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.