Bug 94803 - nouveau bug crashes kernel 4.4.6 on warm boot
Summary: nouveau bug crashes kernel 4.4.6 on warm boot
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-02 23:14 UTC by Michael Daum
Modified: 2016-04-03 01:11 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
picture from stack trace taken with camera (277.30 KB, image/jpeg)
2016-04-02 23:14 UTC, Michael Daum
no flags Details
dump of the nv50_disp_intr_supervisor function (37.93 KB, text/plain)
2016-04-02 23:16 UTC, Michael Daum
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Daum 2016-04-02 23:14:46 UTC
Created attachment 122684 [details]
picture from stack trace taken with camera

Whenever doing a warm boot (reboot), kernel 4.4.6 crashes showing the stack trace in the attached picture. The kernel boots fine when doing a cold start or after pressing the reset button.

The GPU in question is a "PNY Quadro FX 580" (Tesla / G96) with two attached monitors, both connected on display port.

The crash happens quite early when the kernel sets mode on the framebuffer. Resolution is set to 1920x1200 px, 240x75 cols/lines.

attached:

picture of the screen showing the stack trace
dump of "nv50_disp_intr_supervisor" function from the 4.4.6 vmlinux
Comment 1 Michael Daum 2016-04-02 23:16:41 UTC
Created attachment 122685 [details]
dump of the nv50_disp_intr_supervisor function
Comment 2 Ilia Mirkin 2016-04-02 23:36:31 UTC
0x80c (2060) looks like this:

   0xffffffff814daf32 <+2050>:	xor    %edx,%edx
   0xffffffff814daf34 <+2052>:	mov    %rax,%rdi
   0xffffffff814daf37 <+2055>:	mov    $0xc,%eax
   0xffffffff814daf3c <+2060>:	divl   -0x88(%rbp)

which has gotta come from

nv50_disp_intr_unk20_2_dp(...) {

	u32 dpctrl = nvkm_rd32(device, 0x61c10c + loff);
	link_nr = hweight32(dpctrl & 0x000f0000);

...

 	value = value - (3 * !!(dpctrl & 0x00004000)) - (12 / link_nr);

Which means that on boot link_nr is 0. Michael, if you're up for some kernel patching, can you just add a

if (!link_nr) {
  nvkm_error(subdev, "link_nr = 0; dpctrl: %08x\n", dpctrl);
  return;
}

right after the link_nr assignment in that function?
Comment 3 Michael Daum 2016-04-03 01:11:49 UTC
The patch from Ilia prevents the kernel from crashing. Reboot does not fail any longer. Additionally _most_ of the time both monitors come up at reboot.

But on some reboots one of the monitors stays black and the kernel logs following error (from dmesg):

[    0.955154] nouveau 0000:01:00.0: disp: outp 04:0006:0384: link training failed
[    0.973972] nouveau 0000:01:00.0: disp: outp 04:0006:0384: link training failed
[    0.974279] nouveau 0000:01:00.0: disp: link_nr = 0; dpctrl: 00401101


The value of dpctrl is always the same (00401101) then.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.