Bug 99505

Summary: Attempting to reclock GeForce GT740M (GK208) cause GPU and system to hang
Product: xorg Reporter: Boyan Ding <stu_dby>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
The dmesg output after reclocking none

Description Boyan Ding 2017-01-23 13:29:41 UTC
Created attachment 129107 [details]
The dmesg output after reclocking

When I tried to reclock the GT740M (the GK208 one) on my laptop with:
# echo "07" > /sys/kernel/debug/dri/1/pstate
The shell process in which I enter the command became unkillable (state D+ in ps) and the GPU would hang (rendering stops). Attempting to shutdown the computer would completely freeze the system, only force reset can be used.

My computer is an Thinkpad E431 laptop running Arch Linux, kernel version is 4.8.13-1-ARCH

Output of lspci:
00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4)
00:1c.3 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 4 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation HM77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)
01:00.0 3D controller: NVIDIA Corporation GK208M [GeForce GT 740M] (rev a1)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5229 PCI Express Card Reader (rev 01)
04:00.0 Network controller: Broadcom Limited BCM43228 802.11a/b/g/n
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)

Output of /sys/kernel/debug/dri/1/pstate before reclocking:
07: core 405 MHz memory 810 MHz
0a: core 405-1058 MHz memory 1620 MHz
0f: core 405-1058 MHz memory 2002 MHz
AC: core 0 MHz memory 0 MHz
Comment 1 Ilia Mirkin 2017-01-23 13:31:39 UTC
I think you can't reclock while the GPU is disabled. Try running DRI_PRIME=1 glxgears while doing that echo.
Comment 2 Karol Herbst 2017-01-23 13:33:10 UTC
(In reply to Ilia Mirkin from comment #1)
> I think you can't reclock while the GPU is disabled. Try running DRI_PRIME=1
> glxgears while doing that echo.

yeah, until now it hangs the thread. I have patches pending to fix it though.
Comment 3 Boyan Ding 2017-01-23 13:44:05 UTC
(In reply to Ilia Mirkin from comment #1)
> I think you can't reclock while the GPU is disabled. Try running DRI_PRIME=1
> glxgears while doing that echo.

I tried to reclock while running glxgears (the AC line shows 405 MHz instead of 0), but it also hangs.
Comment 4 Boyan Ding 2017-01-23 13:54:17 UTC
(In reply to Karol Herbst from comment #2)
> yeah, until now it hangs the thread. I have patches pending to fix it though.

Are those patches published? I can compile my kernel test them if I can access the patches.
Comment 5 Ilia Mirkin 2017-01-23 14:17:03 UTC
(In reply to Boyan Ding from comment #3)
> (In reply to Ilia Mirkin from comment #1)
> > I think you can't reclock while the GPU is disabled. Try running DRI_PRIME=1
> > glxgears while doing that echo.
> 
> I tried to reclock while running glxgears (the AC line shows 405 MHz instead
> of 0), but it also hangs.

Hmmm... this is probably a different issue then. To double-check, try booting with nouveau.runpm=0 (which will disable all the runpm stuff).

Either way, please try to figure out where the echo gets stuck (you can echo ... something into sysrq-trigger, I forget what... 'w' maybe?)
Comment 6 Boyan Ding 2017-01-23 14:48:25 UTC
(In reply to Ilia Mirkin from comment #5)
> (In reply to Boyan Ding from comment #3)
> > I tried to reclock while running glxgears (the AC line shows 405 MHz instead
> > of 0), but it also hangs.
> 
> Hmmm... this is probably a different issue then. To double-check, try
> booting with nouveau.runpm=0 (which will disable all the runpm stuff).
> 
> Either way, please try to figure out where the echo gets stuck (you can echo
> ... something into sysrq-trigger, I forget what... 'w' maybe?)

I added nouveau.runpm=0 and it still hangs, sysrq-trigger with w shows the following:

[  169.926880] sysrq: SysRq : Show Blocked State
[  169.926883]   task                        PC stack   pid father
[  169.926954] kworker/1:123   D ffff880242063c90     0   315      2 0x00000000
[  169.926982] Workqueue: events nvkm_pstate_work [nouveau]
[  169.926984]  ffff880242063c90 00ff880239628b00 ffff880244a59c80 ffff880242048000
[  169.926987]  ffffffffa06e1cec ffff880242064000 ffff88024210c800 ffff880239628a80
[  169.926989]  ffff88024456ad00 0000000000000009 ffff880242063ca8 ffffffff815f40ec
[  169.926992] Call Trace:
[  169.927012]  [<ffffffffa06e1cec>] ? memx_out+0x3c/0x90 [nouveau]
[  169.927014]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[  169.927032]  [<ffffffffa06e1b8a>] nvkm_pmu_send+0x24a/0x2c0 [nouveau]
[  169.927035]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[  169.927051]  [<ffffffffa06e1fa8>] nvkm_memx_fini+0xe8/0xf0 [nouveau]
[  169.927064]  [<ffffffffa068ff5f>] ? nvkm_boolopt+0x2f/0x190 [nouveau]
[  169.927080]  [<ffffffffa06c2a4b>] gk104_ram_prog+0x9b/0xc0 [nouveau]
[  169.927095]  [<ffffffffa06a4014>] nvkm_pstate_work+0x154/0x540 [nouveau]
[  169.927097]  [<ffffffff810a3277>] ? finish_task_switch+0x77/0x1e0
[  169.927100]  [<ffffffff81095ef5>] process_one_work+0x1e5/0x470
[  169.927102]  [<ffffffff810961c8>] worker_thread+0x48/0x4e0
[  169.927104]  [<ffffffff81096180>] ? process_one_work+0x470/0x470
[  169.927106]  [<ffffffff8109be38>] kthread+0xd8/0xf0
[  169.927108]  [<ffffffff8102c782>] ? __switch_to+0x2d2/0x630
[  169.927110]  [<ffffffff815f823f>] ret_from_fork+0x1f/0x40
[  169.927112]  [<ffffffff8109bd60>] ? kthread_worker_fn+0x170/0x170
[  169.927217] bash            D ffff8802202afb60     0  1863   1862 0x00000000
[  169.927220]  ffff8802202afb60 00ff88020e930000 ffff880244a59c80 ffff88020e930000
[  169.927222]  0000000000000246 ffff8802202b0000 ffff88024210b948 0000000000000008
[  169.927225]  ffff88024210b800 0000000000000028 ffff8802202afb78 ffffffff815f40ec
[  169.927227] Call Trace:
[  169.927229]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[  169.927244]  [<ffffffffa06a3cdb>] nvkm_pstate_calc+0x7b/0xd0 [nouveau]
[  169.927246]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[  169.927260]  [<ffffffffa06a4598>] nvkm_clk_ustate+0x98/0xb0 [nouveau]
[  169.927277]  [<ffffffffa06f0235>] nvkm_control_mthd+0x315/0x440 [nouveau]
[  169.927290]  [<ffffffffa068f2e8>] nvkm_object_mthd+0x18/0x20 [nouveau]
[  169.927302]  [<ffffffffa068daa6>] nvkm_ioctl_mthd+0x66/0xb0 [nouveau]
[  169.927314]  [<ffffffffa068e3b7>] nvkm_ioctl+0x107/0x260 [nouveau]
[  169.927329]  [<ffffffffa07382d2>] nvkm_client_ioctl+0x12/0x20 [nouveau]
[  169.927341]  [<ffffffffa068b041>] nvif_object_ioctl+0x41/0x50 [nouveau]
[  169.927352]  [<ffffffffa068b3e8>] nvif_object_mthd+0x138/0x160 [nouveau]
[  169.927354]  [<ffffffff81217c7c>] ? path_openat+0x31c/0x1170
[  169.927356]  [<ffffffff81316e4b>] ? _kstrtoull+0x3b/0x90
[  169.927371]  [<ffffffffa07352fe>] nouveau_debugfs_pstate_set+0x12e/0x1c0 [nouveau]
[  169.927373]  [<ffffffff812993c0>] full_proxy_write+0x60/0xa0
[  169.927375]  [<ffffffff81208797>] __vfs_write+0x37/0x140
[  169.927378]  [<ffffffff81227bc9>] ? __alloc_fd+0xc9/0x180
[  169.927379]  [<ffffffff81206dd6>] ? filp_close+0x56/0x80
[  169.927382]  [<ffffffff810c7be7>] ? percpu_down_read+0x17/0x50
[  169.927384]  [<ffffffff81209566>] vfs_write+0xb6/0x1a0
[  169.927385]  [<ffffffff8120a9e5>] SyS_write+0x55/0xc0
[  169.927387]  [<ffffffff81227e92>] ? __close_fd+0x92/0xb0
[  169.927389]  [<ffffffff815f8032>] entry_SYSCALL_64_fastpath+0x1a/0xa4
Comment 7 Karol Herbst 2017-01-23 15:42:30 UTC
try it with an 4.10 kernel
Comment 8 Boyan Ding 2017-01-24 03:38:36 UTC
I built 4.10-rc5 and now it reclocks okay. I echoed 0f to pstate and ran through a few games. All worked very well. Thanks for the great work.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.