Bug 102553 - Venus PRO R9 M265X amdgpu: Kernel OOPS si_dpm_set_power_state unable to handle kernel NULL pointer dereference
Summary: Venus PRO R9 M265X amdgpu: Kernel OOPS si_dpm_set_power_state unable to handl...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-05 20:23 UTC by mercuriete
Modified: 2018-04-16 13:22 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
full dmesg (52.03 KB, text/plain)
2017-09-05 20:23 UTC, mercuriete
no flags Details
very ugly patch LOL (632 bytes, patch)
2018-03-07 22:34 UTC, mercuriete
no flags Details | Splinter Review
fix null dereference (2.32 KB, patch)
2018-03-12 21:19 UTC, mercuriete
no flags Details | Splinter Review
implement missing functions (2.11 KB, patch)
2018-03-29 23:21 UTC, Stefano Cipriani
no flags Details | Splinter Review
possible fix (2.59 KB, patch)
2018-03-30 18:35 UTC, Alex Deucher
no flags Details | Splinter Review
lspci battery (2.90 KB, text/plain)
2018-04-01 17:59 UTC, mercuriete
no flags Details
lspci balanced (2.90 KB, text/plain)
2018-04-01 17:59 UTC, mercuriete
no flags Details

Description mercuriete 2017-09-05 20:23:38 UTC
Created attachment 133980 [details]
full dmesg

Hi all

When you try to change the DPM "governor" to "battery" using this command:

echo "battery" > /sys/class/drm/card0/device/power_dpm_state

You will get a kernel OOPS

OS: Gentoo

Kernel:
 ~ $ uname -a
Linux localhost.localdomain 4.13.0-gentoo #2 SMP PREEMPT Tue Sep 5 19:43:50 CEST 2017 x86_64 Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz GenuineIntel GNU/Linux
m

lspci of my graphic card:
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Venus PRO [Radeon HD 8850M / R9 M265X]

lspci -n:
01:00.0 0380: 1002:6823




partial dmesg output:

[ 1861.111097] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 1861.111929] IP:           (null)
[ 1861.112733] PGD 3cb980067 
[ 1861.112734] P4D 3cb980067 
[ 1861.113516] PUD 343b2f067 
[ 1861.114286] PMD 0 

[ 1861.116531] Oops: 0010 [#1] PREEMPT SMP
[ 1861.117296] Modules linked in:
[ 1861.118053] CPU: 4 PID: 4384 Comm: bash Not tainted 4.13.0-gentoo #2
[ 1861.118803] Hardware name: TOSHIBA SATELLITE P50-B-11M/VG20SQ, BIOS 1.50 12/09/2014
[ 1861.119547] task: ffff880307a33000 task.stack: ffffc90000894000
[ 1861.120295] RIP: 0010:          (null)
[ 1861.121365] RSP: 0018:ffffc90000897cc8 EFLAGS: 00010202
[ 1861.122300] RAX: ffffffff81ea12e0 RBX: ffff88041b8f0000 RCX: 0000000000000078
[ 1861.123101] RDX: 0000000000000078 RSI: 0000000000000007 RDI: ffff88041b8f0000
[ 1861.123938] RBP: ffffc90000897d50 R08: 0000000000020000 R09: 0000000000018c04
[ 1861.124720] R10: 0000000000000000 R11: ffff88041b940044 R12: 0000000000000000
[ 1861.125507] R13: ffff88041b940000 R14: ffff88041b941988 R15: 0000000000000003
[ 1861.126293] FS:  00007fefa79d6700(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
[ 1861.127096] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1861.127903] CR2: 0000000000000000 CR3: 00000002e9266000 CR4: 00000000001406e0
[ 1861.128718] Call Trace:
[ 1861.129530]  ? si_dpm_set_power_state+0xc49/0x11e0
[ 1861.130354]  amdgpu_pm_compute_clocks+0x289/0x600
[ 1861.131177]  amdgpu_set_dpm_state+0x79/0x110
[ 1861.131990]  dev_attr_store+0x13/0x20
[ 1861.132804]  sysfs_kf_write+0x32/0x40
[ 1861.133624]  kernfs_fop_write+0x112/0x190
[ 1861.134466]  __vfs_write+0x23/0x130
[ 1861.135324]  ? preempt_count_add+0x99/0xb0
[ 1861.136163]  ? _raw_write_unlock+0x11/0x30
[ 1861.136992]  ? __this_cpu_preempt_check+0x13/0x20
[ 1861.137804]  ? __sb_start_write+0x50/0xe0
[ 1861.138613]  vfs_write+0xaf/0x180
[ 1861.139405]  SyS_write+0x41/0xb0
[ 1861.140189]  ? __context_tracking_exit.part.2+0x2e/0xf0
[ 1861.140980]  do_syscall_64+0x6f/0x160
[ 1861.141764]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1861.142557] RIP: 0033:0x7fefa7092b40
[ 1861.143346] RSP: 002b:00007ffcd158b488 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1861.144158] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007fefa7092b40
[ 1861.144975] RDX: 0000000000000008 RSI: 00000000006c5b00 RDI: 0000000000000001
[ 1861.145794] RBP: 00000000006c5b00 R08: 00007fefa734d760 R09: 00007fefa79d6700
[ 1861.146617] R10: 0000000000000097 R11: 0000000000000246 R12: 0000000000000008
[ 1861.147442] R13: 0000000000000001 R14: 00007fefa734c600 R15: 0000000000000000
[ 1861.148267] Code:  Bad RIP value.
[ 1861.149090] RIP:           (null) RSP: ffffc90000897cc8
[ 1861.149925] CR2: 0000000000000000
[ 1861.157732] ---[ end trace 4f6879609b5f474d ]---
Comment 1 mercuriete 2017-11-19 02:14:09 UTC
Problem still happens in kernel 4.11.0

Why Instruction pointer is null?
Do you need more feedback from me?

Somebody with knowledge about kernel debugging can explain me the way to how create a good bug report?



partial dmesg:

[  523.890285] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  523.890289] IP:           (null)
[  523.890290] PGD 243e2f067 P4D 243e2f067 PUD 1e5127067 PMD 0 
[  523.890294] Oops: 0010 [#1] PREEMPT SMP
[  523.890295] Modules linked in:
[  523.890297] CPU: 0 PID: 6677 Comm: laptop_mode Not tainted 4.14.0-gentoo #2
[  523.890298] Hardware name: TOSHIBA SATELLITE P50-B-11M/VG20SQ, BIOS 1.50 12/09/2014
[  523.890299] task: ffff8802437764c0 task.stack: ffffc90003e38000
[  523.890300] RIP: 0010:          (null)
[  523.890301] RSP: 0018:ffffc90003e3bca0 EFLAGS: 00010202
[  523.890302] RAX: ffffffff81eaa400 RBX: ffff88041c7c0000 RCX: 0000000000000078
[  523.890303] RDX: 0000000000000078 RSI: 0000000000000007 RDI: ffff88041c7c0000
[  523.890304] RBP: ffffc90003e3bd30 R08: 0000000000020000 R09: 0000000000018c04
[  523.890304] R10: 0000000000000000 R11: ffff88041c7d8044 R12: 0000000000000000
[  523.890305] R13: ffff88041c7d8000 R14: ffff88041c7d9988 R15: 0000000000000003
[  523.890306] FS:  00007f013077af80(0000) GS:ffff88042fa00000(0000) knlGS:0000000000000000
[  523.890307] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  523.890308] CR2: 0000000000000000 CR3: 00000002437c8006 CR4: 00000000001606f0
[  523.890309] Call Trace:
[  523.890315]  ? si_dpm_set_power_state+0xc96/0x1260
[  523.890318]  amdgpu_pm_compute_clocks+0x2ab/0x620
[  523.890320]  amdgpu_set_dpm_state+0x88/0x130
[  523.890323]  dev_attr_store+0x13/0x20
[  523.890326]  sysfs_kf_write+0x32/0x40
[  523.890327]  kernfs_fop_write+0x112/0x190
[  523.890330]  __vfs_write+0x32/0x150
[  523.890333]  ? __this_cpu_preempt_check+0x13/0x20
[  523.890335]  vfs_write+0xaf/0x180
[  523.890337]  SyS_write+0x50/0xc0
[  523.890339]  do_syscall_64+0x6f/0x1c0
[  523.890342]  entry_SYSCALL64_slow_path+0x25/0x25
[  523.890344] RIP: 0033:0x7f012fe331e0
[  523.890345] RSP: 002b:00007ffccf672a88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  523.890346] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f012fe331e0
[  523.890347] RDX: 0000000000000008 RSI: 0000000000793130 RDI: 0000000000000001
[  523.890347] RBP: 0000000000793130 R08: 00007f01300f7740 R09: 00007f013077af80
[  523.890348] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000008
[  523.890349] R13: 0000000000000001 R14: 00007f01300f65e0 R15: 0000000000000000
[  523.890350] Code:  Bad RIP value.
[  523.890354] RIP:           (null) RSP: ffffc90003e3bca0
[  523.890354] CR2: 0000000000000000
[  523.890356] ---[ end trace a653bce7b1707265 ]---
[
Comment 2 mercuriete 2018-03-03 09:49:01 UTC
Still happening on


uname -a
Linux localhost.localdomain 4.14.23-gentoo #2 SMP PREEMPT Sat Mar 3 10:31:54 CET 2018 x86_64 Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz GenuineIntel GNU/Linux


*******************

[   58.536983] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   58.536992] IP:           (null)
[   58.536994] PGD 0 P4D 0 
[   58.536999] Oops: 0010 [#1] PREEMPT SMP PTI
[   58.537001] Modules linked in:
[   58.537006] CPU: 7 PID: 4177 Comm: laptop_mode Not tainted 4.14.23-gentoo #2
[   58.537008] Hardware name: TOSHIBA SATELLITE P50-B-11M/VG20SQ, BIOS 1.50 12/09/2014
[   58.537010] task: ffff880417fab280 task.stack: ffffc90002d48000
[   58.537012] RIP: 0010:          (null)
[   58.537014] RSP: 0018:ffffc90002d4bd08 EFLAGS: 00010293
[   58.537017] RAX: ffffffff822b2f60 RBX: ffff88041c00b984 RCX: 0000000000000001
[   58.537019] RDX: 0000000080000001 RSI: 0000000000000007 RDI: ffff88041c7b0000
[   58.537022] RBP: ffff88041c008000 R08: 0000000000020000 R09: 0000000000018c04
[   58.537023] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88041c7b0000
[   58.537025] R13: ffff88041c008000 R14: 0000000000000003 R15: ffff88041c009988
[   58.537028] FS:  00007f08348f1040(0000) GS:ffff88042fbc0000(0000) knlGS:0000000000000000
[   58.537030] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.537032] CR2: 0000000000000000 CR3: 00000003d4d92005 CR4: 00000000001606e0
[   58.537034] Call Trace:
[   58.537043]  ? si_dpm_set_power_state+0xca1/0x12c0
[   58.537048]  ? amdgpu_pm_compute_clocks+0x2a4/0x560
[   58.537051]  ? amdgpu_set_dpm_state+0x8c/0x130
[   58.537056]  ? kernfs_fop_write+0xf6/0x170
[   58.537062]  ? __vfs_write+0x2e/0x150
[   58.537065]  ? handle_mm_fault+0xce/0x1c0
[   58.537069]  ? vfs_write+0xa7/0x160
[   58.537073]  ? SyS_write+0x4d/0xc0
[   58.537077]  ? do_syscall_64+0x76/0x1c0
[   58.537082]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   58.537085] Code:  Bad RIP value.
[   58.537092] RIP:           (null) RSP: ffffc90002d4bd08
[   58.537094] CR2: 0000000000000000
[   58.537108] ---[ end trace 8a42a5b7f8846196 ]---
Comment 3 mercuriete 2018-03-03 11:11:07 UTC
Only for give more information:
When using radeon DRM driver

this is output in dmesg

[   82.490643] [drm:si_dpm_set_power_state] *ERROR* invalid pcie lane request: 7


But there are no kernel oops. So maybe there are a condition that is not being checked in amdgpu.
Comment 4 mercuriete 2018-03-03 11:17:14 UTC
Searching in internet I found somebody hitting my same problem.

https://patchwork.kernel.org/patch/8731311/

maybe lane with calculations leads on a kernel oops in amdgpu and a warning in radeonsi.

I will try that patch.
Comment 5 mercuriete 2018-03-04 12:14:27 UTC
The width lanes calculation on radeon is a different bug of amdgpu null dereference.

But at least that patch silence that warning.
Comment 6 Alex Deucher 2018-03-05 21:51:00 UTC
See if you can find out where it's crashing.  See:
https://stackoverflow.com/questions/13468286/how-to-read-understand-analyze-and-debug-a-linux-kernel-panicand
Comment 7 mercuriete 2018-03-07 22:34:03 UTC
Created attachment 137877 [details] [review]
very ugly patch LOL
Comment 8 mercuriete 2018-03-07 22:36:42 UTC
Thanks you for the response.

I created a very ugly workaround that WORKS_FOR_ME

I need to read the link that you provided me to learn how to debug that stack trace.
I think in a few days I'll have more time to spend.

Thank you very much.
Comment 9 mercuriete 2018-03-11 18:51:19 UTC
my investigations ends in this macro:
drivers/gpu/drm/amd/amdgpu/amdgpu.h

#define amdgpu_set_pcie_lanes(adev, l) (adev)->asic_funcs->set_pcie_lanes((adev), (l))


then if you see this file:
drivers/gpu/drm/amd/amdgpu/si.c

static const struct amdgpu_asic_funcs si_asic_funcs =
{
	.read_disabled_bios = &si_read_disabled_bios,
	.read_bios_from_rom = &si_read_bios_from_rom,
	.read_register = &si_read_register,
	.reset = &si_asic_reset,
	.set_vga_state = &si_vga_set_state,
	.get_xclk = &si_get_xclk,
	.set_uvd_clocks = &si_set_uvd_clocks,
	.set_vce_clocks = NULL,
	.get_config_memsize = &si_get_config_memsize,
};

There are no set_pcie_lanes in that struct


so in this file:
drivers/gpu/drm/amd/amdgpu/si_dpm.c

	if (new_lane_width != current_lane_width) {
		amdgpu_set_pcie_lanes(adev, new_lane_width);
		lane_width = amdgpu_get_pcie_lanes(adev);
		si_write_smc_soft_register(adev, SI_SMC_SOFT_REGISTER_non_ulv_pcie_link_width, lane_width);
	}



You are jumping to null.

Please answer me soon to know if i am wrong or if i am right.



PS: in radeon the implementation of that functions is in this file:

drivers/gpu/drm/radeon/radeon_asic.c

static struct radeon_asic si_asic = { 
...
...
	.pm = {
		.misc = &evergreen_pm_misc,
		.prepare = &evergreen_pm_prepare,
		.finish = &evergreen_pm_finish,
		.init_profile = &sumo_pm_init_profile,
		.get_dynpm_state = &r600_pm_get_dynpm_state,
		.get_engine_clock = &radeon_atom_get_engine_clock,
		.set_engine_clock = &radeon_atom_set_engine_clock,
		.get_memory_clock = &radeon_atom_get_memory_clock,
		.set_memory_clock = &radeon_atom_set_memory_clock,
		.get_pcie_lanes = &r600_get_pcie_lanes,
		.set_pcie_lanes = &r600_set_pcie_lanes,
		.set_clock_gating = NULL,
		.set_uvd_clocks = &si_set_uvd_clocks,
		.set_vce_clocks = &si_set_vce_clocks,
		.get_temperature = &si_get_temp,
	},


r600_set_pcie_lanes is in this file:
drivers/gpu/drm/radeon/r600.c
void r600_set_pcie_lanes(struct radeon_device *rdev, int lanes)
{




PS2: So sumarizing the problem is in the macro (adev)->asic_funcs->set_pcie_lanes
that doesn't exists.

Thanks you very much
Comment 10 mercuriete 2018-03-11 20:00:16 UTC
I am planning to send a patch commenting out 

amdgpu_set_pcie_lanes(adev, new_lane_width);
amdgpu_get_pcie_lanes(adev);


not sure if this is a good approach

What about copying r600.c and r600.h and implementing this feature properly?


Right now on Southern Islands "set pci lanes width" and "get pci lanes width" is a feature completely broken.

It would be nice if an amd developer can give me some advice to how to tackle this issue.

Thanks in advance.
Comment 11 mercuriete 2018-03-12 21:19:38 UTC
Created attachment 138045 [details] [review]
fix null dereference
Comment 12 Stefano Cipriani 2018-03-29 23:21:49 UTC
Created attachment 138437 [details] [review]
implement missing functions

I tried to "port" the missing functions from radeon, using constants from amdgpu when I could find them, and removing those I deemed no more relevant. 
Don't know if I've done everything right, though now I can change dpm state freely
Comment 13 Alex Deucher 2018-03-30 18:35:28 UTC
Created attachment 138450 [details] [review]
possible fix

Thanks for looking into this.  Good start!  The attached patch should fix it.
Comment 14 Stefano Cipriani 2018-03-30 22:02:43 UTC
The patch works as expected! However when I set the state to battery I get in dmesg:

[drm:si_dpm_set_power_state [amdgpu]] *ERROR* invalid pcie lane request: 7

and the same message with 15 instead of 7 when I change back to balanced. Trying with radeon I get the same messages. 

I think the patch mentioned in comment 4 could solve the problem for both drivers, but I can't test it right now.

Should I open another bug for this issue?
Comment 15 Stefano Cipriani 2018-03-31 16:38:02 UTC
So I had time to try said patch on amdgpu and I don't get any more errors. However lspci always says:

LnkSta: Speed 8GT/s, Width x8

I think it's a limitation of my laptop, even if I have "Width x16" on "LnkCap". However, forcing a lane width of 4 when it was set to 8, like so:

if (new_lane_width != current_lane_width) {
	if(new_lane_width == 8)
		new_lane_width = 4;

I could finally verify that the width was indeed changing (x8 on balanced, x4 on battery) and that the patch was working.
Comment 16 mercuriete 2018-04-01 17:24:18 UTC
(In reply to Alex Deucher from comment #13)
> Created attachment 138450 [details] [review] [review]
> possible fix
> 
> Thanks for looking into this.  Good start!  The attached patch should fix it.

Hi Alex,
I tested your patch from comment 13

It indeed fix the null dereference but....
you ported a bug from radeonsi....

partial output from dmesg

[   40.620248] [drm:si_dpm_set_power_state] *ERROR* invalid pcie lane request: 7
[   45.921237] [drm:si_dpm_set_power_state] *ERROR* invalid pcie lane request: 15


Apart from that this patch is:

Tested-by: Abel Garcia Dorta <mercuriete@yahoo.es>
Comment 17 mercuriete 2018-04-01 17:57:45 UTC
With the patch from comment 4 ported to amdgpu...

https://patchwork.kernel.org/patch/8731311/

the warning is silenced but i have the same behaviour as Stefano Cipriani

I will attach lspci of battery and balanced but i have the same output:


partial output of lspci:

           LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-



Anyway, thank you very much Alex and Stefano. You make my day.

PS: From my point of view the issue of dereference is fixed and this issue can be closed when pushed to git.

PS2: Alex, this patch can be merged with linux LTS 4.14?
Your patch was tested on:
uname -a
Linux localhost.localdomain 4.14.27-gentoo #2 SMP PREEMPT Sun Apr 1 19:43:12 CEST 2018 x86_64 Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz GenuineIntel GNU/Linux


Thanks you very much guys!!!
Comment 18 mercuriete 2018-04-01 17:59:21 UTC
Created attachment 138479 [details]
lspci battery
Comment 19 mercuriete 2018-04-01 17:59:46 UTC
Created attachment 138480 [details]
lspci balanced
Comment 20 mercuriete 2018-04-10 18:57:24 UTC
Status of this patches:

Alex already created a pull request to Dave drm maintainer, so probably this doesnt take too much time to be merged in linux 4.17:

https://lists.freedesktop.org/archives/dri-devel/2018-April/172022.html

Sorry for the noise, this comment is only for keep track of the status.

Thanks for your amazing work Alex! :)
Comment 21 mercuriete 2018-04-14 01:15:20 UTC
Patches already merged to linux master

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16e205cf42da1f497b10a4a24f563e6c0d574eec

Please feel free to close this bug.

Thanks for fixing this bug. :)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.