Bug 110142

Summary:

"Oops: Kernel access of bad area sig 7" on Kernel 5.0.0 PPC64LE when loading amdgpu, xorg hangs after being unable to load after OS boots.

Product:

DRI

Reporter:

Peter Easton <JollyRoger>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

critical

Priority:

medium

Version:

unspecified

Hardware:

PowerPC

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
dmesg output, kernel configuration file, lspci, and Xorg.0.log respectively.	none

Description Peter Easton 2019-03-17 02:52:22 UTC

Created attachment 143700 [details]
dmesg output, kernel configuration file, lspci, and Xorg.0.log respectively.

Ahoy!

It looks like amdgpu is having an "Oops" when initializing with Kernels 5.0.0 and later on Linux PPC64 (Little Endian) platforms, right as it tries to load amdgpu. 

In the attached dmesg it looks like it starts around here: 

[   34.247578] Oops: Kernel access of bad area, sig: 7 [#1]

I came to notice this bug when I upgraded the kernel from 4.20.11 on Gentoo and 4.20.1 on Debian and rebooted, then trying to bring up xfce4 would hang. This even causes Gentoo to hang on shutdown when / cannot be unmounted and requires a hard poweroff, even if I attempt to kill the process starting xfce4. 

I tried it with both 5.0.0 and 5.0.2 on Gentoo, after upgrading from 4.20.11, and got similar results: when I enter "startxfce4" at the prompt, xorg hangs. 

I'm attaching my dmesg, kernel configuration file, the output of lspci, and the xorg.0.log in that order. The xorg.0.log file is only 48 lines long, and hasn't been truncated, that's all that's in it. Currently the only way to work around this is for me to use an older kernel. I can post the dmesg from 4.20.11 (the last kernel I had that worked) if it's required.

Comment 1 Alex Deucher 2019-03-18 02:33:07 UTC

Can you bisect?

Comment 2 Peter Easton 2019-03-18 02:59:29 UTC

Sure! 

I haven't git bisected a kernel before so I'll go and teach myself how to and report back as soon as I figure it out (I'm currently testing it with 4.20.16, which also seems to work). I apologize for the delay, I will keep you updated!

Comment 3 Michel Dänzer 2019-03-18 10:03:26 UTC

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c713a461459202504050305242cd854bad57837c seems the most likely candidate, it's the only significant change to gmc_v9_0_late_init between 4.20 and 5.0.

I guess the problem is actually in gmc_v9_0_allocate_vm_inv_eng though. Peter, what does

 scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko gmc_v9_0_late_init+0x114/0x500

say in the kernel build tree (in the state after building the binaries which generated the attached dmesg output)?

Comment 4 Christian König 2019-03-18 10:05:05 UTC

Yeah, that is a known problem.

Give me a moment to submit a fix to the mailing list.

Comment 5 Peter Easton 2019-03-19 03:21:14 UTC

I'm going to try to see if I can narrow things down a bit by finding the last commit that worked before the bug happened and then bisect the kernel there. I'm going to try to compile the kernels, install them, and then try rebooting and starting xfce4 one by one until I find the one that won't start.

I apologize in advance for the sluggishness, there are a lot of commits here and the computer boots very slowly, so this puts a bit of a bottleneck on how many kernels I can test in a given timeframe. I'll try to hurry as best I can and I'll try to keep this thread updated as I make progress on it. Right now I think I'll start at commit af0df68432f65915b2a316aa99eeeb588d4c65a2 since that one works, and I'll start working my way towards 5.0.0 from there to see if I can narrow down the right commit that borked the driver. 

> I guess the problem is actually in gmc_v9_0_allocate_vm_inv_eng though. Peter, what does...[truncated]

Sure. 5.0.2, which gave us that dmesg output, returned to this message when I tried to enter the commands as followed, this is what the screen looks like: 

> captain@morgans-revenge /usr/src/linux-5.0.2-gentoo/scripts $ ./faddr2line ../drivers/gpu/drm/amd/amdgpu/amdgpu.ko
> gmc_v9_0_late_init+0x114/0x500
> gmc_v9_0_late_init+0x114/0x500:
> gmc_v9_0_late_init at gmc_v9_0.c:?
> captain@morgans-revenge /usr/src/linux-5.0.2-gentoo/scripts $

I hope this might be what you are looking for? I wasn't sure what to enter so I looked for the drivers folder and then typed the rest in as it was.

Comment 6 Peter Easton 2019-03-19 03:24:37 UTC

Whoops, hit return too quickly. 

(In reply to Michel Dänzer from comment #3)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=c713a461459202504050305242cd854bad57837c seems the most likely
> candidate, it's the only significant change to gmc_v9_0_late_init between
> 4.20 and 5.0.

Sure. I'll go and give this one a try first, actually. I'm out of time for the night but I can hop back on it first thing tomorrow.

Comment 7 Michel Dänzer 2019-03-19 08:50:14 UTC

Please try https://patchwork.freedesktop.org/patch/292720/ , it should fix the problem.

Comment 8 Peter Easton 2019-03-20 00:42:53 UTC

Great, I'll go try it on 5.0.2 and report back, thanks!

Comment 9 Peter Easton 2019-03-20 02:01:11 UTC

(In reply to Michel Dänzer from comment #7)
> Please try https://patchwork.freedesktop.org/patch/292720/ , it should fix
> the problem.

Splice the mainbrace! It worked like a charm! 

It worked, xfce4 started without a hitch this time with the new kernel. Thank you so much! 

What shall we do now? Is there a way to get that patch merged upstream?

Comment 10 Michel Dänzer 2019-03-20 09:47:20 UTC

It's on its way already, but it might take a while for it to land in a 5.0.y release.

Comment 11 Peter Easton 2019-03-22 01:40:46 UTC

Yay, glad to hear! 

I'm going to change it to fixed then, if that's okay with you guys? 

Thank you so much for the help.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.