Bug 96836

Summary: Kernel unaligned access at TPC[105d9fb4] nvkm_instobj_wr32+0x14/0x20 [nouveau]
Product: xorg Reporter: Kieron Gillespie <ciaran.gillespie>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: SPARC   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Message log while nouveau.debug=trace
none
messages forcing BusID in Xorg config
none
Xorg.0.log using xorg.conf.broke
none
xorg.conf.broke
none
lspci output from the system in question
none
panic_console.out none

Description Kieron Gillespie 2016-07-07 02:23:25 UTC
When attempting to use Nouveau driver with Xorg on my SunBlade 2500 running the latested build from Debian-9.0 Sid SPARC64 CPU 0 runs up to 100% and then eventually the Xorg session shuts down. It appears that it's timing out on a GPU lock. A bunch of messages appears in the Xorg.0.log indicating fifo: CACH_ERROR also with in the messages log there are several repeating errors.

log_unaligned: 598 callbacks suppressed
Kernel unaligned access at TPC[105d9fb4] nvkm_instobj_wr32+0x14/0x20 [nouveau]

Eventually the Xorg session gives up and times out.

I'll try and attach some more logs but it appears to have something to do with the iowrite32_native macro on SPARC V9 systems.
Comment 1 Ilia Mirkin 2016-07-07 02:51:54 UTC
What GPU is this? Is Sparc64 a BE system? Are you using 4K pages? (if not, use 4K pages)
Comment 2 Kieron Gillespie 2016-07-08 23:18:30 UTC
So Sparc64 is a big endian archetecutre. The request to use 4K huge pages is not possible on Sarc64 as the smallest it supports is 8K (which is what my system is currently using.)

The card is a GeForce FX5200 128MB DDR PCI

The GPU I am currently testing is a bit of a fossil, but I had great success on this SPARC system in the past managed to get full hardware acceleration working, sometime early 2015.
Comment 3 Ilia Mirkin 2016-07-08 23:34:16 UTC
Hmmm... maybe with nv3x the 4K pages aren't such a hard requirement. Definitely people on PPC64 with 64K pages had trouble with nv4x though.

But, if it worked before, it can work again. Since this isn't exactly *the* most common setup, you're going to have to do a bit more of the work. Try more kernels. Nouveau got a huge rewrite in kernel 4.3, try 4.2 maybe? That rewrite ended up breaking BE briefly, but I fixed it up again and it was working semi-recently on my FX5200 in a G5 (PPC64, also BE).

iowrite32_native is used all over the place to write to the card's MMIO space in one of the BARs (can never remember which).

The specific error seems to indicate that we did a wr32 on an instobj to a non-32-bit-aligned address. This would be very surprising. Please boot with

nouveau.debug=trace

and attach a full log of the result. (It should be large.)

Also, please try several kernels, including both pre- and post-4.3 ones.
Comment 4 Kieron Gillespie 2016-07-10 02:14:08 UTC
Created attachment 124978 [details]
Message log while nouveau.debug=trace

So I enabled the debug trace and I let it run for sometime, roughly 15 minutes, though it looks like it didn't get terribly far. I am going to try and collected more information by having it run for several hours, but I figured I'd upload this in case it's at all useful.
Comment 5 Ilia Mirkin 2016-07-10 02:23:07 UTC
(In reply to Kieron Gillespie from comment #4)
> Created attachment 124978 [details]
> Message log while nouveau.debug=trace
> 
> So I enabled the debug trace and I let it run for sometime, roughly 15
> minutes, though it looks like it didn't get terribly far. I am going to try
> and collected more information by having it run for several hours, but I
> figured I'd upload this in case it's at all useful.

Hm, something bad is going on. It's supposed to work much more gracefully.

First off ... where are all the init messages from nouveau loading?

Do you have a digital screen you can connect? It looks like something keeps trying to get the scanout position but can't (see the error returned by nv04_disp_scanoutpos), which in turn floods the logs.
Comment 6 Ilia Mirkin 2016-07-10 02:26:49 UTC
(In reply to Kieron Gillespie from comment #4)
> Created attachment 124978 [details]
> Message log while nouveau.debug=trace
> 
> So I enabled the debug trace and I let it run for sometime, roughly 15
> minutes, though it looks like it didn't get terribly far. I am going to try
> and collected more information by having it run for several hours, but I
> figured I'd upload this in case it's at all useful.

Also, looks like your Xorg is in a restart loop, perhaps logs from that could be interesting.
Comment 7 Kieron Gillespie 2016-07-10 02:36:10 UTC
So it is actually connected to a display, and I can get a console with nouveau, I'll try to get a better Xorg output, I think the driver is still having trouble auto-detecting the device.

Also I am going to connect one of my serial cables so I can get a cleaner output, I think that messages is missing some of the very early boot messsages. Not sure but would like to rule it out.
Comment 8 Kieron Gillespie 2016-07-10 02:38:13 UTC
The constant restarting of Xorg is coming from the lightdm service. It's constantly trying over and over again.
Comment 9 Kieron Gillespie 2016-07-10 02:48:22 UTC
Created attachment 124979 [details]
messages forcing BusID in Xorg config

So this time I logged into the box remotely and stop the lightdm service I then ran "Xorg -config xorg.conf.broke -verbose 6" I'll also attach the Xorg.0.log and the config file.

The Xorg log almost makes it look like it is working, though all I am left with on the screen is a blank screen with a single non-blinking cursor in the top left corner of the monitor, alsmost like it switched to the virtual terminal but didn't actually clear the screen and didn't start to draw anything.
Comment 10 Kieron Gillespie 2016-07-10 02:49:14 UTC
Created attachment 124980 [details]
Xorg.0.log using xorg.conf.broke
Comment 11 Kieron Gillespie 2016-07-10 02:49:30 UTC
Created attachment 124981 [details]
xorg.conf.broke
Comment 12 Kieron Gillespie 2016-07-10 02:51:18 UTC
Created attachment 124982 [details]
lspci output from the system in question
Comment 13 Kieron Gillespie 2016-07-10 02:57:25 UTC
So I tried unplugging the monitor just to see what would happen, welp...

Message from syslogd@celestia at Jul  9 22:55:21 ...
 kernel:[ 2211.347894] Kernel panic - not syncing: Irrecoverable deferred error trap.

Message from syslogd@celestia at Jul  9 22:55:21 ...
 kernel:[ 2211.347894] 

Message from syslogd@celestia at Jul  9 22:55:21 ...
 kernel:[ 2213.461991] Press Stop-A (L1-A) to return to the boot prom

Message from syslogd@celestia at Jul  9 22:55:21 ...
 kernel:[ 2213.534081] ---[ end Kernel panic - not syncing: Irrecoverable deferred error trap.

Message from syslogd@celestia at Jul  9 22:55:21 ...
 kernel:[ 2213.534081] 

Message from syslogd@celestia at Jul  9 22:55:54 ...
 kernel:[ 2246.560543] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [Xorg:3774]

Well now I know what happens! :P
Comment 14 Ilia Mirkin 2016-07-10 02:58:12 UTC
OK, well let's start small. One source of problems is that we have

drivers/gpu/drm/nouveau/nouveau_bios.h:#define ROM16(x) le16_to_cpu(*(u16 *)&(x))

Which can only work on aligned pointers x, but it gets called with unaligned offsets in nouveau_bios.c

Can you try changing that to

#define ROM16(x) get_unaligned_le16(&(x))

I'm guessing that will help with the first group of unaligned traps.
Comment 15 Ilia Mirkin 2016-07-10 03:02:18 UTC
(In reply to Ilia Mirkin from comment #14)
> OK, well let's start small. One source of problems is that we have
> 
> drivers/gpu/drm/nouveau/nouveau_bios.h:#define ROM16(x) le16_to_cpu(*(u16
> *)&(x))
> 
> Which can only work on aligned pointers x, but it gets called with unaligned
> offsets in nouveau_bios.c
> 
> Can you try changing that to
> 
> #define ROM16(x) get_unaligned_le16(&(x))
> 
> I'm guessing that will help with the first group of unaligned traps.

Oh, and same treatment for ROM32 of course (and ROM64 while you're at it, but that never gets called from what I can tell).
Comment 16 Kieron Gillespie 2016-07-10 03:08:09 UTC
Created attachment 124983 [details]
panic_console.out

So now that I am logging directly from the seiral terminal I appear to be getting more information.

There are times when the system boots the nouveau driver it's self crashes. I was able to catch it this time.
Comment 17 Kieron Gillespie 2016-07-10 04:07:11 UTC
(In reply to Ilia Mirkin from comment #15)
> (In reply to Ilia Mirkin from comment #14)
> > OK, well let's start small. One source of problems is that we have
> > 
> > drivers/gpu/drm/nouveau/nouveau_bios.h:#define ROM16(x) le16_to_cpu(*(u16
> > *)&(x))
> > 
> > Which can only work on aligned pointers x, but it gets called with unaligned
> > offsets in nouveau_bios.c
> > 
> > Can you try changing that to
> > 
> > #define ROM16(x) get_unaligned_le16(&(x))
> > 
> > I'm guessing that will help with the first group of unaligned traps.
> 
> Oh, and same treatment for ROM32 of course (and ROM64 while you're at it,
> but that never gets called from what I can tell).

Alright I'll give that a shot and let you know, thanks for the help!
Comment 18 Ilia Mirkin 2016-07-27 23:20:43 UTC
I believe these two patches should be relevant to your situation:

https://lists.freedesktop.org/archives/nouveau/2016-July/025683.html
https://lists.freedesktop.org/archives/nouveau/2016-July/025688.html

Whether it resolves anything ... who knows. Should at least get rid of all the unaligned access errors.
Comment 19 Martin Peres 2019-12-04 09:15:04 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/273.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.