Bug 107545 - radeon - ring 0 stalled - GPU lockup - SI
Summary: radeon - ring 0 stalled - GPU lockup - SI
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: XOrg git
Hardware: All Linux (All)
: high normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-10 22:59 UTC by Julien Isorce
Modified: 2019-11-19 09:34 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
simple.c (1.99 KB, text/plain)
2018-08-14 16:52 UTC, Julien Isorce
no flags Details
simple.c (1.46 KB, text/plain)
2018-08-20 17:24 UTC, Julien Isorce
no flags Details
cs_dump_user_space.txt (11.87 KB, text/plain)
2018-08-20 17:26 UTC, Julien Isorce
no flags Details
cs_dum_kernel_space.txt (22.34 KB, text/plain)
2018-08-20 17:27 UTC, Julien Isorce
no flags Details
dmsg output running on wayland (61.49 KB, text/plain)
2018-08-23 23:03 UTC, Christopher
no flags Details
dmsg output running on xorg (67.18 KB, text/plain)
2018-08-23 23:06 UTC, Christopher
no flags Details

Description Julien Isorce 2018-08-10 22:59:59 UTC
* Steps to reproduce:
for i in {0..300}; do (glxgears &); done  (note that 100 might be enough instead of 300)

* Actual result:
ring 0 stalled, gpu locakup, reset and x11 stops and cannot restart. The only way is to reboot.

* Expected result:
The fps goes very low the more there are glxgears instances and no gpu lockup, like with intel gpu.

* Infos:
card W600
mesa 18.2,
kernel 4.15.0-15-generic,
LLVM 7.0.0
xorg 1.20.99.1
xf86-video-ati 18.0.1.
(same result with kernel 4.4.0-130, mesa 12.0.6, llvm 3.8.0, DRM 2.43.0)

I was playing with the apitrace here 
https://bugs.freedesktop.org/show_bug.cgi?id=87278#c31 and decided to through dozens of glxgears instances to see.
Comment 1 Julien Isorce 2018-08-14 16:52:26 UTC
Created attachment 141084 [details]
simple.c

Minimal test to reproduce the issue by just drawing 2 lines. Run: for i in {0..300}; do (./simple &); done

gcc -Wall simple.c -o simple $(pkg-config --cflags --libs gl x11)

Also happens with R600_DEBUG=nodma,nowc,nodcc
Comment 2 Julien Isorce 2018-08-20 17:24:22 UTC
Created attachment 141202 [details]
simple.c

Minimized the repro test even more using just a pixmap (no window) and 1 glVertex (GL_POINTS).
Comment 3 Julien Isorce 2018-08-20 17:26:37 UTC
Created attachment 141203 [details]
cs_dump_user_space.txt
Comment 4 Julien Isorce 2018-08-20 17:27:16 UTC
Created attachment 141204 [details]
cs_dum_kernel_space.txt

Packet0 not allowed!.
Comment 5 Julien Isorce 2018-08-20 17:42:01 UTC
Extract of the 2 attached cs dumps:

User space so before ioctl radeon_cs_ioctl:
0x00000290
0x00000000
0xC0016900
0x000002A1

Kernel space so in radeon_cs_ioctl:
0x00000290
0x0000000b
0x00000000
0x000002a1

So for some reasons 0x00000000C0016900 gets overwritten by 0x0000000b00000000

Note that it always get overwritten with this value above and this value also appears in the other packet0 bug report: https://bugs.freedesktop.org/show_bug.cgi?id=84500#c7

I have started to narrow down the issue and it looks like it happens in "radeon_cs_parser_init" in kernel/drivers/gpu/drm/radeon as the overwrtting is already present just after this function. But it is not easy to debug further as this function is quite difficult to understand so any inputs would be appreciated, thx!

Does kernel space make a copy of the cs chunks or just keep a pointer on it, as I see "user_ptr" ? 

Also note that the issue does not happen with amdgpu so one possibility is that "amdgpu_cs_parser_init" is more robust.
Comment 6 Christopher 2018-08-23 23:03:00 UTC
Created attachment 141263 [details]
dmsg output running on wayland
Comment 7 Christopher 2018-08-23 23:04:11 UTC
Hello,

I am getting similar issues with regards to fence wait timeouts.  However I have narrowed it further to it ONLY happening when gnome is running on xorg.

I have over the past month or so rebuilt my system from the ground up.  I am NOT using a distro that holds peoples hands with package managers and bloated useless kernel modules.  I use instructions from linuxfromscratch.org to build the entire system from the latest stable sourcecode.

After I first boot into gnome, with it running on xorg, as soon as I have logged in and click on activities on the gnome menu and select terminal, then the little circle starts twirling, and after a few seconds the screen flashes, and it momentarily goes to the grey login background, then flashes to what can only be described as a mini pixal dump, then after a while it flashes back to the login screen again and you need to login again.  At this point, if you click on the drop down list to see the types of login session available, gnome on xorg is missing from the list.  At this stage I login and going back and activating gnome terminal is successfull, however the dmesg log shows that it has ring stalled errors, and the dreaded parser error that has been mentioned here.

If I start gnome on wayland, and then proceed to click on activities and then on terminal to bring up gnome terminal, even though the circle twirls for a long time after, the terminal window opens almost immediately and the output of dmesg is free of the ring timeouts.

Running xorg by itself using twm with clock and xterm also produces a clean dmesg log.

Please find the results attached for both boot tests.  By the way this is on one of the latest versions of the 4.18 kernel series available on kernel.org.

The version of Mesa used is: mesa-18.1.5
Comment 8 Christopher 2018-08-23 23:06:49 UTC
Created attachment 141264 [details]
dmsg output running on xorg
Comment 9 Michel Dänzer 2018-08-24 07:30:28 UTC
(In reply to Christopher from comment #7)
> After I first boot into gnome, with it running on xorg, as soon as I have
> logged in and click on activities on the gnome menu and select terminal,
> then the little circle starts twirling, and after a few seconds the screen
> flashes, and it momentarily goes to the grey login background, then flashes
> to what can only be described as a mini pixal dump, then after a while it
> flashes back to the login screen again and you need to login again.

You're running into bug 105381 , unrelated to this report, fixed in xf86-video-ati Git master.
Comment 10 Christopher 2018-08-24 09:06:58 UTC
(In reply to Michel Dänzer from comment #9)
> (In reply to Christopher from comment #7)
> > After I first boot into gnome, with it running on xorg, as soon as I have
> > logged in and click on activities on the gnome menu and select terminal,
> > then the little circle starts twirling, and after a few seconds the screen
> > flashes, and it momentarily goes to the grey login background, then flashes
> > to what can only be described as a mini pixal dump, then after a while it
> > flashes back to the login screen again and you need to login again.
> 
> You're running into bug 105381 , unrelated to this report, fixed in
> xf86-video-ati Git master.

Hello Michel,

Thank you for taking the time to respond.  After doing a git pull of xf86-video-ati, compiling, installing and re-booting, this has indeed solved the issue for me.

It really is next to impossible with the range of drivers that could have been the source of the error to know where to actually post a bug report.  I am not a programmer, just an IT professional with decades of system administration experience, so once again, many thanks for pointing out how to solve my issue, even though I thought it looked similar to this.

Christopher.
Comment 11 Julien Isorce 2018-08-24 23:34:53 UTC
I found time to go a bit further. Now I understand this radeon_cs_parser_init function a bit more 

If I comment the AGP condition here https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon_cs.c?h=amd-staging-drm-next#n340 so that kdata is used then I can verify that the kdata contains the same data as user space.

But when writing to parser->ib.ptr here https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon_cs.c?h=amd-staging-drm-next#n648 then comparing parser->ib.ptr's data and kdata shows the same difference as pointed in comment #5.

Could it be an issue with pcie (though is works with admgpu, well in fact it uses kdata on amdgpu) ? Is there anyway I can force a commit/flush just after it writes to parser->ib.ptr as a test even if it is slower ? thx!
Comment 12 Christian König 2018-08-27 10:38:40 UTC
(In reply to Julien Isorce from comment #11)
> Could it be an issue with pcie (though is works with admgpu, well in fact it
> uses kdata on amdgpu) ? Is there anyway I can force a commit/flush just
> after it writes to parser->ib.ptr as a test even if it is slower ? thx!

Really unlikely, if we would have a hardware problem with PCIe we would see random bit values flip and not a constant pattern like we do.
Comment 13 Martin Peres 2019-11-19 09:34:14 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/856.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.