* Steps to reproduce: for i in {0..300}; do (glxgears &); done (note that 100 might be enough instead of 300) * Actual result: ring 0 stalled, gpu locakup, reset and x11 stops and cannot restart. The only way is to reboot. * Expected result: The fps goes very low the more there are glxgears instances and no gpu lockup, like with intel gpu. * Infos: card W600 mesa 18.2, kernel 4.15.0-15-generic, LLVM 7.0.0 xorg 1.20.99.1 xf86-video-ati 18.0.1. (same result with kernel 4.4.0-130, mesa 12.0.6, llvm 3.8.0, DRM 2.43.0) I was playing with the apitrace here https://bugs.freedesktop.org/show_bug.cgi?id=87278#c31 and decided to through dozens of glxgears instances to see.
Created attachment 141084 [details] simple.c Minimal test to reproduce the issue by just drawing 2 lines. Run: for i in {0..300}; do (./simple &); done gcc -Wall simple.c -o simple $(pkg-config --cflags --libs gl x11) Also happens with R600_DEBUG=nodma,nowc,nodcc
Created attachment 141202 [details] simple.c Minimized the repro test even more using just a pixmap (no window) and 1 glVertex (GL_POINTS).
Created attachment 141203 [details] cs_dump_user_space.txt
Created attachment 141204 [details] cs_dum_kernel_space.txt Packet0 not allowed!.
Extract of the 2 attached cs dumps: User space so before ioctl radeon_cs_ioctl: 0x00000290 0x00000000 0xC0016900 0x000002A1 Kernel space so in radeon_cs_ioctl: 0x00000290 0x0000000b 0x00000000 0x000002a1 So for some reasons 0x00000000C0016900 gets overwritten by 0x0000000b00000000 Note that it always get overwritten with this value above and this value also appears in the other packet0 bug report: https://bugs.freedesktop.org/show_bug.cgi?id=84500#c7 I have started to narrow down the issue and it looks like it happens in "radeon_cs_parser_init" in kernel/drivers/gpu/drm/radeon as the overwrtting is already present just after this function. But it is not easy to debug further as this function is quite difficult to understand so any inputs would be appreciated, thx! Does kernel space make a copy of the cs chunks or just keep a pointer on it, as I see "user_ptr" ? Also note that the issue does not happen with amdgpu so one possibility is that "amdgpu_cs_parser_init" is more robust.
Created attachment 141263 [details] dmsg output running on wayland
Hello, I am getting similar issues with regards to fence wait timeouts. However I have narrowed it further to it ONLY happening when gnome is running on xorg. I have over the past month or so rebuilt my system from the ground up. I am NOT using a distro that holds peoples hands with package managers and bloated useless kernel modules. I use instructions from linuxfromscratch.org to build the entire system from the latest stable sourcecode. After I first boot into gnome, with it running on xorg, as soon as I have logged in and click on activities on the gnome menu and select terminal, then the little circle starts twirling, and after a few seconds the screen flashes, and it momentarily goes to the grey login background, then flashes to what can only be described as a mini pixal dump, then after a while it flashes back to the login screen again and you need to login again. At this point, if you click on the drop down list to see the types of login session available, gnome on xorg is missing from the list. At this stage I login and going back and activating gnome terminal is successfull, however the dmesg log shows that it has ring stalled errors, and the dreaded parser error that has been mentioned here. If I start gnome on wayland, and then proceed to click on activities and then on terminal to bring up gnome terminal, even though the circle twirls for a long time after, the terminal window opens almost immediately and the output of dmesg is free of the ring timeouts. Running xorg by itself using twm with clock and xterm also produces a clean dmesg log. Please find the results attached for both boot tests. By the way this is on one of the latest versions of the 4.18 kernel series available on kernel.org. The version of Mesa used is: mesa-18.1.5
Created attachment 141264 [details] dmsg output running on xorg
(In reply to Christopher from comment #7) > After I first boot into gnome, with it running on xorg, as soon as I have > logged in and click on activities on the gnome menu and select terminal, > then the little circle starts twirling, and after a few seconds the screen > flashes, and it momentarily goes to the grey login background, then flashes > to what can only be described as a mini pixal dump, then after a while it > flashes back to the login screen again and you need to login again. You're running into bug 105381 , unrelated to this report, fixed in xf86-video-ati Git master.
(In reply to Michel Dänzer from comment #9) > (In reply to Christopher from comment #7) > > After I first boot into gnome, with it running on xorg, as soon as I have > > logged in and click on activities on the gnome menu and select terminal, > > then the little circle starts twirling, and after a few seconds the screen > > flashes, and it momentarily goes to the grey login background, then flashes > > to what can only be described as a mini pixal dump, then after a while it > > flashes back to the login screen again and you need to login again. > > You're running into bug 105381 , unrelated to this report, fixed in > xf86-video-ati Git master. Hello Michel, Thank you for taking the time to respond. After doing a git pull of xf86-video-ati, compiling, installing and re-booting, this has indeed solved the issue for me. It really is next to impossible with the range of drivers that could have been the source of the error to know where to actually post a bug report. I am not a programmer, just an IT professional with decades of system administration experience, so once again, many thanks for pointing out how to solve my issue, even though I thought it looked similar to this. Christopher.
I found time to go a bit further. Now I understand this radeon_cs_parser_init function a bit more If I comment the AGP condition here https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon_cs.c?h=amd-staging-drm-next#n340 so that kdata is used then I can verify that the kdata contains the same data as user space. But when writing to parser->ib.ptr here https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon_cs.c?h=amd-staging-drm-next#n648 then comparing parser->ib.ptr's data and kdata shows the same difference as pointed in comment #5. Could it be an issue with pcie (though is works with admgpu, well in fact it uses kdata on amdgpu) ? Is there anyway I can force a commit/flush just after it writes to parser->ib.ptr as a test even if it is slower ? thx!
(In reply to Julien Isorce from comment #11) > Could it be an issue with pcie (though is works with admgpu, well in fact it > uses kdata on amdgpu) ? Is there anyway I can force a commit/flush just > after it writes to parser->ib.ptr as a test even if it is slower ? thx! Really unlikely, if we would have a hardware problem with PCIe we would see random bit values flip and not a constant pattern like we do.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/856.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.