Bug 106696

Summary:	repeatable drm:amdgpu_job_timedout with vulkan toy
Product:	Mesa	Reporter:	Dave Gilbert <freedesktop>
Component:	Drivers/Vulkan/radeon	Assignee:	mesa-dev
Status:	RESOLVED NOTOURBUG	QA Contact:	mesa-dev
Severity:	normal
Priority:	medium
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:

Description Dave Gilbert 2018-05-28 16:11:12 UTC

I've been playing with vulkan and seem to have produced a reliable hanger:

symptom:
[  188.168870] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1328, last emitted seq=1330
[  188.168880] [drm] IP block:gfx_v8_0 is hung!
[  188.168929] [drm] GPU recovery disabled.

and the display locks.  Can't cleanly reboot.

to reproduce:
1) Get code from: https://github.com/penguin42/vulkanmand/tree/gpu-hang-1
2) build with cargo build  (rust 1.26.0 on fedora 28)
   - I've included the spir-v files I built
3) Run with ./target/debug/vulkanmand
4) Poke the -> (rotate Y axis) button a few times until it hangs.  Normally does it after 4 or 5 for me.

Hardware:
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7) (prog-if 00 [VGA controller])
	Subsystem: Sapphire Technology Limited Lexa PRO [Radeon RX 550]

OS: Fedora 28
kernel: 4.16.11-300.fc28.x86_64 (Fedora packaged)
other stuff:
mesa-libGLU-9.0.0-14.fc28.x86_64
mesa-libGL-devel-18.0.2-1.fc28.x86_64
mesa-filesystem-18.0.2-1.fc28.x86_64
mesa-debugsource-18.0.2-1.fc28.x86_64
mesa-dri-drivers-18.0.2-1.fc28.x86_64
mesa-vulkan-drivers-debuginfo-18.0.2-1.fc28.x86_64
mesa-libGL-18.0.2-1.fc28.x86_64
mesa-libglapi-18.0.2-1.fc28.x86_64
mesa-vulkan-drivers-18.0.2-1.fc28.x86_64
mesa-libEGL-devel-18.0.2-1.fc28.x86_64
mesa-libGLU-devel-9.0.0-14.fc28.x86_64
mesa-vdpau-drivers-18.0.2-1.fc28.x86_64
mesa-libGLES-18.0.2-1.fc28.x86_64
mesa-libgbm-18.0.2-1.fc28.x86_64
mesa-libGLES-devel-18.0.2-1.fc28.x86_64
mesa-libOpenCL-18.0.2-1.fc28.x86_64
mesa-debuginfo-18.0.2-1.fc28.x86_64
mesa-libEGL-18.0.2-1.fc28.x86_64
mesa-libxatracker-18.0.2-1.fc28.x86_64
llvm-devel-6.0.0-11.fc28.x86_64
llvm-libs-6.0.0-11.fc28.x86_64
llvm5.0-libs-5.0.1-7.fc28.x86_64
llvm-6.0.0-11.fc28.x86_64

Comment 1 Dave Gilbert 2018-05-28 16:38:37 UTC

I think there's a fair chance that it's actually getting stuck in the while loop in my ray.comp (which may well be a screwup on my part); but even so taking out everything in a non-rebootable way is a bit of a mess!

adding:
diff --git a/ray.comp b/ray.comp
index e75039f..0611d56 100644
--- a/ray.comp
+++ b/ray.comp
@@ -52,13 +52,14 @@ void main() {
   // sure that none of rx/ry/rz are greater than a pixel
   ray = ray / length(ray);
 
+  int limit = 0;
   float result = 0.0;
   bool hitx = false;
   bool hity = false;
   bool hitz = false;
   bool hitedge = false;
   float lighting = 0.0;
-  while (result <= 255.4 && !hitedge &&
+  while (result <= 255.4 && !hitedge && limit < 256 &&
          !(hitx=hitend(pvp.x, ray.x, vsize.x)) &&
          !(hity=hitend(pvp.y, ray.y, vsize.y)) &&
          !(hitz=hitend(pvp.z, ray.y, vsize.z))) {
@@ -76,6 +77,7 @@ void main() {
       result+= float(value/8.0);
     }
     pvp += ray;
+    limit++;
   }
 
   if (result > 255.0) result=255.0;


seems to stop it triggering.

Comment 2 Dave Gilbert 2018-05-28 16:43:22 UTC

or actually the correct fix to my ray.comp is:
--- a/ray.comp
+++ b/ray.comp
@@ -61,7 +61,7 @@ void main() {
   while (result <= 255.4 && !hitedge &&
          !(hitx=hitend(pvp.x, ray.x, vsize.x)) &&
          !(hity=hitend(pvp.y, ray.y, vsize.y)) &&
-         !(hitz=hitend(pvp.z, ray.y, vsize.z))) {
+         !(hitz=hitend(pvp.z, ray.z, vsize.z))) {

that will stop the ray tracer running off into the distance for ever.

Comment 3 Christian König 2018-05-29 06:47:33 UTC

GPU reset and recovery can be enabled using amdgpu.gpu_recovery=1.

Otherwise a shader with an endless loop will just keep running forever.

Comment 4 Dave Gilbert 2018-05-29 09:16:09 UTC

(In reply to Christian König from comment #3)
> GPU reset and recovery can be enabled using amdgpu.gpu_recovery=1.
> 
> Otherwise a shader with an endless loop will just keep running forever.

Hmm; that's a dangerous situation.

1) Recovery doesn't work - I tried it, and it did at least make the machine rebootable, but the GPU state was very broken after the reset attempt.

2) An unprivileged user being able to make the system need & fail to reboot by default is a security issue; couldn't this be triggered by something like WebGL?

Comment 5 Nicolai Hähnle 2018-05-30 14:56:46 UTC

Yes, that is a known issue, but this particular bug report is not the way to cover it, so closing it again.

Comment 6 almos 2018-05-30 22:19:53 UTC

(In reply to Nicolai Hähnle from comment #5)
> Yes, that is a known issue, but this particular bug report is not the way to
> cover it, so closing it again.

What is the way to cover it then? There have been several bugreports about unrecoverable gpu hangs leading to full system hangs, and all of them were swept off the table just like this one.

Comment 7 Dave Gilbert 2018-06-02 20:25:48 UTC

(In reply to Nicolai Hähnle from comment #5)
> Yes, that is a known issue, but this particular bug report is not the way to
> cover it, so closing it again.

OK, fair enough - which one should I be following?

(For reference, I just tried it on an Intel box;  according to dmesg it times out in the same way, but it does succesfully manage a reset which I guesss is better; it's still fairly grim locking the GUI for the timeout period).

Comment 8 Samuel Pitoiset 2018-06-13 19:12:00 UTC

As Nicolai said, this is a known issue. Definitely unrelated to RADV. Please don't re-open, thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.