Summary: | [hawaii, radeonsi, clover] Running Piglit cl/program/execute/{,tail-}calls{,-struct,-workitem-id}.cl cause GPU VM error and ring stalled GPU lockup | ||
---|---|---|---|
Product: | Mesa | Reporter: | Vedran Miletić <vedran> |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | normal | ||
Priority: | high | CC: | arsenm2, mail, me |
Version: | git | ||
Hardware: | All | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Bug Depends on: | |||
Bug Blocks: | 99553 |
Description
Vedran Miletić
2018-02-15 14:58:13 UTC
Same story with tests/cl/program/execute/calls-struct.cl tests/cl/program/execute/calls-workitem-id.cl tests/cl/program/execute/calls.cl tests/cl/program/execute/tail-calls.cl while tests/cl/program/execute/call-clobbers-amdgcn.cl gets skipped. All those test were added in e408ce1f2bff23121670a8206258c80bb3d9befd. I've also hit this issue on "Oland PRO [Radeon R7 240/340] (rev 87)" with mesa-18.1.0_rc2, llvm-6.0.0 and kernel 4.16.5. The crash happens at "cl/program/execute/calls-struct.cl" from piglit as well. It happens both from a X session and from a KMS console. The exact crash looks like this: [ 171.969488] radeon 0000:20:00.0: GPU fault detected: 147 0x06106001 [ 171.969489] radeon 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00500030 [ 171.969490] radeon 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x10060001 [ 171.969491] VM fault (0x01, vmid 8) at page 5242928, read from CB (96) Then the radeon driver tries to reset the GPU endlessly. I've tried pcie_gen2=0, msi=0, dpm=0, hard_reset=1, vm_size=16 in various combinations, nothing seems to help (msi=0 gives a ton of IOMMU errors, BTW). Also have tried amdgpu which gives a similar crash (it looks like this driver didn't attempt to reset the GPU afterwards): [ 435.596230] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002 [ 435.596233] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00500060 [ 435.596235] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08060002 [ 435.596239] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (96) [ 435.596245] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002 [ 435.596247] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00500060 [ 435.596248] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08050002 [ 435.596252] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (80) [ 435.596256] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002 [ 435.596258] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00500060 [ 435.596260] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08010002 [ 435.596263] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (16) [ 435.596267] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002 [ 435.596269] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00500060 [ 435.596271] amdgpu 0000:20:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08050002 [ 435.596274] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (80) [ 435.596278] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002 This might be (also?) a kernel bug since a userspace program should not be able to crash a GPU, regardless how incorrect command stream it sends to one. Still happens on Mesa 18.1.2, LLVM 6.0.1 and kernel 4.17.2. Note that piglit tests aren't the only thing that is affected by this bug - ImageMagick OpenCL support also causes a similar GPU fault. I have tested this hardware setup again with Mesa 18.2.3, LLVM 7.0.0, kernel 4.19.0 and piglit from yesterday's git. These tests no longer crash the GPU but fail anyway with various errors: program@execute@calls-struct: Error: 6 unsupported relocations Expecting 1021 (0x3fd) with tolerance 0, but got 1 (0x1) Error at int[0] Argument 0: FAIL Expecting 14 (0xe) with tolerance 0, but got 1 (0x1) ... and so for other arguments in this test. program@execute@calls-workitem-id: Error: 8 unsupported relocations Could not wait for kernel to finish: CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST Unexpected CL error: CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST -14 program@execute@calls: <inline asm>:1:2: error: instruction not supported on this GPU v_lshlrev_b64 v[0:1], 44, 1 program@execute@tail-calls: Expecting 4 (0x4) with tolerance 0, but got 0 (0x0) Error at int[0] Argument 0: FAIL Running kernel test: Tail call with more arguments than caller Using kernel kernel_call_tailcall_extra_arg Setting kernel arguments... Running the kernel... Validating results... Expecting 2 (0x2) with tolerance 0, but got 1 (0x1) Error at int[0] Argument 0: FAIL Running kernel test: Tail call with fewer arguments than acller Using kernel kernel_call_tailcall_fewer_args Setting kernel arguments... Running the kernel... Validating results... Expecting 4 (0x4) with tolerance 0, but got 0 (0x0) Error at int[0] Argument 0: FAIL ... and so for other arguments and calls in this test. This behaviour is expected. until mesa properly supports code relocations (or llvm stops using relocations for internal symbols) it will refuse to run kernels that need relocating. Jumping to invalid address causes both pagefault and a gpu hang. There are really two issues at play here: 1) If the LLVM-generated code cannot be run properly then it should be simply rejected by whatever is actually in charge of submitting it to the GPU (I guess this would be Mesa?). This way an application will know it cannot use OpenCL for computation, at least not with this compute kernel. Instead, it currently looks like many of these test run but give incorrect results, which is obviously rather bad. 2) Some (previous) Mesa + LLVM versions generate a command stream that crashes the GPU and, as far as I can remember, sometimes even lockup the whole machine. It should not be possible to crash the GPU, regardless how incorrect a command stream that userspace sends to it is - because otherwise it is possible for an unprivileged user with GPU access to DoS the machine. (In reply to Maciej S. Szmigiero from comment #6) > There are really two issues at play here: > 1) If the LLVM-generated code cannot be run properly then it should be simply > rejected by whatever is actually in charge of submitting it to the GPU (I > guess > this would be Mesa?). > This way an application will know it cannot use OpenCL for computation, at > least > not with this compute kernel. > > Instead, it currently looks like many of these test run but give incorrect > results, which is obviously rather bad. Do you have an example of this? clover should return OUT_OF_RESOURCES error when the compute state creation fails (like in the presence of code relocations). It does not change the content of the buffer, so it will return whatever was stored in the buffer on creation. > 2) Some (previous) Mesa + LLVM versions generate a command stream that > crashes the GPU and, as far as I can remember, sometimes even lockup the > whole machine. > > It should not be possible to crash the GPU, regardless how incorrect a > command stream that userspace sends to it is - because otherwise it is > possible for > an unprivileged user with GPU access to DoS the machine. This is a separate issue. GPU hangs are generally addressed via gpu reset which should be enabled for gfx8/9 GPUs in recent amdgpu.ko [0] [0] https://patchwork.freedesktop.org/patch/257994/ (In reply to Jan Vesely from comment #7) > (In reply to Maciej S. Szmigiero from comment #6) > > There are really two issues at play here: > > 1) If the LLVM-generated code cannot be run properly then it should be simply > > rejected by whatever is actually in charge of submitting it to the GPU (I > > guess > > this would be Mesa?). > > This way an application will know it cannot use OpenCL for computation, at > > least > > not with this compute kernel. > > > > Instead, it currently looks like many of these test run but give incorrect > > results, which is obviously rather bad. > > Do you have an example of this? clover should return OUT_OF_RESOURCES error > when the compute state creation fails (like in the presence of code > relocations). > It does not change the content of the buffer, so it will return whatever was > stored in the buffer on creation. Aren't program@execute@calls-struct and program@execute@tail-calls tests from comment 4 examples of this behavior? These seem to run but return wrong results, or am I not parsing the piglit test results correctly? > > 2) Some (previous) Mesa + LLVM versions generate a command stream that > > crashes the GPU and, as far as I can remember, sometimes even lockup the > > whole machine. > > > > It should not be possible to crash the GPU, regardless how incorrect a > > command stream that userspace sends to it is - because otherwise it is > > possible for > > an unprivileged user with GPU access to DoS the machine. > > This is a separate issue. GPU hangs are generally addressed via gpu reset > which should be enabled for gfx8/9 GPUs in recent amdgpu.ko [0] > > [0] https://patchwork.freedesktop.org/patch/257994/ This would explain why "amdgpu" seemed to not even attempt to reset the GPU after a crash. However, I think I've got at least one lockup when testing this issue half a year ago on "radeon" driver ("amdgpu" is still marked as experimental for SI parts). If I am able to reproduce it in the future I will report it then. (In reply to Maciej S. Szmigiero from comment #8) > Aren't program@execute@calls-struct and program@execute@tail-calls tests > from comment 4 examples of this behavior? > These seem to run but return wrong results, or am I not parsing the piglit > test results correctly? This is more of a piglit problem. piglit uses a combination of enqueue and clFinish. However, the error happens on kernel launch. thus; 1.) clEnqueueNDRangeKernel -- success 2.) The driver tries to launch the kernel and fails on relocations 3.) application(piglit) calls clFinish depending on the order of 2. and 3. clFinish can either see an empty queue and succeed or try to wait for kernel execution and fail. The following series should address that: https://patchwork.freedesktop.org/series/52857/ > This would explain why "amdgpu" seemed to not even attempt to reset the GPU > after a crash. > > However, I think I've got at least one lockup when testing this issue half a > year ago on "radeon" driver ("amdgpu" is still marked as experimental for SI > parts). > If I am able to reproduce it in the future I will report it then. comment #1 shows an example of a successful restart using radeon.ko, so I guess it worked for at least some ASICs. at any rate, restarting GPU is a separate, kernel, problem. Feel free to remove the relocation guard if you want to investigate GPU reset. (In reply to Jan Vesely from comment #9) > (In reply to Maciej S. Szmigiero from comment #8) > > Aren't program@execute@calls-struct and program@execute@tail-calls tests > > from comment 4 examples of this behavior? > > These seem to run but return wrong results, or am I not parsing the piglit > > test results correctly? > > This is more of a piglit problem. piglit uses a combination of enqueue and > clFinish. However, the error happens on kernel launch. thus; > 1.) clEnqueueNDRangeKernel -- success > 2.) The driver tries to launch the kernel and fails on relocations > 3.) application(piglit) calls clFinish > > depending on the order of 2. and 3. clFinish can either see an empty queue > and succeed or try to wait for kernel execution and fail. > > The following series should address that: > https://patchwork.freedesktop.org/series/52857/ Thanks for the detailed explanation and the patches. I can confirm that with them applied program@execute@calls-struct and program@execute@tail-calls exit with CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST, so I guess they work (or rather, fail) as expected. Feel free to add "Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>" tag if you would like. (In reply to Maciej S. Szmigiero from comment #10) > (In reply to Jan Vesely from comment #9) > > (In reply to Maciej S. Szmigiero from comment #8) > > > Aren't program@execute@calls-struct and program@execute@tail-calls tests > > > from comment 4 examples of this behavior? > > > These seem to run but return wrong results, or am I not parsing the piglit > > > test results correctly? > > > > This is more of a piglit problem. piglit uses a combination of enqueue and > > clFinish. However, the error happens on kernel launch. thus; > > 1.) clEnqueueNDRangeKernel -- success > > 2.) The driver tries to launch the kernel and fails on relocations > > 3.) application(piglit) calls clFinish > > > > depending on the order of 2. and 3. clFinish can either see an empty queue > > and succeed or try to wait for kernel execution and fail. > > > > The following series should address that: > > https://patchwork.freedesktop.org/series/52857/ > > Thanks for the detailed explanation and the patches. > > I can confirm that with them applied program@execute@calls-struct and > program@execute@tail-calls exit with > CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST, so I guess they work > (or rather, fail) as expected. > > Feel free to add > "Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>" tag if you > would > like. Thanks. I pushed the piglit patches. I'll keep this bug open until mesa properly supports relocations. Super for the tested patch. What is the status regarding Mesa on this? Relocations are now handled in the new radeonsi linker (merged in 77b05cc42df29472a7852b90575a19e8991815cd and co.) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.