Bug 105113

Summary:	[hawaii, radeonsi, clover] Running Piglit cl/program/execute/{,tail-}calls{,-struct,-workitem-id}.cl cause GPU VM error and ring stalled GPU lockup
Product:	Mesa	Reporter:	Vedran Miletić <vedran>
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED FIXED	QA Contact:	Default DRI bug account <dri-devel>
Severity:	normal
Priority:	high	CC:	arsenm2, mail, me
Version:	git
Hardware:	All
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:
Bug Blocks:	99553

Description Vedran Miletić 2018-02-15 14:58:13 UTC

On kernel 4.16.0-0.rc1.git0.1.fc28.x86_64, with 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT GL [FirePro W9100] upon running Piglit's cl/program/execute/calls-struct.cl test I get:

[ 1574.837119] radeon 0000:01:00.0: GPU fault detected: 147 0x080a8401
[ 1574.837124] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00400840
[ 1574.837126] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A084001
[ 1574.837128] VM fault (0x01, vmid 5) at page 4196416, read from 'TC5' (0x54433500) (132)
[ 1585.420894] radeon 0000:01:00.0: ring 0 stalled for more than 10080msec
[ 1585.420901] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1585.924885] radeon 0000:01:00.0: ring 0 stalled for more than 10584msec
[ 1585.924892] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1586.428890] radeon 0000:01:00.0: ring 0 stalled for more than 11088msec
[ 1586.428897] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1586.932902] radeon 0000:01:00.0: ring 0 stalled for more than 11592msec
[ 1586.932911] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1587.436903] radeon 0000:01:00.0: ring 0 stalled for more than 12096msec
[ 1587.436909] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1587.940855] radeon 0000:01:00.0: ring 0 stalled for more than 12600msec
[ 1587.940859] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1588.444913] radeon 0000:01:00.0: ring 0 stalled for more than 13104msec
[ 1588.444922] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1588.948909] radeon 0000:01:00.0: ring 0 stalled for more than 13608msec
[ 1588.948918] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1589.452909] radeon 0000:01:00.0: ring 0 stalled for more than 14112msec
[ 1589.452916] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1589.956912] radeon 0000:01:00.0: ring 0 stalled for more than 14616msec
[ 1589.956920] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1590.460913] radeon 0000:01:00.0: ring 0 stalled for more than 15120msec
[ 1590.460920] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1590.964927] radeon 0000:01:00.0: ring 0 stalled for more than 15624msec
[ 1590.964934] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1591.468898] radeon 0000:01:00.0: ring 0 stalled for more than 16128msec
[ 1591.468905] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1591.972882] radeon 0000:01:00.0: ring 0 stalled for more than 16632msec
[ 1591.972887] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1592.476903] radeon 0000:01:00.0: ring 0 stalled for more than 17136msec
[ 1592.476908] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1592.980928] radeon 0000:01:00.0: ring 0 stalled for more than 17640msec
[ 1592.980936] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1593.484931] radeon 0000:01:00.0: ring 0 stalled for more than 18144msec
[ 1593.484939] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1593.988933] radeon 0000:01:00.0: ring 0 stalled for more than 18648msec
[ 1593.988941] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1594.492935] radeon 0000:01:00.0: ring 0 stalled for more than 19152msec
[ 1594.492943] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1594.996951] radeon 0000:01:00.0: ring 0 stalled for more than 19656msec
[ 1594.996962] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1595.500953] radeon 0000:01:00.0: ring 0 stalled for more than 20160msec
[ 1595.500963] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1596.004957] radeon 0000:01:00.0: ring 0 stalled for more than 20664msec
[ 1596.004967] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1596.508970] radeon 0000:01:00.0: ring 0 stalled for more than 21168msec
[ 1596.508983] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1597.012966] radeon 0000:01:00.0: ring 0 stalled for more than 21672msec
[ 1597.012982] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1597.516969] radeon 0000:01:00.0: ring 0 stalled for more than 22176msec
[ 1597.516984] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1598.020970] radeon 0000:01:00.0: ring 0 stalled for more than 22680msec
[ 1598.020985] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1598.524974] radeon 0000:01:00.0: ring 0 stalled for more than 23184msec
[ 1598.524989] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1599.028975] radeon 0000:01:00.0: ring 0 stalled for more than 23688msec
[ 1599.028990] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1599.532977] radeon 0000:01:00.0: ring 0 stalled for more than 24192msec
[ 1599.532992] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1600.036981] radeon 0000:01:00.0: ring 0 stalled for more than 24696msec
[ 1600.036997] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1600.540984] radeon 0000:01:00.0: ring 0 stalled for more than 25200msec
[ 1600.540999] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1601.044948] radeon 0000:01:00.0: ring 0 stalled for more than 25704msec
[ 1601.044963] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1601.548986] radeon 0000:01:00.0: ring 0 stalled for more than 26208msec
[ 1601.549002] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1602.052966] radeon 0000:01:00.0: ring 0 stalled for more than 26712msec
[ 1602.052981] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1602.556999] radeon 0000:01:00.0: ring 0 stalled for more than 27216msec
[ 1602.557014] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1603.060934] radeon 0000:01:00.0: ring 0 stalled for more than 27720msec
[ 1603.060938] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1603.564977] radeon 0000:01:00.0: ring 0 stalled for more than 28224msec
[ 1603.564981] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1604.068965] radeon 0000:01:00.0: ring 0 stalled for more than 28728msec
[ 1604.068969] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1604.572967] radeon 0000:01:00.0: ring 0 stalled for more than 29232msec
[ 1604.572971] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1605.076978] radeon 0000:01:00.0: ring 0 stalled for more than 29736msec
[ 1605.076984] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000000002b last fence id 0x000000000000002c on ring 0)
[ 1605.136370] radeon 0000:01:00.0: Saved 24 dwords of commands on ring 0.
[ 1605.136381] radeon 0000:01:00.0: GPU softreset: 0x00000009
[ 1605.136382] radeon 0000:01:00.0:   GRBM_STATUS=0xA0403028
[ 1605.136383] radeon 0000:01:00.0:   GRBM_STATUS2=0x50000008
[ 1605.136385] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x08000006
[ 1605.136386] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x08000006
[ 1605.136387] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0x08000006
[ 1605.136388] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0x08000006
[ 1605.136389] radeon 0000:01:00.0:   SRBM_STATUS=0x20000040
[ 1605.136390] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
[ 1605.136391] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
[ 1605.136392] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
[ 1605.136393] radeon 0000:01:00.0:   CP_STAT = 0x80038600
[ 1605.136394] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000c00
[ 1605.136395] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x00018000
[ 1605.136397] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
[ 1605.136398] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x00000002
[ 1605.136399] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
[ 1605.136400] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x80000063
[ 1605.136401] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000000
[ 1605.136402] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
[ 1605.136403] radeon 0000:01:00.0:   CP_CPC_STATUS = 0x00000000
[ 1605.136404] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1605.136406] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[ 1605.136539] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00010001
[ 1605.136591] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[ 1605.137745] radeon 0000:01:00.0:   GRBM_STATUS=0x00003028
[ 1605.137746] radeon 0000:01:00.0:   GRBM_STATUS2=0x00000008
[ 1605.137747] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
[ 1605.137748] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
[ 1605.137749] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
[ 1605.137751] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
[ 1605.137752] radeon 0000:01:00.0:   SRBM_STATUS=0x20000040
[ 1605.137752] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
[ 1605.137754] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
[ 1605.137755] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEE557
[ 1605.137756] radeon 0000:01:00.0:   CP_STAT = 0x00000000
[ 1605.137757] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000000
[ 1605.137758] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x00000000
[ 1605.137759] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
[ 1605.137760] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x00000000
[ 1605.137761] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
[ 1605.137762] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x00000000
[ 1605.137763] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000000
[ 1605.137764] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
[ 1605.137766] radeon 0000:01:00.0:   CP_CPC_STATUS = 0x00000000
[ 1605.137779] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[ 1605.316214] [drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm failed
[ 1605.316228] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed
[ 1605.316232] [drm] probing gen 2 caps for device 8086:c01 = 261ad03/e
[ 1605.316234] [drm] PCIE gen 3 link speeds already enabled
[ 1605.322812] [drm] PCIE GART of 2048M enabled (table at 0x000000000030E000).
[ 1605.322948] radeon 0000:01:00.0: WB enabled
[ 1605.322963] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000400000c00 and cpu addr 0x0000000069866a2d
[ 1605.322964] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000400000c04 and cpu addr 0x000000006efe9aa0
[ 1605.322965] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000400000c08 and cpu addr 0x00000000a652c3ad
[ 1605.322966] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000400000c0c and cpu addr 0x00000000fc5d211b
[ 1605.322967] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000400000c10 and cpu addr 0x00000000cd5ca2f4
[ 1605.323322] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000078d30 and cpu addr 0x00000000ae9e3dfe
[ 1605.323463] radeon 0000:01:00.0: fence driver on ring 6 use gpu addr 0x0000000400000c18 and cpu addr 0x000000007065469b
[ 1605.323464] radeon 0000:01:00.0: fence driver on ring 7 use gpu addr 0x0000000400000c1c and cpu addr 0x00000000b246b6b7
[ 1605.325595] [drm] ring test on 0 succeeded in 4 usecs
[ 1605.325659] [drm] ring test on 1 succeeded in 3 usecs
[ 1605.325667] [drm] ring test on 2 succeeded in 2 usecs
[ 1605.325846] [drm] ring test on 3 succeeded in 5 usecs
[ 1605.325852] [drm] ring test on 4 succeeded in 4 usecs
[ 1605.371875] [drm] ring test on 5 succeeded in 1 usecs
[ 1605.391888] [drm] UVD initialized successfully.
[ 1605.493990] [drm] ring test on 6 succeeded in 1223 usecs
[ 1605.494000] [drm] ring test on 7 succeeded in 4 usecs
[ 1605.494000] [drm] VCE initialized successfully.
[ 1605.494036] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed
[ 1605.494372] [drm] ib test on ring 0 succeeded in 0 usecs
[ 1605.494505] [drm] ib test on ring 1 succeeded in 0 usecs
[ 1605.494638] [drm] ib test on ring 2 succeeded in 0 usecs
[ 1605.494771] [drm] ib test on ring 3 succeeded in 0 usecs
[ 1605.494903] [drm] ib test on ring 4 succeeded in 0 usecs
[ 1606.021037] [drm] ib test on ring 5 succeeded
[ 1606.042045] [drm] ib test on ring 6 succeeded
[ 1606.042863] [drm] ib test on ring 7 succeeded

Comment 1 Vedran Miletić 2018-02-15 15:58:07 UTC

Same story with

tests/cl/program/execute/calls-struct.cl
tests/cl/program/execute/calls-workitem-id.cl
tests/cl/program/execute/calls.cl
tests/cl/program/execute/tail-calls.cl

while

tests/cl/program/execute/call-clobbers-amdgcn.cl

gets skipped. All those test were added in e408ce1f2bff23121670a8206258c80bb3d9befd.

Comment 2 Maciej S. Szmigiero 2018-04-29 22:23:24 UTC

I've also hit this issue on "Oland PRO [Radeon R7 240/340] (rev 87)" with
mesa-18.1.0_rc2, llvm-6.0.0 and kernel 4.16.5.

The crash happens at "cl/program/execute/calls-struct.cl" from piglit as well.
It happens both from a X session and from a KMS console.

The exact crash looks like this:
[  171.969488] radeon 0000:20:00.0: GPU fault detected: 147 0x06106001
[  171.969489] radeon 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500030
[  171.969490] radeon 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x10060001
[  171.969491] VM fault (0x01, vmid 8) at page 5242928, read from CB (96)

Then the radeon driver tries to reset the GPU endlessly.
I've tried pcie_gen2=0, msi=0, dpm=0, hard_reset=1, vm_size=16 in various
combinations, nothing seems to help (msi=0 gives a ton of IOMMU errors, BTW).

Also have tried amdgpu which gives a similar crash (it looks like this
driver didn't attempt to reset the GPU afterwards):
[  435.596230] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002
[  435.596233] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500060
[  435.596235] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08060002
[  435.596239] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (96)
[  435.596245] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002
[  435.596247] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500060
[  435.596248] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08050002
[  435.596252] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (80)
[  435.596256] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c086002
[  435.596258] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500060
[  435.596260] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08010002
[  435.596263] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (16)
[  435.596267] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002
[  435.596269] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00500060
[  435.596271] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08050002
[  435.596274] amdgpu 0000:20:00.0: VM fault (0x02, vmid 4) at page 5242976, read from '' (0x00000000) (80)
[  435.596278] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0c085002

This might be (also?) a kernel bug since a userspace program should not
be able to crash a GPU, regardless how incorrect command stream it sends
to one.

Comment 3 Maciej S. Szmigiero 2018-06-30 13:06:10 UTC

Still happens on Mesa 18.1.2, LLVM 6.0.1 and kernel 4.17.2.

Note that piglit tests aren't the only thing that is affected
by this bug - ImageMagick OpenCL support also causes a
similar GPU fault.

Comment 4 Maciej S. Szmigiero 2018-10-27 17:42:41 UTC

I have tested this hardware setup again with Mesa 18.2.3, LLVM 7.0.0,
kernel 4.19.0 and piglit from yesterday's git.
These tests no longer crash the GPU but fail anyway with various errors:

program@execute@calls-struct:
Error: 6 unsupported relocations
Expecting 1021 (0x3fd) with tolerance 0, but got 1 (0x1)
Error at int[0]
Argument 0: FAIL
Expecting 14 (0xe) with tolerance 0, but got 1 (0x1)
... and so for other arguments in this test.

program@execute@calls-workitem-id:
Error: 8 unsupported relocations
Could not wait for kernel to finish: CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST
Unexpected CL error: CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST -14

program@execute@calls:
<inline asm>:1:2: error: instruction not supported on this GPU
v_lshlrev_b64 v[0:1], 44, 1

program@execute@tail-calls:
Expecting 4 (0x4) with tolerance 0, but got 0 (0x0)
Error at int[0]
Argument 0: FAIL
Running kernel test: Tail call with more arguments than caller
Using kernel kernel_call_tailcall_extra_arg
Setting kernel arguments...
Running the kernel...
Validating results...
Expecting 2 (0x2) with tolerance 0, but got 1 (0x1)
Error at int[0]
Argument 0: FAIL
Running kernel test: Tail call with fewer  arguments than acller
Using kernel kernel_call_tailcall_fewer_args
Setting kernel arguments...
Running the kernel...
Validating results...
Expecting 4 (0x4) with tolerance 0, but got 0 (0x0)
Error at int[0]
Argument 0: FAIL
... and so for other arguments and calls in this test.

Comment 5 Jan Vesely 2018-11-15 22:24:07 UTC

This behaviour is expected.
until mesa properly supports code relocations (or llvm stops using relocations for internal symbols) it will refuse to run kernels that need relocating.
Jumping to invalid address causes both pagefault and a gpu hang.

Comment 6 Maciej S. Szmigiero 2018-11-16 16:13:57 UTC

There are really two issues at play here:
1) If the LLVM-generated code cannot be run properly then it should be simply
rejected by whatever is actually in charge of submitting it to the GPU (I guess
this would be Mesa?).
This way an application will know it cannot use OpenCL for computation, at least
not with this compute kernel.

Instead, it currently looks like many of these test run but give incorrect results, which is obviously rather bad.

2) Some (previous) Mesa + LLVM versions generate a command stream that crashes the GPU and, as far as I can remember, sometimes even lockup the whole machine.

It should not be possible to crash the GPU, regardless how incorrect a command stream that userspace sends to it is - because otherwise it is possible for
an unprivileged user with GPU access to DoS the machine.

Comment 7 Jan Vesely 2018-11-18 19:24:54 UTC

(In reply to Maciej S. Szmigiero from comment #6)
> There are really two issues at play here:
> 1) If the LLVM-generated code cannot be run properly then it should be simply
> rejected by whatever is actually in charge of submitting it to the GPU (I
> guess
> this would be Mesa?).
> This way an application will know it cannot use OpenCL for computation, at
> least
> not with this compute kernel.
> 
> Instead, it currently looks like many of these test run but give incorrect
> results, which is obviously rather bad.

Do you have an example of this? clover should return OUT_OF_RESOURCES error when the compute state creation fails (like in the presence of code relocations).
It does not change the content of the buffer, so it will return whatever was stored in the buffer on creation.

> 2) Some (previous) Mesa + LLVM versions generate a command stream that
> crashes the GPU and, as far as I can remember, sometimes even lockup the
> whole machine.
> 
> It should not be possible to crash the GPU, regardless how incorrect a
> command stream that userspace sends to it is - because otherwise it is
> possible for
> an unprivileged user with GPU access to DoS the machine.

This is a separate issue. GPU hangs are generally addressed via gpu reset which should be enabled for gfx8/9 GPUs in recent amdgpu.ko [0]

[0] https://patchwork.freedesktop.org/patch/257994/

Comment 8 Maciej S. Szmigiero 2018-11-19 14:03:36 UTC

(In reply to Jan Vesely from comment #7)
> (In reply to Maciej S. Szmigiero from comment #6)
> > There are really two issues at play here:
> > 1) If the LLVM-generated code cannot be run properly then it should be simply
> > rejected by whatever is actually in charge of submitting it to the GPU (I
> > guess
> > this would be Mesa?).
> > This way an application will know it cannot use OpenCL for computation, at
> > least
> > not with this compute kernel.
> > 
> > Instead, it currently looks like many of these test run but give incorrect
> > results, which is obviously rather bad.
> 
> Do you have an example of this? clover should return OUT_OF_RESOURCES error
> when the compute state creation fails (like in the presence of code
> relocations).
> It does not change the content of the buffer, so it will return whatever was
> stored in the buffer on creation.

Aren't program@execute@calls-struct and program@execute@tail-calls tests
from comment 4 examples of this behavior?
These seem to run but return wrong results, or am I not parsing the piglit
test results correctly?

> > 2) Some (previous) Mesa + LLVM versions generate a command stream that
> > crashes the GPU and, as far as I can remember, sometimes even lockup the
> > whole machine.
> > 
> > It should not be possible to crash the GPU, regardless how incorrect a
> > command stream that userspace sends to it is - because otherwise it is
> > possible for
> > an unprivileged user with GPU access to DoS the machine.
> 
> This is a separate issue. GPU hangs are generally addressed via gpu reset
> which should be enabled for gfx8/9 GPUs in recent amdgpu.ko [0]
> 
> [0] https://patchwork.freedesktop.org/patch/257994/

This would explain why "amdgpu" seemed to not even attempt to reset the GPU
after a crash.

However, I think I've got at least one lockup when testing this issue half a
year ago on "radeon" driver ("amdgpu" is still marked as experimental for SI
parts).
If I am able to reproduce it in the future I will report it then.

Comment 9 Jan Vesely 2018-11-22 04:44:54 UTC

(In reply to Maciej S. Szmigiero from comment #8)
> Aren't program@execute@calls-struct and program@execute@tail-calls tests
> from comment 4 examples of this behavior?
> These seem to run but return wrong results, or am I not parsing the piglit
> test results correctly?

This is more of a piglit problem. piglit uses a combination of enqueue and clFinish. However, the error happens on kernel launch. thus;
1.) clEnqueueNDRangeKernel -- success
2.) The driver tries to launch the kernel and fails on relocations
3.) application(piglit) calls clFinish

depending on the order of 2. and 3. clFinish can either see an empty queue and succeed or try to wait for kernel execution and fail.

The following series should address that:
https://patchwork.freedesktop.org/series/52857/

> This would explain why "amdgpu" seemed to not even attempt to reset the GPU
> after a crash.
> 
> However, I think I've got at least one lockup when testing this issue half a
> year ago on "radeon" driver ("amdgpu" is still marked as experimental for SI
> parts).
> If I am able to reproduce it in the future I will report it then.

comment #1 shows an example of a successful restart using radeon.ko, so I guess it worked for at least some ASICs. at any rate, restarting GPU is a separate, kernel, problem.
Feel free to remove the relocation guard if you want to investigate GPU reset.

Comment 10 Maciej S. Szmigiero 2018-11-23 13:43:32 UTC

(In reply to Jan Vesely from comment #9)
> (In reply to Maciej S. Szmigiero from comment #8)
> > Aren't program@execute@calls-struct and program@execute@tail-calls tests
> > from comment 4 examples of this behavior?
> > These seem to run but return wrong results, or am I not parsing the piglit
> > test results correctly?
> 
> This is more of a piglit problem. piglit uses a combination of enqueue and
> clFinish. However, the error happens on kernel launch. thus;
> 1.) clEnqueueNDRangeKernel -- success
> 2.) The driver tries to launch the kernel and fails on relocations
> 3.) application(piglit) calls clFinish
> 
> depending on the order of 2. and 3. clFinish can either see an empty queue
> and succeed or try to wait for kernel execution and fail.
> 
> The following series should address that:
> https://patchwork.freedesktop.org/series/52857/

Thanks for the detailed explanation and the patches.

I can confirm that with them applied program@execute@calls-struct and
program@execute@tail-calls exit with CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST, so I guess they work
(or rather, fail) as expected.

Feel free to add
"Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>" tag if you would
like.

Comment 11 Jan Vesely 2018-12-04 17:40:53 UTC

(In reply to Maciej S. Szmigiero from comment #10)
> (In reply to Jan Vesely from comment #9)
> > (In reply to Maciej S. Szmigiero from comment #8)
> > > Aren't program@execute@calls-struct and program@execute@tail-calls tests
> > > from comment 4 examples of this behavior?
> > > These seem to run but return wrong results, or am I not parsing the piglit
> > > test results correctly?
> > 
> > This is more of a piglit problem. piglit uses a combination of enqueue and
> > clFinish. However, the error happens on kernel launch. thus;
> > 1.) clEnqueueNDRangeKernel -- success
> > 2.) The driver tries to launch the kernel and fails on relocations
> > 3.) application(piglit) calls clFinish
> > 
> > depending on the order of 2. and 3. clFinish can either see an empty queue
> > and succeed or try to wait for kernel execution and fail.
> > 
> > The following series should address that:
> > https://patchwork.freedesktop.org/series/52857/
> 
> Thanks for the detailed explanation and the patches.
> 
> I can confirm that with them applied program@execute@calls-struct and
> program@execute@tail-calls exit with
> CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST, so I guess they work
> (or rather, fail) as expected.
> 
> Feel free to add
> "Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>" tag if you
> would
> like.

Thanks. I pushed the piglit patches. I'll keep this bug open until mesa properly supports relocations.

Comment 12 Pander 2019-01-15 09:23:26 UTC

Super for the tested patch. What is the status regarding Mesa on this?

Comment 13 Jan Vesely 2019-06-13 18:45:54 UTC

Relocations are now handled in the new radeonsi linker (merged in 77b05cc42df29472a7852b90575a19e8991815cd and co.)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.