108814 – [radeonsi] page fault, umr dump

Bug 108814 - [radeonsi] page fault, umr dump

Summary: [radeonsi] page fault, umr dump

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	18.3
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-11-20 21:39 UTC by Domen
Modified:	2018-12-01 07:24 UTC (History)
CC List:	0 users

See Also:	108261
i915 platform:
i915 features:

Attachments
umr dump (277.87 KB, text/plain) 2018-11-21 11:12 UTC, Domen	Details
gallium dump t1 (248.00 KB, text/plain) 2018-11-21 11:13 UTC, Domen	Details
gallium dump t0 (16.00 KB, text/plain) 2018-11-21 11:13 UTC, Domen	Details
trace events amdgpu (520.50 KB, application/vnd.rar) 2018-11-21 11:14 UTC, Domen	Details
another gallium dump (184.00 KB, text/plain) 2018-11-23 17:42 UTC, Domen	Details
View All

Description Domen 2018-11-20 21:39:10 UTC

I tried it on two computers.


Linux (none) 4.20.0-rc1+ #8 SMP PREEMPT Tue Nov 20 00:24:49 CET 2018 x86_64 AMD Athlon PRO 200GE w/ Radeon Vega Graphics AuthenticAMD GNU/Linux

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: X.Org (0x1002)
    Device: AMD RAVEN (DRM 3.27.0, 4.20.0-rc1+, LLVM 7.0.0) (0x15dd)
    Version: 18.2.5

[   80.221112] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:32 vmid:2 pasid:32768, for process roles pid 358 thread roles:cs0 pid 359)
[   80.221116] amdgpu 0000:38:00.0:   in page starting at address 0x0000800000a94000 from 27
[   80.221118] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00240C40

Other computer.
Linux amd1.blue.org 4.19.2 #1 SMP PREEMPT Tue Nov 20 21:41:52 CET 2018 x86_64 AMD Ryzen 7 1700X Eight-Core Processor AuthenticAMD GNU/Linux

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: X.Org (0x1002)
    Device: AMD Radeon (TM) RX 460 Graphics (POLARIS11, DRM 3.27.0, 4.19.2, LLVM 7.0.0) (0x67ef)
    Version: 18.2.5

[ 1253.329906] amdgpu 0000:0e:00.0: GPU fault detected: 147 0x09004802 for process roles pid 1119 thread roles:cs0 pid 1120
[ 1253.329910] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000EB20
[ 1253.329911] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C048002
[ 1253.329914] amdgpu 0000:0e:00.0: VM fault (0x02, vmid 6, pasid 32769) at page 60192, read from 'TC0' (0x54433000) (72)

Is this llvm or mesa issue ?
I also tried older kernel 4.16 same thing.

What reports do you need ?

Comment 1 Domen 2018-11-21 11:12:28 UTC

Created attachment 142535 [details]
umr dump

Comment 2 Domen 2018-11-21 11:13:07 UTC

Created attachment 142536 [details]
gallium dump t1

Comment 3 Domen 2018-11-21 11:13:29 UTC

Created attachment 142537 [details]
gallium dump t0

Comment 4 Domen 2018-11-21 11:14:16 UTC

Created attachment 142538 [details]
trace events amdgpu

Comment 5 Domen 2018-11-21 11:15:29 UTC

Attached logs
[  332.004841] amdgpu 0000:0e:00.0: GPU fault detected: 147 0x0f800802 for process roles pid 1043 thread roles:cs0 pid 1044
[  332.004844] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000EA1F0
[  332.004845] amdgpu 0000:0e:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04008002
[  332.004848] amdgpu 0000:0e:00.0: VM fault (0x02, vmid 2, pasid 32769) at page 958960, read from 'TC2' (0x54433200) (8)

Comment 6 Domen 2018-11-23 17:42:00 UTC

Created attachment 142598 [details]
another gallium dump

another dump, tried with propriery nvidia drivers. it works fine there.

Comment 7 Domen 2018-11-25 14:13:55 UTC

Looks like sctx->bindless_descriptors->gpu_address is not accessable by gpu.
2e00000 is not in buffer list.

c0017600 SET_SH_REG:
0000014d
02e00000         SPI_SHADER_USER_DATA_COMMON_1 <- 0x02e00000

[  174.469016] amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:32 vmid:2 pasid:32769, for process roles pid 398 thread roles:cs0 pid 399)
[  174.469021] amdgpu 0000:38:00.0:   in page starting at address 0x0000800002e04000 from 27
[  174.469023] amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00240C40
[  184.763074] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=583, emitted seq=585

Comment 8 Domen 2018-11-27 07:58:37 UTC

Well this is bug when using bindless textures and framebuffer which is also resident in bindless textures.
There is no more fault if i comment out si_upload_bindless_descriptor function.

	radeon_emit(cs, PKT3(PKT3_WRITE_DATA, 2 + num_dwords, 0));
	radeon_emit(cs, S_370_DST_SEL(V_370_TC_L2) |
		    S_370_WR_CONFIRM(1) |
		    S_370_ENGINE_SEL(V_370_ME));
	radeon_emit(cs, va);
	radeon_emit(cs, va >> 32);

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.