101325 – UE4Editor crash after pressing "play" with radeon southern island card (7850 HD)

Bug 101325 - UE4Editor crash after pressing "play" with radeon southern island card (7850 HD)

Summary: UE4Editor crash after pressing "play" with radeon southern island card (7850 HD)

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-06-07 01:19 UTC by Luke
Modified:	2019-09-25 17:59 UTC (History)
CC List:	1 user (show)

See Also:	100306
i915 platform:
i915 features:

Attachments
~/ddebug_dumps/ file with environment variables R600_DEBUG=check_vm GALLIUM_DDEBUG="pipelined 10000" (17.93 KB, text/plain) 2017-06-07 01:19 UTC, Luke	Details
dmesg from crash using radeon module (78.14 KB, text/plain) 2017-06-07 01:20 UTC, Luke	Details
dmesg from crash using amdgpu module (56.54 KB, text/plain) 2017-06-07 01:20 UTC, Luke	Details
dmesg from crash using env var R600_DEBUG=nowc (91.05 KB, text/plain) 2017-06-10 02:18 UTC, Luke	Details
ddebug_mono_500ms_hang.log (29.62 KB, text/plain) 2017-07-10 11:46 UTC, Julien Isorce	Details
cs.log (44.24 KB, text/plain) 2017-07-10 11:48 UTC, Julien Isorce	Details
Output of "dmesg -w" (full) (463.67 KB, text/plain) 2017-08-07 21:37 UTC, MirceaKitsune	Details
apitrace trace --api gl UE4Editor (975.21 KB, application/octet-stream) 2017-08-26 23:26 UTC, Chrizz_-Oo>	Details
View All

Description Luke 2017-06-07 01:19:18 UTC

Created attachment 131758 [details]
~/ddebug_dumps/  file with environment variables R600_DEBUG=check_vm GALLIUM_DDEBUG="pipelined 10000"

Hi, thanks for your great work on bringing hardware accelerated graphics to linux.

I have a recurring problem with one 3D program (UE4editor) crashing my computer during a particular operation (pressing "play" on any project including blank).

I believe the problem is at the mesa layer.

Brief details:
Radeon 7850HD
It crashes using radeon and amdgpu (on ubuntu 17.04, linux 4.11, padoka ppa).
It does not crash using the fglrx driver (on ubuntu 14.04).
The entire machine hangs and requires a reset.

Following a suggestion from Michael I set the following environment variables for running ue4editor:

R600_DEBUG=check_vm GALLIUM_DDEBUG="pipelined 10000"

Now the program dumps without taking out the whole system. It created a file in ~/ddebug_dumps/ with more information about the GPU hang (attached).


I have also attached the two dmesg logs for two separate trials involving the radeon and amdgpu modules. I blacklisted them both in grub2 and rebooted.

I first manually loaded amdgpu and then after triggering the crash in ue4editor, rebooted and tried again with the manually loaded radeon module.

The radeon one seems to give more information and the screen flashed a few times before freezing. When running amdgpu the x session just hangs and does nothing.

I made sure I was running the latest (git 17-06-05) padoka builds.

I was able to ssh into the machine for a while after the x session had completely frozen (including frozen mouse) until it also disconnected after a few minutes.

This is the part of dmesg running radeon where it appears to go off the rails:

radeon 0000:07:00.0: ring 4 stalled for more than 10024msec
[  +0.000004] radeon 0000:07:00.0: GPU lockup (current fence id 0x0000000000000fd0 last fence id 0x0000000000000fd2 on ring 4)
[  +0.485614] radeon 0000:07:00.0: Saved 724 dwords of commands on ring 0.
[  +0.000126] radeon 0000:07:00.0: GPU softreset: 0x0000004D
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS               = 0xA0403028
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x08000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x08000006
[  +0.000001] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000118] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000002] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00018000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400006
[  +0.000001] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x84038647
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83106
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000002] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  +0.000001] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[  +0.465637] radeon 0000:07:00.0: GRBM_SOFT_RESET=0x0000DDFF
[  +0.000052] radeon 0000:07:00.0: SRBM_SOFT_RESET=0x00100100
[  +0.001146] radeon 0000:07:00.0:   GRBM_STATUS               = 0x00003028
[  +0.000002] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
[  +0.000000] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000111] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000247] radeon 0000:07:00.0: GPU reset succeeded, trying to resume
[  +0.025448] [drm] probing gen 2 caps for device 8086:151 = 261a103/e
[  +0.000003] [drm] PCIE gen 3 link speeds already enabled
[  +0.002586] [drm] PCIE GART of 2048M enabled (table at 0x00000000001D6000).
[  +0.000120] radeon 0000:07:00.0: WB enabled
[  +0.000002] radeon 0000:07:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff94260bdd8c00
[  +0.000001] radeon 0000:07:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff94260bdd8c04
[  +0.000000] radeon 0000:07:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff94260bdd8c08
[  +0.000001] radeon 0000:07:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff94260bdd8c0c
[  +0.000001] radeon 0000:07:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff94260bdd8c10
[  +0.000314] radeon 0000:07:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffaa3d89635a18
[  +0.010136] radeon 0000:07:00.0: failed VCE resume (-22).
[  +0.159454] [drm] ring test on 0 succeeded in 4 usecs
[  +0.000004] [drm] ring test on 1 succeeded in 1 usecs
[  +0.000003] [drm] ring test on 2 succeeded in 1 usecs
[  +0.000009] [drm] ring test on 3 succeeded in 6 usecs
[  +0.000007] [drm] ring test on 4 succeeded in 5 usecs
[  +0.175707] [drm] ring test on 5 succeeded in 2 usecs
[  +0.000004] [drm] UVD initialized successfully.
[  +1.041140] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait timed out.
[  +0.000018] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on GFX ring (-110).
[  +0.474934] radeon 0000:07:00.0: GPU softreset: 0x00000048
[  +0.000002] radeon 0000:07:00.0:   GRBM_STATUS               = 0xA0003028
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000118] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00400002
[  +0.000002] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x84010243
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000002] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  +0.000001] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[  +0.465304] radeon 0000:07:00.0: GRBM_SOFT_RESET=0x0000DDFF
[  +0.000056] radeon 0000:07:00.0: SRBM_SOFT_RESET=0x00000100
[  +0.001147] radeon 0000:07:00.0:   GRBM_STATUS               = 0x00003028
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE0           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   GRBM_STATUS_SE1           = 0x00000006
[  +0.000001] radeon 0000:07:00.0:   SRBM_STATUS               = 0x200000C0
[  +0.000110] radeon 0000:07:00.0:   SRBM_STATUS2              = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_008680_CP_STAT          = 0x00000000
[  +0.000001] radeon 0000:07:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[  +0.000001] radeon 0000:07:00.0:   R_00D834_DMA_STATUS_REG   = 0x44C83D57
[  +0.000238] radeon 0000:07:00.0: GPU reset succeeded, trying to resume

Eventually it starts to do this:
[  +1.153721] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait timed out.
[  +0.000018] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 5 (-110).
[  +0.000008] radeon 0000:07:00.0: scheduling IB failed (-12).
[  +0.000011] [drm:radeon_vce_get_create_msg [radeon]] *ERROR* radeon: failed to schedule ib (-12).
[  +0.000018] [drm:radeon_vce_ib_test [radeon]] *ERROR* radeon: failed to get create msg (-12).
[  +0.000010] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 6 (-12).
[  +0.000002] radeon 0000:07:00.0: scheduling IB failed (-12).
[  +0.000010] [drm:radeon_vce_get_create_msg [radeon]] *ERROR* radeon: failed to schedule ib (-12).
[  +0.000009] [drm:radeon_vce_ib_test [radeon]] *ERROR* radeon: failed to get create msg (-12).
[  +0.000009] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 7 (-12).
[  +0.001058] radeon 0000:07:00.0: GPU fault detected: 147 0x00044802
[  +0.000003] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000FF000
[  +0.000001] radeon 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04048002
[  +0.000001] VM fault (0x02, vmid 2) at page 1044480, read from TC (72)
[ +10.114896] radeon 0000:07:00.0: ring 0 stalled for more than 10116msec
[  +0.000004] radeon 0000:07:00.0: GPU lockup (current fence id 0x0000000000001506 last fence id 0x000000000000151a on ring 0)
[  +0.000022] radeon 0000:07:00.0: ring 4 stalled for more than 10112msec
[  +0.000003] radeon 0000:07:00.0: GPU lockup (current fence id 0x0000000000000fd3 last fence id 0x0000000000000fd7 on ring 4)
[  +0.000029] radeon 0000:07:00.0: ring 3 stalled for more than 10116msec
[  +0.000002] radeon 0000:07:00.0: GPU lockup (current fence id 0x00000000000018b7 last fence id 0x000000000000190c on ring 3)
[  +0.507937] radeon 0000:07:00.0: ring 3 stalled for more than 10624msec
[  +0.000001] radeon 0000:07:00.0: ring 0 stalled for more than 10624msec

Then the log repeats a few hundred times the non-utf-8 character "\00" before it cuts off.



This is the part of dmesg running amdgpu where it appears to go off the rails:

[Jun 6 11:16] INFO: task RenderThread 3:6190 blocked for more than 120 seconds.
[  +0.000006]       Tainted: G           OE   4.11.0-mytest #2
[  +0.000002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000002] RenderThread 3  D    0  6190   1612 0x00000000
[  +0.000003] Call Trace:
[  +0.000007]  __schedule+0x3c6/0x8c0
[  +0.000004]  schedule+0x36/0x80
[  +0.000043]  amd_sched_entity_push_job+0xc4/0x110 [amdgpu]
[  +0.000003]  ? wake_atomic_t_function+0x60/0x60
[  +0.000031]  amdgpu_job_submit+0x72/0x90 [amdgpu]
[  +0.000027]  amdgpu_vm_bo_split_mapping+0x51f/0x7c0 [amdgpu]
[  +0.000024]  ? amdgpu_vm_do_copy_ptes+0x90/0x90 [amdgpu]
[  +0.000024]  amdgpu_vm_clear_freed+0x70/0xb0 [amdgpu]
[  +0.000024]  amdgpu_gem_va_ioctl+0x39a/0x3f0 [amdgpu]
[  +0.000014]  drm_ioctl+0x218/0x4b0 [drm]
[  +0.000010]  ? drm_ioctl+0x218/0x4b0 [drm]
[  +0.000023]  ? amdgpu_gem_metadata_ioctl+0x1d0/0x1d0 [amdgpu]
[  +0.000003]  ? kmem_cache_free+0x1b6/0x1e0
[  +0.000020]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[  +0.000003]  do_vfs_ioctl+0xa3/0x600
[  +0.000002]  ? ____fput+0xe/0x10
[  +0.000003]  ? task_work_run+0x85/0xa0
[  +0.000002]  SyS_ioctl+0x79/0x90
[  +0.000002]  entry_SYSCALL_64_fastpath+0x1e/0xad
[  +0.000002] RIP: 0033:0x7fb15e3cb987
[  +0.000001] RSP: 002b:00007faefab6a778 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000002] RAX: ffffffffffffffda RBX: 00007faf340060e0 RCX: 00007fb15e3cb987
[  +0.000001] RDX: 00007faefab6a7c0 RSI: 00000000c0286448 RDI: 0000000000000009
[  +0.000001] RBP: 00007faefab6a810 R08: 000000011c430000 R09: 000000000000000e
[  +0.000001] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000040086409
[  +0.000001] R13: 0000000000000009 R14: 00007faf34ef5700 R15: 00007faf34d61780
[Jun 6 11:17] wlp4s0: disconnect from AP 80:2a:a8:11:23:5e for new auth to 80:2a:a8:11:24:29
[  +0.011326] wlp4s0: authenticate with 80:2a:a8:11:24:29
[  +0.020720] wlp4s0: send auth to 80:2a:a8:11:24:29 (try 1/3)
[  +0.002594] wlp4s0: authenticated
[  +0.002803] wlp4s0: associate with 80:2a:a8:11:24:29 (try 1/3)
[  +0.005088] wlp4s0: RX AssocResp from 80:2a:a8:11:24:29 (capab=0x431 status=0 aid=2)
[  +0.000223] wlp4s0: associated



End of xorg.log.0 (for an earlier run)

[    12.056] (II) systemd-logind: got pause for 13:79
[    32.415] (II) config/udev: removing GPU device /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/drm/card0 /dev/dri/card0
[    32.415] (II) config/udev: Adding drm device (/dev/dri/card0)
[    32.415] (II) xfree86: Adding drm device (/dev/dri/card0)
[    35.984] (II) systemd-logind: got resume for 13:81
[    35.984] (EE) FBDEV(0): FBIOPUT_VSCREENINFO: No such device
[    35.984] (EE)
Fatal server error:
[    35.984] (EE) EnterVT failed for screen 0
[    35.984] (EE)
[    35.984] (EE)
Please consult the The X.Org Foundation support
     at http://wiki.x.org
 for help.
[    35.984] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[    35.984] (EE)
[    35.984] (EE) FBDEV(0): FBIOPUT_VSCREENINFO: No such device
[    36.074] (EE) Server terminated with error (1). Closing log file.

Comment 1 Luke 2017-06-07 01:20:12 UTC

Created attachment 131759 [details]
dmesg from crash using radeon module

Comment 2 Luke 2017-06-07 01:20:31 UTC

Created attachment 131760 [details]
dmesg from crash using amdgpu module

Comment 3 Julien Isorce 2017-06-07 09:47:38 UTC

This (In reply to Luke from comment #0)
> 
> Then the log repeats a few hundred times the non-utf-8 character "\00"
> before it cuts off.
> 

Hello, just some naive comments:

I also get some "^@" character after reboot sometimes, this happen because the kernel is not functional enough to properly write all messages on disk. One solution is to read these messages just before it freezes (does it freeze btw ?). For example ssh from another machine and type: "dmesg -w" before to start the test. This will read from the ring buffer directly.

Though the real problem here is "radeon 0000:07:00.0: ring 4 stalled for more than 10024msec" so it is not really worth to get more messages in this case.
Do you require to hard reboot after these gpu reset failures ? Alternatively you can try radeon.hard_reset=1 as a kernel param to do a pci reset instead of a soft reset.

Also maybe you could generate an apitrace (apt-get install apitrace, apitrace trace /path/to/application ).  Then if you reproduce the issue with apitrace replay application.trace you can attach it to this bug.

About the amdgpu issue maybe you could try to build this kernel https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-4.9 (set CONFIG_DRM_AMDGPU_SI=y in the .config) which contains some up to date fixes for amdgpu (and radeon).

Comment 4 Luke 2017-06-08 01:47:16 UTC

Hi Julien, 

I ran the apitrace as suggested (what a cool tool!). The replay file appears to recreate the bug. It is 250mb... is that OK to attach?

I am running the latest 4.11 kernel from kernel.org, is there really much of a difference with https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-4.9 ?

Luke

Comment 5 Michel Dänzer 2017-06-08 01:52:25 UTC

250 MB is too big to attach here directly, some kind of file sharing service is usually used instead in that case.

Comment 6 Luke 2017-06-08 02:26:14 UTC

Here is a dropbox link to the gzipped output from apitrace trace.

https://www.dropbox.com/s/u224vlazdql774j/UE4Editor.5.trace.tar.gz?dl=0

I feel like perhaps the ddebug_dumps file might be the most useful thing as those environment variables had a material effect on the issue?

Comment 7 Julien Isorce 2017-06-08 21:28:29 UTC

Hi Luke, any chance you can try the apitrace https://bugs.freedesktop.org/show_bug.cgi?id=100712#c7 ?
What worked for me on that bug was to disable write-combining by setting the env var R600_DEBUG=nowc for both Xorg and for the gl application. It affects performance a lot on some situations but this gives some hints were to look at for debugging.

Comment 8 Luke 2017-06-10 02:16:16 UTC

Hi Julien,

DISPLAY=:0 apitrace replay thrash.trace 
with export R600_DEBUG=nowc
runs to completion but doesn't generate any extra entries in dmesg or a file in ddumps

STDOUT/STDERR output is:
9090: message: shader compiler issue 1: LLVM diagnostic (remark): <unknown>:0:0: 3 instructions in function
9090: message: shader compiler issue 2: Shader Stats: SGPRS: 16 VGPRS: 8 Code Size: 64 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0
9090: message: shader compiler issue 1: LLVM diagnostic (remark): <unknown>:0:0: 4 instructions in function
9090: message: shader compiler issue 2: Shader Stats: SGPRS: 10 VGPRS: 16 Code Size: 40 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0
9109: message: shader compiler issue 2: Shader Stats: SGPRS: 24 VGPRS: 16 Code Size: 80 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0
42823: message: shader compiler issue 2: Shader Stats: SGPRS: 24 VGPRS: 8 Code Size: 196 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0
42823: message: shader compiler issue 2: Shader Stats: SGPRS: 24 VGPRS: 16 Code Size: 180 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0 PrivMem VGPRs: 0
Rendered 541 frames in 29.2442 secs, average of 18.4994 fps


However, running UE4Editor with export R600_DEBUG=nowc gives extra trace information in dmesg. The computer still crashes but the screen also flashes garbled noise for a few seconds too (my poor card). I've attached the full dmesg file but here is the main new snippet:

[  +0.002390] ------------[ cut here ]------------
[  +0.000015] WARNING: CPU: 0 PID: 1642 at drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+0xea/0xf0 [radeon]
[  +0.000001] Modules linked in: pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) ccm binfmt_misc nls_iso8859_1 joydev input_leds intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_via snd_hda_codec_generic kvm_intel snd_hda_codec_hdmi kvm snd_hda_intel snd_hda_codec irqbypass snd_hda_core snd_hwdep snd_pcm crct10dif_pclmul arc4 crc32_pclmul snd_seq_midi snd_seq_midi_event ghash_clmulni_intel pcbc aesni_intel aes_x86_64 snd_rawmidi crypto_simd glue_helper cryptd ath9k intel_cstate ath9k_common intel_rapl_perf ath9k_hw snd_seq ath mac80211 serio_raw snd_seq_device snd_timer cfg80211 snd lpc_ich soundcore tpm_infineon shpchp mei_me ie31200_edac mei mac_hid edac_core it87 hwmon_vid coretemp parport_pc ppdev lp parport ip_tables x_tables autofs4 amdgpu hid_generic wacom usbhid
[  +0.000024]  hid uas usb_storage mxm_wmi amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper psmouse syscopyarea sysfillrect sysimgblt ahci fb_sys_fops libahci atl1c drm wmi video
[  +0.000008] CPU: 0 PID: 1642 Comm: Xorg Tainted: G           OE   4.11.0-mytest #2
[  +0.000000] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Z77X-D3H, BIOS F9 05/31/2012
[  +0.000001] Call Trace:
[  +0.000004]  dump_stack+0x63/0x81
[  +0.000002]  __warn+0xcb/0xf0
[  +0.000002]  warn_slowpath_null+0x1d/0x20
[  +0.000009]  radeon_ttm_bo_destroy+0xea/0xf0 [radeon]
[  +0.000003]  ttm_bo_release_list+0xcf/0x210 [ttm]
[  +0.000002]  ttm_bo_release+0x19b/0x200 [ttm]
[  +0.000002]  ttm_bo_unref+0x2c/0x30 [ttm]
[  +0.000009]  radeon_bo_unref+0x39/0x70 [radeon]
[  +0.000010]  radeon_gem_object_free+0x57/0x70 [radeon]
[  +0.000007]  drm_gem_object_free+0x29/0x70 [drm]
[  +0.000004]  drm_gem_object_unreference_unlocked+0x40/0x70 [drm]
[  +0.000005]  drm_gem_object_handle_unreference_unlocked+0x66/0xb0 [drm]
[  +0.000004]  drm_gem_object_release_handle+0x53/0x90 [drm]
[  +0.000005]  drm_gem_handle_delete+0x5e/0x90 [drm]
[  +0.000004]  drm_gem_close_ioctl+0x20/0x30 [drm]
[  +0.000005]  drm_ioctl+0x218/0x4b0 [drm]
[  +0.000005]  ? drm_gem_handle_create+0x40/0x40 [drm]
[  +0.000002]  ? sock_write_iter+0x85/0xf0
[  +0.000007]  radeon_drm_ioctl+0x4f/0x90 [radeon]
[  +0.000001]  do_vfs_ioctl+0xa3/0x600
[  +0.000002]  ? vfs_write+0x14c/0x1a0
[  +0.000001]  SyS_ioctl+0x79/0x90
[  +0.000002]  entry_SYSCALL_64_fastpath+0x1e/0xad
[  +0.000001] RIP: 0033:0x7f0fd198b987
[  +0.000000] RSP: 002b:00007ffd79addc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000001] RAX: ffffffffffffffda RBX: 000055bae3b084d0 RCX: 00007f0fd198b987
[  +0.000001] RDX: 00007ffd79addcd8 RSI: 0000000040086409 RDI: 000000000000000e
[  +0.000000] RBP: 00007ffd79addc58 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000040086409
[  +0.000000] R13: 000000000000000e R14: 0000000011b15000 R15: 000055bae3c22d70
[  +0.000001] ---[ end trace d2fc370376efb1a1 ]---

Comment 9 Luke 2017-06-10 02:18:10 UTC

Created attachment 131837 [details]
dmesg from crash using env var R600_DEBUG=nowc

Comment 10 Julien Isorce 2017-06-13 16:30:38 UTC

Hi Luke,

I can reproduce the ring stalled issue using your apitrace 100% of the time.

In apitrace output I could notice:
warning: extension `GL_EXT_gpu_shader4' unsupported in fragment shader
message: api issue 4: FBO incomplete: no attachments and default width or height is 0 [-1]
message: api performance issue 5: using glBufferSubData(buffer 1599, offset 0, size 288) to update a GL_STATIC_DRAW buffer
message: shader compiler issue 2: LLVM diagnostic (remark): <unknown>:0:0: 895 instructions in function
message: shader compiler issue 3: too many identical messages; ignoring

Using GALLIUM_DDEBUG and check_vm I could get some msg like:


[1;32mPS - Constant buffer slot 4 (GPU list):ESC[0m
        ESC[1;33mSQ_BUF_RSRC_WORD0ESC[0m <- 0x000de0e0
        ESC[1;33mSQ_BUF_RSRC_WORD1ESC[0m <- BASE_ADDRESS_HI = 0
                             STRIDE = 7200 (0x1c20)
                             CACHE_SWIZZLE = 0
                             SWIZZLE_ENABLE = 0
        ESC[1;33mSQ_BUF_RSRC_WORD2ESC[0m <- 0x4000c007
        ESC[1;33mSQ_BUF_RSRC_WORD3ESC[0m <- DST_SEL_X = SQ_SEL_X
                             DST_SEL_Y = SQ_SEL_0
                             DST_SEL_Z = SQ_SEL_0
                             DST_SEL_W = SQ_SEL_1
                             NUM_FORMAT = BUF_NUM_FORMAT_UNORM
                             DATA_FORMAT = BUF_DATA_FORMAT_INVALID
                             ELEMENT_SIZE = 2
                             INDEX_STRIDE = 2
                             ADD_TID_ENABLE = 1
                             ATC = 0
                             HASH_ENABLE = 0
                             HEAP = 0
                             MTYPE = 2
                             TYPE = SQ_RSRC_BUF_RSVD_2
                             USER_VM_ENABLE = 0
                             USER_VM_MODE = 1
                             NV = 0
!!!!! This slot was corrupted in GPU memory !!!!!

I am using lastest git mesa, xserver, ddx, llvm and amd-staging-4.9 kernel.

Similar result with amdgpu though it does not detect the gpu hang, all I can see is with radeontop, the graphic pipe is stuck at 100% for ever, but no error in dmesg.

And indeed R600_DEBUG=nowc does not help in that case.

I have the feeling that this is a llvm issue due to exotic shaders but only maintainers will be able to help you at this point. 

Providing this is 100% reproductible on SI chip class (at least) on both radeon and amdgpu drivers, it would be intersting to know if it happens on a more recent graphic card too.

Comment 11 Luke 2017-06-15 00:24:10 UTC

Hi Julien,

Thanks for reproducing the problem! If the issue is related to GL_EXT_gpu_shader4, is it more likely the card not supporting it properly or a bug in the mesa driver?

Comment 12 Michel Dänzer 2017-06-15 01:08:27 UTC

Mesa doesn't support GL_EXT_gpu_shader4, so it looks like there's a bug in UE4(Editor). Not sure the GPU hang is directly related to that though.

Comment 13 Julien Isorce 2017-07-10 11:31:35 UTC

apitrace dump:

94 glShaderSource(shader = 2, count = 2, string = {"#version 150

#define PLATFORM_USES_ES2 0

#define PLATFORM_LINUX 1
", "
#if PLATFORM_USES_ES2
precision highp float;
#else
// #version 120 at the beginning is added in FSlateOpenGLShader::CompileShader()
#extension GL_EXT_gpu_shader4 : enable
#endif

varying vec2 textureCoordinate;

uniform sampler2D SplashTexture;

void main() 
{
        // OpenGL has 0,0 the \"math\" way
        vec2 tc = vec2(textureCoordinate.s, 1.0-textureCoordinate.t);

        gl_FragColor = texture2D(SplashTexture, tc);

}"}, length = NULL)

So it does not look that GL_EXT_gpu_shader4 is really used.
Also I wonder if you can configure UE4Editor to use egl/gles2 and see if it still hangs.

Comment 14 Julien Isorce 2017-07-10 11:46:07 UTC

Created attachment 132579 [details]
ddebug_mono_500ms_hang.log

The gpu hang can be reproduced very quickly when using R600_DEBUG=mono.

If it helps here is output using GALLIUM_DDEBUG="pipelined 500" R600_DEBUG=mono apitrace replay -v:
 
...
20265 glCompileShader(shader = 2041)
20266 glCreateProgram() = 2049
20267 glCreateProgram() = 2050
20268 glCreateProgram() = 2051
...
21287 glCreateProgram() = 3070
21288 glCreateProgram() = 3071
21289 glCreateProgram() = 3072
21290 glAttachShader(program = 3072, shader = 1023)
21291 glAttachShader(program = 3072, shader = 2041)
...
21455 glCreateShader(type = GL_COMPUTE_SHADER) = 3073
21456 glCreateShader(type = GL_COMPUTE_SHADER) = 3074
...
22476 glCreateShader(type = GL_COMPUTE_SHADER) = 4094
22477 glCreateShader(type = GL_COMPUTE_SHADER) = 4095
22478 glCreateShader(type = GL_COMPUTE_SHADER) = 4096

22479 @4 glShaderSource(shader = 4096,
 some shader code
22507 @4 glShaderSource(shader = 4095,
 some shader code (really a lot)
22508 @4 glCompileShader(shader = 4095)
22509 @4 glAttachShader(program = 3070, shader = 4095)
22510 @4 glLinkProgram(program = 3070)
GPU hang detected.
dd: Aborting the process...

Comment 15 Julien Isorce 2017-07-10 11:48:15 UTC

Created attachment 132581 [details]
cs.log

Output when adding 'cs' to R600_DEBUG.

Comment 16 MirceaKitsune 2017-08-07 21:21:35 UTC

I did not expect to find another report on this. I've been struggling with what appears to be the exact same crash for over 6 months. The difference for me is that, this is not caused by Unreal Editor 4; It's triggered exclusively by the KDE desktop, usually when alt-tab switching between windows. My machine is rendered unresponsive approximately every 1 to 4 days.

I found this after googling parts of my own dmesg output, which seem to be identical to what you've pasted in the first comment. I too have a RadeonSI card, in my case a Radeon R7 370 from Gigabyte. My OS is openSUSE Tumbleweed x64.

More information is available on my report, where I've posted my fair share of logs and observations on the problem. We should first of all determine if this is the same issue beyond doubt, then hopefully combining the information we've both obtained can lead to a solution.

https://bugs.freedesktop.org/show_bug.cgi?id=100306

Comment 17 MirceaKitsune 2017-08-07 21:37:05 UTC

Created attachment 133369 [details]
Output of "dmesg -w" (full)

Full output of "dmesg -w", recorded by running "dmesg -w > filename.txt". It includes the dmesg log of the entire session, up until the crash occurs and the monitor turns off. My issue looks very related to yours, so I felt it would make sense to also publish this here.

Comment 18 Michel Dänzer 2017-08-08 01:07:00 UTC

(In reply to MirceaKitsune from comment #16)
> I did not expect to find another report on this. I've been struggling with
> what appears to be the exact same crash for over 6 months. The difference
> for me is that, this is not caused by Unreal Editor 4; [...]

Which means it's most likely not the same problem. Please don't hijack other bug reports.

Comment 19 MirceaKitsune 2017-08-08 01:36:51 UTC

(In reply to Michel Dänzer from comment #18)

Please do a comparison of the dmesg logs, which show the exact same errors and functions being invoked... both this and my issue seem to lead to identical system crashes, potentially the same problem just different triggers. If this is the case my other logs might be able to help; This is a very obscure and specific problem, I'm trying to bring together all info I can to find a solution.

Comment 20 Michel Dänzer 2017-08-08 01:48:08 UTC

(In reply to MirceaKitsune from comment #19)
> Please do a comparison of the dmesg logs, which show the exact same errors
> and functions being invoked...

Those are generic symptoms of a GPU hang, which can be caused by any number of different problems.

> This is a very obscure and specific problem, I'm trying to bring together all
> info I can to find a solution.

You're actually adding noise to this report.

Comment 21 Matt 2017-08-08 23:35:12 UTC

Can confirm this bug on multiple cards / drivers / kernels / Mesa versions starting with Unreal 4.15.

Radeon 7870
Radeon R7 360
Radeon R9 380 (Using amdgpu)

with anything greater than Mesa 17. Mesa 13 works fine. 

Here's a recent dmesg from a box running Fedora 26 / 4.11 / Mesa 17.07 and a Radeon 7870.

[  215.746258] WARNING: CPU: 1 PID: 2239 at drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+0xf6/0x100 [radeon]
[  215.746258] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc intel_rapl snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_codec_hdmi intel_powerclamp snd_hda_intel snd_hda_codec coretemp mei_wdt iTCO_wdt iTCO_vendor_support ppdev kvm snd_hda_core snd_hwdep irqbypass snd_seq intel_cstate snd_seq_device snd_pcm mei_me intel_uncore snd_timer xpad intel_rapl_perf joydev snd ff_memless mei lpc_ich i2c_i801 soundcore
[  215.746278]  parport_pc tpm_infineon ie31200_edac shpchp parport edac_core tpm_tis tpm_tis_core video tpm dm_crypt amdkfd amd_iommu_v2 radeon crct10dif_pclmul i2c_algo_bit crc32_pclmul drm_kms_helper crc32c_intel ttm ghash_clmulni_intel drm r8169 ata_generic pata_acpi uas mii usb_storage
[  215.746286] CPU: 1 PID: 2239 Comm: UdpMess-eceiver Tainted: G        W       4.11.11-300.fc26.x86_64 #1
[  215.746286] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./B75M-D3H, BIOS F14 04/17/2013
[  215.746287] Call Trace:
[  215.746288]  dump_stack+0x63/0x84
[  215.746289]  __warn+0xcb/0xf0
[  215.746290]  warn_slowpath_null+0x1d/0x20
[  215.746302]  radeon_ttm_bo_destroy+0xf6/0x100 [radeon]
[  215.746304]  ttm_bo_release_list+0xdb/0x200 [ttm]
[  215.746306]  ? ttm_bo_man_put_node+0x3f/0x50 [ttm]
[  215.746308]  ttm_bo_release+0x190/0x200 [ttm]
[  215.746310]  ttm_bo_unref+0x2c/0x30 [ttm]
[  215.746317]  radeon_bo_unref+0x39/0x70 [radeon]
[  215.746326]  radeon_gem_object_free+0x57/0x70 [radeon]
[  215.746330]  drm_gem_object_free+0x29/0x70 [drm]
[  215.746336]  drm_gem_object_unreference_unlocked+0x3a/0x70 [drm]
[  215.746341]  drm_gem_dmabuf_release+0x17/0x30 [drm]
[  215.746342]  dma_buf_release+0x58/0x1a0
[  215.746343]  __fput+0xdf/0x1e0
[  215.746345]  ____fput+0xe/0x10
[  215.746346]  task_work_run+0x76/0x90
[  215.746347]  do_exit+0x2e9/0xb90
[  215.746348]  do_group_exit+0x47/0xb0
[  215.746349]  get_signal+0x29f/0x640
[  215.746351]  do_signal+0x37/0x6a0
[  215.746352]  ? __check_object_size+0xbb/0x1b3
[  215.746354]  ? _copy_to_user+0x51/0x60
[  215.746355]  ? poll_select_copy_remaining+0xd6/0x140
[  215.746357]  exit_to_usermode_loop+0x76/0xb0
[  215.746358]  syscall_return_slowpath+0xa6/0xb0
[  215.746359]  entry_SYSCALL_64_fastpath+0xa7/0xa9
[  215.746360] RIP: 0033:0x7f0253686cd3
[  215.746360] RSP: 002b:00007f014c084970 EFLAGS: 00000293 ORIG_RAX: 0000000000000017
[  215.746361] RAX: fffffffffffffdfe RBX: 00007f01942e2d80 RCX: 00007f0253686cd3
[  215.746361] RDX: 0000000000000000 RSI: 00007f014c084998 RDI: 00000000000000dd
[  215.746362] RBP: 0000000000000000 R08: 00007f014c084988 R09: 0000000000000000
[  215.746362] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f0193d8ec00
[  215.746363] R13: 00007ffdc4b94320 R14: 00000000000f4240 R15: 0000000000000000
[  215.746364] ---[ end trace 4a4848a1b1423a31 ]---

I'm unfamiliar with using apitrace, but if someone can give me proper reporting instructions I can reproduce this bug on a wide range of hardware / software if it would be helpful.

Comment 22 MirceaKitsune 2017-08-09 13:35:40 UTC

(In reply to Matt from comment #21)

I sincerely don't mean to complicate this report with anything potentially unrelated, and apologize if I'm adding more noise. However I have to make an observation on something you said here;

The issue I referenced also started happening when my distribution introduced a major Mesa update. From the notes I've taken, that was indeed upgrading from Mesa 13.0 to 17.0.1. Several articles further confirm that Mesa 17 was introduced to openSUSE Tumbleweed at the beginning of March 2017, which is precisely when I began noticing my issue and reported it soon after.

I'll check to see if I can download and install a free version of Unreal Editor 4 for Linux, so I can confirm that it's the same problem then better replicate it (Plasma desktop does it very unpredictably). Unfortunately I'll be busy during the next weeks, and will have to leave this for later.

Comment 23 Chrizz_-Oo> 2017-08-26 23:26:19 UTC

Created attachment 133809 [details]
apitrace trace --api gl UE4Editor

skip attachments and read on first. see link.

Comment 24 Chrizz_-Oo> 2017-08-27 00:03:10 UTC

https://pastebin.com/cyFJNbYu file: sleep 25; dmesg sync > bResponse
https://pastebin.com/HKywSKU4 file: text from the terminal

Those 3 files are my results. If needed some day.

As far as I can tell, this issue is also occuring here https://bugs.freedesktop.org/show_bug.cgi?id=101977#c18 where you will also get redirected to Git-pull:"linux : fix radeonsi/radev crashes during startup #390"

Good Luck!

Additional case info:

Graphics used: Radeon R9 280X was announced on September 25, 2013 (..) based on the Tahiti XTL chip, being a slightly upgraded, rebranded Radeon HD 7970 GHz Edition.

1# OS used: Mint 18.1. Problem: After launching in opengl3 or with no param, graphic glitch appears. Striped/Colored.
Graphic glitches only affect areas of the loading screen. System did not freeze.
After loading, the slightly bigger Editor window pops up, while the content cannot be read, it still reacts to mouse hover events, causing even more small gpu and opengl errors. (additional: without params, I can recall a message opening saying "minimum of OpenGl 4.1 or 4.3(?) is required, try -opengl3". Closing that message may have unblocked the Terminal.

No logs were made from OS #1.

2# OS used: Ubuntu 17.04. Problem: After launching in whatever mode, the result is the same. Its worse. Whole system hangs. All displays start flickering trying to find a signal and freeze in that state / eventually give up retrying.
1. case: System freezes, with glitch from 1# OS in fullscreen. Then black. Responsive for less than 55 seconds after init. I assume shell does not work after time > 25 sec.
2. case: After several seconds the Monitor reappears, mouse is responding for few seconds, then first case.
3. case: Either 1 or 2, then another reappear happens, colored in the boot-scheme, red. Striped top down (thin lines). Looks nice, like a neat webdesign. Followed up by hanging in blackness.

Both systems are not modified. Like, the use future kernels or amdgpu - fglrx related. Transparent Terminals.

Much greeting, Chrizz_-Oo>

Comment 25 Timothy Arceri 2019-06-25 01:30:09 UTC

The apitrace runs fine on my Vega. Can you confirm if this is still an issue for you with recent Mesa or if we can close the bug report?

Comment 26 GitLab Migration User 2019-09-25 17:59:17 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1268.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.