Bug 100465

Summary:

Hard lockup with radeonsi driver on FirePro W600, W9000 and W9100

Product:

DRI

Reporter:

Julien Isorce <julien.isorce>

Component:

DRM/Radeon

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED MOVED

QA Contact:

Severity:

normal

Priority:

medium

Version:

DRI git

Hardware:

x86-64 (AMD64)

OS:

All

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
dmesg	none
xorg.log	none
Same result with amdgpu using a 4.10 kernel	none
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled	none
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled	none
dmesg_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled	none
ddebug_dumps_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled	none

Description Julien Isorce 2017-03-30 10:59:40 UTC

Created attachment 130563 [details]
dmesg

The machine completely freeze using radeonsi driver with FirePro W600, W9000 and W9100.

* Steps to reproduce:

wget http://www.phoronix-test-suite.com/benchmark-files/GpuTest_Linux_x64_0.7.0.zip
DISPLAY=:0 ./GpuTest /test=fur /fullscreen

* Acutal result:

System and screen are frozen after a few minutes (sometimes a few seconds, sometimes 20 min). No mouse/keyboard. Does not respond to ping. No kernel panic. Requires hard reboot.

After reboot, no error in /var/log/kern.log. Empty dir /var/crash, empty dir /sys/fs/pstore. Sometimes some nul characters ^@ just before the next "Linux version".
Using a serial console does not show additional debug messages.

* Expected result:

No system freeze.

* List of things that have been tried but leading to the same result:

- kernel 4.4.X, 4.8.x packaged by ubuntu.
- amd-staging-4.9 from https://cgit.freedesktop.org/~agd5f/linux.
- a few 4.10 kernels from http://kernel.ubuntu.com/~kernel-ppa/mainline/ .
- radeon.dpm=1 (all values for power_dpm_state / power_dpm_force_performance_level)
- radeon.dpm=0 (power_mode=profile and all values for power_profile)
- radeon.msi=1 / 0.
- DRI2 / DRI3
- glamor / no accel, TearFree on / off
- single monitor, multi monitor, resolutions 1600x1200, 1920x1080.
- Latest libdrm / mesa. llvm 3.8, 4 and 5.

* List of things that avoids the system freeze:

- radeon.gartsize=512 radeon.vramlimit=1024

* Others:

- apitrace trace then replay does not lead to the freeze.
- No errors with R600_DEBUG=* or MESA_DEBUG.
- strace sometimes shows that the last call is ioctl(RADEON_CS) but not sure how reliable this is provinding the last print might not be flush.
- Happens with 2 differents brand for the mother board.
- takes a bit longer for the mentioned GpuTest to freeze the machine on W9000 and W9100.

* TODOs:

- Try again kgdb.
- Try amdgpu instead of radeonsi.

Comment 1 Julien Isorce 2017-03-30 11:00:13 UTC

Created attachment 130564 [details]
xorg.log

Comment 2 Julien Isorce 2017-03-30 12:31:26 UTC

Created attachment 130569 [details]
Same result with amdgpu using a 4.10 kernel

Comment 3 Michel Dänzer 2017-03-31 02:21:54 UTC

Doesn't seem to happen with my Tonga after 30 minutes.

One thing to keep in mind is that FurMark is designed to stress the GPU. Do the systems you're testing on have appropriate power supply and cooling?

GALLIUM_HUD=.dfps,.drequested-VRAM+mapped-VRAM+VRAM-usage+VRAM-vis-usage,.drequested-GTT+mapped-GTT+GTT-usage,cpu+temperature+GPU-load shows that it only uses little VRAM and GTT, so it's weird that limiting those to much larger sizes has any effect.

Does it also happen with older versions of Mesa?

Comment 4 joro-2013 2017-03-31 09:04:04 UTC

I've noticed the same difference on my ancient mac mini G4 with a humble RV280 GPU. I try to boot it in AGP mode from time to time after applying some patches, just for fun. Before i would see some GPU lockup-resetting stuff (it always locks up after some time and never recovers) in kern.log, lately it just locks up and there's nothing in the logs.

The patches i applied lately were

[PATCH] radeon: allow write_reloc with unaccounted buffers to cope
 with Mesa bug

https://patchwork.kernel.org/patch/4663071/

and 

drm-radeon-fix-TOPDOWN-handling-for-bo_create-v3.patch


https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg132931.html

My guess would be patch number two.

Comment 5 Alex Deucher 2017-03-31 14:49:32 UTC

(In reply to joro-2013 from comment #4)
> I've noticed the same difference on my ancient mac mini G4 with a humble
> RV280 GPU. I try to boot it in AGP mode from time to time after applying
> some patches, just for fun. Before i would see some GPU lockup-resetting
> stuff (it always locks up after some time and never recovers) in kern.log,
> lately it just locks up and there's nothing in the logs.

I think the problem is your case is AGP and Mac.  AGP is notoriously unreliable and the Apple northbridge had additional coherency problems.

Comment 6 joro-2013 2017-04-02 13:18:35 UTC

Yeah, i know. But the behaviour of not writing to kern.log but just completely locking up appeared after said patch.

Comment 7 Julien Isorce 2017-04-03 09:19:21 UTC

(In reply to joro-2013 from comment #4)
> I've noticed the same difference on my ancient mac mini G4 with a humble
> RV280 GPU. I try to boot it in AGP mode from time to time after applying
> some patches, just for fun. Before i would see some GPU lockup-resetting
> stuff (it always locks up after some time and never recovers) in kern.log,
> lately it just locks up and there's nothing in the logs.
> 
> The patches i applied lately were
> 
> [PATCH] radeon: allow write_reloc with unaccounted buffers to cope
>  with Mesa bug
> 
> https://patchwork.kernel.org/patch/4663071/
> 
> and 
> 
> drm-radeon-fix-TOPDOWN-handling-for-bo_create-v3.patch
> 
> 
> https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg132931.html
> 
> My guess would be patch number two.

Thx for the note, but it looks like the 2 patches you pointed (first one for libdrm) and second one (https://cgit.freedesktop.org/~agd5f/linux/commit/?h=topdown-fixes&id=db0be21d83078b2fe4cc6e9115d0b63a72a7e505 for kernel) were never merged so it cannot not affect my usecase.

Comment 8 Julien Isorce 2017-04-04 17:03:45 UTC

(In reply to Michel Dänzer from comment #3)
> Doesn't seem to happen with my Tonga after 30 minutes.
> 
> One thing to keep in mind is that FurMark is designed to stress the GPU. Do
> the systems you're testing on have appropriate power supply and cooling?
> 

Thx for the suggestion I will have another look to the temperature but when I checked some times ago the temperature was around 55 C when it freeze.

> GALLIUM_HUD=.dfps,.drequested-VRAM+mapped-VRAM+VRAM-usage+VRAM-vis-usage,.
> drequested-GTT+mapped-GTT+GTT-usage,cpu+temperature+GPU-load shows that it
> only uses little VRAM and GTT, so it's weird that limiting those to much
> larger sizes has any effect.

Thx for the try.

Today I could reproduce it with a HD7770 so the problem seems not specific to FirePro.

Also just before it freezes I have this sometimes:

radeon:    size      : 1048576 bytes
radeon:    va        : 0x8520d000
radeon: Failed to deallocate virtual address for buffer:
radeon:    size      : 65536 bytes
radeon:    va        : 0x86d8f000

> 
> Does it also happen with older versions of Mesa?

I can have it with mesa 12.0.6. Are you thinking of something older ?

Comment 9 Julien Isorce 2017-04-06 16:27:39 UTC

When using R600_DEBUG=check_vm on both Xorg and the gl app I can get some output in kern.log. It looks like a "ring 0 stalled" is detected and then follow a gpu softreset which succeeds ("GPU reset succeeded, trying to resume") but fails to resume because:

[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombios stuck executing C483 (len 254, WS 0, PS 4) @ 0xC4AD
[drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombios stuck executing BC59 (len 74, WS 0, PS 8) @ 0xBC8E

Then there is two: radeon_mc_wait_for_idle failure "Wait for MC idle timedout" from si_mc_program

Finally si_startup fails because si_cp_resume fails because r600_ring_test fails with: "radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)"

But it seems it keeps looping trying to do a gpu softreset and at some point it freezes. I need to confirm this ending scenario though but these atombios failures are worring in the first place.

At the same time I get some "radeon_ttm_bo_destroy" notified by "WARN_ON(!list_empty(&bo->va));" from kernel radeon driver. So it seems to leak some buffers. 

I will attach the full log tomorrow, it is mess-up with my traces atm but the essential is above I hope.

So I have 4 questions:
 1: Can an application causes a "ring 0 stalled" ? or is it a driver bug (kernel side or mesa/drm or xserver) ?
 2: About these atombios failures, does it mean that it fails to load the gpu microcode/firmware ?
 3: Does it try to do a gpu softreset because I added R600_DEBUG=check_vm ? Or this one just help to flush the traces on vm fault (like mentioned in a commit msg related to that env var in mesa) ?
 4: For the deallocation failure / leak above (radeon_ttm_bo_destroy warning), does it mean the memory is lost until next reboot or does a gpu soft reset allow to recover these leaks ? 

Thx !

Comment 10 Alex Deucher 2017-04-06 16:54:05 UTC

(In reply to Julien Isorce from comment #9)
> 
> So I have 4 questions:
>  1: Can an application causes a "ring 0 stalled" ? or is it a driver bug
> (kernel side or mesa/drm or xserver) ?

driver bug.  Probably mesa or kernel.

>  2: About these atombios failures, does it mean that it fails to load the
> gpu microcode/firmware ?

Most likely the GPU reset was not actually successful and the atombios errors are a symptom of that.

>  3: Does it try to do a gpu softreset because I added R600_DEBUG=check_vm ?
> Or this one just help to flush the traces on vm fault (like mentioned in a
> commit msg related to that env var in mesa) ?

check_vm doesn't not change anything with respect to gpu reset.

>  4: For the deallocation failure / leak above (radeon_ttm_bo_destroy
> warning), does it mean the memory is lost until next reboot or does a gpu
> soft reset allow to recover these leaks ? 

I'm not quite sure what you are referring to, but if the GPU reset is successful, all fences should be signalled so any memory that is pinned due to a command buffer being in flight could be freed.

Comment 11 Michel Dänzer 2017-04-10 07:44:19 UTC

When R600_DEBUG=check_vm catches a VM fault, it generates a report in ~/ddebug_dumps/ , please attach that here.

A similar report from something like GALLIUM_DDEBUG="pipelined 10000" might give more information.

Comment 12 Julien Isorce 2017-04-10 10:03:59 UTC

Thx for the answers and suggestions. In order to make sure I am still tracking the same hard lockup, will any failure to load the gpu microcode will lead to a total freeze of the machine ?

Currently in the setup where I can get some logs it can fail in 2 ways:

For both it starts with: "ring 0 stalled" is detected.

1: kworker triggers the gpu soft reset.

[drm:atom_op_jump [radeon]] *ERROR* atombios stuck in loop for more than 5secs aborting.
[drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing C483 (len 254, WS 0, PS 4) @ 0xC4AD
[drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing BC59 (len 74, WS 0, PS 8) @ 0xBC8E
si_mc_program::radeon_mc_wait_for_idle, the one after WREG32(vram.start) and before 
evergreen_mc_resume. Can it freeze on a call to RREG32(SRBM_STATUS) & 0x1F00 ?

2: the gl app triggers the gpu soft reset.

The first failure is then evergreen_mc_stop::radeon_mc_wait_for_idle which reached the timeout and then same errors as 1:
But the freeze happens a bit later in the radeon_gpu_reset sequence, in atombios_crtc_dpms in one of its atombios_X(crtc, ATOM_Y) calls.

So my question above, will any single problem during the gpu soft reset lead to a machine freeze ? If yes then I am probably tracking now a different freeze that the one I reported initially.

Also in kernel's drm_drv.c::drm_err I tried to add a call to sys_sync(); (#include <linux/syscalls.h>) to make sure all errors are written on disk so that I can read them after a reboot (Instead of having null characters ^@). But I got an undefined reference. How could I add dependcy on fs/sync.c ? I have not search long but at first glance tty driver calls it and there is nothing special in the Makefile.
(As an alternative I am running a while true; sleep 0.5; sync; done but it does not work all the time)

Thx!

Comment 13 Julien Isorce 2017-04-10 22:25:05 UTC

Created attachment 130787 [details]
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled

Comment 14 Julien Isorce 2017-04-10 22:27:00 UTC

Created attachment 130788 [details]
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled

Comment 15 Julien Isorce 2017-04-10 22:28:36 UTC

Created attachment 130789 [details]
dmesg_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled

Comment 16 Julien Isorce 2017-04-10 22:29:15 UTC

Created attachment 130790 [details]
ddebug_dumps_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled

Comment 17 Julien Isorce 2017-04-10 22:56:06 UTC

These last 4 attachements logs are for comment #12. And generated with GALLIUM_DDEBUG="pipelined 10000" R600_DEBUG=check_vm . Again it is potentially a different freeze than the one reported initially, simplify because I still have no logs for the former which is with a W600.

Comment 18 Julien Isorce 2017-04-10 22:56:50 UTC

-simplify +simply

Comment 19 Julien Isorce 2017-04-18 14:23:07 UTC

Comment on attachment 130787 [details]
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled

Marking as obsolete because this is a different problem than the one reported.

Comment 20 Julien Isorce 2017-04-18 14:23:23 UTC

Comment on attachment 130788 [details]
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled

Marking as obsolete because this is a different problem than the one reported.

Comment 21 Julien Isorce 2017-04-18 14:23:38 UTC

Comment on attachment 130789 [details]
dmesg_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled

Marking as obsolete because this is a different problem than the one reported.

Comment 22 Julien Isorce 2017-04-18 14:23:52 UTC

Comment on attachment 130790 [details]
ddebug_dumps_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled

Marking as obsolete because this is a different problem than the one reported.

Comment 23 Julien Isorce 2017-04-18 15:19:21 UTC

I confirm that comment #9 and #12 are about a different issue (or at least different symptoms). I reported it here: https://bugs.freedesktop.org/show_bug.cgi?id=100712

Comment 24 Zachary Michaels 2017-04-19 23:01:48 UTC

Hi! There is a stress test here that we have been using to reproduce this issue: https://github.com/Oblong/thrasher

Enabling debug tracing works around the issue. Specifically, when si_draw_vbo calls si_trace_emit, the problem goes away.

Comment 25 Zachary Michaels 2017-04-19 23:04:15 UTC

These settings reproduce consistently on my W600 test machine:
./thrash -w 1920 -h 1080 -c 3 -t 1000 -m 1000000000

Comment 26 Zachary Michaels 2017-04-19 23:36:07 UTC

Also note that forcing VGT_STREAMOUT_SYNC, VGT_STREAMOUT_RESET, or VGT_FLUSH to be emitted on each call to si_draw_vbo (via si_emit_cache_flush) also appears to work around the issue, though this has not been as thoroughly tested.

Comment 27 Martin Peres 2019-11-19 09:27:35 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/790.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.