Summary: | Hard lockup with radeonsi driver on FirePro W600, W9000 and W9100 | ||
---|---|---|---|
Product: | DRI | Reporter: | Julien Isorce <julien.isorce> |
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | ||
Version: | DRI git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Julien Isorce
2017-03-30 10:59:40 UTC
Created attachment 130564 [details]
xorg.log
Created attachment 130569 [details]
Same result with amdgpu using a 4.10 kernel
Doesn't seem to happen with my Tonga after 30 minutes. One thing to keep in mind is that FurMark is designed to stress the GPU. Do the systems you're testing on have appropriate power supply and cooling? GALLIUM_HUD=.dfps,.drequested-VRAM+mapped-VRAM+VRAM-usage+VRAM-vis-usage,.drequested-GTT+mapped-GTT+GTT-usage,cpu+temperature+GPU-load shows that it only uses little VRAM and GTT, so it's weird that limiting those to much larger sizes has any effect. Does it also happen with older versions of Mesa? I've noticed the same difference on my ancient mac mini G4 with a humble RV280 GPU. I try to boot it in AGP mode from time to time after applying some patches, just for fun. Before i would see some GPU lockup-resetting stuff (it always locks up after some time and never recovers) in kern.log, lately it just locks up and there's nothing in the logs. The patches i applied lately were [PATCH] radeon: allow write_reloc with unaccounted buffers to cope with Mesa bug https://patchwork.kernel.org/patch/4663071/ and drm-radeon-fix-TOPDOWN-handling-for-bo_create-v3.patch https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg132931.html My guess would be patch number two. (In reply to joro-2013 from comment #4) > I've noticed the same difference on my ancient mac mini G4 with a humble > RV280 GPU. I try to boot it in AGP mode from time to time after applying > some patches, just for fun. Before i would see some GPU lockup-resetting > stuff (it always locks up after some time and never recovers) in kern.log, > lately it just locks up and there's nothing in the logs. I think the problem is your case is AGP and Mac. AGP is notoriously unreliable and the Apple northbridge had additional coherency problems. Yeah, i know. But the behaviour of not writing to kern.log but just completely locking up appeared after said patch. (In reply to joro-2013 from comment #4) > I've noticed the same difference on my ancient mac mini G4 with a humble > RV280 GPU. I try to boot it in AGP mode from time to time after applying > some patches, just for fun. Before i would see some GPU lockup-resetting > stuff (it always locks up after some time and never recovers) in kern.log, > lately it just locks up and there's nothing in the logs. > > The patches i applied lately were > > [PATCH] radeon: allow write_reloc with unaccounted buffers to cope > with Mesa bug > > https://patchwork.kernel.org/patch/4663071/ > > and > > drm-radeon-fix-TOPDOWN-handling-for-bo_create-v3.patch > > > https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg132931.html > > My guess would be patch number two. Thx for the note, but it looks like the 2 patches you pointed (first one for libdrm) and second one (https://cgit.freedesktop.org/~agd5f/linux/commit/?h=topdown-fixes&id=db0be21d83078b2fe4cc6e9115d0b63a72a7e505 for kernel) were never merged so it cannot not affect my usecase. (In reply to Michel Dänzer from comment #3) > Doesn't seem to happen with my Tonga after 30 minutes. > > One thing to keep in mind is that FurMark is designed to stress the GPU. Do > the systems you're testing on have appropriate power supply and cooling? > Thx for the suggestion I will have another look to the temperature but when I checked some times ago the temperature was around 55 C when it freeze. > GALLIUM_HUD=.dfps,.drequested-VRAM+mapped-VRAM+VRAM-usage+VRAM-vis-usage,. > drequested-GTT+mapped-GTT+GTT-usage,cpu+temperature+GPU-load shows that it > only uses little VRAM and GTT, so it's weird that limiting those to much > larger sizes has any effect. Thx for the try. Today I could reproduce it with a HD7770 so the problem seems not specific to FirePro. Also just before it freezes I have this sometimes: radeon: size : 1048576 bytes radeon: va : 0x8520d000 radeon: Failed to deallocate virtual address for buffer: radeon: size : 65536 bytes radeon: va : 0x86d8f000 > > Does it also happen with older versions of Mesa? I can have it with mesa 12.0.6. Are you thinking of something older ? When using R600_DEBUG=check_vm on both Xorg and the gl app I can get some output in kern.log. It looks like a "ring 0 stalled" is detected and then follow a gpu softreset which succeeds ("GPU reset succeeded, trying to resume") but fails to resume because: [drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombios stuck executing C483 (len 254, WS 0, PS 4) @ 0xC4AD [drm:atom_execute_table_locked [radeon]] [kworker/0:1H, 434] *ERROR* atombios stuck executing BC59 (len 74, WS 0, PS 8) @ 0xBC8E Then there is two: radeon_mc_wait_for_idle failure "Wait for MC idle timedout" from si_mc_program Finally si_startup fails because si_cp_resume fails because r600_ring_test fails with: "radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)" But it seems it keeps looping trying to do a gpu softreset and at some point it freezes. I need to confirm this ending scenario though but these atombios failures are worring in the first place. At the same time I get some "radeon_ttm_bo_destroy" notified by "WARN_ON(!list_empty(&bo->va));" from kernel radeon driver. So it seems to leak some buffers. I will attach the full log tomorrow, it is mess-up with my traces atm but the essential is above I hope. So I have 4 questions: 1: Can an application causes a "ring 0 stalled" ? or is it a driver bug (kernel side or mesa/drm or xserver) ? 2: About these atombios failures, does it mean that it fails to load the gpu microcode/firmware ? 3: Does it try to do a gpu softreset because I added R600_DEBUG=check_vm ? Or this one just help to flush the traces on vm fault (like mentioned in a commit msg related to that env var in mesa) ? 4: For the deallocation failure / leak above (radeon_ttm_bo_destroy warning), does it mean the memory is lost until next reboot or does a gpu soft reset allow to recover these leaks ? Thx ! (In reply to Julien Isorce from comment #9) > > So I have 4 questions: > 1: Can an application causes a "ring 0 stalled" ? or is it a driver bug > (kernel side or mesa/drm or xserver) ? driver bug. Probably mesa or kernel. > 2: About these atombios failures, does it mean that it fails to load the > gpu microcode/firmware ? Most likely the GPU reset was not actually successful and the atombios errors are a symptom of that. > 3: Does it try to do a gpu softreset because I added R600_DEBUG=check_vm ? > Or this one just help to flush the traces on vm fault (like mentioned in a > commit msg related to that env var in mesa) ? check_vm doesn't not change anything with respect to gpu reset. > 4: For the deallocation failure / leak above (radeon_ttm_bo_destroy > warning), does it mean the memory is lost until next reboot or does a gpu > soft reset allow to recover these leaks ? I'm not quite sure what you are referring to, but if the GPU reset is successful, all fences should be signalled so any memory that is pinned due to a command buffer being in flight could be freed. When R600_DEBUG=check_vm catches a VM fault, it generates a report in ~/ddebug_dumps/ , please attach that here. A similar report from something like GALLIUM_DDEBUG="pipelined 10000" might give more information. Thx for the answers and suggestions. In order to make sure I am still tracking the same hard lockup, will any failure to load the gpu microcode will lead to a total freeze of the machine ? Currently in the setup where I can get some logs it can fail in 2 ways: For both it starts with: "ring 0 stalled" is detected. 1: kworker triggers the gpu soft reset. [drm:atom_op_jump [radeon]] *ERROR* atombios stuck in loop for more than 5secs aborting. [drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing C483 (len 254, WS 0, PS 4) @ 0xC4AD [drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing BC59 (len 74, WS 0, PS 8) @ 0xBC8E si_mc_program::radeon_mc_wait_for_idle, the one after WREG32(vram.start) and before evergreen_mc_resume. Can it freeze on a call to RREG32(SRBM_STATUS) & 0x1F00 ? 2: the gl app triggers the gpu soft reset. The first failure is then evergreen_mc_stop::radeon_mc_wait_for_idle which reached the timeout and then same errors as 1: But the freeze happens a bit later in the radeon_gpu_reset sequence, in atombios_crtc_dpms in one of its atombios_X(crtc, ATOM_Y) calls. So my question above, will any single problem during the gpu soft reset lead to a machine freeze ? If yes then I am probably tracking now a different freeze that the one I reported initially. Also in kernel's drm_drv.c::drm_err I tried to add a call to sys_sync(); (#include <linux/syscalls.h>) to make sure all errors are written on disk so that I can read them after a reboot (Instead of having null characters ^@). But I got an undefined reference. How could I add dependcy on fs/sync.c ? I have not search long but at first glance tty driver calls it and there is nothing special in the Makefile. (As an alternative I am running a while true; sleep 0.5; sync; done but it does not work all the time) Thx! Created attachment 130787 [details]
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled
Created attachment 130788 [details]
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled
Created attachment 130789 [details]
dmesg_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled
Created attachment 130790 [details]
ddebug_dumps_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled
These last 4 attachements logs are for comment #12. And generated with GALLIUM_DDEBUG="pipelined 10000" R600_DEBUG=check_vm . Again it is potentially a different freeze than the one reported initially, simplify because I still have no logs for the former which is with a W600. -simplify +simply Comment on attachment 130787 [details]
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled
Marking as obsolete because this is a different problem than the one reported.
Comment on attachment 130788 [details]
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled
Marking as obsolete because this is a different problem than the one reported.
Comment on attachment 130789 [details]
dmesg_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled
Marking as obsolete because this is a different problem than the one reported.
Comment on attachment 130790 [details]
ddebug_dumps_HD7770_kernel_agd5f-drm-next-4.12_ring_stalled
Marking as obsolete because this is a different problem than the one reported.
I confirm that comment #9 and #12 are about a different issue (or at least different symptoms). I reported it here: https://bugs.freedesktop.org/show_bug.cgi?id=100712 Hi! There is a stress test here that we have been using to reproduce this issue: https://github.com/Oblong/thrasher Enabling debug tracing works around the issue. Specifically, when si_draw_vbo calls si_trace_emit, the problem goes away. These settings reproduce consistently on my W600 test machine: ./thrash -w 1920 -h 1080 -c 3 -t 1000 -m 1000000000 Also note that forcing VGT_STREAMOUT_SYNC, VGT_STREAMOUT_RESET, or VGT_FLUSH to be emitted on each call to si_draw_vbo (via si_emit_cache_flush) also appears to work around the issue, though this has not been as thoroughly tested. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/790. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.