Summary: | [drm:evergreen_resume] *ERROR* evergreen startup failed on resume | ||
---|---|---|---|
Product: | DRI | Reporter: | jbart <jasa.bartelj> |
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | isma.casti, stefanscheffler |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Is this a regression, i.e. did it work with older kernels? If so, can you bisect? Hi. I'm unable to reproduce this issue again on a more recent kernel build (3.18.0-0.rc4.git0.2). I'll just mark it worksforme. I rescind my previous comment, the same issue has happened again. The bug is spurious, I can't reliably reproduce it yet. As for bisection I can definitely give it a shot, but it's pointless until I can reproduce the issue at will... (In reply to jbart from comment #3) > As for bisection I can definitely give it a shot, but it's pointless until I > can reproduce the issue at will... Not really, if you can get a feeling for how many times you need to test before you can be confident that the problem is not present. Anyway, were you using older kernels before where the problem never occurred? What was the newest kernel where that was the case? I've seen a similar issue at least in the last, i.e. 3.17-rc cycle. Because it seemed to stop occuring in later rcs and stable releases I didn't investigate further. Because I don't recall such issues before, yes, I would say that this behaviour is a regression. I've also experienced this issue with the latest rc on rawhide, 3.18.0-0.rc5.git0.2.fc22.x86_64. After the GPU fails to resume I need to kill X. A new session says it is being rendered by llvmpipe, a state which can only be gotten out of by a full reboot. Since the issue is sporadic (it happens about 2 or 3 times a week) I haven't started bisecting since "getting a good feeling" would really take too long. I haven't noticed that any open apps trigger the bug. I get the error every time I suspend the machine. And 3D acceleration is disabled until I reboot. I am using a HD5770 running 3.17.4-1-ck. Both Mesa and X are in the last stable version (I use Arch). Nov 30 14:49:32 boreal kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000000025D000). Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: WB enabled Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418 and cpu addr Nov 30 14:49:32 boreal kernel: [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD) Nov 30 14:49:32 boreal kernel: [drm:evergreen_resume] *ERROR* evergreen startup failed on resume Same here with a Radeon HD 5670 512MB GDDR5 using Arch Linux. It also happens without starting X. This error only shows up when the card successfully resumes not when it fails: [drm:rv770_stop_dpm] *ERROR* Could not force DPM to low. Disabling DPM seems to prevent the startup failures. Another observation. According to /proc/interrupts the card generates thousands of interrupts per seconds after a failed resume, which might explain why the system feels extremely sluggish afterwards. Bisecting brought me to this commit. It's from early in the 3.17 cycle. I thought this was happening with 3.16 already, but maybe I remember wrong. commit a3eb06dbca08e3fdad7039021ae03b46b215f22a Author: Michel Dänzer <michel.daenzer@amd.com> Date: Wed Jul 9 20:15:42 2014 +0200 drm/radeon: Remove radeon_gart_restore() Doesn't seem necessary, the GART table memory should be persistent. Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Created attachment 110335 [details]
dmesg after failed resume
Dmesg after a few suspend/resumes. Card failed to start up after the last one.
Created attachment 110348 [details] [review] Add mb() call to radeon_gart_table_vram_pin Does this patch help? No, doesn't make a difference. Tested with 3.17.4. The "interrupt storm" I mentioned didn't happen this time. I checked with a patched and unpatched kernel. Either it was just one time thing or only happened with older kernels. Created attachment 110555 [details] [review] Add mb() andradeon_gart_tlb_flush() calls to radeon_gart_table_vram_pin How about this patch? Doesn't help. Tested with 3.17.4 and 3.18.0. Created attachment 111362 [details] [review] Keep GART table mapped across suspend/resume Does this patch help? Note that it's just an incomplete proof of concept, but it should at least help narrow down the problem. Created attachment 111395 [details]
dmesg with radeon-suspend-gart-table.diff
On 3.18.1 the patch applied with some offsets and didn't help. I'm getting those NULL pointer warnings now though.
Created attachment 112527 [details] [review] Split off separate radeon_gart_get_page_entry ASIC hook Created attachment 112528 [details] [review] Reinstate radeon_gart_restore for when the GART table is in VRAM Do these two patches help? (In reply to Michel Dänzer from comment #16) > Created attachment 112528 [details] [review] [review] > Reinstate radeon_gart_restore for when the GART table is in VRAM > > Do these two patches help? Yes it helps. Suspend and resume work perfectly now. Xan 22 21:20:20 boreal kernel: [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 Xan 22 21:20:20 boreal kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000000025E000). Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: WB enabled Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff88041693cc00 Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff88041693cc0c Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418 and cpu addr 0xffffc9000471c418 Xan 22 21:20:20 boreal kernel: [drm] ring test on 0 succeeded in 1 usecs Xan 22 21:20:20 boreal kernel: [drm] ring test on 3 succeeded in 2 usecs Xan 22 21:20:20 boreal kernel: usb 6-1.6.2: reset high-speed USB device number 5 using ehci-pci Xan 22 21:20:20 boreal kernel: [drm] ring test on 5 succeeded in 1 usecs Xan 22 21:20:20 boreal kernel: [drm] UVD initialized successfully. Xan 22 21:20:20 boreal kernel: [drm] ib test on ring 0 succeeded in 0 usecs Xan 22 21:20:20 boreal kernel: [drm] ib test on ring 3 succeeded in 0 usecs The fix is queued up in Alex Deucher's -fixes tree. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 109443 [details] corrupt display After resuming from suspend on kernel 3.18-rc4 the graphics card locks : [55325.256071] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [55325.261268] [drm] PCIE GART of 1024M enabled (table at 0x000000000025E000). [55325.261400] radeon 0000:03:00.0: WB enabled [55325.261402] radeon 0000:03:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff8800d7caac00 [55325.261404] radeon 0000:03:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff8800d7caac0c [55325.262271] radeon 0000:03:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418 and cpu addr 0xffffc9000581c418 [55325.437301] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD) [55325.437302] [drm:evergreen_resume] *ERROR* evergreen startup failed on resume