Bug 86267

Summary:

[drm:evergreen_resume] *ERROR* evergreen startup failed on resume

Product:

DRI

Reporter:

jbart <jasa.bartelj>

Component:

DRM/Radeon

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

major

Priority:

medium

CC:

isma.casti, stefanscheffler

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
corrupt display	none
dmesg after failed resume	none
Add mb() call to radeon_gart_table_vram_pin	none
Add mb() andradeon_gart_tlb_flush() calls to radeon_gart_table_vram_pin	none
Keep GART table mapped across suspend/resume	none
dmesg with radeon-suspend-gart-table.diff	none
Split off separate radeon_gart_get_page_entry ASIC hook	none
Reinstate radeon_gart_restore for when the GART table is in VRAM	none

Description jbart 2014-11-14 01:51:57 UTC

Created attachment 109443 [details]
corrupt display

After resuming from suspend on kernel 3.18-rc4 the graphics card locks

:

[55325.256071] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
[55325.261268] [drm] PCIE GART of 1024M enabled (table at 0x000000000025E000).
[55325.261400] radeon 0000:03:00.0: WB enabled
[55325.261402] radeon 0000:03:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff8800d7caac00
[55325.261404] radeon 0000:03:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff8800d7caac0c
[55325.262271] radeon 0000:03:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418 and cpu addr 0xffffc9000581c418
[55325.437301] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
[55325.437302] [drm:evergreen_resume] *ERROR* evergreen startup failed on resume

Comment 1 Michel Dänzer 2014-11-14 02:17:22 UTC

Is this a regression, i.e. did it work with older kernels? If so, can you bisect?

Comment 2 jbart 2014-11-17 17:40:21 UTC

Hi. I'm unable to reproduce this issue again on a more recent kernel build (3.18.0-0.rc4.git0.2). I'll just mark it worksforme.

Comment 3 jbart 2014-11-18 20:46:54 UTC

I rescind my previous comment, the same issue has happened again. The bug is spurious, I can't reliably reproduce it yet.

As for bisection I can definitely give it a shot, but it's pointless until I can reproduce the issue at will...

Comment 4 Michel Dänzer 2014-11-19 06:40:27 UTC

(In reply to jbart from comment #3)
> As for bisection I can definitely give it a shot, but it's pointless until I
> can reproduce the issue at will...

Not really, if you can get a feeling for how many times you need to test before you can be confident that the problem is not present.

Anyway, were you using older kernels before where the problem never occurred? What was the newest kernel where that was the case?

Comment 5 jbart 2014-11-23 13:39:54 UTC

I've seen a similar issue at least in the last, i.e. 3.17-rc cycle. Because it seemed to stop occuring in later rcs and stable releases I didn't investigate further. Because I don't recall such issues before, yes, I would say that this behaviour is a regression.

I've also experienced this issue with the latest rc on rawhide, 3.18.0-0.rc5.git0.2.fc22.x86_64.

After the GPU fails to resume I need to kill X. A new session says it is being rendered by llvmpipe, a state which can only be gotten out of by a full reboot.

Since the issue is sporadic (it happens about 2 or 3 times a week) I haven't started bisecting since "getting a good feeling" would really take too long. I haven't noticed that any open apps trigger the bug.

Comment 6 Ismael 2014-12-01 16:57:48 UTC

I get the error every time I suspend the machine. And 3D acceleration is disabled until I reboot.
I am using a HD5770 running 3.17.4-1-ck. Both Mesa and X are in the last stable version (I use Arch).

Nov 30 14:49:32 boreal kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000000025D000).
Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: WB enabled
Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 
Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 
Nov 30 14:49:32 boreal kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418 and cpu addr 
Nov 30 14:49:32 boreal kernel: [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
Nov 30 14:49:32 boreal kernel: [drm:evergreen_resume] *ERROR* evergreen startup failed on resume

Comment 7 stefanscheffler 2014-12-02 01:48:39 UTC

Same here with a Radeon HD 5670 512MB GDDR5 using Arch Linux. It also happens without starting X.

This error only shows up when the card successfully resumes not when it fails:

  [drm:rv770_stop_dpm] *ERROR* Could not force DPM to low.

Disabling DPM seems to prevent the startup failures.

Another observation. According to /proc/interrupts the card generates thousands of interrupts per seconds after a failed resume, which might explain why the system feels extremely sluggish afterwards.


Bisecting brought me to this commit. It's from early in the 3.17 cycle. I  thought this was happening with 3.16 already, but maybe I remember wrong.

commit a3eb06dbca08e3fdad7039021ae03b46b215f22a
Author: Michel DÃ¤nzer <michel.daenzer@amd.com>
Date:   Wed Jul 9 20:15:42 2014 +0200

    drm/radeon: Remove radeon_gart_restore()
    
    Doesn't seem necessary, the GART table memory should be persistent.
    
    Signed-off-by: Michel DÃ¤nzer <michel.daenzer@amd.com>
    Reviewed-by: Christian KÃ¶nig <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Comment 8 stefanscheffler 2014-12-02 01:52:41 UTC

Created attachment 110335 [details]
dmesg after failed resume

Dmesg after a few suspend/resumes. Card failed to start up after the last one.

Comment 9 Michel Dänzer 2014-12-02 06:57:42 UTC

Created attachment 110348 [details] [review]
Add mb() call to radeon_gart_table_vram_pin

Does this patch help?

Comment 10 stefanscheffler 2014-12-02 11:18:17 UTC

No, doesn't make a difference. Tested with 3.17.4.

The "interrupt storm" I mentioned didn't happen this time. I checked with a patched and unpatched kernel. Either it was just one time thing or only happened with older kernels.

Comment 11 Michel Dänzer 2014-12-08 06:32:27 UTC

Created attachment 110555 [details] [review]
Add mb() andradeon_gart_tlb_flush() calls to radeon_gart_table_vram_pin

How about this patch?

Comment 12 stefanscheffler 2014-12-08 14:55:26 UTC

Doesn't help. Tested with 3.17.4 and 3.18.0.

Comment 13 Michel Dänzer 2014-12-26 10:01:37 UTC

Created attachment 111362 [details] [review]
Keep GART table mapped across suspend/resume

Does this patch help?

Note that it's just an incomplete proof of concept, but it should at least help narrow down the problem.

Comment 14 stefanscheffler 2014-12-27 11:43:57 UTC

Created attachment 111395 [details]
dmesg with radeon-suspend-gart-table.diff

On 3.18.1 the patch applied with some offsets and didn't help. I'm getting those NULL pointer warnings now though.

Comment 15 Michel Dänzer 2015-01-20 10:11:14 UTC

Created attachment 112527 [details] [review]
Split off separate radeon_gart_get_page_entry ASIC hook

Comment 16 Michel Dänzer 2015-01-20 10:12:05 UTC

Created attachment 112528 [details] [review]
Reinstate radeon_gart_restore for when the GART table is in VRAM

Do these two patches help?

Comment 17 Ismael 2015-01-22 19:23:16 UTC

(In reply to Michel Dänzer from comment #16)
> Created attachment 112528 [details] [review] [review]
> Reinstate radeon_gart_restore for when the GART table is in VRAM
> 
> Do these two patches help?

Yes it helps. Suspend and resume work perfectly now.

Xan 22 21:20:20 boreal kernel: [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
Xan 22 21:20:20 boreal kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000000025E000).
Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: WB enabled
Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff88041693cc00
Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff88041693cc0c
Xan 22 21:20:20 boreal kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418 and cpu addr 0xffffc9000471c418
Xan 22 21:20:20 boreal kernel: [drm] ring test on 0 succeeded in 1 usecs
Xan 22 21:20:20 boreal kernel: [drm] ring test on 3 succeeded in 2 usecs
Xan 22 21:20:20 boreal kernel: usb 6-1.6.2: reset high-speed USB device number 5 using ehci-pci
Xan 22 21:20:20 boreal kernel: [drm] ring test on 5 succeeded in 1 usecs
Xan 22 21:20:20 boreal kernel: [drm] UVD initialized successfully.
Xan 22 21:20:20 boreal kernel: [drm] ib test on ring 0 succeeded in 0 usecs
Xan 22 21:20:20 boreal kernel: [drm] ib test on ring 3 succeeded in 0 usecs

Comment 18 Michel Dänzer 2015-01-23 03:31:17 UTC

The fix is queued up in Alex Deucher's -fixes tree.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.