91268 – R6xx freezes with kernel 3.17 and up

Bug 91268 - R6xx freezes with kernel 3.17 and up

Summary: R6xx freezes with kernel 3.17 and up

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Radeon (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (1):	93911 (view as bug list)
Depends on:
Blocks:

Reported:	2015-07-08 12:15 UTC by Kajzer
Modified:	2016-03-22 15:54 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:
i915 features:

Attachments
disable uc/wc (609 bytes, patch) 2015-07-13 14:19 UTC, Alex Deucher	no flags	Details \| Splinter Review
dmesg output (7.98 KB, text/plain) 2015-07-15 12:27 UTC, Kajzer	no flags	Details
Disable uc/wc on anything older than R7xx (1.25 KB, patch) 2015-07-16 13:56 UTC, Christian König	no flags	Details \| Splinter Review
Show Obsolete (1) View All

Description Kajzer 2015-07-08 12:15:00 UTC

Something was introduced in kernel 3.17 which makes my GPU to freeze while playing games. When it happens screen freeze for a few seconds, then it goes blank for a few seconds, then it comes back with strange artifacts on the screen, system is basically unresponsive, the only thing you can do is a hard reset.
With kernels below 3.17 this doesn't happen.
Mesa version also doesn't matter.
Basically same everything, just booting with different kernel makes a difference.
3.16 is good and 3.17 is bad (also every other kernel above 3.17)
With kernel 3.16 I can play games for days/weeks and bug will not happen.
With 3.17 it can happen anywhere between 15 minutes and few hours.
I did a bisect and it produced this :

git bisect start '--' 'drivers/gpu/drm/radeon'
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# bad: [03f62abd112d5150b6ce8957fa85d4f6e85e357f] drm/radeon: split PT setup in more functions
git bisect bad 03f62abd112d5150b6ce8957fa85d4f6e85e357f
# bad: [391bfec33cd4e103274f197924d41ef648b849de] drm/radeon: remove visible vram size limit on bo allocation (v4)
git bisect bad 391bfec33cd4e103274f197924d41ef648b849de
# good: [da9976206c15178eeae1b4445c9266125bf35b0a] drm/radeon: enable display scaling on all connectors (v2)
git bisect good da9976206c15178eeae1b4445c9266125bf35b0a
# good: [380670aebfca998bb67b9cf05fc7f28ebeac4b18] drm/radeon: Demote 'BO allocation size too large' message to debug only
git bisect good 380670aebfca998bb67b9cf05fc7f28ebeac4b18
# bad: [02376d8282b88f07d0716da6155094c8760b1a13] drm/radeon: Allow write-combined CPU mappings of BOs in GTT (v2)
git bisect bad 02376d8282b88f07d0716da6155094c8760b1a13
# good: [77497f2735ad6e29c55475e15e9790dbfa2c2ef8] drm/radeon: Pass GART page flags to radeon_gart_set_page() explicitly
git bisect good 77497f2735ad6e29c55475e15e9790dbfa2c2ef8
# first bad commit: [02376d8282b88f07d0716da6155094c8760b1a13] drm/radeon: Allow write-combined CPU mappings of BOs in GTT (v2)

commit 02376d8282b88f07d0716da6155094c8760b1a13
Author: Michel Dänzer <michel.daenzer@amd.com>
Date:   Thu Jul 17 19:01:08 2014 +0900

    drm/radeon: Allow write-combined CPU mappings of BOs in GTT (v2)
    
    v2: fix rebase onto drm-fixes
    
    Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Currently I'm running kernel with commit before the first bad one :
$ git reset --hard 77497f2735ad6e29c55475e15e9790dbfa2c2ef8
HEAD is now at 77497f2 drm/radeon: Pass GART page flags to radeon_gart_set_page() explicitly

to test it more thoroughly and see if hang will occur.

Comment 1 Kajzer 2015-07-11 19:20:01 UTC

Quote from another thread where this bug initially started :

(In reply to Michel Dänzer from comment #273)
> Please run a kernel built from commit
> 77497f2735ad6e29c55475e15e9790dbfa2c2ef8 (the commit before
> 02376d8282b88f07d0716da6155094c8760b1a13) for at least a few days to make
> sure it doesn't happen with that.

After few days I can safely say that this kernel runs great, I had no hangs.

Comment 2 Kajzer 2015-07-12 02:20:32 UTC

I made a patch using git show and I've patched last known good kernel 3.16.7
I guess that's one way to find out is this commit the real culprit or not.

Comment 3 Kajzer 2015-07-12 12:37:41 UTC

Trouble is that kernel won't compile now.

CC [M]  drivers/gpu/drm/radeon/radeon_object.o
drivers/gpu/drm/radeon/radeon_object.c: In function ‘radeon_ttm_placement_from_domain’:
drivers/gpu/drm/radeon/radeon_object.c:117:20: error: ‘RADEON_GEM_GTT_UC’ undeclared (first use in this function)
   if (rbo->flags & RADEON_GEM_GTT_UC) {
                    ^
drivers/gpu/drm/radeon/radeon_object.c:117:20: note: each undeclared identifier is reported only once for each function it appears in
drivers/gpu/drm/radeon/radeon_object.c:119:28: error: ‘RADEON_GEM_GTT_WC’ undeclared (first use in this function)
   } else if ((rbo->flags & RADEON_GEM_GTT_WC) ||
                            ^
drivers/gpu/drm/radeon/radeon_object.c: In function ‘radeon_bo_create’:
drivers/gpu/drm/radeon/radeon_object.c:198:18: error: ‘RADEON_GEM_GTT_WC’ undeclared (first use in this function)
   bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
                  ^
drivers/gpu/drm/radeon/radeon_object.c:198:38: error: ‘RADEON_GEM_GTT_UC’ undeclared (first use in this function)
   bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
                                      ^
make[5]: *** [drivers/gpu/drm/radeon/radeon_object.o] Error 1


I made a patch with git show 02376d8282b88f07d0716da6155094c8760b1a13 > badcommit.patch
It patched fine with no errors.

I'm out of moves now, is there any other way to either add this commit to 3.16 or take it out from 3.17 ?

Comment 4 Alex Deucher 2015-07-13 14:19:51 UTC

Created attachment 117089 [details] [review]
disable uc/wc

The attached patch will disable uncached mappings.

Comment 5 Kajzer 2015-07-13 15:30:08 UTC

(In reply to Alex Deucher from comment #4)
> Created attachment 117089 [details] [review] [review]
> disable uc/wc
> 
> The attached patch will disable uncached mappings.

Thanks Alex !
I've patched kernel 3.18.8 and I'm running it right now.
I'll see what happens, hopefully it won't hang ! :)

Comment 6 Michel Dänzer 2015-07-15 03:30:52 UTC

Please attach the output of dmesg, including all the drm/radeon initialization messages.

Comment 7 Kajzer 2015-07-15 12:27:54 UTC

Created attachment 117136 [details]
dmesg output

Comment 8 Kajzer 2015-07-15 12:29:53 UTC

(In reply to Michel Dänzer from comment #6)
> Please attach the output of dmesg, including all the drm/radeon
> initialization messages.

I suspect you need one when hang happens, I'm trying really hard to make it hang with the patch from Alex but it seems that patch did the trick, there are no more hangs. But I'll keep trying, just to be sure.
Although it should have happened by now.

Anyway, if you need dmesg when bug happens I'll do that one later, for now here's the current one with no hangs : https://bugs.freedesktop.org/attachment.cgi?id=117136

Comment 9 Michel Dänzer 2015-07-16 03:52:06 UTC

(In reply to Kajzer from comment #8)
> I suspect you need one when hang happens,

No, as I said I'm mostly interested in the initialization messages.


> I'm trying really hard to make it hang with the patch from Alex but it seems
> that patch did the trick, there are no more hangs.

That's expected. Alex's patch isn't a fix but just to confirm the problem is really directly related to write-combined CPU mappings.

Comment 10 Kajzer 2015-07-16 13:12:03 UTC

(In reply to Michel Dänzer from comment #9)
> That's expected. Alex's patch isn't a fix but just to confirm the problem is
> really directly related to write-combined CPU mappings.

Yeah I know, that's what I really asked for, a way to disable that commit.
I can confirm now that indeed there's some bug in that commit (with R6xx chips)
I had no hangs with mappings disabled.
I'm willing to test potential fixes.

Comment 11 Christian König 2015-07-16 13:56:42 UTC

Created attachment 117172 [details] [review]
Disable uc/wc on anything older than R7xx

Considering how old the hardware is I suggest that we just disable that feature for anything older than R7XX.

A patch doing exactly this is attached.

Comment 12 Alex Deucher 2015-07-16 13:58:29 UTC

Just to be clear, does this bug only happen when you force dpm on or all the time?

Comment 13 Kajzer 2015-07-16 14:37:26 UTC

(In reply to Alex Deucher from comment #12)
> Just to be clear, does this bug only happen when you force dpm on or all the
> time?

If I don't set performance to high then it hangs all the time (not just in gaming) and I can provoke it within minutes, regardless of kernel version.
This bug (CPU mappings) happens only while playing games and with kernels above 3.16
So, will this bug happen if I don't force performance to high ?
To be honest I don't know, been a while since I was on anything else other than high, because for sure the other bug would happen, and they behave the same when the hang happens.
So I guess it would hang if I don't force it.
Except maybe if there were some kind of mappings in the kernel before 3.17 and that somehow both bugs are related.
That I don't know.

Comment 14 Alex Deucher 2015-07-16 14:43:12 UTC

(In reply to Kajzer from comment #13)
> If I don't set performance to high then it hangs all the time (not just in
> gaming) and I can provoke it within minutes, regardless of kernel version.
> This bug (CPU mappings) happens only while playing games and with kernels
> above 3.16
> So, will this bug happen if I don't force performance to high ?
> To be honest I don't know, been a while since I was on anything else other
> than high, because for sure the other bug would happen, and they behave the
> same when the hang happens.
> So I guess it would hang if I don't force it.
> Except maybe if there were some kind of mappings in the kernel before 3.17
> and that somehow both bugs are related.
> That I don't know.

Do you see this bug if you don't enable dpm at all (which is the default)?

Comment 15 Kajzer 2015-07-16 14:49:15 UTC

(In reply to Alex Deucher from comment #14)
> Do you see this bug if you don't enable dpm at all (which is the default)?

Ah I get you now... I don't know, there's no point for me to even be on Linux without dpm, but if you think that testing that would solve some things then I guess I can try that. I'll let you know.

Comment 16 Alex Deucher 2015-07-16 15:03:52 UTC

(In reply to Kajzer from comment #15)
> (In reply to Alex Deucher from comment #14)
> > Do you see this bug if you don't enable dpm at all (which is the default)?
> 
> Ah I get you now... I don't know, there's no point for me to even be on
> Linux without dpm, but if you think that testing that would solve some
> things then I guess I can try that. I'll let you know.

Yes, please test.

Comment 17 Kajzer 2015-07-16 16:15:00 UTC

(In reply to Alex Deucher from comment #16)
> (In reply to Kajzer from comment #15)
> > (In reply to Alex Deucher from comment #14)
> > > Do you see this bug if you don't enable dpm at all (which is the default)?
> > 
> > Ah I get you now... I don't know, there's no point for me to even be on
> > Linux without dpm, but if you think that testing that would solve some
> > things then I guess I can try that. I'll let you know.
> 
> Yes, please test.

I just did and it happened fast, 20 mins after game started.
So, answer is yes, I see this bug when dpm is disabled.

Comment 18 Michel Dänzer 2015-07-17 03:06:53 UTC

(In reply to Christian König from comment #11)
> Considering how old the hardware is I suggest that we just disable that
> feature for anything older than R7XX.

fglrx was already using write-combined CPU mappings with the very first PCIe GPUs (RV3xx), so I don't think it's that simple.

I was hoping that we'd find something to key off a quirk in the dmesg output, but since we can't seem to get that, maybe this is the best we can do for now. :(

Comment 19 Michel Dänzer 2015-07-17 03:30:37 UTC

(In reply to Michel Dänzer from comment #18)
> I was hoping that we'd find something to key off a quirk in the dmesg
> output, but since we can't seem to get that, maybe this is the best we can
> do for now. :(

Oops, sorry, I totally missed that the dmesg output is here already. :) Nothing in particular jumps out at me though.

Comment 20 Fedja Beader 2015-07-24 19:38:21 UTC

This patch seems (for 1h now) to work on 4.0.8 + Gentoo + grsecurity

For me, the screen froze with the graphics still visible. Additionally, the game was still running in the background (heard sounds and spewed errors in console) and I had full ssh access. In another game the screen turned black and white +something that looked like missing textures, but I could still interact with it.

Happened on both 3.18.9 + Gentoo + grsecurity and above mentioned 4.0.8
mesa is at 10.3

lspci:
VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV620/M82 [Mobility Radeon HD 3450/3470]


[ 3936.443037] radeon 0000:01:00.0: ring 0 stalled for more than 10273msec
[ 3936.443046] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000050ded last fence id 0x0000000000050df3 on ring 0)
[ 3936.450174] radeon 0000:01:00.0: Saved 185 dwords of commands on ring 0.
[ 3936.450191] radeon 0000:01:00.0: GPU softreset: 0x00000008
[ 3936.450197] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[ 3936.450202] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[ 3936.450207] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200000C0
[ 3936.450212] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 3936.450216] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[ 3936.450221] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00020186
[ 3936.450226] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80028645
[ 3936.450231] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 3936.501715] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00004001
[ 3936.501773] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[ 3936.503883] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[ 3936.503888] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[ 3936.503893] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200080C0
[ 3936.503898] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[ 3936.503903] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[ 3936.503907] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[ 3936.503912] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80100000
[ 3936.503917] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[ 3936.503929] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[ 3936.523106] [drm] PCIE GART of 512M enabled (table at 0x0000000000254000).
[ 3936.523152] radeon 0000:01:00.0: WB enabled
[ 3936.523160] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000010000c00 and cpu addr 0xffff880074d72c00
[ 3936.524373] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x00000000000521d0 and cpu addr 0xffffc900045921d0
[ 3936.556287] [drm] ring test on 0 succeeded in 0 usecs
[ 3936.732365] [drm] ring test on 5 succeeded in 1 usecs
[ 3936.732375] [drm] UVD initialized successfully.
[ 3946.943038] radeon 0000:01:00.0: ring 0 stalled for more than 10213msec
[ 3946.943047] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000050dee last fence id 0x0000000000050df3 on ring 0)
[ 3946.956388] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[ 3946.956396] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).

Comment 21 Fedja Beader 2015-07-24 19:42:00 UTC

(In reply to Kajzer from comment #13)

> If I don't set performance to high then it hangs all the time

It gave me that impression, yes

Comment 22 Michel Dänzer 2015-08-04 10:00:48 UTC

Seeing as both Kajzer and Fedja Beader are using RV6xx GPUs, maybe we could just disable WC for those for now?

Comment 23 Kajzer 2015-08-04 14:33:28 UTC

Still working fine with disabled WC, not a single crash since.
Also, I wasn't able to notice any difference with disabled WC, I mean regarding performance or something.
Disabling WC on RV6xx is definitely a good thing.

Comment 24 Laurento Frittella 2015-10-29 20:34:01 UTC

I'm trying the attached patch to disable WC on my r6xx and it seems to help here as well.

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV620/M82 [Mobility Radeon HD 3450/3470]

Linux mybox 4.2.1-custom #3 SMP PREEMPT Mon Oct 26 22:05:24 CET 2015 x86_64 GNU/Linux

Debian stretch/sid

Comment 25 Michel Dänzer 2015-11-26 08:17:15 UTC

Fixed in https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=96ea47c0ec8c012509116bee8c57414281428fc4 , will get backported to stable kernel trees.

Comment 26 Michel Dänzer 2016-01-29 02:42:35 UTC

*** Bug 93911 has been marked as a duplicate of this bug. ***

Comment 27 David Breese 2016-03-22 15:54:42 UTC

*** Bug 93911 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.