Summary: | R6xx freezes with kernel 3.17 and up | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Kajzer <kap3tan> | ||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||
Severity: | normal | ||||||||||
Priority: | medium | CC: | ckoenig.leichtzumerken, dabreese00, fedja.beader, f.pinamartins, laurento.frittella, nicolamori | ||||||||
Version: | unspecified | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
Kajzer
2015-07-08 12:15:00 UTC
Quote from another thread where this bug initially started : (In reply to Michel Dänzer from comment #273) > Please run a kernel built from commit > 77497f2735ad6e29c55475e15e9790dbfa2c2ef8 (the commit before > 02376d8282b88f07d0716da6155094c8760b1a13) for at least a few days to make > sure it doesn't happen with that. After few days I can safely say that this kernel runs great, I had no hangs. I made a patch using git show and I've patched last known good kernel 3.16.7 I guess that's one way to find out is this commit the real culprit or not. Trouble is that kernel won't compile now. CC [M] drivers/gpu/drm/radeon/radeon_object.o drivers/gpu/drm/radeon/radeon_object.c: In function ‘radeon_ttm_placement_from_domain’: drivers/gpu/drm/radeon/radeon_object.c:117:20: error: ‘RADEON_GEM_GTT_UC’ undeclared (first use in this function) if (rbo->flags & RADEON_GEM_GTT_UC) { ^ drivers/gpu/drm/radeon/radeon_object.c:117:20: note: each undeclared identifier is reported only once for each function it appears in drivers/gpu/drm/radeon/radeon_object.c:119:28: error: ‘RADEON_GEM_GTT_WC’ undeclared (first use in this function) } else if ((rbo->flags & RADEON_GEM_GTT_WC) || ^ drivers/gpu/drm/radeon/radeon_object.c: In function ‘radeon_bo_create’: drivers/gpu/drm/radeon/radeon_object.c:198:18: error: ‘RADEON_GEM_GTT_WC’ undeclared (first use in this function) bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC); ^ drivers/gpu/drm/radeon/radeon_object.c:198:38: error: ‘RADEON_GEM_GTT_UC’ undeclared (first use in this function) bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC); ^ make[5]: *** [drivers/gpu/drm/radeon/radeon_object.o] Error 1 I made a patch with git show 02376d8282b88f07d0716da6155094c8760b1a13 > badcommit.patch It patched fine with no errors. I'm out of moves now, is there any other way to either add this commit to 3.16 or take it out from 3.17 ? Created attachment 117089 [details] [review] disable uc/wc The attached patch will disable uncached mappings. (In reply to Alex Deucher from comment #4) > Created attachment 117089 [details] [review] [review] > disable uc/wc > > The attached patch will disable uncached mappings. Thanks Alex ! I've patched kernel 3.18.8 and I'm running it right now. I'll see what happens, hopefully it won't hang ! :) Please attach the output of dmesg, including all the drm/radeon initialization messages. Created attachment 117136 [details]
dmesg output
(In reply to Michel Dänzer from comment #6) > Please attach the output of dmesg, including all the drm/radeon > initialization messages. I suspect you need one when hang happens, I'm trying really hard to make it hang with the patch from Alex but it seems that patch did the trick, there are no more hangs. But I'll keep trying, just to be sure. Although it should have happened by now. Anyway, if you need dmesg when bug happens I'll do that one later, for now here's the current one with no hangs : https://bugs.freedesktop.org/attachment.cgi?id=117136 (In reply to Kajzer from comment #8) > I suspect you need one when hang happens, No, as I said I'm mostly interested in the initialization messages. > I'm trying really hard to make it hang with the patch from Alex but it seems > that patch did the trick, there are no more hangs. That's expected. Alex's patch isn't a fix but just to confirm the problem is really directly related to write-combined CPU mappings. (In reply to Michel Dänzer from comment #9) > That's expected. Alex's patch isn't a fix but just to confirm the problem is > really directly related to write-combined CPU mappings. Yeah I know, that's what I really asked for, a way to disable that commit. I can confirm now that indeed there's some bug in that commit (with R6xx chips) I had no hangs with mappings disabled. I'm willing to test potential fixes. Created attachment 117172 [details] [review] Disable uc/wc on anything older than R7xx Considering how old the hardware is I suggest that we just disable that feature for anything older than R7XX. A patch doing exactly this is attached. Just to be clear, does this bug only happen when you force dpm on or all the time? (In reply to Alex Deucher from comment #12) > Just to be clear, does this bug only happen when you force dpm on or all the > time? If I don't set performance to high then it hangs all the time (not just in gaming) and I can provoke it within minutes, regardless of kernel version. This bug (CPU mappings) happens only while playing games and with kernels above 3.16 So, will this bug happen if I don't force performance to high ? To be honest I don't know, been a while since I was on anything else other than high, because for sure the other bug would happen, and they behave the same when the hang happens. So I guess it would hang if I don't force it. Except maybe if there were some kind of mappings in the kernel before 3.17 and that somehow both bugs are related. That I don't know. (In reply to Kajzer from comment #13) > If I don't set performance to high then it hangs all the time (not just in > gaming) and I can provoke it within minutes, regardless of kernel version. > This bug (CPU mappings) happens only while playing games and with kernels > above 3.16 > So, will this bug happen if I don't force performance to high ? > To be honest I don't know, been a while since I was on anything else other > than high, because for sure the other bug would happen, and they behave the > same when the hang happens. > So I guess it would hang if I don't force it. > Except maybe if there were some kind of mappings in the kernel before 3.17 > and that somehow both bugs are related. > That I don't know. Do you see this bug if you don't enable dpm at all (which is the default)? (In reply to Alex Deucher from comment #14) > Do you see this bug if you don't enable dpm at all (which is the default)? Ah I get you now... I don't know, there's no point for me to even be on Linux without dpm, but if you think that testing that would solve some things then I guess I can try that. I'll let you know. (In reply to Kajzer from comment #15) > (In reply to Alex Deucher from comment #14) > > Do you see this bug if you don't enable dpm at all (which is the default)? > > Ah I get you now... I don't know, there's no point for me to even be on > Linux without dpm, but if you think that testing that would solve some > things then I guess I can try that. I'll let you know. Yes, please test. (In reply to Alex Deucher from comment #16) > (In reply to Kajzer from comment #15) > > (In reply to Alex Deucher from comment #14) > > > Do you see this bug if you don't enable dpm at all (which is the default)? > > > > Ah I get you now... I don't know, there's no point for me to even be on > > Linux without dpm, but if you think that testing that would solve some > > things then I guess I can try that. I'll let you know. > > Yes, please test. I just did and it happened fast, 20 mins after game started. So, answer is yes, I see this bug when dpm is disabled. (In reply to Christian König from comment #11) > Considering how old the hardware is I suggest that we just disable that > feature for anything older than R7XX. fglrx was already using write-combined CPU mappings with the very first PCIe GPUs (RV3xx), so I don't think it's that simple. I was hoping that we'd find something to key off a quirk in the dmesg output, but since we can't seem to get that, maybe this is the best we can do for now. :( (In reply to Michel Dänzer from comment #18) > I was hoping that we'd find something to key off a quirk in the dmesg > output, but since we can't seem to get that, maybe this is the best we can > do for now. :( Oops, sorry, I totally missed that the dmesg output is here already. :) Nothing in particular jumps out at me though. This patch seems (for 1h now) to work on 4.0.8 + Gentoo + grsecurity For me, the screen froze with the graphics still visible. Additionally, the game was still running in the background (heard sounds and spewed errors in console) and I had full ssh access. In another game the screen turned black and white +something that looked like missing textures, but I could still interact with it. Happened on both 3.18.9 + Gentoo + grsecurity and above mentioned 4.0.8 mesa is at 10.3 lspci: VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV620/M82 [Mobility Radeon HD 3450/3470] [ 3936.443037] radeon 0000:01:00.0: ring 0 stalled for more than 10273msec [ 3936.443046] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000050ded last fence id 0x0000000000050df3 on ring 0) [ 3936.450174] radeon 0000:01:00.0: Saved 185 dwords of commands on ring 0. [ 3936.450191] radeon 0000:01:00.0: GPU softreset: 0x00000008 [ 3936.450197] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA0003030 [ 3936.450202] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000003 [ 3936.450207] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200000C0 [ 3936.450212] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 3936.450216] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 3936.450221] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00020186 [ 3936.450226] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80028645 [ 3936.450231] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 3936.501715] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00004001 [ 3936.501773] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100 [ 3936.503883] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA0003030 [ 3936.503888] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000003 [ 3936.503893] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200080C0 [ 3936.503898] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 3936.503903] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 3936.503907] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [ 3936.503912] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80100000 [ 3936.503917] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 3936.503929] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [ 3936.523106] [drm] PCIE GART of 512M enabled (table at 0x0000000000254000). [ 3936.523152] radeon 0000:01:00.0: WB enabled [ 3936.523160] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000010000c00 and cpu addr 0xffff880074d72c00 [ 3936.524373] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x00000000000521d0 and cpu addr 0xffffc900045921d0 [ 3936.556287] [drm] ring test on 0 succeeded in 0 usecs [ 3936.732365] [drm] ring test on 5 succeeded in 1 usecs [ 3936.732375] [drm] UVD initialized successfully. [ 3946.943038] radeon 0000:01:00.0: ring 0 stalled for more than 10213msec [ 3946.943047] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000050dee last fence id 0x0000000000050df3 on ring 0) [ 3946.956388] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [ 3946.956396] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35). (In reply to Kajzer from comment #13) > If I don't set performance to high then it hangs all the time It gave me that impression, yes Seeing as both Kajzer and Fedja Beader are using RV6xx GPUs, maybe we could just disable WC for those for now? Still working fine with disabled WC, not a single crash since. Also, I wasn't able to notice any difference with disabled WC, I mean regarding performance or something. Disabling WC on RV6xx is definitely a good thing. I'm trying the attached patch to disable WC on my r6xx and it seems to help here as well. 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV620/M82 [Mobility Radeon HD 3450/3470] Linux mybox 4.2.1-custom #3 SMP PREEMPT Mon Oct 26 22:05:24 CET 2015 x86_64 GNU/Linux Debian stretch/sid Fixed in https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=96ea47c0ec8c012509116bee8c57414281428fc4 , will get backported to stable kernel trees. *** Bug 93911 has been marked as a duplicate of this bug. *** *** Bug 93911 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.