Summary: | [r600g][RV670 HD3870] Ioquake games causes GPU lockup (waiting for 0x00003039 last fence id 0x00003030) | ||
---|---|---|---|
Product: | Mesa | Reporter: | Bryan Quigley <gquigs+bugs> |
Component: | Drivers/Gallium/r600 | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | archon-123, dinolib, maraeo, myckel, rminkler |
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
kern.log
syslog Xorg log weird screen good+bad git bisects possible fix 3 outputs of syslog: before the patch, after, and after really bad possible fix possible fix Possible fix for R600 hw deadlock flush fix 1/4 flush fix 2/4 flush fix 3/4 flush fix 4/4 new attempt 1/5 new attempt 2/5 new attempt 3/5 new attempt 4/5 new attempt 5/5 simple fix alternate simple fix better alternative fix |
Description
Bryan Quigley
2012-06-03 14:23:01 UTC
Created attachment 62475 [details]
syslog
Created attachment 62476 [details]
Xorg log
Created attachment 62478 [details]
weird screen
Would it be possible to test the same thing, but with kernel 3.2? I'd like to know if we are experiencing the same problem that I reported some time ago. Would it be possible to narrow down which component (kernel, ddx, or mesa) is causing the problem and bisect? I'd guess it's a mesa issue. I did test with a 3.2 and the same 3.4 kernel and the stable mesa/X/drivers that came with Precise. This did not cause a crash.. I think I tested with 3.2 and the git mesa/X/drivers will cause the crash, I'll confirm tonight. Just upgrading Mesa (which does pull in libdrm upgrades) causes the bug.. even on the 3.2 kernel without Xorg/Drivers upgraded... I think this confirms it is mesa bug.. (In reply to comment #7) > I think this confirms it is mesa bug.. Would be great if you could bisect mesa Git then. I think I did everything right in this bisect (I didn't the first attempt). fbebd431ec4e2e461a0cbcd5f3a04a000b8f6bbf is the first bad commit commit fbebd431ec4e2e461a0cbcd5f3a04a000b8f6bbf Author: Marek Olšák <maraeo@gmail.com> Date: Fri Feb 3 05:05:31 2012 +0100 r600g: move invariant register updates into start_cs for r6xx-r7xx :040000 040000 dd9232a0c49e54e0cd536fa858dc131982dc2fbe 379e1d61c53d98a8706f32da5020dc22c0c0ee33 M src Created attachment 62689 [details]
good+bad git bisects
Both the good and bad git bisect logs, the good one had me run warsow, padman, and urbanterror looking for the bug.
The bad one missed some occurrences it seems.
Marek, any ideas? (bug 47116 might be related) I think I know what's going on here. There's a hw bug on r6xx where you need to re-emit a CB register if some state further up the pipeline changes even if the CB state has not changed. I remember fixing it in r600c, but I can't find the commit... IIRC, the fix is to always re-emit a CB reg between draw calls if some other state changed. (In reply to comment #11) > Marek, any ideas? (bug 47116 might be related) Sorry I've got none. All the regs were really invariant at the time I wrote the commit. A hardware bug like Alex suggested is one possible explanation... Bug still occurs in git from yesterday. I'm willing to test patches or even do some basic programming (no graphics experience). I wasn't able to just revert the problem patch and am not sure which parts I should be trying to keep. Created attachment 66040 [details] [review] possible fix Could you please try this patch? The patch doesn't seem to work. It may have made the crash more likely to bring the system down, but I'd have to do more testing to confirm that. Attaching 3 syslog results in 1 file containing: Before the patch After the patch After the patch - broke so much it needed a restart Created attachment 66047 [details]
3 outputs of syslog: before the patch, after, and after really bad
Would any other output help debug this? Register dumps using avivotool? Created attachment 71271 [details] [review] possible fix Does this patch help? Nope, but the patch didn't work as is, so I changed it to: rctx->framebuffer.atom.dirty = true; Which may not be what the patch was actually trying to do... (In reply to comment #21) > Nope, but the patch didn't work as is, so I changed it to: > rctx->framebuffer.atom.dirty = true; > > Which may not be what the patch was actually trying to do... Your modification is correct. So did it work or not? No, the new patch doesn't fix it either. Some more info is in Bug 58058, because I think this is the same problem. Created attachment 71346 [details] [review] possible fix Try this patch. It re-emits most of the invariant state at draw time. If it helps, please try commenting out (change the #if 1 to #if 0) each new section until you are able to trigger the lock ups again so we can narrow down which state needs to be re-emitted at draw time. (In reply to comment #25) > Created attachment 71346 [details] [review] [review] > possible fix > > Try this patch. It re-emits most of the invariant state at draw time. If > it helps, please try commenting out (change the #if 1 to #if 0) each new > section until you are able to trigger the lock ups again so we can narrow > down which state needs to be re-emitted at draw time. In my case it didn't help, although I had the impression it took longer before it hang (could also be random?) 2nd time it took shorter for it to lock up. Some observations: screen gets distorted after one or more resets (wrong rendering order?). Resetting of the screen keeps going, also when switched in the console (tty interface), until X is killed/shutdown. I confirm that patch 71346 didn't help either. I get a similar lockup when starting Team Fortress 2 (native). It happens at startup so it's much easier to reproduce.. I've just put my rv670 (HD3850) card back in my AGP box and can reliably get etqw to lock after a few seconds with waiting for fence. I may be too different from the OP for this to be relevant to this bug differences - AGP, 32 bit, running drm-fixes kernel, no writebacks and my bisect came up with a commit postdating the original report. But for me - 1eedebc65b02130ef7a27062a1ed67972a317a08 is first bad commit commit 1eedebc65b02130ef7a27062a1ed67972a317a08 Author: Marek Olšák <maraeo@gmail.com> Date: Thu Nov 1 02:00:37 2012 +0100 r600g: re-enable handling of DISCARD_RANGE, improving performance It seems to work for me now. Even the graphics corruption is gone. This also boosts performance in Reaction Quake. Gives a reliable rv670 lock up with etqw. This is testing with mesa built with --disable-llvm (as R600_LLVM doesn't work at all on this card) It may (or may not) be worth anyone testing with mesa master to try resetting it to the commit before the one above like - make distclean git clean -dfx git reset --hard fa58644855e44830e0b91dc627703c236fa6712a Andy, How many times did you try it at that commit? I ask because I orginally bisected it wrong because it didn't always reproduce consistantly for me. (Would take >1 run) I'll test it out though. Looks like this was a separate issue - I've just managed to get openarena to lock GPU with mesa set to before r600g: re-enable handling of DISCARD_RANGE (In reply to comment #31) > Andy, > > How many times did you try it at that commit? I ask because I orginally > bisected it wrong because it didn't always reproduce consistantly for me. > (Would take >1 run) > > I'll test it out though. I am still testing - for etqw it looks good, but as I just posted I can after some time get openarena to lock. (In reply to comment #9) > I think I did everything right in this bisect (I didn't the first attempt). > > fbebd431ec4e2e461a0cbcd5f3a04a000b8f6bbf is the first bad commit > commit fbebd431ec4e2e461a0cbcd5f3a04a000b8f6bbf > Author: Marek Olšák <maraeo@gmail.com> > Date: Fri Feb 3 05:05:31 2012 +0100 > > r600g: move invariant register updates into start_cs for r6xx-r7xx > > :040000 040000 dd9232a0c49e54e0cd536fa858dc131982dc2fbe > 379e1d61c53d98a8706f32da5020dc22c0c0ee33 M src This seems correct, I can get a lock after a few minutes on this commit, but have so far failed to lock on the one before it. (In reply to comment #30) > make distclean > git clean -dfx > git reset --hard fa58644855e44830e0b91dc627703c236fa6712a Ok, did this and rebuild everything, but problem stays in my case. I believe this bug is now triggered much faster (within 10 seconds of starting one of these games). But on the plus side it seems to usually just crash the game in question. (Running xorg-edgers (git) on Ubuntu Raring, kernel 3.8) Excerpt from kern.log 346488] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec 346502] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000ced last fence id 0x0000000000000cea) 347653] radeon 0000:01:00.0: Saved 121 dwords of commands on ring 0. 347664] radeon 0000:01:00.0: GPU softreset: 0x00000003 348246] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xE7730130 348253] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00FF0103 348259] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200000C0 348265] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x02000000 348271] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00040804 348277] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00028284 348283] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80878645 348289] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE 363165] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 378050] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA0003030 378056] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000003 378062] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200080C0 378068] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 378074] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 378079] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 378085] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80100000 382501] radeon 0000:01:00.0: GPU reset succeeded, trying to resume 400412] [drm] probing gen 2 caps for device 1022:9603 = 2/0 400422] [drm] PCIE gen 2 link speeds already enabled 405272] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). 405369] radeon 0000:01:00.0: WB enabled 405379] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffdccc00 405388] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000020000c0c and cpu addr 0xffdccc0c 436612] [drm] ring test on 0 succeeded in 0 usecs 436678] [drm] ring test on 3 succeeded in 1 usecs 438392] [drm] ib test on ring 0 succeeded in 0 usecs 438423] [drm] ib test on ring 3 succeeded in 1 usecs End of an apitrace: 4700817 glClientActiveTextureARB(texture = GL_TEXTURE1) 4700818 glBindTexture(target = GL_TEXTURE_2D, texture = 0) 4700819 glActiveTextureARB(texture = GL_TEXTURE0) 4700820 glClientActiveTextureARB(texture = GL_TEXTURE0) 4700821 glBindTexture(target = GL_TEXTURE_2D, texture = 0) 4700822 glXMakeCurrent(dpy = 0xb7f7c80, drawable = 0, ctx = NULL) = True 4700823 glXDestroyContext(dpy = 0xb7f7c80, ctx = 0xb830e08) 4700339 glDrawElements(mode = GL_TRIANGLES, count = 18, type = GL_UNSIGNED_INT, indices = blob(72)) // incomplete Some attempts from my side: I've been going back in the tree, to see if I could find a point where it doesn't show this bug. I've come as far as end 2011, but still it locks up (although it seems that it takes more time). With my last check (early 2011) I was unable to build the code, seems that it is incompatible going back that far. I'll see if I can find the spot where I can build it again and test it. @Myckel Habets in comment #39 What do you mean by a lot more time? I would test with 3 games, running 3 times each, automatically via phoronix stest suite. With the latest git mesa does it crash very quickly for you? Created attachment 75272 [details] [review] Possible fix for R600 hw deadlock Patch has been tested on a system with AMD K8 CPU and Radeon AGP card (AMD RV670 / Radeon HD 3850) with both 3.6.11-030611-generic kernel (from Ubuntu kernel PPA mainline) and kernel built from recent drm-fixes git in the testing. This patch may also be relevant to reported Bug 47116 . (In reply to comment #39) > Created attachment 75272 [details] [review] [review] > Possible fix for R600 hw deadlock > > Patch has been tested on a system with AMD K8 CPU and Radeon AGP card (AMD > RV670 / Radeon HD 3850) with both 3.6.11-030611-generic kernel (from Ubuntu > kernel PPA mainline) and kernel built from recent drm-fixes git in the > testing. This patch may also be relevant to reported Bug 47116 . There is a lot of unrelated stuff going on in that patch. Can you narrow down what part fixes the issue? Created attachment 75274 [details] [review] flush fix 1/4 Please try this patch series. The 4th patch is optional. It just enables CP DMA assuming that the previous flushing fixes fix the CP DMA issues. Created attachment 75275 [details] [review] flush fix 2/4 patch 2 of 4. Created attachment 75276 [details] [review] flush fix 3/4 patch 3 of 4. Created attachment 75277 [details] [review] flush fix 4/4 Optional patch to enable CP DMA on 6xx. *** Bug 47116 has been marked as a duplicate of this bug. *** (In reply to comment #39) > Created attachment 75272 [details] [review] [review] > Possible fix for R600 hw deadlock > > Patch has been tested on a system with AMD K8 CPU and Radeon AGP card (AMD > RV670 / Radeon HD 3850) with both 3.6.11-030611-generic kernel (from Ubuntu > kernel PPA mainline) and kernel built from recent drm-fixes git in the > testing. This patch may also be relevant to reported Bug 47116 . Testing AGP HD3850 - this patch regresses etqw which since my previous post in this bug had become stable. GPU lock within seconds with or without llvm. Testing on 3.7.6 (purely because I have a separate issue with gpu locks provoking oops with current kernels). It does however seem to fix openarena and nexuiz which without this patch would gpu lock, or really hard lock respectively after a couple of minutes. Haven't had time to test really long runs yet though. The series of 4 patches by Alex (41-44) doesn't fix the issue for me. The patch in Comment #39 does fix it for me! I tested it repeatedly with 6 runs of padman, urbanterror and openarena each. (using 3.8 kernel) (In reply to comment #42) > Created attachment 75275 [details] [review] [review] > flush fix 2/4 > > patch 2 of 4. This patch (patch 1 also applied) regresses etqw. Created attachment 75317 [details] [review] new attempt 1/5 Another attempt to fix the issue. Patch 5 is optional and not related to bug per se. Created attachment 75318 [details] [review] new attempt 2/5 2/5 Created attachment 75319 [details] [review] new attempt 3/5 3/5 Created attachment 75320 [details] [review] new attempt 4/5 4/5 Created attachment 75321 [details] [review] new attempt 5/5 optional 5/5. Latest patches 1 and 4 alone are enough to fix the hangs for me on an rs780. actually just patch 4 alone seems to fix it. Created attachment 75331 [details] [review] simple fix Just this patch alone seems to fix the issue here. (In reply to comment #51) > Created attachment 75319 [details] [review] [review] > new attempt 3/5 > > 3/5 FWIW now it's obsolete this still regressed etqw. Also tried 1+2+4 and 1+2+3+4 with openarena/nexuiz and still had lockups. Will try 4 alone next. The simple patch appears to have fixed it for me. (comment 56). Just did 9 total runs, will test more later today. (In reply to comment #56) > Created attachment 75331 [details] [review] [review] > simple fix > > Just this patch alone seems to fix the issue here. I just had a lock up after ~30min in the game (openarena). (In reply to comment #59) > (In reply to comment #56) > > Created attachment 75331 [details] [review] [review] [review] > > simple fix > > > > Just this patch alone seems to fix the issue here. > > I just had a lock up after ~30min in the game (openarena). Can you try just the patch "new attempt 4/5" (attachment 75320 [details] [review]) by itself? (In reply to comment #57) > (In reply to comment #51) > > Created attachment 75319 [details] [review] [review] [review] > > new attempt 3/5 > > > > 3/5 > > FWIW now it's obsolete this still regressed etqw. > > Also tried 1+2+4 and 1+2+3+4 with openarena/nexuiz and still had lockups. > > Will try 4 alone next. Can you also try the simple fix (attachment 75331 [details] [review])? Created attachment 75363 [details] [review] alternate simple fix Another patch to try. I tried the simple fix together with Eriks patch, haven't been able to get it locked up yet after ~30 minutes. I'll also try the alternate simple fix later. (In reply to comment #56) > Created attachment 75331 [details] [review] [review] > simple fix > > Just this patch alone seems to fix the issue here. I can still lockup with this and 0004. It took longer with 0004 and generally llvm seems to take longer to lock than R600_LLVM=0. Will try the new patch. (In reply to comment #64) > (In reply to comment #56) > > Created attachment 75331 [details] [review] [review] [review] > > simple fix > > > > Just this patch alone seems to fix the issue here. > > I can still lockup with this Ignore this - I messed up when testing simple fix and was testing unpatched - it's running now and hasn't locked yet. (In reply to comment #65) > (In reply to comment #64) > > (In reply to comment #56) > > > Created attachment 75331 [details] [review] [review] [review] [review] > > > simple fix > > > > > > Just this patch alone seems to fix the issue here. > > > > I can still lockup with this > > Ignore this - I messed up when testing simple fix and was testing unpatched > - it's running now and hasn't locked yet. It eventually hard locked with nexuiz. Created attachment 75373 [details] [review] better alternative fix Please try this one instead of the previous one. The better alternative fix just worked fine for me running: openarena, nexuiz, padman, tremulus, and urbanterror. I'm going to run it again to be sure. Will report back if it breaks. (http://openbenchmarking.org/result/1302221-RA-BUGTESTIN78) I went ahead and pushed a split up version of attachment 75373 [details] [review] to mesa: http://cgit.freedesktop.org/mesa/mesa/commit/?id=7ebf83f109db9dde89830d5844107c936cf42e4d http://cgit.freedesktop.org/mesa/mesa/commit/?id=8442b67f5f3aedbfdb4446164dd09d4eaeda4888 9.1 is supposed to be released today and even if the patch isn't perfect for everyone yet, it's a lot better than it was before. I'll keep this bug open and we can continue to work on this until we get it nailed. (In reply to comment #67) > Created attachment 75373 [details] [review] [review] > better alternative fix > > Please try this one instead of the previous one. I can still hard lock with this and previous - nexuiz is easiest and it normally hard locks. openarena with vanilla is nicer and gpu locks and recovery is possible, but with these patches I did get a hard lock from it. There is a difference to vanilla in that I am getting the locks after a level/timedemo has run rather than during. With the patch before this I played 40 minutes of openarena got bored and typed disconnect then after it had exited the level it locked. With nexuix I just run the demos in order and again the locks are coming after a demo has finished and the game has been showing a text screen for several seconds. (In reply to comment #69) > I went ahead and pushed a split up version of attachment 75373 [details] [review] > [review] to mesa: > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=7ebf83f109db9dde89830d5844107c936cf42e4d > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=8442b67f5f3aedbfdb4446164dd09d4eaeda4888 > 9.1 is supposed to be released today and even if the patch isn't perfect for > everyone yet, it's a lot better than it was before. I'll keep this bug open > and we can continue to work on this until we get it nailed. That was quick - I've only just got to try with etqw and with v5 it quickly causes a GPU reset. (In reply to comment #71) > (In reply to comment #69) > > I went ahead and pushed a split up version of attachment 75373 [details] [review] [review] > > [review] to mesa: > > http://cgit.freedesktop.org/mesa/mesa/commit/ > > ?id=7ebf83f109db9dde89830d5844107c936cf42e4d > > http://cgit.freedesktop.org/mesa/mesa/commit/ > > ?id=8442b67f5f3aedbfdb4446164dd09d4eaeda4888 > > 9.1 is supposed to be released today and even if the patch isn't perfect for > > everyone yet, it's a lot better than it was before. I'll keep this bug open > > and we can continue to work on this until we get it nailed. > > That was quick - I've only just got to try with etqw and with v5 it quickly > causes a GPU reset. On vanilla master now. Can still get etqw to provoke a gpu reset but it seems like it's the initial use of the text console when on the main screen that provokes it. If I avoid using it then I can run without locks. Does disabling hyperZ help? Set env var R600_HYPERZ=0 (In reply to comment #73) > Does disabling hyperZ help? Set env var R600_HYPERZ=0 No, that doesn't help. I have just found another way to avoid it though, running with my card on "low" I can't get it to lock. Turning it up to high as I normally do and it will lock on first (but not subsequent) use of text console every time. (In reply to comment #72) > (In reply to comment #71) > > (In reply to comment #69) > > > I went ahead and pushed a split up version of attachment 75373 [details] [review] [review] [review] > > > [review] to mesa: > > > http://cgit.freedesktop.org/mesa/mesa/commit/ > > > ?id=7ebf83f109db9dde89830d5844107c936cf42e4d > > > http://cgit.freedesktop.org/mesa/mesa/commit/ > > > ?id=8442b67f5f3aedbfdb4446164dd09d4eaeda4888 > > > 9.1 is supposed to be released today and even if the patch isn't perfect for > > > everyone yet, it's a lot better than it was before. I'll keep this bug open > > > and we can continue to work on this until we get it nailed. > > > > That was quick - I've only just got to try with etqw and with v5 it quickly > > causes a GPU reset. > > On vanilla master now. Can still get etqw to provoke a gpu reset but it > seems like it's the initial use of the text console when on the main screen > that provokes it. If I avoid using it then I can run without locks. I'm also on vanilla master now, just got a lock up on open arena (after ~40 min). I'm trying Eriks patch again, because I yet have to get it to lock up with that one (after ~2h of playing). I haven't seen this bug since my last comment. (and for the last month been on a different video card). Does anyone else still see this issue or shall I close it Fix Released? (In reply to comment #76) > I haven't seen this bug since my last comment. (and for the last month been > on a different video card). > > Does anyone else still see this issue or shall I close it Fix Released? I tested RV670 with piglit and DOTA 2 in April this year and it worked fine. (In reply to comment #76) > I haven't seen this bug since my last comment. (and for the last month been > on a different video card). > > Does anyone else still see this issue or shall I close it Fix Released? Give me a few days to test (not so much spare time now) and see if I can still trigger the bug. Per my comment on 2014-07-30 and no other updates since that year I'm going to go ahead and mark this Fixed. Thanks all! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.