Summary: | Random radeonsi crashes with mesa 10.3.x | ||
---|---|---|---|
Product: | Mesa | Reporter: | Hannu <hannu.tmp> |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | aaronbottegal, albandil, alexander, darkbasic, jimmy, mabo |
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
dmesg
Xorg.0.log Xorg.0.log after crash Crash with mesa 10.2.8, Xorg.0.log journalctl -b after crash radeonsi: Disable asynchronous DMA radeonsi: Disable asynchronous DMA except for PIPE_BUFFER diff when patched with "radeonsi: Disable asynchronous DMA except for PIPE_BUFFER" "goto fallback" added before si_dma_copy_tile() check space after r600_need_dma_space() mesa 10.3.2-1 and linux 3.18.0 crash journalctl -b after crash, kernel 3.19-rc3 with patches journalctl -b after crash, linux 4.0.0-rc2 with patches Xorg.0.log after crash, linux 4.0.0-rc2 with patches |
Description
Hannu
2014-10-30 12:45:53 UTC
Some additional info: grepped aptitude log, shows which versions I have tried: [UPGRADE] libgl1-mesa-dri:amd64 10.2.6-1 -> 10.2.8-1 [UPGRADE] libgl1-mesa-dri:amd64 10.2.8-1 -> 10.3.0~rc3-3 [DOWNGRADE] libgl1-mesa-dri:amd64 10.3.0~rc3-3 -> 10.2.8-1 [UPGRADE] libgl1-mesa-dri:amd64 10.2.8-1 -> 10.3.0-2 [DOWNGRADE] libgl1-mesa-dri:amd64 10.3.0-2 -> 10.2.8-1 [UPGRADE] libgl1-mesa-dri:amd64 10.2.8-1 -> 10.3.1-1 [DOWNGRADE] libgl1-mesa-dri:amd64 10.3.1-1 -> 10.2.8-1 [UPGRADE] libgl1-mesa-dri:amd64 10.2.8-1 -> 10.3.1-1 [UPGRADE] libgl1-mesa-dri:amd64 10.2.8-1 -> 10.3.2-1 20 october I checked with diff that I have latest ucode from http://people.freedesktop.org/~agd5f/radeon_ucode/ (In reply to Hannu from comment #0) > Xorg has randomly timed (hours or days between) crashes with mesa 10.3.x. > First it stops responding for some seconds, then screen goes black. > Sometimes it recovers for a moment after the black screen but you had better > reboot the computer while you can, ít will crash again soon after the first > crash. Are you saying this is a regression between mesa 10.3.x and 10.2.x? If so, can can bisect? Also, please attach your dmesg output and xorg log. Created attachment 108693 [details]
dmesg
Created attachment 108694 [details]
Xorg.0.log
(In reply to Alex Deucher from comment #2) > (In reply to Hannu from comment #0) > > Xorg has randomly timed (hours or days between) crashes with mesa 10.3.x. > > First it stops responding for some seconds, then screen goes black. > > Sometimes it recovers for a moment after the black screen but you had better > > reboot the computer while you can, ít will crash again soon after the first > > crash. > > Are you saying this is a regression between mesa 10.3.x and 10.2.x? If so, > can can bisect? Also, please attach your dmesg output and xorg log. Well... yes, it seems to be a regression between mesa 10.3.x and 10.2.x. As I explained above, bisecting this is impossible due to random long time between occurrences of the bug. Attached dmesg and Xorg.0.log. I don't have Xorg.0.log with the error message currently, I'll attach if/when it happens again. As I wrote above "After the crash Xorg.0.log has "(EE) [mi] EQ overflowing. Additional events will be discarded until existing events are processed." and so on at the end of the log." I don't know much about xorg stacks inner workings but it might be that mesa isn't buggy, but has some added functionality between 10.3.x and 10.2.x that exercises buggy code path lower in the driver stack. (In reply to Hannu from comment #5) > As I explained above, bisecting this is impossible due to random long time > between occurrences of the bug. That doesn't make it impossible, it just means you need patience. Take your time, and make sure you test each commit for at least as long (preferably longer, to account for variation) as it's ever taken for the problem to occur before declaring it good. On the bright side, when the problem occurs quickly, you know that commit is bad and can move on to the next one. (In reply to Hannu from comment #6) > I don't know much about xorg stacks inner workings but it might be that mesa > isn't buggy, but has some added functionality between 10.3.x and 10.2.x that > exercises buggy code path lower in the driver stack. It doesn't matter. Since the problem happens with Mesa 10.3.y but not with 10.2.y, once we know which change between them makes the difference, we'll be at least a big step closer to solving the problem. (In reply to Michel Dänzer from comment #7) > That doesn't make it impossible, it just means you need patience. Take your > time, and make sure you test each commit for at least as long (preferably > longer, to account for variation) as it's ever taken for the problem to > occur before declaring it good. On the bright side, when the problem occurs > quickly, you know that commit is bad and can move on to the next one. On another bright side, when the problem doesn't occur for a long time, you can enjoy the lack of complaints about it from your son. :) However, it's very important that you don't declare a commit good which would have exhibited the problem after more testing, otherwise the bisection will give the wrong result. *** Bug 81644 has been marked as a duplicate of this bug. *** Created attachment 108767 [details]
Xorg.0.log after crash
Xorg.0.log after crash. Mesa 10.3.2
Now it crashed with mesa 10.2.8-1 for the first time so we can propably forget the "regression between mesa 10.3.x and 10.2.x" part of this report. I will attach Xorg.0.log but it has the same old error. Crash happened while watching flash video full screen. Created attachment 108864 [details]
Crash with mesa 10.2.8, Xorg.0.log
Created attachment 108940 [details]
journalctl -b after crash
Got something new, I updated xserver to 1.16.1.901-1 and linux to 3.18.0-rc3. Crashed same way as before. After crash no error message in Xorg.0.log but found this with journalctl -b:
Nov 05 10:24:10 kernel: radeon 0000:01:00.0: ring 3 stalled for more than 10020msec
Nov 05 10:24:10 kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000002757 last fence id 0x0000000000002788 on ring 3)
Nov 05 10:24:10 kernel: radeon 0000:01:00.0: Saved 2028 dwords of commands on ring 0.
Nov 05 10:24:10 kernel: radeon 0000:01:00.0: GPU softreset: 0x0000006C
Nov 05 10:24:10 kernel: radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028
Nov 05 10:24:10 kernel: radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006
and so on.
Relevant parts attached, error message at end.
Created attachment 108943 [details] [review] radeonsi: Disable asynchronous DMA Does this patch avoid the lockups? (In reply to Michel Dänzer from comment #14) > Created attachment 108943 [details] [review] [review] > radeonsi: Disable asynchronous DMA > > Does this patch avoid the lockups? I have now mesa 10.3.2 package patched, built and installed for i386 and amd64 from debian source package. We will have to wait couple of days or maybe a week to say with some probability if the crash has gone. (In reply to Michel Dänzer from comment #14) > Created attachment 108943 [details] [review] [review] > radeonsi: Disable asynchronous DMA > > Does this patch avoid the lockups? No crashes/lockups with your patch after 2 days, testing continues. I will do some full screen flash video testing on monday. Today Xorg.0.log has a few of these: [ 12208.821] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 731959 < target_msc 731960 [ 13141.437] (WW) RADEON(0): radeon_dri2_flip_event_handler: Pageflip completion event has impossible msc 787906 < target_msc 787907 Hi, running kernel 3.18-rc3 mesa 10.3.2 + the "Disable asynchronous DMA" patch my system has been stable for three days now. Done lots of video watching and using suspend to ram. No misbehaviour up to now. Without the patch this combination would fail pretty quickly. (In reply to Michel Dänzer from comment #14) > Created attachment 108943 [details] [review] [review] > radeonsi: Disable asynchronous DMA > > Does this patch avoid the lockups? No lockups/crashes with your "radeonsi: Disable asynchronous DMA"-patch after 4 days. I did some testing today with a 23 minute full screen flash video: Mesa 10.3.2 with the patch: 10 playbacks of the video, no crashes/lockups. I built and installed mesa 10.3.2 from debian same source package without the patch: Run 1: fifth playback of the video crashed. Run 2: fourth playback of the video crashed. I think no further testing of this patch is needed, indicates something wrong with the DMA code? Created attachment 109261 [details] [review] radeonsi: Disable asynchronous DMA except for PIPE_BUFFER Can you test this patch as well? Created attachment 109263 [details]
diff when patched with "radeonsi: Disable asynchronous DMA except for PIPE_BUFFER"
Applied the patch, it was not for mesa 10.3.2 version. I attached the diff.
I will test with the same video as yesterday as soon as I get the packages built.
(In reply to Michel Dänzer from comment #19) > Created attachment 109261 [details] [review] [review] > radeonsi: Disable asynchronous DMA except for PIPE_BUFFER > > Can you test this patch as well? Tested with a 23 minute full screen flash video same as before. Mesa 10.3.2 with the "radeonsi: Disable asynchronous DMA except for PIPE_BUFFER"-patch: 7 playbacks of the video, no crashes/lockups. I will continue tomorrow, now the computer is needed for gaming. (In reply to Michel Dänzer from comment #19) > Created attachment 109261 [details] [review] [review] > radeonsi: Disable asynchronous DMA except for PIPE_BUFFER > > Can you test this patch as well? Hello Michel, I suffered from the same crashes Hannu and Malte Schröder reported. After applying this patch, the crashes and lockups ceased. I tested attachment 108943 [details] [review] (coming from https://bugs.freedesktop.org/show_bug.cgi?id=85866) which solved the issues for me as well. Radeon HD 7970. (In reply to Michel Dänzer from comment #19) > Created attachment 109261 [details] [review] [review] > radeonsi: Disable asynchronous DMA except for PIPE_BUFFER > > Can you test this patch as well? Tested with a 23 minute full screen flash video same as before. Mesa 10.3.2 with the "radeonsi: Disable asynchronous DMA except for PIPE_BUFFER"-patch: 10 playbacks of the video, no crashes/lockups. I think the patch makes mesa stable at least for flash videos, no further testing needed for this patch. *** Bug 85866 has been marked as a duplicate of this bug. *** Created attachment 109402 [details]
"goto fallback" added before si_dma_copy_tile()
I tested today with this:
si_dma_copy_buffer(sctx, dst, src, dst_offset, src_offset,
src_box->height * src_pitch);
} else {
+ goto fallback;
si_dma_copy_tile(sctx, dst, dst_level, dst_x, dst_y, dst_z,
(diff attached)
Tested with a 23 minute full screen flash video same as before.
Mesa 10.3.2 with the "goto fallback" added before si_dma_copy_tile(): 10 playbacks of the video, no crashes/lockups.
I am going to test the original code without changes again tomorrow to be sure that the crash is still there. (I updated firefox yesterday).
I built mesa 10.4.0-devel (master) from mesa git. +linux 3.18.0-rc4 Tested with a 23 minute full screen flash video same as before, sixth playback of the video crashed. Mesa 10.3.2 with the bypassed si_dma_copy_tile() as in comment 26 has not crashed yet si_dma_copy_buffer(sctx, dst, src, dst_offset, src_offset, src_box->height * src_pitch); } else { + goto fallback; si_dma_copy_tile(sctx, dst, dst_level, dst_x, dst_y, dst_z, Module: Mesa Branch: master Commit: ae4536b4f71cbe76230ea7edc7eb4d6041e651b4 URL: http://cgit.freedesktop.org/mesa/mesa/commit/?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4 Author: Michel Dänzer <michel.daenzer@amd.com> Date: Tue Nov 11 16:10:20 2014 +0900 radeonsi: Disable asynchronous DMA except for PIPE_BUFFER (In reply to Michel Dänzer from comment #28) > Module: Mesa > Branch: master > Commit: ae4536b4f71cbe76230ea7edc7eb4d6041e651b4 > URL: > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4 > > Author: Michel Dänzer <michel.daenzer@amd.com> > Date: Tue Nov 11 16:10:20 2014 +0900 > > radeonsi: Disable asynchronous DMA except for PIPE_BUFFER There might be something that fixes this bug in linux 3.18.0-rc5. I just tested linux 3.18.0-rc5 with the same mesa 10.4.0-devel package as in comment 27. Tested with a 23 minute full screen flash video same as before. Mesa 10.4.0-devel and linux 3.18.0-rc5: 10 playbacks of the video, no crashes/lockups. I'll test mesa 10.3.2 original debian package without changes and linux 3.18.0-rc5 tomorrow. I would only attempt to say it's fixed after about 2 days of stability playing multimedia content/dynamic content in an accelerated web browser. 23 Minutes is not long enough to test this bug, it sometimes took days with flash for me. I got a replacement card, so I can't test anything anymore, so I can't help say if rc5 has any fixes. Sorry. 3.18-rc5 does not fix the problem. I just tried and the system crashed after a few minutes. Now running gain with my patched mesa 10.3.2. mesa 10.3.2 original debian package without changes and linux 3.18.0-rc5 crashed after a few minutes when playing the video. That is worse than before, it has usually crashed after four to six replays of the video. I'll start testing mesa 10.4.0-devel with rc5 again, probably it was a random fluke that it lasted ten replays, longest this far has been seven replays without the bypassed si_dma_copy_tile(). Tested with a 23 minute full screen flash video same as before. Mesa 10.4.0-devel and linux 3.18.0-rc5 again: second replay of the video crashed. So no luck with that, I'll try "fix that only disables DMA if 1D tiling is involved" from bug 83500, if that doesn't work I give up and use the fix I did in comment 26. (In reply to Hannu from comment #33) > I'll try "fix that only disables DMA if 1D tiling is involved" from bug 83500, You can save the time for that, given that it didn't fully fix the problem for the reporter of that bug. > if that doesn't work I give up and use the fix I did in comment 26. Why don't you just use the fix I pushed to resolve this report? :) I deliberately didn't keep the second si_dma_copy_buffer path enabled because I know (from experience trying to port the DMA code to CIK) that it doesn't properly check for all cases it can't handle. I don't think that path is very relevant in practice anyway. (In reply to Michel Dänzer from comment #34) > (In reply to Hannu from comment #33) > > I'll try "fix that only disables DMA if 1D tiling is involved" from bug 83500, > > You can save the time for that, given that it didn't fully fix the problem > for the reporter of that bug. > > > > if that doesn't work I give up and use the fix I did in comment 26. > > Why don't you just use the fix I pushed to resolve this report? :) I > deliberately didn't keep the second si_dma_copy_buffer path enabled because > I know (from experience trying to port the DMA code to CIK) that it doesn't > properly check for all cases it can't handle. I don't think that path is > very relevant in practice anyway. OK, I'll stop testing for now. Thanks for all your testing, BTW! Created attachment 109668 [details]
check space after r600_need_dma_space()
By the way, at some point while testing I noticed that si_dma_copy_tile() does not check that it actually got the space it wanted after this call:
r600_need_dma_space(&ctx->b, ncopy * 9);
void r600_need_dma_space(struct r600_common_context *ctx, unsigned num_dw)
{
/* The number of dwords we already used in the DMA so far. */
num_dw += ctx->rings.dma.cs->cdw;
/* Flush if there's not enough space. */
if (num_dw > RADEON_MAX_CMDBUF_DWORDS) {
ctx->rings.dma.flush(ctx, RADEON_FLUSH_ASYNC, NULL);
}
}
So I added check after r600_need_dma_space():
r600_need_dma_space(&ctx->b, ncopy * 9);
if (((ncopy * 9) + cs->cdw) > RADEON_MAX_CMDBUF_DWORDS) {
return 0;
}
and then goto fallback if returns 0, but it crashed anyway so I skipped that as irrelevant to this bug. (diff attached)
Created attachment 110558 [details] mesa 10.3.2-1 and linux 3.18.0 crash attached crash report from journalctl. Linux 3.18.0 and original debian mesa 10.3.2-1. I don't know if this adds anything new to report in attachment 108940 [details], there are some differences though. (In reply to Hannu from comment #38) > Created attachment 110558 [details] > mesa 10.3.2-1 and linux 3.18.0 crash > > attached crash report from journalctl. Linux 3.18.0 and original debian mesa > 10.3.2-1. I don't know if this adds anything new to report in attachment > 108940 [details], there are some differences though. Try with mesa 10.3.4. http://www.mesa3d.org/relnotes/10.3.4.html (In reply to agapito from comment #39) > (In reply to Hannu from comment #38) > > Created attachment 110558 [details] > > mesa 10.3.2-1 and linux 3.18.0 crash > > > > attached crash report from journalctl. Linux 3.18.0 and original debian mesa > > 10.3.2-1. I don't know if this adds anything new to report in attachment > > 108940 [details], there are some differences though. > > Try with mesa 10.3.4. > > http://www.mesa3d.org/relnotes/10.3.4.html Yes I know that 'radeonsi: Disable asynchronous DMA except for PIPE_BUFFER' fixes the crash, but since the patch disables some of the functionality I checked again with latest kernel. (In reply to Hannu from comment #40) > Yes I know that 'radeonsi: Disable asynchronous DMA except for PIPE_BUFFER' > fixes the crash, but since the patch disables some of the functionality I > checked again with latest kernel. Note that I don't think the disabled functionality made a big difference anyway. Created attachment 111954 [details] journalctl -b after crash, kernel 3.19-rc3 with patches --- Comment #223 from Michel Dänzer <michel at daenzer.net> --- (In reply to fdb4c415 from comment #222) There's a good chance that a newer upstream version of Mesa would help for your problem, if not fix it completely. For those still having problems, the kernel patches http://lists.freedesktop.org/archives/dri-devel/2015-January/074968.html and http://lists.freedesktop.org/archives/dri-devel/2015-January/074969.html might be worth a try. --------------------- Attached crash report from journalctl. Linux 3.19-rc3 with those two patches and original debian mesa 10.3.2-1. Xorg locked up screen same way as before. No error reports in Xorg.0.log. Created attachment 113979 [details] journalctl -b after crash, linux 4.0.0-rc2 with patches Tested mesa 10.3.2-1 with the dma problem again with linux kernel 4.0.0-rc2 with attachment 166571 from bug 90741 and "drm/radeon: do a posting read in si_set_irq" from posting-read branch applied (I'm testing just for fun, 10.4.5 works fine). This time X restarted after the lockup (asks username and password). Journalctl -b attached, there is something in Xorg.0.log this time, I'll attach it next. Created attachment 113981 [details]
Xorg.0.log after crash, linux 4.0.0-rc2 with patches
Here is Xorg.0.log, backtrace at the end.
|
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.