Bug 73528

Summary: Deferred lighting in Second Life causes system hiccups and screen flickering
Product: Mesa Reporter: MirceaKitsune <sonichedgehog_hyperblast00>
Component: Drivers/Gallium/r600Assignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: high CC: dawide2211, EoD, greg, sonichedgehog_hyperblast00
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Xorg.0.log dating 26-05-2014
glxinfo output, dating 26-05-2014
dmesg output, dating 26-05-2014
verify the bisected commit
patch
debug info
fast color clear hang code

Description MirceaKitsune 2014-01-12 23:04:26 UTC
I compiled Mesa with libdrm from their GIT repositories (10.1, commit 2dc35a619c50139d07ad96fc4dfe456e5811c84e). I started up the Second Life viewer through it (Kokua x64, 3.6.12.30743, Dec 12 2013 05:51:00) and logged in.

Upon enabling Advanced Lighting Model from Preferences - Graphics, the system froze and the screen started flickering, turning black then recovering every few seconds. The scene also got covered in pink pixels, meaning it also failed to render correctly. I was barely able to move the mouse pointer and shut the viewer down. Once I did, everything returned to normal.

This does not happen with Mesa 9.2.3, and other Graphics options in the viewer work fine (like "Basic shaders" or "Atmospheric shaders"). Shadows were disabled and likely not a cause.

If someone has Second Life, please try the deferred lighting system under the 10.1 development version of Mesa. Currently, I don't have any more information on the problem apart from having experienced the described symptoms.
Comment 1 Alex Deucher 2014-01-12 23:27:02 UTC
Can you bisect?
Comment 2 MirceaKitsune 2014-01-12 23:53:36 UTC
(In reply to comment #1)
> Can you bisect?

Probably not. I just set up the GIT today for the first time, and compiling takes many minutes as it is. I can't estimate where it might have broken either, nor risk experiencing the problem too many times (given it almost brings the system down). I just know it works well in 9.2.3 but not 10.1.
Comment 3 Michel Dänzer 2014-01-15 04:10:19 UTC
(In reply to comment #2)
> [...] compiling takes many minutes as it is.

git bisect minimizes the number of times you need to compile Mesa to isolate the change introducing the problem.

> I can't estimate where it might have broken either,

Your initial report says it's working with Mesa 9.2.3 but broken with current Git, that's good enough for bisecting.

> nor risk experiencing the problem too many times (given it almost
> brings the system down).

I don't think it's that dramatic. It sounds like the usual symptoms of the GPU repeatedly hanging and being reset by the driver.
Comment 4 MirceaKitsune 2014-02-16 12:05:11 UTC
I updated Mesa to latest GIT and tried this again last night (Kokua viewer 3.6.12 x64). The result was even more catastrophic: As soon the world was to begin rendering, the screen turned black and the monitor went in standby mode. The numlock / capslock leds on the keyboard also stopped toggling. I had to power off the computer to restart. This likely means a Kernel panic happened, so the problem is probably really bad.
Comment 5 MirceaKitsune 2014-05-03 00:32:47 UTC
Any news on this? With a new distribution upgrade coming slowly, I'm worried that Second Life might no longer work properly and even crash my system. Can anyone with a 6xxx series card try the latest Mesa GIT and attempt to reproduce the problem?
Comment 6 MirceaKitsune 2014-05-23 00:36:42 UTC
I compiled latest MESA GIT today and tried the SL viewer out once more. The issue still exists! I have however at least found a way to test it, that doesn't appear to crash my computer (have no prims in view and therefore rendered, have a command ready to instantly kill the viewer process). I shall see if I can test with different settings, in case it's a combination of factors.

One thing I noticed additionally: I have Radeon DPM enabled, despite my Kernel requiring a boot parameter for that as that's not yet default. When the issue takes place, the fan on my video card also reach maximum RPM. Something clearly floods the video card and GPU beyond what they can handle.
Comment 7 MirceaKitsune 2014-05-23 13:20:06 UTC
I brought this up on IRC last night, and was advised to do a GIT bisect or an apitrace. Since I have no idea when the issue started happening and I can't risk crashing my machine dozens of times, I can't try GIT bisect myself. I was however able to do an apitrace, which also produces the crash when played back. As requested, I packed it as an xz archive and uploaded it. This should allow anyone who can't try the Second Life viewer themselves to reproduce the crash.

http://www.sendspace.com/file/oixqur

Unpack that somewhere, then run the command "glretrace ./do-not-directly-run-kokua-bin.trace" from the same directory. Be prepared for a system freeze, and only do this on a machine where it's safe! Also, post a comment here if that link stops working, so I can re-post the trace file if needed.

I crashed my machine 5 times in order to record this. So I'd appreciate it if someone could make the effort worth it and please test this more in depth, to see where the issue really takes place in the MESA code.
Comment 8 Michel Dänzer 2014-05-26 03:10:29 UTC
First of all, please attach the Xorg.0.log file and the output of glxinfo and dmesg.

(In reply to comment #6)
> One thing I noticed additionally: I have Radeon DPM enabled, despite my
> Kernel requiring a boot parameter for that as that's not yet default.

Does the problem also occur without DPM?

BTW, in case your Mesa is built with --enable-r600-llvm-compiler, does the problem also occur if you disable the LLVM based shader compiler with the environment variable R600_LLVM=0 for the SL process?


(In reply to comment #7)
> Since I have no idea when the issue started happening and I can't risk
> crashing my machine dozens of times, I can't try GIT bisect myself.

That's just not true, see comment #3. Also (to preempt another concern you mentioned on IRC), bisecting normally doesn't require any knowledge of the code.


> I was however able to do an apitrace, which also produces the crash
> when played back.

Unfortunately, I'm not able to reproduce anything like what you describe with that apitrace on my machines.


> I crashed my machine 5 times in order to record this.

Five iterations of git bisect would go a long way towards isolating the problem. If you had started the bisection when you reported this bug, you would have isolated the change introducing the problem for you a long time ago, even if you only tested it once a day.
Comment 9 MirceaKitsune 2014-05-26 11:49:55 UTC
(In reply to comment #8)
I haven't tried without DPM. I might sometime later.

I shall also test R600_LLVM=0. Do I just set that as a variable in the same console before running the SL process, or is it a compile parameter?

The problem likely existed months before I first tried running the SL viewer on a GIT version of MESA. The oldest version of Mesa where I don't get this occurring is the one my Linux distribution comes with, 9.2.3. I wouldn't even know where to begin bisecting in such an ocean of commits, especially considering the crashes.

One more thing to mention if it's relevant: I also compile latest GIT Mesa with latest libdrm from GIT. Since by the time I first tried a development version of Mesa, it wouldn't compile against the libdrm-devel package in my system. Still, I use the free Radeon driver and X11 version provided by openSUSE 13.1. Could there be an incompatibility there? No, I can't risk testing a newer driver or X11 than my distribution offers, as this is my desktop and I could break it.

I'll attach the logs you mentioned shortly after I post this comment.
Comment 10 MirceaKitsune 2014-05-26 11:51:33 UTC
Created attachment 99862 [details]
Xorg.0.log dating 26-05-2014
Comment 11 MirceaKitsune 2014-05-26 11:53:28 UTC
Created attachment 99863 [details]
glxinfo output, dating 26-05-2014
Comment 12 MirceaKitsune 2014-05-26 11:55:26 UTC
Created attachment 99864 [details]
dmesg output, dating 26-05-2014
Comment 13 Michel Dänzer 2014-05-27 03:04:51 UTC
(In reply to comment #9)
> I shall also test R600_LLVM=0. Do I just set that as a variable in the same
> console before running the SL process, or is it a compile parameter?

The former, but make sure it's actually visible in the SL process, either by exporting it or just prepending it on the SL command line.


> I wouldn't even know where to begin bisecting in such an ocean of commits,
> [...]

As I said before, the information in your original report is enough to get the bisection started:

git bisect bad 2dc35a619c50139d07ad96fc4dfe456e5811c84e
git bisect good mesa-9.2.3

After that, git bisect will isolate the change introducing the problem after the minimum number of tests required, which is about a dozen (roughly the logarithm to the base of 2 of the number of commits between the commits above).


At this point, there's no way to make progress on this bug without someone who can reproduce the problem and investigate or at least bisect it. It's up to you if you want to look for someone else who fits the bill, or just bisect yourself.


> I also compile latest GIT Mesa with latest libdrm from GIT. [...] Still, I
> use the free Radeon driver and X11 version provided by openSUSE 13.1. Could
> there be an incompatibility there?

That's unlikely, especially since Mesa 9.2.3 still works fine?


> No, I can't risk testing a newer driver or X11 than my distribution offers,
> as this is my desktop and I could break it.

FYI, it's possible to test newer versions of all components without affecting the system installed ones. But so far everything points to this being a Mesa bug.
Comment 14 Michel Dänzer 2014-05-27 03:08:07 UTC
BTW, if you capture the output of dmesg after the problem occurs, it should contain some information about the GPU lockups. But it'll be much harder to deduce the cause from that than from a bisection.
Comment 15 MirceaKitsune 2014-11-03 15:16:48 UTC
Today with its release, I upgraded to openSUSE 13.2, which has MESA 10.3.0. Despite being an official package, unlike my previous tests on repository versions of MESA, the issue is still there. Enabling Advanced Lighting in the SL viewer will freeze and crash the entire system immediately.

I still have a trace from a while ago, which can be used to re-produce the issue without running SL yourself. As an xz archive, it seems to be reduced to 54 MB... but if I cannot post it here ask me for another place to upload it. Anyway, I really hope someone looks into this :(
Comment 16 MirceaKitsune 2014-11-15 21:43:46 UTC
Okay, I just found an important clue: The issue is not limited to Second Life alone! I just tried to enable shaders in the open-source racing game Stuntrally (Vdrift Ogre) and got the exact same problem. The machine would freeze, the screen would flicker (monitor going in standby), but in this case it would unfreeze after a few seconds so I managed to turn shaders off.

Additionally, I did more tests in Second Life several days ago, and found something interesting: If I look at the ground so barely anything is rendered, the shaders work perfectly well. But if the sky or too much detail come into view, it only takes a few seconds until the GPU suddenly hiccups and I can barely re-stabilize the system.

I'm unclear whether this is a coincidence. But when the problem takes place, the fan on my video card is running at maximum speed, indicating the Radeon DPM module is stressing the card out. Note that the freezes existed before DPM was supported by the free driver, and came after a MESA update alone.

My humble impression is that some shaders either create an infinite loop on the GPU due to a compilation bug, either they're so unoptimized that they flood the GPU with data it can't handle. It does seem to be triggered by specific objects coming into view... but I can't tell if that's because of the model complexity (high polygon count) or because they contain a "sick" shader which the video card attempts to render.

Since this has been around for an year, and I gathered a few clues already, please allow me to set it to "high" priority to bring more awareness. I'm around and watching this, so I can do more tests... as long as it's nothing that would risk breaking my system (like changing system drivers).
Comment 17 Michel Dänzer 2014-11-17 04:36:56 UTC
The situation hasn't changed since comment #13:

At this point, there's no way to make progress on this bug without someone who can reproduce the problem and investigate or at least bisect it. It's up to you if you want to look for someone else who fits the bill, or just bisect yourself.
Comment 18 MirceaKitsune 2014-11-17 13:13:24 UTC
(In reply to Michel Dänzer from comment #17)

This happens since an year ago. Commit 2dc35a619c50139d07ad96fc4dfe456e5811c84e is when I first noticed the problem. I'm uncertain whether I can even still compile such an old version of MESA at this day. I shall attempt to however.

In case I can't compile: Is there a place where I can download pre-compiled nightly / weekly builds of MESA at least, dating since last year?
Comment 19 MirceaKitsune 2014-11-17 18:48:18 UTC
Alright. I dusted up my local GIT clone of MESA, and started checking out and compiling various versions. So far I only looked at the tags, and found the releases between which the issue was added: mesa-9.2.5 (works fine) to mesa-10.0-rc1 (contains the problem).

Now comes the painful process of narrowing down which commit between these two tags implements the problem. It might take some time, but I'll post here the moment I have a result.
Comment 20 MirceaKitsune 2014-11-17 22:22:44 UTC
It is done. After nearly two hours of bisecting, I managed to track the exact commit which adds the freezes... at least in the case of Second Life.

edbbfac6cfc634e697d7f981155a5072c52d77ac is the first bad commit
commit edbbfac6cfc634e697d7f981155a5072c52d77ac
Author: Grigori Goronzy <greg@chown.ath.cx>
Date:   Wed Sep 11 01:41:40 2013 +0200

    r600g: fast color clears for single-sample buffers
    
    Allocate a CMASK on demand and use it to fast clear single-sample
    colorbuffers. Both FBOs and window system colorbuffers are fast
    cleared. Expand as needed when colorbuffers are mapped or displayed
    on screen.
    
    v2: cosmetics, move transfer expansion into dma_blit
    
    Signed-off-by: Marek Olšák <marek.olsak@amd.com>

:040000 040000 dbb6df4e7ed243043e46afc9b91c389b61549ef9 62ae46bd2049b7cb6e35e2dce2ba4725b413bdbe M      src

56d9a397aa2dbee6b12e1bbe56be39f426e1e34d is the last good commit

These are the changes in the bad commit: http://pastebin.com/raw.php?i=Z1RAGRcT
Comment 21 Michel Dänzer 2014-11-18 01:49:00 UTC
(In reply to MirceaKitsune from comment #20)
> edbbfac6cfc634e697d7f981155a5072c52d77ac is the first bad commit
> commit edbbfac6cfc634e697d7f981155a5072c52d77ac
> Author: Grigori Goronzy <greg@chown.ath.cx>
> Date:   Wed Sep 11 01:41:40 2013 +0200
> 
>     r600g: fast color clears for single-sample buffers


Thank you for bisecting. So you can reliably reproduce the problem with that commit, but not with its parent commit 56d9a397aa2dbee6b12e1bbe56be39f426e1e34d?

Grigori, any ideas?
Comment 22 MirceaKitsune 2014-11-18 02:14:01 UTC
(In reply to Michel Dänzer from comment #21)

I did several tests multiple times to be sure, and it seems like it. No matter how much I stress the viewer (a lot of avatars / geometry on the screen) nothing bad happens with commit 56d9a397aa2dbee6b12e1bbe56be39f426e1e34d. But the moment I switch to commit edbbfac6cfc634e697d7f981155a5072c52d77ac, it only takes a bit of content for the GPU to instantly hiccup and only recover if I move the view down really quick.

Note however that with both of these commits, the scene becomes black when I enable Advanced Lighting, and I can only see floating text. But that seems to be a completely unrelated issue, likely added temporarily during this time then fixed relatively soon... probably something to do with lighting.
Comment 23 Marek Olšák 2014-11-18 12:39:21 UTC
Created attachment 109669 [details] [review]
verify the bisected commit

Can you test the attached patch with Mesa master? It will help to verify that the bisected commit is the culprit.
Comment 24 MirceaKitsune 2014-11-18 16:05:14 UTC
(In reply to Marek Olšák from comment #23)

I can't test latest Mesa with Second Life, due to another bug in current GIT master: http://bugs.freedesktop.org/show_bug.cgi?id=86089 So I manually integrated your patch in the latest commit which doesn't have that issue (6212d2402df4ad0658cbb98ce889e35ef5f32fa3) and tested with it.

The problem is indeed solved! I can now enable Advanced Lighting and all related shaders without any issues in Second Life. Turning shaders on in Stuntrally also doesn't produce the crash any longer. Thank you for the patch! I will wait for it to be included and use an updated release afterward.
Comment 25 Marek Olšák 2014-11-18 16:10:38 UTC
My explanations for this bug are:

1) CMASK allocation is wrong. We should try and overallocate it.

2) CMASK contains garbage, which can hang the hardware. Either the clear_buffer function isn't reliable (it uses CP DMA) or CMASK was corrupted by eviction. Or an out-of-bounds access from some other source corrupted it.
Comment 26 MirceaKitsune 2014-11-26 01:17:34 UTC
Any news regarding a solution? I understand the patch I tested with might not be the actual answer, but a way to test that the problem lies somewhere else. Since I spend time on Second Life daily, and this is the only reason why I can't use shaders, a fix before the next MESA release would be highly appreciated.
Comment 27 MirceaKitsune 2015-02-03 01:17:14 UTC
I still check this every few days for news. I understand the developers have many issues to deal with, but could someone please commit a simple fix for this problem? It's the only reason why I can't safely use shaders in Second Life, Stuntrally, and potentially other engines... for about two years since the issue was introduced here. I confirmed a line in the code which fixes the problem, as well as the commit which introduced it, thanks to the suggestions in the comments above... so hopefully a solution isn't that difficult now. Any thoughts please?
Comment 28 Marek Olšák 2015-02-07 13:28:12 UTC
Created attachment 113251 [details] [review]
patch

Could you please test this patch?
Comment 29 MirceaKitsune 2015-02-12 01:19:01 UTC
The issue still happens with that patch. I applied it against latest GIT master, and enabling shaders in Second Life froze the system immediately again.
Comment 30 MirceaKitsune 2015-02-26 21:24:04 UTC
Still waiting for news please. Like I said, the patch in comment 28 does not fix the issue, but the patch in comment 23 did. I'm hoping there is enough data to make a solution possible before the next release, if not I can still test other patches.
Comment 31 MirceaKitsune 2015-04-08 17:39:27 UTC
I poked the developers about this on IRC, since I stopped getting replies here. I was reminded about posting an API trace, so other developers could replicate the problem without having to install Second Life. I already generated one a while ago, which I re-uploaded on my Google Drive at the following link:

https://drive.google.com/file/d/0B5lE6Cy2gg_rZXV6aW1RQV80ODg/view?usp=sharing

Playing back this trace reproduces the GPU freeze for me, so it should contain the trigger for the issue. I understand this is specific to Radeon 6xxx cards and related to the "fast clear" feature, so it needs to be tested on a similar model of card. I will await for feedback on the issue... thank you.
Comment 32 MirceaKitsune 2015-07-01 13:57:47 UTC
Today I downloaded a new Linux native game. One of its shaders also triggers this issue, causing the GPU to freeze and the monitor to keep resetting while the shader is active. Unfortunately, the game didn't include any settings to disable this specific shader. I had to abandon playing it for the time being, as doing so it eventually brought down my X server and forced me to restart.

I understand that Mesa is open-source software, which the developers are probably not paid to maintain and do so voluntarily. Even so, it's becoming frustrating that for two years there is barely anyone even taking notice of this report. Especially after I found the guilty commit in a GIT bisect, tested two patches, and even posted a replicable trace!

I'm setting the task's priority to Urgent, since I don't know what else I can do to get the developer's attention. I'm also changing the version to GIT, because latest master itself still has this problem.

Please, help fix this major issue! This is a major problem and has no workarounds!
Comment 33 MirceaKitsune 2015-07-01 18:26:24 UTC
At last, I found a partial workaround! More precisely, the solution was posted in #82186 by MWATTT. I must simply export the following environment variable:

R600_DEBUG=notiling

I can confirm that with this, I can freely enable all shaders in Second Life and get no crash for the first time. Stunt Rally also seems to no longer cause the freezes. However, the new game that I posted about in the above comment (One Late Night) still produces the freeze even with this enabled.

Until the problem is fixed in the code, I will add this to my ~./profile file. Does it disable anything important? I'm also setting the bug back from Urgent to Major, since at least now there is now an acceptable solution to prevent it in most cases.
Comment 34 Michel Dänzer 2015-07-02 01:45:19 UTC
(In reply to MirceaKitsune from comment #33)
> However, the new game that I posted about in the above comment (One Late
> Night) still produces the freeze even with this enabled.

That's probably a different issue then and needs to be tracked separately.
Comment 35 Marek Olšák 2015-07-02 10:30:12 UTC
Created attachment 116872 [details] [review]
debug info

Please apply the attached patch and attach the output. Thanks.
Comment 36 Marek Olšák 2015-07-02 10:36:16 UTC
*** Bug 82186 has been marked as a duplicate of this bug. ***
Comment 37 Marek Olšák 2015-07-03 14:32:31 UTC
Workaround committed as 97ec2c694fe568e375ec7a2b85c1acb1e4666b54. The single-sample fast color clear is unlikely to be re-enabled again. The real cause of hangs is unknown.
Comment 38 MirceaKitsune 2015-07-04 12:30:35 UTC
Thank you very much for fixing this, that's great news! I will gladly compile MESA from GIT after some things are fixed. I'm sorry if anything that should have worked had to be disabled... that is however much better than this issue taking place.
Comment 39 Ernst Sjöstrand 2015-07-07 23:46:10 UTC
*** Bug 90042 has been marked as a duplicate of this bug. ***
Comment 40 MWATTT 2015-07-30 16:34:27 UTC
Created attachment 117466 [details]
fast color clear hang code
Comment 41 MWATTT 2015-07-30 16:47:43 UTC
I'm able to reproduce the bug (with commit 97ec2c694fe568e375ec7a2b85c1acb1e4666b54 reverted) with a small program after viewing a minecraft apitrace.

Apparently, if 2 textures are attached to a fbo (as GL_COLOR_ATTACHMENT0, GL_COLOR_ATTACHMENT1), GL_COLOR_ATTACHMENT1 is cleared and after that glDrawBuffers(2, {GL_COLOR_ATTACHMENT0, GL_COLOR_ATTACHMENT1}) is called, but the shader of the following drawing command doesn't write anything in the second buffer (GL_COLOR_ATTACHMENT1), then it will hang.

I've attached a small code which reproduce the hang.
Comment 42 MirceaKitsune 2015-08-23 12:54:21 UTC
(In reply to MWATTT from comment #41)

Interesting discovery, thank you for looking into that. Perhaps in this case the problem lies somewhere else, and it's not directly the fault of single sample fast color clear? Would it be worth one of the developers looking into it, if there were important advantages to the disabled fast-clear method?

Please however don't make any untested changes in the mainstream branch if so. Marek's fix just got shaders working again, I wouldn't want to go back to the old situation. If his is the best solution, I'm certainly happy with that!
Comment 43 EoD 2015-11-19 20:13:26 UTC
FTR: the above workaround has been reverted in http://cgit.freedesktop.org/mesa/mesa/commit/?id=de59a40f6898e20a61ac4ea0e5995334f6ed2932 and there are no more problems on my r600g at least.

Thanks!
Comment 44 MirceaKitsune 2015-11-24 11:39:21 UTC
Uh... that news makes me a bit uncomfortable. I can't currently compile latest GIT master of MESA on openSUSE Tumbleweed, so I can't test if the problem doesn't return because of it. I'll wait for the relevant system packages to update and test the problematic shaders again, then post another update here.

Have you ran Second Life with Deferred Lighting enabled, EoD? It does require an account to log in, so if you don't use it regularly it's understandable if you can't. Would have liked knowing your results until I can test the new commit myself though.
Comment 45 MirceaKitsune 2015-11-24 11:46:20 UTC
I apologize for my stupidity: I just noticed the commit mentioned above is dated 02-08-2015, meaning more than 3 months ago. I have MESA 11.0.5 installed, which is very recent so it must include that commit. There are indeed no problems with the shaders and the problem stays fixed, so all is good :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.