Bug 59089

Summary: [bisected, regression] flood of GPU fault detected in logs caused by 9af20... drm/radeon: fix fence locking in the pageflip callback
Product: Mesa Reporter: Alexandre Demers <alexandre.f.demers>
Component: Drivers/Gallium/r600Assignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: p00hzone
Version: git   
Hardware: All   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=58667
Whiteboard:
i915 platform: i915 features:

Description Alexandre Demers 2013-01-06 20:39:04 UTC
GPU fault detected flood in logs (dmesg, kernel, errors and everything) of the following form:
[  533.928472] radeon 0000:03:00.0: GPU fault detected: 146 0x00335514
[  533.928477] radeon 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  533.928483] radeon 0000:03:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000

From time to time, there will be an address value different from 0x00000000. They are produced at an awful rate, producing GB of logs in no time.

Appeared in kernel 3.8.0-rcX using a cayman GPU (HD 6950).

Bisecting to identify when the flood first appeared points at:
Commit: 4ac0533abaec2b83a7f2c675010eedd55664bc26

Author: Jerome Glisse <jglisse@redhat.com>  2012-12-13 12:08:11
Committer: Alex Deucher <alexander.deucher@amd.com>  2012-12-14 10:45:24
Parent: 9af20792124850369e764965690b99b20623dfc4 (drm/radeon: fix fence locking in the pageflip callback)
Branch: remotes/origin/master
Follows: v3.7-rc7
Precedes: v3.8-rc1

    drm/radeon: fix htile buffer size computation for command stream checker
    
    Fix the size computation of the htile buffer.
    
    Signed-off-by: Jerome Glisse <jglisse@redhat.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Maybe be related to some of the crashes seen in bug 58667.
Comment 1 Alex Deucher 2013-01-07 20:33:57 UTC
Should be fixed with this mesa commit:
http://cgit.freedesktop.org/mesa/mesa/commit/?id=4332f6fc185f968e7563e748b8c949021937c935
Comment 2 Alexandre Demers 2013-01-08 01:20:59 UTC
Seems good now.
Comment 3 Thomas Rohloff 2013-01-08 04:57:08 UTC
I don't want to play the bad guy but for me this is not fixed, just reduced.
Comment 4 Alexandre Demers 2013-01-08 05:53:46 UTC
(In reply to comment #3)
> I don't want to play the bad guy but for me this is not fixed, just reduced.

Well, I closed it because I don't have the continuous GPU fault flood happening anymore. However, I was unable to determine if there was still a GPU fault happening. This bug is really about the flood.

So, I don't have any problem in reopening it if you do experience a flood of GPU faults. I was getting GB of logs in no time.

Are you still seeing GPU faults only in some circumstances (games or specific applications) or just opening a session (for me it's with Gnome Shell) is enough? Also, keep in mind this bug is pinpointing a specific commit.
Comment 5 Anthony Waters 2013-01-14 03:56:58 UTC
I get the GPU faults starting with mesa commit

3e163a137be7f9a80ec720903c4bda028de5681f is the first bad commit
commit 3e163a137be7f9a80ec720903c4bda028de5681f
Author: Marek Olšák <maraeo@gmail.com>
Date:   Thu Nov 29 02:55:01 2012 +0100

    gallium/postprocess: share pipe_context and cso_context with the state tracker
    
    Using one context instead of two is more efficient and
    we can skip another context flush.
    
    Reviewed-by: Brian Paul <brianp@vmware.com>
Comment 6 Alexandre Demers 2013-01-14 06:01:26 UTC
(In reply to comment #5)
> I get the GPU faults starting with mesa commit
> 
> 3e163a137be7f9a80ec720903c4bda028de5681f is the first bad commit
> commit 3e163a137be7f9a80ec720903c4bda028de5681f
> Author: Marek Olšák <maraeo@gmail.com>
> Date:   Thu Nov 29 02:55:01 2012 +0100
> 
>     gallium/postprocess: share pipe_context and cso_context with the state
> tracker
>     
>     Using one context instead of two is more efficient and
>     we can skip another context flush.
>     
>     Reviewed-by: Brian Paul <brianp@vmware.com>

Is it a flood? Other commits may create GPU faults, but it shouldn't flood your logs. I think it would be better to track different sources of GPU faults in different bugs.
Comment 7 Anthony Waters 2013-01-14 14:13:31 UTC
I would consider it a flood, the message continually appears until glxgears is exited.  Can you confirm whether 3e163a137be7f9a80ec720903c4bda028de5681f in mesa stops all of the GPU faults? If it does I will open another bug report seeing as it may be different.
Comment 8 Alexandre Demers 2013-01-17 02:58:26 UTC
(In reply to comment #7)
> I would consider it a flood, the message continually appears until glxgears
> is exited.  Can you confirm whether 3e163a137be7f9a80ec720903c4bda028de5681f
> in mesa stops all of the GPU faults? If it does I will open another bug
> report seeing as it may be different.

Neither glxgears nor Heroes of Newerth produce GPU fault flood since applied fix in mesa. If you are still experiencing a flood when glxgears is running, I'm pretty sure it is not the same thing under the hood. Which makes me think: what kernel version are you using? Did you test with latest mesa version since fix was pushed in git? Which gpu do you have?

I'm running latest kernel from Linus' git, latest mesa git and I'm using an HD 6950.
Comment 9 Anthony Waters 2013-01-17 03:24:18 UTC
Mesa is at latest git, however, my kernel wasn't at Linus' git, so that may be the issue, using HD 6950. I will try the newest kernel and if it doesn't work I'll create a new bug report.
Comment 10 Alexandre Demers 2013-01-17 04:03:36 UTC
(In reply to comment #9)
> Mesa is at latest git, however, my kernel wasn't at Linus' git, so that may
> be the issue, using HD 6950. I will try the newest kernel and if it doesn't
> work I'll create a new bug report.

Ok, let me know. Meanwhile, I'll test something on my side to see if your bad commit could be related to another bug I'm experiencing with Tropics.
Comment 11 Alexandre Demers 2013-01-17 05:05:48 UTC
(In reply to comment #9)
> Mesa is at latest git, however, my kernel wasn't at Linus' git, so that may
> be the issue, using HD 6950. I will try the newest kernel and if it doesn't
> work I'll create a new bug report.

Also, you could see if updating libdrm and/or ddx driver could help, there were some commits for both of them in the last couple of weeks.
Comment 12 Anthony Waters 2013-01-18 15:57:04 UTC
I have been able to get rid of the flood, the message appears some times in dmesg so there is some other bug that exists.
Comment 13 Alexandre Demers 2013-01-18 16:49:47 UTC
(In reply to comment #12)
> I have been able to get rid of the flood, the message appears some times in
> dmesg so there is some other bug that exists.

I agree with you on that point, Thomas and I are also experiencing gpu faults in some cases.
Comment 14 Alexandre Demers 2013-02-20 05:39:51 UTC
Closing since original bug/commit was fixed. The remaining GPU faults must have a different root.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.