67276 – kernel-3.11 [drm:r600_uvd_ring_test] *ERROR* radeon: ring 5 test failed (0xCAFEDEAD)

Bug 67276 - kernel-3.11 [drm:r600_uvd_ring_test] *ERROR* radeon: ring 5 test failed (0xCAFEDEAD)

Summary: kernel-3.11 [drm:r600_uvd_ring_test] *ERROR* radeon: ring 5 test failed (0xCA...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Radeon (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-07-24 21:02 UTC by Joshua Cov.
Modified:	2013-08-15 14:50 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
relevant dmesg (9.78 KB, text/plain) 2013-07-24 21:02 UTC, Joshua Cov.	no flags	Details
radeon_vram_mm WITHOUT the patch (23.20 KB, text/plain) 2013-07-28 14:56 UTC, Joshua Cov.	no flags	Details
radeon_vram_mm WITH the patch (14.37 KB, text/plain) 2013-07-28 15:27 UTC, Joshua Cov.	no flags	Details
View All

Description Joshua Cov. 2013-07-24 21:02:05 UTC

Created attachment 82960 [details]
relevant dmesg

I have a strange problem that happens sporadically (usually on cold boot but I saw it once on a restart). The Monitor goes to sleep for about 5 sec. then the Xserver starts loading and the monitors turns into small colorful squares. The problem started to appear in the last week or so and I'm sure it started after I started to apply the latest drm-fixes-3.11. My card is:

01:00.0 VGA compatible controller: ATI Technologies Inc Turks XT [AMD Radeon HD 6600 Series] (prog-if 00 [VGA controller])
        Subsystem: PC Partner Limited Device e194
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 57
        Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at fe620000 (64-bit, non-prefetchable) [size=128K]
        Region 4: I/O ports at e000 [size=256]
        Expansion ROM at fe600000 [disabled] [size=128K]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0f00c  Data: 41e2
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Kernel driver in use: radeon

I still haven't find a way to forcibly reproduce it.

Comment 1 Alex Deucher 2013-07-24 21:42:09 UTC

Does it only happen when you enable dpm?

Comment 2 Joshua Cov. 2013-07-24 22:05:44 UTC

(In reply to comment #1)
> Does it only happen when you enable dpm?

I'm not sure. I haven't tried with dpm disabled

Comment 3 Joshua Cov. 2013-07-25 12:02:18 UTC

I have to say that I'm using kernel 3.10.3 with all radeon patches (drm-next-3.11 and drm-fixes-3.11) applied to it. The prooblem started after applying the fixes...

However I needed to fix commit 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b "drm/radeon: add missing ttm_eu_backoff_reservation to radeon_bo_list_validate" for 3.10 with deleting the ticket parameter from the function ttm_eu_backoff_reservation(). The commit looks like this for the kernel-3.10:

diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index 0219d26..2020bf4 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -377,6 +377,7 @@ int radeon_bo_list_validate(struct ww_acquire_ctx *ticket,
domain = lobj->alt_domain;
goto retry;
     }
+    ttm_eu_backoff_reservation(head);
     return r;
   }
 }

I'm suspecting this commit as the culprit of the problem. However I cannot verify it because I still cannot forcibly trigger the error.

Comment 4 Alex Deucher 2013-07-25 12:52:53 UTC

If you didn't pull in the drm reservation changes you don't need the fix.  Do you have this problem straight 3.11?

Comment 5 Joshua Cov. 2013-07-25 13:06:49 UTC

(In reply to comment #4)
> If you didn't pull in the drm reservation changes you don't need the fix. 
> Do you have this problem straight 3.11?

I pull everything from your git that you sent to dave for mainline. The commit in question was in the patch series and it got pulled in, so that I have to adjust it for 3.10. I suppose I haven't picked up the "drm reservation changes" you're talking about.

Can this commit be the culprit of the problem?

Comment 6 Joshua Cov. 2013-07-25 13:10:21 UTC

(In reply to comment #4)
> If you didn't pull in the drm reservation changes you don't need the fix. 
> Do you have this problem straight 3.11?

I didn't remember to have had this problem with the straight 3.11. As I said it started to appear after I pulled the fixes in drm-fixes-3.11

Comment 7 Alex Deucher 2013-07-25 13:26:58 UTC

(In reply to comment #5)
> 
> Can this commit be the culprit of the problem?

Yes, it could be.  As I said, it was a fix for the new reservation code which it sounds like you didn't pull in.

Comment 8 Joshua Cov. 2013-07-25 13:54:30 UTC

(In reply to comment #7)
> (In reply to comment #5)
> > 
> > Can this commit be the culprit of the problem?
> 
> Yes, it could be.  As I said, it was a fix for the new reservation code
> which it sounds like you didn't pull in.

I'll take a look later if I can backport the reservation code back to 3.10 (if it's worth). I see some commits (280cf2118675..8f262540e61c7, or especially ecff665f5e in git://people.freedesktop.org/~airlied/linux drm-next) but I think it's easier to just remove commit 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b from your patchset and see if the problem still occurs in the next 3-4 days.

Comment 9 Alex Deucher 2013-07-25 14:17:41 UTC

(In reply to comment #8)
> I'll take a look later if I can backport the reservation code back to 3.10
> (if it's worth). I see some commits (280cf2118675..8f262540e61c7, or
> especially ecff665f5e in git://people.freedesktop.org/~airlied/linux
> drm-next) but I think it's easier to just remove commit
> 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b from your patchset and see if the
> problem still occurs in the next 3-4 days.

Probably easiest to just drop the patch.  The new reservation code doesn't provide any advantages at this point.

Comment 10 Joshua Cov. 2013-07-25 14:22:48 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > I'll take a look later if I can backport the reservation code back to 3.10
> > (if it's worth). I see some commits (280cf2118675..8f262540e61c7, or
> > especially ecff665f5e in git://people.freedesktop.org/~airlied/linux
> > drm-next) but I think it's easier to just remove commit
> > 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b from your patchset and see if the
> > problem still occurs in the next 3-4 days.
> 
> Probably easiest to just drop the patch.  The new reservation code doesn't
> provide any advantages at this point.

I think you're right. I'll test it in the comming days and report back if the problem occurs.

Comment 11 Joshua Cov. 2013-07-26 11:10:50 UTC

The bug still appears, so this is not the faulty patch. Obviously it happens during cold boot, otherwise I haven't seen it.

Can you help me to debug this?

I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff applied. It worked there flawlessly.

Comment 12 Alex Deucher 2013-07-26 14:06:00 UTC

(In reply to comment #11)
> The bug still appears, so this is not the faulty patch. Obviously it happens
> during cold boot, otherwise I haven't seen it.
> 
> Can you help me to debug this?
> 
> I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff
> applied. It worked there flawlessly.

Bisect?

Comment 13 Joshua Cov. 2013-07-26 14:44:17 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > The bug still appears, so this is not the faulty patch. Obviously it happens
> > during cold boot, otherwise I haven't seen it.
> > 
> > Can you help me to debug this?
> > 
> > I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff
> > applied. It worked there flawlessly.
> 
> Bisect?

I cannot forcibly trigger the problem. That's why I'm hoping for more detailed debug messages. Maybe something connected with ring-testing or maybe data feeding for those rings or anything that hs to do with the ring.

I'm also wondering why the problem don't appear when I restart the pc? Is it some kind of a racing condition? How can I get the info that's in the ring at the time the error occurs? How is the ring testing done?

Comment 14 Alex Deucher 2013-07-26 15:00:14 UTC

(In reply to comment #13)
> 
> I cannot forcibly trigger the problem. That's why I'm hoping for more
> detailed debug messages. Maybe something connected with ring-testing or
> maybe data feeding for those rings or anything that hs to do with the ring.
> 
> I'm also wondering why the problem don't appear when I restart the pc? Is it
> some kind of a racing condition? How can I get the info that's in the ring
> at the time the error occurs? How is the ring testing done?

I'm not sure.  I've never seen the problem.  It could be a problem with another one of the patches.  I've never tested backporting the patches to 3.10.  If you don't have problems with 3.9 or 3.11, it's hard to say.

The ring testing happens when we initially set up the ring.  The basic idea is to clear a scratch register or memory buffer to a known value, then write a new value to that register or buffer using the ring.  When the ring is done if the new value isn't there, the ring isn't working properly.

Comment 15 Joshua Cov. 2013-07-26 15:13:52 UTC

(In reply to comment #14)
> (In reply to comment #13)
> > 
> > I cannot forcibly trigger the problem. That's why I'm hoping for more
> > detailed debug messages. Maybe something connected with ring-testing or
> > maybe data feeding for those rings or anything that hs to do with the ring.
> > 
> > I'm also wondering why the problem don't appear when I restart the pc? Is it
> > some kind of a racing condition? How can I get the info that's in the ring
> > at the time the error occurs? How is the ring testing done?
> 
> I'm not sure.  I've never seen the problem.  It could be a problem with
> another one of the patches.  I've never tested backporting the patches to
> 3.10.  If you don't have problems with 3.9 or 3.11, it's hard to say.
> 
> The ring testing happens when we initially set up the ring.  The basic idea
> is to clear a scratch register or memory buffer to a known value, then write
> a new value to that register or buffer using the ring.  When the ring is
> done if the new value isn't there, the ring isn't working properly.

Do you have any idea, why this happens randomly on cold boots but not when restarting the pc? I think the whole initialization process should be the same everytime the system is booted, so that the ring initialization should fail every time.

On I side note: Earlier I had an interesting problem. When rebooting the pc after a drm-lockup I could see the last screen before the restart. This means that some memory buffers were not completely cleared during the reboots.

Maybe I have a similar problem here: The faulty register isn't rewritten on every reboot???

Comment 16 Alex Deucher 2013-07-26 16:47:29 UTC

(In reply to comment #15)
> Do you have any idea, why this happens randomly on cold boots but not when
> restarting the pc? I think the whole initialization process should be the
> same everytime the system is booted, so that the ring initialization should
> fail every time.
> 
> On I side note: Earlier I had an interesting problem. When rebooting the pc
> after a drm-lockup I could see the last screen before the restart. This
> means that some memory buffers were not completely cleared during the
> reboots.

Memory is never explicitly cleared.

> 
> Maybe I have a similar problem here: The faulty register isn't rewritten on
> every reboot???

The only way to figure out what is going on is to bisect.  There are a lot of things that could cause ring tests to fail.

Comment 17 Alex Deucher 2013-07-26 16:48:27 UTC

(In reply to comment #16)
> (In reply to comment #15)
> > Do you have any idea, why this happens randomly on cold boots but not when
> > restarting the pc? I think the whole initialization process should be the
> > same everytime the system is booted, so that the ring initialization should
> > fail every time.
> > 
> > On I side note: Earlier I had an interesting problem. When rebooting the pc
> > after a drm-lockup I could see the last screen before the restart. This
> > means that some memory buffers were not completely cleared during the
> > reboots.
> 
> Memory is never explicitly cleared.

Unless there is a specific reason to.  Otherwise it just adds extra overhead.

Comment 18 Joshua Cov. 2013-07-28 06:40:25 UTC

I'm really puzzeled. Building the radeon module into the kernel, obviously doesn't trigger the bug. Otherwise it occurs in about 50% of all cold boots.

Comment 19 Toni Ballesta 2013-07-28 09:05:22 UTC

The kernel 3.11 is still beta (RC). Recommended wait for stable version, and for 1 or 2 minor versions later if is possible.

Comment 20 Toni Ballesta 2013-07-28 09:07:09 UTC

(In reply to comment #19)
> The kernel 3.11 is still beta (RC). Recommended wait for stable version, and
> for 1 or 2 minor versions later if is possible.

Ops, sorry, I looking on Libreoffice! :p

Comment 21 Joshua Cov. 2013-07-28 12:15:45 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > The bug still appears, so this is not the faulty patch. Obviously it happens
> > during cold boot, otherwise I haven't seen it.
> > 
> > Can you help me to debug this?
> > 
> > I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff
> > applied. It worked there flawlessly.
> 
> Bisect?

I tracked this down to commit 9cc2e0e9f13315559c85c9f99f141e420967c955 "drm/radeon: never unpin UVD bo v3". After reverting this commit I couldn't reproduce the bug trying numerous times to cold boot the pc.

CC: Christian König

Comment 22 Christian König 2013-07-28 12:28:21 UTC

Can you give us the output of /sys/kernel/debug/dri/0/radeon_vram_mm with and without the patch in question?

Comment 23 Joshua Cov. 2013-07-28 14:56:12 UTC

Created attachment 83123 [details]
radeon_vram_mm WITHOUT the patch

(In reply to comment #22)
> Can you give us the output of /sys/kernel/debug/dri/0/radeon_vram_mm with
> and without the patch in question?

radeon_vram_mm WITHOUT the patch

Comment 24 Joshua Cov. 2013-07-28 15:27:07 UTC

Created attachment 83125 [details]
radeon_vram_mm WITH the patch

(In reply to comment #22)
> Can you give us the output of /sys/kernel/debug/dri/0/radeon_vram_mm with
> and without the patch in question?

radeon_vram_mm from a working system WITH the patch

Comment 25 Joshua Cov. 2013-08-09 11:27:40 UTC

Ping on this.

Comment 26 Alex Deucher 2013-08-09 14:01:32 UTC

Is this still an issue with Dave's latest drm-fixes branch:
http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes

Comment 27 Joshua Cov. 2013-08-15 14:49:07 UTC

I think I can close this. After applying the latest drm-fixes-3.11 as well as drm-next-3.12 I haven't seen the issue. I'll reopen this bug if the problem occurs again

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.