Summary: | kernel-3.11 [drm:r600_uvd_ring_test] *ERROR* radeon: ring 5 test failed (0xCAFEDEAD) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Joshua Cov. <joshuacov> | ||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||
Severity: | normal | ||||||||||
Priority: | medium | CC: | ckoenig.leichtzumerken, octoploid | ||||||||
Version: | DRI git | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Does it only happen when you enable dpm? (In reply to comment #1) > Does it only happen when you enable dpm? I'm not sure. I haven't tried with dpm disabled I have to say that I'm using kernel 3.10.3 with all radeon patches (drm-next-3.11 and drm-fixes-3.11) applied to it. The prooblem started after applying the fixes... However I needed to fix commit 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b "drm/radeon: add missing ttm_eu_backoff_reservation to radeon_bo_list_validate" for 3.10 with deleting the ticket parameter from the function ttm_eu_backoff_reservation(). The commit looks like this for the kernel-3.10: diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index 0219d26..2020bf4 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -377,6 +377,7 @@ int radeon_bo_list_validate(struct ww_acquire_ctx *ticket, domain = lobj->alt_domain; goto retry; } + ttm_eu_backoff_reservation(head); return r; } } I'm suspecting this commit as the culprit of the problem. However I cannot verify it because I still cannot forcibly trigger the error. If you didn't pull in the drm reservation changes you don't need the fix. Do you have this problem straight 3.11? (In reply to comment #4) > If you didn't pull in the drm reservation changes you don't need the fix. > Do you have this problem straight 3.11? I pull everything from your git that you sent to dave for mainline. The commit in question was in the patch series and it got pulled in, so that I have to adjust it for 3.10. I suppose I haven't picked up the "drm reservation changes" you're talking about. Can this commit be the culprit of the problem? (In reply to comment #4) > If you didn't pull in the drm reservation changes you don't need the fix. > Do you have this problem straight 3.11? I didn't remember to have had this problem with the straight 3.11. As I said it started to appear after I pulled the fixes in drm-fixes-3.11 (In reply to comment #5) > > Can this commit be the culprit of the problem? Yes, it could be. As I said, it was a fix for the new reservation code which it sounds like you didn't pull in. (In reply to comment #7) > (In reply to comment #5) > > > > Can this commit be the culprit of the problem? > > Yes, it could be. As I said, it was a fix for the new reservation code > which it sounds like you didn't pull in. I'll take a look later if I can backport the reservation code back to 3.10 (if it's worth). I see some commits (280cf2118675..8f262540e61c7, or especially ecff665f5e in git://people.freedesktop.org/~airlied/linux drm-next) but I think it's easier to just remove commit 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b from your patchset and see if the problem still occurs in the next 3-4 days. (In reply to comment #8) > I'll take a look later if I can backport the reservation code back to 3.10 > (if it's worth). I see some commits (280cf2118675..8f262540e61c7, or > especially ecff665f5e in git://people.freedesktop.org/~airlied/linux > drm-next) but I think it's easier to just remove commit > 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b from your patchset and see if the > problem still occurs in the next 3-4 days. Probably easiest to just drop the patch. The new reservation code doesn't provide any advantages at this point. (In reply to comment #9) > (In reply to comment #8) > > I'll take a look later if I can backport the reservation code back to 3.10 > > (if it's worth). I see some commits (280cf2118675..8f262540e61c7, or > > especially ecff665f5e in git://people.freedesktop.org/~airlied/linux > > drm-next) but I think it's easier to just remove commit > > 1b6e5fd5f4fc152064f4f71cea0bcfeb49e29b8b from your patchset and see if the > > problem still occurs in the next 3-4 days. > > Probably easiest to just drop the patch. The new reservation code doesn't > provide any advantages at this point. I think you're right. I'll test it in the comming days and report back if the problem occurs. The bug still appears, so this is not the faulty patch. Obviously it happens during cold boot, otherwise I haven't seen it. Can you help me to debug this? I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff applied. It worked there flawlessly. (In reply to comment #11) > The bug still appears, so this is not the faulty patch. Obviously it happens > during cold boot, otherwise I haven't seen it. > > Can you help me to debug this? > > I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff > applied. It worked there flawlessly. Bisect? (In reply to comment #12) > (In reply to comment #11) > > The bug still appears, so this is not the faulty patch. Obviously it happens > > during cold boot, otherwise I haven't seen it. > > > > Can you help me to debug this? > > > > I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff > > applied. It worked there flawlessly. > > Bisect? I cannot forcibly trigger the problem. That's why I'm hoping for more detailed debug messages. Maybe something connected with ring-testing or maybe data feeding for those rings or anything that hs to do with the ring. I'm also wondering why the problem don't appear when I restart the pc? Is it some kind of a racing condition? How can I get the info that's in the ring at the time the error occurs? How is the ring testing done? (In reply to comment #13) > > I cannot forcibly trigger the problem. That's why I'm hoping for more > detailed debug messages. Maybe something connected with ring-testing or > maybe data feeding for those rings or anything that hs to do with the ring. > > I'm also wondering why the problem don't appear when I restart the pc? Is it > some kind of a racing condition? How can I get the info that's in the ring > at the time the error occurs? How is the ring testing done? I'm not sure. I've never seen the problem. It could be a problem with another one of the patches. I've never tested backporting the patches to 3.10. If you don't have problems with 3.9 or 3.11, it's hard to say. The ring testing happens when we initially set up the ring. The basic idea is to clear a scratch register or memory buffer to a known value, then write a new value to that register or buffer using the ring. When the ring is done if the new value isn't there, the ring isn't working properly. (In reply to comment #14) > (In reply to comment #13) > > > > I cannot forcibly trigger the problem. That's why I'm hoping for more > > detailed debug messages. Maybe something connected with ring-testing or > > maybe data feeding for those rings or anything that hs to do with the ring. > > > > I'm also wondering why the problem don't appear when I restart the pc? Is it > > some kind of a racing condition? How can I get the info that's in the ring > > at the time the error occurs? How is the ring testing done? > > I'm not sure. I've never seen the problem. It could be a problem with > another one of the patches. I've never tested backporting the patches to > 3.10. If you don't have problems with 3.9 or 3.11, it's hard to say. > > The ring testing happens when we initially set up the ring. The basic idea > is to clear a scratch register or memory buffer to a known value, then write > a new value to that register or buffer using the ring. When the ring is > done if the new value isn't there, the ring isn't working properly. Do you have any idea, why this happens randomly on cold boots but not when restarting the pc? I think the whole initialization process should be the same everytime the system is booted, so that the ring initialization should fail every time. On I side note: Earlier I had an interesting problem. When rebooting the pc after a drm-lockup I could see the last screen before the restart. This means that some memory buffers were not completely cleared during the reboots. Maybe I have a similar problem here: The faulty register isn't rewritten on every reboot??? (In reply to comment #15) > Do you have any idea, why this happens randomly on cold boots but not when > restarting the pc? I think the whole initialization process should be the > same everytime the system is booted, so that the ring initialization should > fail every time. > > On I side note: Earlier I had an interesting problem. When rebooting the pc > after a drm-lockup I could see the last screen before the restart. This > means that some memory buffers were not completely cleared during the > reboots. Memory is never explicitly cleared. > > Maybe I have a similar problem here: The faulty register isn't rewritten on > every reboot??? The only way to figure out what is going on is to bisect. There are a lot of things that could cause ring tests to fail. (In reply to comment #16) > (In reply to comment #15) > > Do you have any idea, why this happens randomly on cold boots but not when > > restarting the pc? I think the whole initialization process should be the > > same everytime the system is booted, so that the ring initialization should > > fail every time. > > > > On I side note: Earlier I had an interesting problem. When rebooting the pc > > after a drm-lockup I could see the last screen before the restart. This > > means that some memory buffers were not completely cleared during the > > reboots. > > Memory is never explicitly cleared. Unless there is a specific reason to. Otherwise it just adds extra overhead. I'm really puzzeled. Building the radeon module into the kernel, obviously doesn't trigger the bug. Otherwise it occurs in about 50% of all cold boots. The kernel 3.11 is still beta (RC). Recommended wait for stable version, and for 1 or 2 minor versions later if is possible. (In reply to comment #19) > The kernel 3.11 is still beta (RC). Recommended wait for stable version, and > for 1 or 2 minor versions later if is possible. Ops, sorry, I looking on Libreoffice! :p (In reply to comment #12) > (In reply to comment #11) > > The bug still appears, so this is not the faulty patch. Obviously it happens > > during cold boot, otherwise I haven't seen it. > > > > Can you help me to debug this? > > > > I have to say I haven't seen this with kernel-3.9 and all UVD and DPM staff > > applied. It worked there flawlessly. > > Bisect? I tracked this down to commit 9cc2e0e9f13315559c85c9f99f141e420967c955 "drm/radeon: never unpin UVD bo v3". After reverting this commit I couldn't reproduce the bug trying numerous times to cold boot the pc. CC: Christian König Can you give us the output of /sys/kernel/debug/dri/0/radeon_vram_mm with and without the patch in question? Created attachment 83123 [details] radeon_vram_mm WITHOUT the patch (In reply to comment #22) > Can you give us the output of /sys/kernel/debug/dri/0/radeon_vram_mm with > and without the patch in question? radeon_vram_mm WITHOUT the patch Created attachment 83125 [details] radeon_vram_mm WITH the patch (In reply to comment #22) > Can you give us the output of /sys/kernel/debug/dri/0/radeon_vram_mm with > and without the patch in question? radeon_vram_mm from a working system WITH the patch Ping on this. Is this still an issue with Dave's latest drm-fixes branch: http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes I think I can close this. After applying the latest drm-fixes-3.11 as well as drm-next-3.12 I haven't seen the issue. I'll reopen this bug if the problem occurs again |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 82960 [details] relevant dmesg I have a strange problem that happens sporadically (usually on cold boot but I saw it once on a restart). The Monitor goes to sleep for about 5 sec. then the Xserver starts loading and the monitors turns into small colorful squares. The problem started to appear in the last week or so and I'm sure it started after I started to apply the latest drm-fixes-3.11. My card is: 01:00.0 VGA compatible controller: ATI Technologies Inc Turks XT [AMD Radeon HD 6600 Series] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited Device e194 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 57 Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at fe620000 (64-bit, non-prefetchable) [size=128K] Region 4: I/O ports at e000 [size=256] Expansion ROM at fe600000 [disabled] [size=128K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0f00c Data: 41e2 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Kernel driver in use: radeon I still haven't find a way to forcibly reproduce it.