25718 – gpu death at radeon_fence_wait

Bug 25718 - gpu death at radeon_fence_wait

Summary: gpu death at radeon_fence_wait

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Radeon (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-12-18 18:16 UTC by Aidan Marks
Modified:	2010-10-22 08:32 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
crash_with_drm_debug_15_netconsole (155.67 KB, text/plain) 2009-12-21 22:48 UTC, Aidan Marks	no flags	Details
fence_crash_fuller_log_netconsole.txt.bz2 (42.87 KB, application/octet-stream) 2009-12-21 22:59 UTC, Aidan Marks	no flags	Details
View All

Description Aidan Marks 2009-12-18 18:16:18 UTC

randomly during normal kde desktop usage, display goes all gray or similar momentarily, then switches off.

rv790, kms, 2.6.31.1 + drm-linus patch (http://files.iniza.org/drm-linus/vanilla-2.6.32/) + V8 pm patch enabled.

captured the following similar crashes within minutes of startup:

 320.709748] [drm:radeon_fence_wait] *ERROR* fence(ffff88030c20e5c0:0x0001B7CE) 510ms timeout going to reset GPU
 320.709758] radeon 0000:02:00.0: GPU softreset
 320.709762] radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xE57044A4
 320.709766] radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00FF3F02
 320.709770] radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200022C0
 320.867374] radeon 0000:02:00.0: Wait for MC idle timedout !
 320.867380] radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
 320.867435] radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
 320.867496] radeon 0000:02:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000C02
 320.891743] radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xFFFFFFFF
 320.891746] radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0xFFFFFFFF
 320.891748] radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0xFFFFFFFF

<<reboot>>

1219.885702] [drm:radeon_fence_wait] *ERROR* fence(ffff88030cd285c0:0x00064506) 510ms timeout going to reset GPU
1219.885713] radeon 0000:02:00.0: GPU softreset
1219.885717] radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xE57004A4
1219.885721] radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00FF0F02
1219.885725] radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x20001AC0
1220.044073] radeon 0000:02:00.0: Wait for MC idle timedout !
1220.044081] radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
1220.044136] radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
1220.044197] radeon 0000:02:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000C02
1220.068448] radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xFFFFFFFF
1220.068454] radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0xFFFFFFFF
1220.068460] radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0xFFFFFFFF

Comment 1 Aidan Marks 2009-12-18 18:33:51 UTC

typo, should say 2.6.32.1 above.

Comment 2 Tobias Jakobi 2009-12-19 07:09:58 UTC

Confirming here, but with different setup.

Using drm-radeon-next git tip + V8 pm patch + ATOM lock patch (the one recommended in the V8 pm description).

dmesg output:
[drm:radeon_fence_wait] *ERROR* fence(ffff880126f3aa00:0x0000F256) 510ms timeout going to reset GPU
radeon 0000:01:00.0: GPU softreset 
radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xE57204A4
radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00FF0F02
radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200038C0
radeon 0000:01:00.0: Wait for MC idle timedout !
radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
radeon 0000:01:00.0:   R_000E60_SRBM_SOFT_RESET=0x00020C02
radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003028
radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000002
radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
[drm:radeon_fence_wait] *ERROR* fence(ffff880126f3aa00:0x0000F256) 35815ms timeout
[drm:radeon_fence_wait] *ERROR* last signaled fence(0x0000F256)
[drm:radeon_fence_wait] *ERROR* fence(ffff8801262650c0:0x0000F266) 502ms timeout going to reset GPU
radeon 0000:01:00.0: GPU softreset 
radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA0003028
radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000002
radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
radeon 0000:01:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000402
radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003028
radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000002
radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0

GFX-Hardware according to Xorg log:
Chipset: "ATI Radeon HD 4770" (ChipID = 0x94b3)

video-ati, libdrm and mesa are all git tip.

Comment 3 Rafał Miłecki 2009-12-19 07:36:19 UTC

Err, moment. Could you say something more about this PM? Is this my PM patch that causes that?

Comment 4 Tobias Jakobi 2009-12-19 11:37:49 UTC

I'm not sure - going to test again with the patchset removed.

Comment 5 Aidan Marks 2009-12-19 12:40:06 UTC

I switched to 2.6.32.y git stable and pulled in drm-radeon-testing.  Patched in V8 pm and the ATOM lock patch.

I've seen three fence crashes with radeon.dynpm=1 and none so far when dynpm is disabled.  More uptime required to confirm, but seems likely the PM patch is triggering it.

Comment 6 Tobias Jakobi 2009-12-19 13:06:29 UTC

Yeah, confirming. I removed the patchset completly and did some tests with (somehow working) 3D apps like quake3 and co.

No crashes/hangs so far.

Comment 7 Rafał Miłecki 2009-12-20 02:48:29 UTC

I don't know... The only place where I mess with fences it:

read_lock_irqsave(&rdev->fence_drv.lock, irq_flags);
if (!list_empty(&rdev->fence_drv.emited)) {
        struct list_head *ptr;
        list_for_each(ptr, &rdev->fence_drv.emited) {
                /* count up to 3, that's enought info */
                if (++not_processed >= 3)
                        break;
        }
}
read_unlock_irqrestore(&rdev->fence_drv.lock, irq_flags);

is there anything I can do wrong in this code?! I admit I don't really understand irqsave/irqrestore variants of read_[un]lock. Could someone check this, please?

Comment 8 Trevour 2009-12-20 08:50:47 UTC

With 2.6.31.1 and drm-radeon-testing of four days ago with V7 and ATOM lock applied, I have no problems here. Has anyone tested V7 with the newer changes?

Comment 9 Rafał Miłecki 2009-12-20 12:00:24 UTC

I got info from Aidan that he experienced similar (same?) problem with dynpm=0 and while playing Nexuiz for longer time (about 15mins?).

Not sure what is real impact of my patch on this.

Comment 10 Trevour 2009-12-20 12:02:26 UTC

(In reply to comment #8)
> With 2.6.31.1 and drm-radeon-testing of four days ago with V7 and ATOM lock
> applied, I have no problems here. Has anyone tested V7 with the newer changes?
> 

Correction: 2.6.32.1

Comment 11 Rafał Miłecki 2009-12-21 04:37:51 UTC

I talked with Jerome for a moment and he confirmed my suspicion. This is mostly probably general bug and using dynpm=1 just makes it happen earlier but is not real source of problem.

Maybe using drm with debug=15 parameter could give us some hints, don't know...

Jerome claims backtrace won't be useful unfortunately.

Comment 12 Aidan Marks 2009-12-21 22:48:53 UTC

Created attachment 32240 [details]
crash_with_drm_debug_15_netconsole

attaching last ~3000 lines of netconsole log with drm.debug=15 and radeon.dynpm=1 for 2.6.32.y + drm-radeon-testing + V8 pm patch + ATOM lock patch.

I had already shoved a WARN() one liner in the fence code just before we reset the gpu which you can see in the log, whether it ends up being of any use or not.

Comment 13 Aidan Marks 2009-12-21 22:59:04 UTC

Created attachment 32241 [details]
fence_crash_fuller_log_netconsole.txt.bz2

attaching bz2 fuller log with ~200,000 lines.

Comment 14 Rafał Miłecki 2009-12-21 23:04:54 UTC

Aidan: did you make some more tests, can you say that dynpm=1 really makes this issue happen earlier than with dynpm=0?

If so, can you try removing radeon_set_engine_clock calls from radeon_pm.c and use dynpm=1? Does this makes KMS stable just like with dynpm=0?

Did you actually experience that issue again with dynpm=0? Or did this happen just once (15min of Nexuiz)?

diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c
index a9c61f4..4aa1061 100644
--- a/drivers/gpu/drm/radeon/radeon_pm.c
+++ b/drivers/gpu/drm/radeon/radeon_pm.c
@@ -156,17 +156,12 @@ static void radeon_pm_set_clocks_locked(struct radeon_device *rdev)
        /*radeon_fence_wait_last(rdev);*/
        switch (rdev->pm.planned_action) {
        case PM_ACTION_UPCLOCK:
-               radeon_set_engine_clock(rdev, rdev->clock.default_sclk);
                rdev->pm.downclocked = false;
                break;
        case PM_ACTION_DOWNCLOCK:
-               radeon_set_engine_clock(rdev,
-                       rdev->pm.min_mode_engine_clock);
                rdev->pm.downclocked = true;
                break;
        case PM_ACTION_MINIMUM:
-               radeon_set_engine_clock(rdev,
-                       rdev->pm.min_gpu_engine_clock);
                break;
        case PM_ACTION_NONE:
                DRM_ERROR("%s: PM_ACTION_NONE\n", __func__);

Comment 15 Tobias Jakobi 2010-10-22 03:34:14 UTC

I can now longer reproduce any of these issues. Should be bug be closed?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.