Summary: | gpu death at radeon_fence_wait | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Aidan Marks <aidan> | ||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||
Severity: | major | ||||||||
Priority: | medium | ||||||||
Version: | XOrg git | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Aidan Marks
2009-12-18 18:16:18 UTC
typo, should say 2.6.32.1 above. Confirming here, but with different setup. Using drm-radeon-next git tip + V8 pm patch + ATOM lock patch (the one recommended in the V8 pm description). dmesg output: [drm:radeon_fence_wait] *ERROR* fence(ffff880126f3aa00:0x0000F256) 510ms timeout going to reset GPU radeon 0000:01:00.0: GPU softreset radeon 0000:01:00.0: R_008010_GRBM_STATUS=0xE57204A4 radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00FF0F02 radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x200038C0 radeon 0000:01:00.0: Wait for MC idle timedout ! radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 radeon 0000:01:00.0: R_000E60_SRBM_SOFT_RESET=0x00020C02 radeon 0000:01:00.0: R_008010_GRBM_STATUS=0x00003028 radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x200000C0 [drm:radeon_fence_wait] *ERROR* fence(ffff880126f3aa00:0x0000F256) 35815ms timeout [drm:radeon_fence_wait] *ERROR* last signaled fence(0x0000F256) [drm:radeon_fence_wait] *ERROR* fence(ffff8801262650c0:0x0000F266) 502ms timeout going to reset GPU radeon 0000:01:00.0: GPU softreset radeon 0000:01:00.0: R_008010_GRBM_STATUS=0xA0003028 radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x200000C0 radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 radeon 0000:01:00.0: R_000E60_SRBM_SOFT_RESET=0x00000402 radeon 0000:01:00.0: R_008010_GRBM_STATUS=0x00003028 radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x200000C0 GFX-Hardware according to Xorg log: Chipset: "ATI Radeon HD 4770" (ChipID = 0x94b3) video-ati, libdrm and mesa are all git tip. Err, moment. Could you say something more about this PM? Is this my PM patch that causes that? I'm not sure - going to test again with the patchset removed. I switched to 2.6.32.y git stable and pulled in drm-radeon-testing. Patched in V8 pm and the ATOM lock patch. I've seen three fence crashes with radeon.dynpm=1 and none so far when dynpm is disabled. More uptime required to confirm, but seems likely the PM patch is triggering it. Yeah, confirming. I removed the patchset completly and did some tests with (somehow working) 3D apps like quake3 and co. No crashes/hangs so far. I don't know... The only place where I mess with fences it: read_lock_irqsave(&rdev->fence_drv.lock, irq_flags); if (!list_empty(&rdev->fence_drv.emited)) { struct list_head *ptr; list_for_each(ptr, &rdev->fence_drv.emited) { /* count up to 3, that's enought info */ if (++not_processed >= 3) break; } } read_unlock_irqrestore(&rdev->fence_drv.lock, irq_flags); is there anything I can do wrong in this code?! I admit I don't really understand irqsave/irqrestore variants of read_[un]lock. Could someone check this, please? With 2.6.31.1 and drm-radeon-testing of four days ago with V7 and ATOM lock applied, I have no problems here. Has anyone tested V7 with the newer changes? I got info from Aidan that he experienced similar (same?) problem with dynpm=0 and while playing Nexuiz for longer time (about 15mins?). Not sure what is real impact of my patch on this. (In reply to comment #8) > With 2.6.31.1 and drm-radeon-testing of four days ago with V7 and ATOM lock > applied, I have no problems here. Has anyone tested V7 with the newer changes? > Correction: 2.6.32.1 I talked with Jerome for a moment and he confirmed my suspicion. This is mostly probably general bug and using dynpm=1 just makes it happen earlier but is not real source of problem. Maybe using drm with debug=15 parameter could give us some hints, don't know... Jerome claims backtrace won't be useful unfortunately. Created attachment 32240 [details]
crash_with_drm_debug_15_netconsole
attaching last ~3000 lines of netconsole log with drm.debug=15 and radeon.dynpm=1 for 2.6.32.y + drm-radeon-testing + V8 pm patch + ATOM lock patch.
I had already shoved a WARN() one liner in the fence code just before we reset the gpu which you can see in the log, whether it ends up being of any use or not.
Created attachment 32241 [details]
fence_crash_fuller_log_netconsole.txt.bz2
attaching bz2 fuller log with ~200,000 lines.
Aidan: did you make some more tests, can you say that dynpm=1 really makes this issue happen earlier than with dynpm=0? If so, can you try removing radeon_set_engine_clock calls from radeon_pm.c and use dynpm=1? Does this makes KMS stable just like with dynpm=0? Did you actually experience that issue again with dynpm=0? Or did this happen just once (15min of Nexuiz)? diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c index a9c61f4..4aa1061 100644 --- a/drivers/gpu/drm/radeon/radeon_pm.c +++ b/drivers/gpu/drm/radeon/radeon_pm.c @@ -156,17 +156,12 @@ static void radeon_pm_set_clocks_locked(struct radeon_device *rdev) /*radeon_fence_wait_last(rdev);*/ switch (rdev->pm.planned_action) { case PM_ACTION_UPCLOCK: - radeon_set_engine_clock(rdev, rdev->clock.default_sclk); rdev->pm.downclocked = false; break; case PM_ACTION_DOWNCLOCK: - radeon_set_engine_clock(rdev, - rdev->pm.min_mode_engine_clock); rdev->pm.downclocked = true; break; case PM_ACTION_MINIMUM: - radeon_set_engine_clock(rdev, - rdev->pm.min_gpu_engine_clock); break; case PM_ACTION_NONE: DRM_ERROR("%s: PM_ACTION_NONE\n", __func__); I can now longer reproduce any of these issues. Should be bug be closed? |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.