Bug 20954 - mesa/drm(git): kernel panic with radeon driver (Radeon 9500 Pro)
mesa/drm(git): kernel panic with radeon driver (Radeon 9500 Pro)
Status: RESOLVED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Radeon
DRI git
x86 (IA32) Linux (All)
: medium major
Assigned To: Default DRI bug account
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-30 12:04 UTC by SpOeK@DistroBit.Net
Modified: 2014-01-23 11:09 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Kernel panic message (192.05 KB, image/jpeg)
2009-03-30 12:04 UTC, SpOeK@DistroBit.Net
no flags Details
Kernel panic complete message (151.58 KB, image/jpeg)
2009-04-10 08:17 UTC, SpOeK@DistroBit.Net
no flags Details
Proposed patch to avoid null pointer use. (439 bytes, patch)
2009-04-21 11:59 UTC, SpOeK@DistroBit.Net
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description SpOeK@DistroBit.Net 2009-03-30 12:04:29 UTC
Created attachment 24374 [details]
Kernel panic message

When I reboot or power off my Linux box, the radeon driver panics. I am unable to dump the message so I've taken a photo and attached it to the bug.

With git bisect I've been able to determine the patches that caused the problem:
http://cgit.freedesktop.org/mesa/drm/commit/?id=957b10695b619d6ed2f1098b00502395d9a3c149
http://cgit.freedesktop.org/mesa/drm/commit/?id=9e8591dbdbd99ec1cc4922d61ec4cc54ef42f0ac

If I close the gnome session, switch to a terminal and try to poweroff or reboot, nothing bad happens.

I'm on Gentoo Linux 2.6.28 with x11 overlay. Please, feel free to ask for more information.
Comment 1 Robert Noland 2009-03-31 11:56:16 UTC
I just moved the vblank_cleanup after lastclose, give that a try and see if it is resolved.
Comment 2 SpOeK@DistroBit.Net 2009-04-01 11:36:22 UTC
(In reply to comment #1)
> I just moved the vblank_cleanup after lastclose, give that a try and see if it
> is resolved.
> 

No, the same kernel panic still appears.
Comment 3 SpOeK@DistroBit.Net 2009-04-10 08:17:18 UTC
Created attachment 24695 [details]
Kernel panic complete message

Here is the complete panic message. I've used objdump with the radeon.ko module to locate the specific line that causes the error. It seems to be this:

radeon_get_vblank_counter(radeon_irq.c:307)

return RADEON_READ(RADEON_CRTC2_CRNT_FRAME);
    de10:       8b 80 e0 00 00 00       mov    0xe0(%eax),%eax
    de16:       8b 40 10                mov    0x10(%eax),%eax
    de19:       05 14 02 00 00          add    $0x214,%eax
    de1e:       8b 00                   mov    (%eax),%eax
    de20:       83 c4 0c                add    $0xc,%esp
    de23:       c3                      ret    

The problem is 0xde16(radeon_get_vblank_counter+0x76) operating with eax = NULL as you can see in the photo. I guess that dev_priv->mmio is NULL and that triggers the oops. I've tested this hypothesis adding a check:

if (dev_priv->mmio == NULL)
    return -EINVAL;

This avoids the panic.

Well, this is the information I've been able to gather but I'm not an expert so I'm sorry if I've erroneously brought more darkness than light :)
Comment 4 Michel Dänzer 2009-04-15 09:44:59 UTC
(In reply to comment #3)
> if (dev_priv->mmio == NULL)
>     return -EINVAL;

Looks like a good solution.
Comment 5 SpOeK@DistroBit.Net 2009-04-21 11:59:48 UTC
Created attachment 25010 [details] [review]
Proposed patch to avoid null pointer use.
Comment 6 Daniel Drake 2009-05-11 15:52:12 UTC
Many more reports:
https://bugs.gentoo.org/show_bug.cgi?id=264280

This still exists in latest kernel src and drm git. Rafael, are you going to submit for inclusion?


Dug around in the source to try and confirm the fix, but I can't find the code that would ever make this mmio pointer be NULL. I wonder if we are operating on this structure after it has been freed? Or am I missing something?
Comment 7 SpOeK@DistroBit.Net 2009-05-13 05:49:17 UTC
(In reply to comment #6)
> Many more reports:
> https://bugs.gentoo.org/show_bug.cgi?id=264280
> 
> This still exists in latest kernel src and drm git. Rafael, are you going to
> submit for inclusion?
How? I thought that filing a bug report was the way to fix this issue.
 
 
> Dug around in the source to try and confirm the fix, but I can't find the code
> that would ever make this mmio pointer be NULL. I wonder if we are operating on
> this structure after it has been freed? Or am I missing something?
I reviewed the code and also I was unable to find the sentence that turns NULL the pointer. To be honest, I don't see my patch as a fix because it attacks the symptom and not the illness. I'll try again to find the point where "dev_priv->mmio" is "nullified" but I'm afraid that I won't get anything useful.

Comment 8 SpOeK@DistroBit.Net 2009-08-09 03:05:35 UTC
I've been using the KMS branch [1] for a while and the panic is gone.

[1] http://airlied.livejournal.com/66958.html
Comment 9 Kevin Locke 2009-09-05 09:32:01 UTC
Appologies for reopening this bug if it truly is fixed in the KMS branch.  I have been unable to get KMS setup on my machine, but have tried my best to analyze the changes in linux-next to check if this is truly fixed but I can not see how it would be.  Also note, I am not an advanced kernel or DRI hacker, so this analysis may be completely bogus.  My analysis of the problem is as follows:

During X shutdown, RADEONDRICloseScreen (radeon_dri.c:1915) in the X driver calls the radeon_cp_init (radeon_cp.c:1313) ioctl with RADEON_CLEANUP_CP.  radeon_cp_init then calls radeon_do_cleanup_cp (radeon_cp.c:1221) which sets mmio to NULL when it memset's most of dev_priv to 0.

Later at module unload time (or pci unplug time in the KMS branch) drm_unload (drm_drv.c:358) (drm_put_dev (drm_stub.c:490) in KMS) calls drm_vblank_cleanup (drm_irq.c:97) which calls vblank_disable_fn (drm_irq.c:74) which calls radeon_get_vblank_counter (radeon_irq.c:284) and causes an Oops trying to read from mmio, which is NULL.

I'm not sure if the solution is to not zero mmio during cleanup or protect against it being NULL in get_vblank_count or not call get_vblank_count during vblank cleanup (presumably by saving the last vblank during driver cleanup).  I don't know enough about the code to make that determination.

As best I can tell, this sequence happens in both the current git master as well as the linux-next KMS branch, but perhaps I am overlooking something...  my apologies if this really is properly fixed.

Thanks for reconsidering it.
Comment 10 Marek Olšák 2014-01-23 11:09:53 UTC
I'm pretty sure this has been fixed. If not, please reopen.