Bug 41668

Summary:	Screen locks up at random points when using a 3D compositing wm (gnome-shell) on an rv515 (radeon mobility x1300)
Product:	Mesa	Reporter:	dmotd <inaudible>
Component:	Drivers/Gallium/r300	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED FIXED	QA Contact:
Severity:	critical
Priority:	high	CC:	ac, ccrisan, jstpierre, przemyslaw
Version:	7.11
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	dmesg output using drm.debug=14 glxinfo startx output Xorg.0.log /proc/interrupts after the freeze /proc/interrupts on a fresh (working) boot.. output of glxinfo gdb gnome-shell backtrace possible fix

Description dmotd 2011-10-10 22:08:19 UTC

Created attachment 52202 [details]
dmesg output using drm.debug=14

i have an intermittent issue with "ATI Mobility Radeon X1300" on an up-to-date archlinux laptop running gnome-shell.. all's well until rendering breaks, sometimes after many hours/days use.. there's no real sign of a repeatable action and nothing reported in logs.. X continues to run and the mouse cursor is active, and even changes to represent the screen content.. 

the closest i can get to a repeatable action is an almost certain freeze using gimp with strenuous brush activity using the clone tool on an average size image.

if i restart gdm i still have no mouse/keyboard interaction with onscreen although gdm dress is displayed.. if i killall -HUP gnome-shell in the running session the window content (without decorations) briefly displays but when it settles only the background appears

if i switch to the VT and back to X then window content with border decorations is displayed instead of background image, but still no interaction

i can't launch any X applications from VT or ssh, or use another WM to replace.

`DISPLAY=:0 openbox --replace` results in:

"Invalid MIT-MAGIC-COOKIE-1 keyOpenbox-Message: Failed to open the display from the DISPLAY environment variable."

however, if i kill gdm, and create an xinitrc for openbox i can start an Xsession with keybd/mouse interaction, so i guess there's definitely something up with 3d (offscreen?) rendering

from within the openbox session:
running glxgears just shows an empty black box.. all other glx demos are the same empty boxes.. 
when i'm logged in to this openbox xsession i can launch xapps via `DISPLAY=:0 foobar` variable.. so this bug i'm encountering must do something to interrupt the ability to draw new windows?

i can only get a working 3D X session by unloading/reloading kernel drivers..  radeon/drm/ttm/drm_kms_helper.. 

here's the very basic script i'm using to unload drivers so i can start X with 3D..
---
#!/bin/bash

# unbind kms fb before unloading modules.. 
echo 0 > /sys/class/vtconsole/vtcon1/bind 
sleep 1

# unload kernel modules
rmmod drm_kms_helper ttm radeon drm
sleep 1

# reload kernel modules
modprobe radeon
---

some basic information:
GL_VERSION:  2.1 Mesa 7.11
GL_VENDOR:   X.Org R300 Project
GL_RENDERER: Gallium 0.4 on ATI RV515
uname -a: Linux neondada 3.0-ARCH #1 SMP PREEMPT Tue Aug 30 08:53:25 CEST 2011 x86_64 Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz GenuineIntel GNU/Linux

i have included output from startx, dmesg (with drm.debug=14), Xorg.0.log & glxinfo.

Comment 1 dmotd 2011-10-10 22:09:06 UTC

Created attachment 52203 [details]
glxinfo

Comment 2 dmotd 2011-10-10 22:09:52 UTC

Created attachment 52204 [details]
startx output

Comment 3 dmotd 2011-10-10 22:10:41 UTC

Created attachment 52205 [details]
Xorg.0.log

Comment 4 Michel Dänzer 2011-10-11 02:34:52 UTC

> `DISPLAY=:0 openbox --replace` results in:
> 
> "Invalid MIT-MAGIC-COOKIE-1 keyOpenbox-Message: Failed to open the display from
> the DISPLAY environment variable."

This is because gdm3 uses a non-default X11 authentication cookie. I use something like

XAUTHORITY=/run/gdm3/$(sudo ls /run/gdm3|grep $(whoami))/database DISPLAY=:0 [...]

to work around it.


> running glxgears just shows an empty black box.. all other glx demos are the
> same empty boxes.. 

Do they work with the environment variable vblank_mode=0? If yes, does the number for radeon increase in /proc/interrupts once the problem occurs?

Comment 5 Jeremy Huddleston Sequoia 2011-10-11 09:28:49 UTC

Do you have records of when you updated the kernel last?  What kernel version 
did you have before you started experiencing this issue?

Comment 6 dmotd 2011-10-11 19:25:07 UTC

(In reply to comment #4)
> > `DISPLAY=:0 openbox --replace` results in:
> > 
> > "Invalid MIT-MAGIC-COOKIE-1 keyOpenbox-Message: Failed to open the display from
> > the DISPLAY environment variable."
> 
> This is because gdm3 uses a non-default X11 authentication cookie. I use
> something like
> 
> XAUTHORITY=/run/gdm3/$(sudo ls /run/gdm3|grep $(whoami))/database DISPLAY=:0
> [...]
> 
> to work around it.

i've since ditched a graphical login, i've had better success reinitiating X with startx.. the last freeze i had i managed to initiate an openbox session (---replace) on top of the failed gnome-shell (which i killed, and switched 'fallback mode' from the cmdline). the result was many windows were inactive and frozen (ie. evolution, gnome-terminal, chromium), while a few others (gvim was one, and empathy was another) remained active and usable. i could however start new instances without issue.

> 
> > running glxgears just shows an empty black box.. all other glx demos are the
> > same empty boxes.. 
> 
> Do they work with the environment variable vblank_mode=0? If yes, does the
> number for radeon increase in /proc/interrupts once the problem occurs?

setting vblank_mode=0 works and displays an output.. but not much change in /proc/interrupts (irq 46 for radeon) i'll attach the output from after the freeze and one from a fresh happy boot..

Comment 7 dmotd 2011-10-11 19:27:14 UTC

Created attachment 52248 [details]
/proc/interrupts after the freeze

Comment 8 dmotd 2011-10-11 19:28:12 UTC

Created attachment 52249 [details]
/proc/interrupts on a fresh (working) boot..

Comment 9 dmotd 2011-10-11 19:42:06 UTC

(In reply to comment #5)
> Do you have records of when you updated the kernel last?  What kernel version 
> did you have before you started experiencing this issue?

it has been occurring since i made the transition to gnome-shell about two months ago, but i believe it was occurring before in another compositing environment  (enlightenment e17 with the ecomporph /ecomp extension - a compiz fork) which i wrongly attributed to the unstable nature of the e17/ecomorph codebase.. i didn't really test that for very long, although i can tell you that i tested it in march of this year, and some quick googling suggests that it would have been running with a 2.6.37 kernel on archlinux then. sorry i don't have package logs going back that far.

Comment 10 Michel Dänzer 2011-10-12 03:07:04 UTC

(In reply to comment #6)
> > > running glxgears just shows an empty black box.. all other glx demos are the
> > > same empty boxes.. 
> > 
> > Do they work with the environment variable vblank_mode=0? If yes, does the
> > number for radeon increase in /proc/interrupts once the problem occurs?
> 
> setting vblank_mode=0 works and displays an output.. but not much change in
> /proc/interrupts (irq 46 for radeon)

Not much change for the radeon number, or none at all? If the latter, apparently the IRQ for the radeon card stops working for some reason, which would explain the core symptoms of the freeze.

Comment 11 dmotd 2011-10-12 03:24:07 UTC

(In reply to comment #10)
> (In reply to comment #6)
> > > > running glxgears just shows an empty black box.. all other glx demos are the
> > > > same empty boxes.. 
> > > 
> > > Do they work with the environment variable vblank_mode=0? If yes, does the
> > > number for radeon increase in /proc/interrupts once the problem occurs?
> > 
> > setting vblank_mode=0 works and displays an output.. but not much change in
> > /proc/interrupts (irq 46 for radeon)
> 
> Not much change for the radeon number, or none at all? If the latter,
> apparently the IRQ for the radeon card stops working for some reason, which
> would explain the core symptoms of the freeze.

no change to the radeon irq number.

Comment 12 dmotd 2011-10-12 23:27:01 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #6)
> > > > > running glxgears just shows an empty black box.. all other glx demos are the
> > > > > same empty boxes.. 
> > > > 
> > > > Do they work with the environment variable vblank_mode=0? If yes, does the
> > > > number for radeon increase in /proc/interrupts once the problem occurs?
> > > 
> > > setting vblank_mode=0 works and displays an output.. but not much change in
> > > /proc/interrupts (irq 46 for radeon)
> > 
> > Not much change for the radeon number, or none at all? If the latter,
> > apparently the IRQ for the radeon card stops working for some reason, which
> > would explain the core symptoms of the freeze.
> 
> no change to the radeon irq number.

is there a way i can debug this further?

Comment 13 Alex Deucher 2011-10-13 06:01:08 UTC

Try the following options in the kernel command line in grub:
pci=nomsi
noapic
irqpoll
and see if any of them help.

Comment 14 dmotd 2011-10-24 16:46:17 UTC

(In reply to comment #13)
> Try the following options in the kernel command line in grub:
> pci=nomsi
> noapic
> irqpoll
> and see if any of them help.

I have been running my machine with all the above kernel flags appended and i haven't yet experienced an issue. I haven't had a chance to exhaustively test these settings, but my machine has been active for a few days now without a screen lock, so I thought I would report back that this seems to help.

Comment 15 Alex Deucher 2011-10-24 16:47:39 UTC

(In reply to comment #14)
> (In reply to comment #13)
> > Try the following options in the kernel command line in grub:
> > pci=nomsi
> > noapic
> > irqpoll
> > and see if any of them help.
> 
> I have been running my machine with all the above kernel flags appended and i
> haven't yet experienced an issue. I haven't had a chance to exhaustively test
> these settings, but my machine has been active for a few days now without a
> screen lock, so I thought I would report back that this seems to help.

Can you narrow down which specific one helps?

Comment 16 Przemyslaw Kochanski 2011-12-14 17:38:59 UTC

Created attachment 54445 [details]
output of glxinfo

I'm experiencing the same issue on Ubuntu 11.10. I'm ready to provide all necessary backtrace. I'm attaching gnome-shell backtrace and glxinfo output. Same issue is reported on gnome-shell bugtracker: https://bugzilla.gnome.org/show_bug.cgi?id=650857 but they claim its either X or the drivers fault.

Comment 17 Przemyslaw Kochanski 2011-12-14 17:39:45 UTC

Created attachment 54446 [details]
gdb gnome-shell backtrace

Comment 18 Przemyslaw Kochanski 2011-12-15 06:20:29 UTC

(In reply to comment #13)
> Try the following options in the kernel command line in grub:
> pci=nomsi
> noapic
> irqpoll
> and see if any of them help.

I've tried your suggestion and it worked! I've discovered the following:

"pci=nomsi noapic irqpoll" no freeze
"pci=nomsi irqpoll" no freeze
"irqpoll" no freeze
"pci=nomsi" no freeze
"noapic" no freeze
"" freeze

So far without any of this options I can reproduce the crash in 100% cases under 1 minute of moving image in Gimp. I did it many times when backtracing. However, there is slight probability that I just got lucky with one of options. I'm 100% sure I've run `update-grub` after every /etc/default/grub change.

Assuming that all of this options fix the problem, which option should I use (witch one disables least things)? I found the following but I don't understand much:

noapic: [SMP,APIC] Tells the kernel to not make use of any IOAPICs that may be present in the system.

irqpoll: [HW] When an interrupt is not handled search all handlersfor it. Also check all handlers each timerinterrupt. Intended to get systems with badly brokenfirmware running.

pci=nomsi: [MSI] If the PCI_MSI kernel config parameter isenabled, this kernel boot option can be used todisable the use of MSI interrupts system-wide.

And finally: Is it a proper fix, or just a workaround?

Comment 19 Alex Deucher 2011-12-15 07:19:31 UTC

(In reply to comment #18)
> And finally: Is it a proper fix, or just a workaround?

It's a workaround.  Since those options work, it's not a radeon bug.  It's most likely a platform bug for your board; probably a bad apic or msi setup for your chipset.  You might need an apic or pci quirk for your motherboard chipset.

Comment 20 Alex Deucher 2011-12-15 07:21:45 UTC

I would suggest emailing the linux-kernel mailing list and saying that you need noapic or pci=msi to get things working on your board.  Include the hw details of your system (lspci, etc.).

Comment 21 Dave Airlie 2012-02-12 08:59:56 UTC

Alex I think this is a driver or hw bug actually.

I seem to lose MSI rearms here, if I manually poke a rearm in from userspace over ssh the system recovers fine.

not sure if we should disable MSI on rv515, you might be able to find some info internally.

Comment 22 Alex Deucher 2012-02-13 06:20:04 UTC

(In reply to comment #21)
> Alex I think this is a driver or hw bug actually.
> 
> I seem to lose MSI rearms here, if I manually poke a rearm in from userspace over ssh the system recovers fine.
> 
> not sure if we should disable MSI on rv515, you might be able to find some info internally.

I'll see what I can find, but r5xx is pretty old.  Did reading back the rearm reg help?

Comment 23 Alex Deucher 2012-02-13 13:40:27 UTC

Created attachment 56992 [details] [review]
possible fix

Does this patch help?

Comment 24 Florian Mickler 2012-02-19 03:19:27 UTC

A patch referencing this bug report has been merged in Linux v3.3-rc4:

commit b7f5b7dec3d539a84734f2bcb7e53fbb1532a40b
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Mon Feb 13 16:36:34 2012 -0500

    drm/radeon/kms: fix MSI re-arm on rv370+

Comment 25 Dave Airlie 2012-02-19 04:25:45 UTC

still seeing the odd lockup will play for a few more days though.

Comment 26 Michel Dänzer 2012-04-24 01:37:57 UTC

Current kernels disable MSI by default for RV515. Does that resolve this report?

Comment 27 dmotd 2012-04-25 19:44:29 UTC

apologies for the many months without reply - i have not been in a position to contribute after making the initial bug report.

i recently performed a system update and am currently running kernel 3.3.1 on archlinux, i can confirm that this bug is still present, and i am still getting occasional graphics freezes of the same nature to before. once again i can get openbox to replace the affected x session, so basic rendering is still okay.

i did get an opportunity to test pci=nomsi for a while and noticed that while there were no freezes of this nature, the 3d graphics rendering would sometimes receive a slight unexplained performance hit (visual lag) that would remain for the rest of the session, a minor effect but not debilitating to general use.

Comment 28 Tomasz P. 2013-01-11 21:11:02 UTC

Is it still a issue with current mesa and kernel ??

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.