Bug 17503

Summary: RV535 blank screen problems
Product: DRI Reporter: Alois Hammer <aloishammer>
Component: DRM/otherAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: high    
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
X server log
none
X server 1.4.2 log
none
X lack-of-stack-trace
none
dmesg output after drm failure
none
kernel oops
none
kernel log with debugging none

Description Alois Hammer 2008-09-09 09:51:47 UTC
Created attachment 18787 [details]
X server log

I'm experiencing a problem with output blanking on both XOrg server 1.4.2 and 1.5.0.  I'm using Gentoo Linux with vanilla kernel 2.6.26.5 (back through 2.6.25 at least) while using stable and git releases of the radeon driver.  X1650PRO (RV535).  I have kernel drm compiled as a module.  I'm not attempting to insert it at any time during boot, so I'm assuming that XOrg drm is taking precedence.

Scenario 1:

* server 1.4.2

I can enter graphics mode exactly once per boot.  Linux starts, xdm initscript starts gdm, I can log in, I can switch from the X vt to the text console vts and back.  If I shut down X and attempt to restart it, X crashes and leaves me with what I think is a working text console, but both the CRT and DVI-I outputs are blanked.  I can shut down gracefully with Ctrl+Alt+Del (/sbin/reboot) or the power button (acpid handler).  X works fine again after a warm boot, but only once, as before.  No log output whatsoever.

Scenario 2:

* server 1.5.0:

X starts fine, or appears to.  I'm attaching a log.  Logging in from another machine via ssh shows various X components running.  However, there's no output on either graphics port despite messages showing what appears to be successful output blank/unblank.

Note that, although the card has TV output, I'm not using it.  I have an LCD panel attached to the VGA port.  IIRC, this problem started occurring when I switched from an X800GTO (R423) to the present card.
Comment 1 Alois Hammer 2008-09-15 14:57:37 UTC
Update: X is living and dying by DRI.  Removing DRI support allows multiple X sessions per warm/cold boot.  Attaching 1.4.2 log.  Note that it ends immediately after DRI installs (this is where X dies immediately).
Comment 2 Alois Hammer 2008-09-15 14:58:57 UTC
Created attachment 18890 [details]
X server 1.4.2 log
Comment 3 Alois Hammer 2008-09-15 16:11:03 UTC
Created attachment 18891 [details]
X lack-of-stack-trace

While attempting to debug with gdb and help from agd5f on Freenode #radeon, X produces a SIGSEGV and leaves no stack to trace.  Attaching gdb backtrace log anyway.  I see one possibly useful error.

(I'm compiling with -O1 and -ggdb in CFLAGS.)
Comment 4 Michel Dänzer 2008-09-16 01:20:34 UTC
Whenever the log file ends abruptly like in this case, one needs to check the X server's stderr output (should be captured in the display manager log file if you're using one). Often this is due to the X server process getting terminated immediately due to an unresolved symbol.

Alternatively, there might be something interesting in dmesg.
Comment 5 Alois Hammer 2008-09-16 09:51:45 UTC
(In reply to comment #4)
> Whenever the log file ends abruptly like in this case, one needs to check the X
> server's stderr output (should be captured in the display manager log file if
> you're using one). Often this is due to the X server process getting terminated
> immediately due to an unresolved symbol.
> 
> Alternatively, there might be something interesting in dmesg.
> 

Good guess.  Here's dmesg after two attempted starts-- I'm attaching the text.
Comment 6 Alois Hammer 2008-09-16 09:52:46 UTC
Created attachment 18927 [details]
dmesg output after drm failure
Comment 7 Alois Hammer 2008-09-16 10:08:19 UTC
Created attachment 18928 [details]
kernel oops
Comment 8 Michel Dänzer 2008-09-17 01:02:32 UTC
Can you try a newer DRM, preferably from upstream drm Git?
Comment 9 Alois Hammer 2008-09-17 08:55:55 UTC
(In reply to comment #8)
> Can you try a newer DRM, preferably from upstream drm Git?
> 

dmesg:

drm: Unknown symbol init_mm

...after taking drm.ko and radeon.ko and placing them in /lib/modules/2.6.26.5/kernel/drivers/char/drm/

init_mm definitely needs to go.  It's being removed from mainline.

Warnings from compile-- the first one, at least, sounds important, and that's the culprit function in the kernel OOPSes:

/var/tmp/portage/x11-base/x11-drm-99999999/work/drm/linux-core/ati_pcigart.c: In function 'drm_ati_pcigart_init':
/var/tmp/portage/x11-base/x11-drm-99999999/work/drm/linux-core/ati_pcigart.c:154: warning: format '%08X' expects type 'unsigned int', but argument 3 has type 'dma_addr_t'

[...]

In file included from ../libdrm/xf86drm.h:40,
                 from dristat.c:33:
./drm.h:493: warning: comma at end of enumerator list
./drm.h:500: warning: ISO C90 does not support 'long long'
./drm.h:905: warning: comma at end of enumerator list
./drm.h:1142: warning: ISO C forbids forward references to 'enum' types
In file included from dristat.c:36:
../libdrm/xf86drm.c: In function 'drmUpdateDrawableInfo':
../libdrm/xf86drm.c:1491: warning: ISO C90 does not support 'long long'

[...]

In file included from ../libdrm/xf86drm.h:40,
                 from drmstat.c:42:
./drm.h:493: warning: comma at end of enumerator list
./drm.h:500: warning: ISO C90 does not support 'long long'
./drm.h:905: warning: comma at end of enumerator list
./drm.h:1142: warning: ISO C forbids forward references to 'enum' types
Comment 10 Alois Hammer 2008-09-17 09:23:03 UTC
(In reply to comment #8)
> Can you try a newer DRM, preferably from upstream drm Git?
> 

Okay, I've gotten drm and radeon working, more or less.  Same results as before.  Here's the excerpted dmesg, with init_mm insertion warnings and all:

Symbol init_mm is marked as UNUSED, however this module is using it.
This symbol will go away in the future.
Please evalute if this is the right api to use and if it really is, submit a report the linux kernel mailinglist together with submitting your code for inclusion.
[drm] Initialized drm 1.1.0 20060810
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:01:00.0 to 64
[drm] Initialized radeon 1.29.0 20080613 on minor 0
[drm] Setting GART location based on new memory map
[drm] Loading R500 Microcode
[drm] Num pipes: 1
[drm] writeback test succeeded in 1 usecs
vbox0: no IPv6 routers present
eth0: no IPv6 routers present
br0: no IPv6 routers present
process `skype' is using obsolete setsockopt SO_BSDCOMPAT
[drm] Num pipes: 1
[drm] Setting GART location based on new memory map
BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<f8b6da3f>] :drm:drm_ati_pcigart_init+0x7f/0x2f0
*pdpt = 00000000347c1001 *pde = 00000000347ee067 *pte = 0000000000000000 
Oops: 0002 [#1] PREEMPT SMP 
Modules linked in: radeon drm bridge llc tun eeprom snd_seq_midi snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_mi
di_event snd_seq snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_hda_intel snd_pcm snd_seq_device snd_util_mem snd_timer snd_page_alloc snd_hwdep snd i2c_i801 i2c_core soundcor
e intel_agp agpgart e1000e

Pid: 8223, comm: X Not tainted (2.6.26.5 #3)
EIP: 0060:[<f8b6da3f>] EFLAGS: 00013246 CPU: 1
EIP is at drm_ati_pcigart_init+0x7f/0x2f0 [drm]
EAX: 00000000 EBX: 00008000 ECX: 00002000 EDX: 00000000
ESI: f744b480 EDI: 00000000 EBP: 00000000 ESP: f6cd9ca0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process X (pid: 8223, ti=f6cd8000 task=f4208d80 task.ti=f6cd8000)
Stack: 00000000 00000000 f64b4000 dfff8000 00000000 f6e932e0 f6dc1800 dfff8000 
       f6dc1800 00000000 00002000 dfff8000 00000000 02000000 f6dc1800 00000000 
       f6e93000 f8b8c25e f8ba1dc4 00000000 00000003 f6de415c f6de4120 f79e50d8 
Call Trace:
 [<f8b8c25e>] radeon_cp_init+0x73e/0xc80 [radeon]
 [<c01d58dc>] journal_stop+0x14c/0x1d0
 [<f8b8bb20>] radeon_cp_init+0x0/0xc80 [radeon]
 [<f8b66070>] drm_unlocked_ioctl+0x100/0x2c0 [drm]
 [<c01c5c17>] ext3_ordered_write_end+0x107/0x1d0
 [<c01625f1>] __do_fault+0x161/0x3c0
 [<c015350b>] remove_suid+0x1b/0x70
 [<c01944e3>] mnt_drop_write+0x53/0x130
 [<c0155abd>] __generic_file_aio_write_nolock+0x23d/0x550
 [<c01189ac>] do_page_fault+0x33c/0x880
 [<c0155e39>] generic_file_aio_write+0x69/0xe0
 [<c01c3080>] ext3_file_write+0x30/0xc0
 [<c017ae75>] do_sync_write+0xd5/0x120
 [<c0140530>] lock_hrtimer_base+0x20/0x50
 [<c013cc10>] autoremove_wake_function+0x0/0x50
 [<c0140641>] hrtimer_cancel+0x11/0x20
 [<c03b951d>] do_nanosleep+0x8d/0xb0
 [<c0187bc0>] vfs_ioctl+0x80/0x90
 [<c0187c37>] do_vfs_ioctl+0x67/0x2e0
 [<c0187eed>] sys_ioctl+0x3d/0x70
 [<c010311d>] sysenter_past_esp+0x6a/0x91
 =======================
Code: 00 8b 56 08 8b 4c 24 14 8b 7c 24 24 8b 41 40 c1 e8 02 39 d0 8d 1c 85 00 00 00 00 0f 4e d0 89 d9 89 54 24 28 c1 e9 02 31 d2 89 d0 <f3> ab f6 c3 02 74 02 66 ab f6 c3 01 74 01 aa 8b 7c 24 28 85 ff 
EIP: [<f8b6da3f>] drm_ati_pcigart_init+0x7f/0x2f0 [drm] SS:ESP 0068:f6cd9ca0
---[ end trace c03fa40495454f94 ]---
[drm:drm_release] *ERROR* Device busy: 1 0
Comment 11 Michel Dänzer 2008-09-17 09:36:49 UTC
Can you try enabling the drm module parameter debug, and maybe add debugging output to drm_ati_pcigart_init() to narrow down where it crashes?

P.S. Please attach large pieces of information rather than cluttering up the comments.
Comment 12 Alois Hammer 2008-09-17 10:05:00 UTC
(In reply to comment #11)
> Can you try enabling the drm module parameter debug, and maybe add debugging
> output to drm_ati_pcigart_init() to narrow down where it crashes?
> 
> P.S. Please attach large pieces of information rather than cluttering up the
> comments.
> 

Apologies for the clutter.

On a suggestion from MrCooper, and after working with airlied some more, tried the nopat kernel option.  Presto, drm works.  Something in the interaction between PAT code and drm...
Comment 13 Alois Hammer 2008-09-24 09:02:32 UTC
Any progress on this bug?  I've got multiple, distinct traces on kerneloops.org.
Comment 14 Jana Saout 2008-10-22 04:40:47 UTC
Created attachment 19813 [details]
kernel log with debugging

For some reason the same problem started occuring yesterday on my laptop too.

I am pretty desperate, since I did not change anythign (well, at least related to X).  I tried different kernel versions, recompiled X server, mesa, libdrm and it still does not work.

The weird thing is, that my X server crashed when I was logging out from my gnome session (so I didn't reboot or anything) and I'm not able to get it working again! :-(

Ok, so I traced it down to a call to "ioremap" failing. I added a check to the return value so that at least I don't get an oops.

I also found out where in ioremap_nocache it is failing. See the line:

"ioremap error for 0xdfff80000-0xe0000000, requested 0x10, got 0x8"

To my understanding it is trying to remap some memory on the graphics adapter requesting the map to be uncached, but it's aborting because the memory is marked as write-back. (??????)

It is an r500 (X1400, i.e. mobility on a Thinkpad T60).  This has been working just fine for months, including 3D and everything.

xorg-server 1.5.2, mesa 1.4.2

I would be willing to help as much as possible, I really need my X going again, but I am no DRM expert.
Comment 15 Jana Saout 2008-10-22 04:43:36 UTC
Additional note: Same problem with vanilla 2.6.26 and 2.6.27 with drm-rawhide git tree from David Airlie.
Comment 16 Jana Saout 2008-10-22 04:49:45 UTC
One more thing, /proc/mtrr shows

reg00: base=0x00000000 (   0MB), size=2048MB: write-back, count=1
reg01: base=0x7ff00000 (2047MB), size=   1MB: uncachable, count=1

SO the last line goes from 0x7ff00000 to 0x8000000 and thus the memory range requested to be mapped uncached is uncached, so not sure why ioremap thinks it is write-back. Unless the first line takes precedence. (??)
Comment 17 Jana Saout 2008-10-22 06:00:19 UTC
Ok, the bug seems to be in PAT as nopat helps.  But how the heck did it stop working yesterday evening?! WTF...

Could you add a null pointer check to the ioremap result?  Having an error message instead of an oops would be nice.
Comment 18 Jerome Glisse 2009-05-20 05:49:28 UTC
Do you still have this issue with recent kernel and xf86-video-ati ?
Comment 19 Alois Hammer 2009-05-20 08:04:24 UTC
(In reply to comment #18)
> Do you still have this issue with recent kernel and xf86-video-ati ?
> 

I'll check as soon as I get home.  I honestly don't recall if I'm still passing nopat.  I sure hope either the various PAT fixes that have gone into mainline since September or the drm or driver fixes have managed to evade this.
Comment 20 Alois Hammer 2009-05-22 08:53:34 UTC
(In reply to comment #18)
> Do you still have this issue with recent kernel and xf86-video-ati ?
> 

...yeah, I haven't been passing nopat recently, and I haven't run into any obvious bugs.  Still using the same hardware, but upgraded to 2.6.29.4, X.Org 1.6.1, and still using repo builds of xf86-video-ati.
Comment 21 Jerome Glisse 2009-05-22 15:50:34 UTC
SO if i did understand you don't see the bug anymore with recent kernel ? So i am closing the bug, reopen if you still see the bug with recent kernel.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.