Bug 20985 - X hangs on startup with latest kernel (2.6.29@intel-drm-next merge)
X hangs on startup with latest kernel (2.6.29@intel-drm-next merge)
Status: RESOLVED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Intel
unspecified
Other All
: medium critical
Assigned To: Eric Anholt
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-31 17:06 UTC by Florian Mickler
Modified: 2009-06-12 13:57 UTC (History)
1 user (show)

See Also:


Attachments
my xorg conf (1.92 KB, application/octet-stream)
2009-03-31 17:06 UTC, Florian Mickler
no flags Details
xorg log (with the offending commit reverted) (22.17 KB, text/plain)
2009-03-31 17:07 UTC, Florian Mickler
no flags Details
kernel config (15.01 KB, application/gzip)
2009-03-31 17:09 UTC, Florian Mickler
no flags Details
dmesg > dmesg_hanging_X shortly after /etc/init.d/xdm start (the gentoo gdm startscript) (247.39 KB, text/plain)
2009-04-01 10:41 UTC, Florian Mickler
no flags Details
the xorg log (don't know exactly if before or after kill -9 X, am in a hurry) (22.17 KB, text/plain)
2009-04-01 10:42 UTC, Florian Mickler
no flags Details
watch -n 0.2 ' cat /proc/`pidof X`/stack' showed only these variations (833 bytes, text/plain)
2009-04-01 10:44 UTC, Florian Mickler
no flags Details
filtered /var/log/messages, for bootup-messages (944.05 KB, application/gzip)
2009-04-03 13:19 UTC, Florian Mickler
no flags Details
this makes X start up and work, but logfiles get spammed nontheless (1.82 KB, patch)
2009-04-06 04:54 UTC, Florian Mickler
no flags Details | Splinter Review
this makes everything working for me again (7.01 KB, patch)
2009-04-06 13:04 UTC, Florian Mickler
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Mickler 2009-03-31 17:06:21 UTC
Created attachment 24418 [details]
my xorg conf

on startup gdm hangs.

cat /proc/[pidofX]/stack says:
[<ffffffff8024c13e>] msleep_interruptible+0x2e/0x40
[<ffffffff80557fdd>] i915_wait_ring+0x17d/0x1d0
[<ffffffff8056280d>] i915_gem_execbuffer+0xd2d/0xf70
[<ffffffff80546555>] drm_ioctl+0x1f5/0x320
[<ffffffff802d8ce5>] vfs_ioctl+0x85/0xa0
[<ffffffff802d8f0b>] do_vfs_ioctl+0x20b/0x510
[<ffffffff802d9297>] sys_ioctl+0x87/0xa0
[<ffffffff8020ba8b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


i did git bisect it down to:

40a5f0decdf050785ebd62b36ad48c869ee4b384 drm/i915: Fix lock order
reversal in GEM relocation entry copying.

(linus git tree)


see http://lkml.org/lkml/2009/3/30/75 .

since then i applied the revert to my kernelsources and upgraded to xf86-video-intel master (applying the 2 kernel patches from bug#20803)

i just checked, and ommitting the revert continues to hang the x server

my hardware: thinkpad r61 with intel 965gm. 

i'm on gentoo, using x11 overlay for git-sources of the x stack
my xorg-server is git from 01:26:02 PM 03/30/2009
my xf86-video-intel is git from 02:57:58 PM 03/31/2009
my mesa is git from 11:01:27 PM 03/13/2009
and my libdrm is git from 12:51:46 PM 03/31/2009

kernel is 15f7176eb1cccec0a332541285ee752b935c1c85 + patches (hang happens also wihout them, but you probably need the two from jesse to make current xf86-video-intel master happy... )

i will attach:
xorg.conf, Xorg.0.log (with 40a5f0de reverted)  and the kernel-patches.

ill gladly give you more info, if you need it....
Comment 1 Florian Mickler 2009-03-31 17:07:16 UTC
Created attachment 24419 [details]
xorg log (with the offending commit reverted)
Comment 2 Florian Mickler 2009-03-31 17:09:58 UTC
Created attachment 24420 [details]
kernel config

you can get the kernel-patches from bug#20803, no need to attach them here again...

instead here is my kernel config
Comment 3 Eric Anholt 2009-04-01 09:23:44 UTC
How about dmesg and Xorg.0.log when the failure has happened?
Comment 4 Florian Mickler 2009-04-01 10:41:42 UTC
Created attachment 24440 [details]
dmesg > dmesg_hanging_X shortly after /etc/init.d/xdm start (the gentoo gdm startscript)

ah yes... it is more like spinning... totally forgot to look into dmesg :)
Comment 5 Florian Mickler 2009-04-01 10:42:48 UTC
Created attachment 24441 [details]
the xorg log (don't know exactly if before or after kill -9 X, am in a hurry)
Comment 6 Florian Mickler 2009-04-01 10:44:07 UTC
Created attachment 24442 [details]
watch -n 0.2 ' cat /proc/`pidof X`/stack' showed only these variations
Comment 7 Florian Mickler 2009-04-01 10:45:34 UTC
i am on x86_64 btw
Comment 8 Florian Mickler 2009-04-03 13:19:40 UTC
Created attachment 24516 [details]
filtered /var/log/messages,  for bootup-messages

so i finally have a little bit time... the dmesg i appended wasn't the big win i think? :) 

i just found 1.5gig of /var/log/messages files.. (and i wondered what my laptop did all the time after bootup.. turns out it was kerneloops-daemon searching through my 1.5gb log-file *g*)

after filtering it through: 
sed -r 's/\[\s+(.+)\]/[\1]/' messages | cut -d' ' -f7- | uniq -c

the resulting file is 43mb big and the timestamps are stripped (obviously)

if you were interested in the bootup-messages of a failure case... here it is.

on a side note: 
i think that log suggests that there is somewhere a race condition in the kernel's printk's ? 


there are also some traces's in there...
Comment 9 Florian Mickler 2009-04-06 04:54:57 UTC
Created attachment 24597 [details] [review]
this makes X start up and work, but logfiles get spammed nontheless

This causes libdrm to abort the do{} while()  loop which it got stuck in before...

and it somehow works with this (but spamming the dmesg nonetheless) 

what do you make out of this?


p.s. there are probably other callsites which should assign -EFAULT to ret if copy_to_user failed ,
Comment 10 Florian Mickler 2009-04-06 13:04:17 UTC
Created attachment 24614 [details] [review]
this makes everything working for me again

hi,

this works for me and i think it makes the code (a little bit) clearer.

hope you agree.

perhaps you should doublecheck the error-returns i used for copy_from_user and copy_to_user.

there was also one __copy* function which seemed to be performance-critical and i left it alone, marking with a comment what we are returning.


Thx,
Florian

p.s.: yay my first patch
Comment 11 Florian Mickler 2009-06-12 13:57:43 UTC
since this is fixed since 2.6.30-rc2 and 2.6.30 is released, i think this can be closed as fixed...