Description
Andrew Moise
2007-09-11 06:30:50 UTC
Created attachment 11508 [details]
Xorg.0.log from driver version 2.0.0-5
Created attachment 11509 [details]
xorg.conf from driver version 2.0.0-5
*** This bug has been marked as a duplicate of bug 11773 *** Bug #11773 has been marked as resolved, but I still see this bug as of today (git commit 4a2b0f340357c4ca58dc9586fad1337b83966362). Heavy disk load seems to help reproduce it. Andrew, can you try to get a backtrace of the crash using what's described at http://www.x.org/wiki/Development/Documentation/ServerDebugging? Feel free to reopen this bug if you think it's still an issue. Also, how exactly are you triggering it? Thanks, Jesse I definitely do think that it's still an issue :-). I was waiting to see whether this or bug #11773 was the one to reopen, but I guess there's no harm in reopening this one for now. I'm not sure I'll be able to get a backtrace. The system is still usable via the network when this happens, but I don't see anything in the log (with the current git driver) to indicate that X knows anything is wrong. It may just be exiting "successfully" while putting my video hardware in a bad state. I'll investigate more when I get a chance (hopefully this weekend). The repro case I used this last time was to run 'while true; find /; done' in one console window and 'while true; cat * > /dev/null; done' (while in a directory belonging to a USB drive with some big files) in another, and then start X from a third and immediately exit. I frequently see it when doing not much of anything special before exiting X, but my old test case (repeatedly starting and immediately exiting X) seems not to reproduce it very well. I have a guess that real usage is good at reproducing it because parts of X are swapped out, and then being swapped back in as X exits, changing the timing, but that's just a guess. FWIW, this machine has not a lot of memory (256 MB with shared video memory), so a lot more gets paged to disk than might on a more powerful system. Thanks for your time; I'll get you more information as soon as I can. Hm, I had some trouble reproducing this bug with the 'find+cat' method of producing disk load that I mentioned earlier. I ran a 'dpkg -l' operation and fooled with firefox until the machine was mostly unresponsive, and then exited, and I got a different crash than I've seen before. Instead of going into DPMS mode, the display froze with a repeated pattern of garbage on the screen. Just as before, there's no X process running, and nothing in the log to indicate that the server seg faulted or anything. Do you think it's still worthwhile setting up to run it in gdb? I'll attach a picture of the corruption that's filled the screen, and the log. THanks. Created attachment 12606 [details]
Xorg.0.log from git version 4a2b0f340357c4ca58dc9586fad1337b83966362
Created attachment 12616 [details]
Picture of screen corruption
One thing I should have mentioned before is that I'm running a rotated display (using xrandr in .xinitrc). I'll try to reproduce it with a non-rotated display and let you know how it goes. Thanks! What info do you need? Just for me to try it again with a non-rotated display? Yeah, that would be a helpful data point. And yeah, running the server under gdb so we can figure out why it exited could be helpful for this bug. Thanks, Jesse Nian, is this something you can reproduce? With 2D driver version 2.2.0, we can not reproduce this bug, it does not exist. With current tip, X can not start up on 855GM(bug #13191). "we can not reproduce this bug" and "it does not exist" are two very different statements :-). I ran version 2.2.0-1 of the driver from Debian, without rotating the screen or modifying the resolution or anything, and with "while true; do find /; done", "while true; do cat * > /dev/null; done", and "while true; do dpkg -l; done" running in three different 'screen' sessions. It crashed again. This time it was the same colorful corruption as before, not the monitor-goes-power-save failure. I did manage to get a backtrace, although it was not staggeringly helpful: Program received signal SIGTERM, Terminated. [Switching to Thread -1211417824 (LWP 19244)] SmartScheduleTimer (sig=14) at ../../os/utils.c:1562 1562 ../../os/utils.c: No such file or directory. in ../../os/utils.c (gdb) bt #0 SmartScheduleTimer (sig=Cannot access memory at address 0xbfb944d0 ) at ../../os/utils.c:1562 Cannot access memory at address 0xbfb944cc (gdb) Is there anything else I can do to help you guys debug this? We used intel driver 2.2.0 and xserver 1.4. We also can rotate the screen. When gdb X, you can disable SmartSchedule by passing argument --dumbSched to Xorg. I do not know how to pass args to X when run "startx", but this works: X -dumbSched & metacity & gnome-session & Please debug again. Hm, the running with -dumbSched is slightly more informative, though I still don't get a proper backtrace: Program received signal SIGTERM, Terminated. [Switching to Thread -1211745504 (LWP 2753)] Cannot remove breakpoints because program is no longer writable. It might be running in another process. Further execution is probably impossible. XF86BigfontResetProc (extEntry=Cannot access memory at address 0xbfaa62a0 ) at ../../Xext/xf86bigfont.c:334 334 ../../Xext/xf86bigfont.c: No such file or directory. in ../../Xext/xf86bigfont.c (gdb) bt #0 XF86BigfontResetProc (extEntry=Cannot access memory at address 0xbfaa62a0 ) at ../../Xext/xf86bigfont.c:334 Cannot access memory at address 0xbfaa629c (gdb) That's with X server 1.4.1~git20071117-1 and driver 2.2.0-1, both from Debian sid. What steps are you trying when you try to reproduce it there? I should mention also that I've been avoiding upgrading my whole system for quite some time, specifically because of brokenness in this driver. I've been using the old i810 driver for general use, and I don't want to get myself into a state where downgrading to that driver will be difficult because of dependencies. If you guys suspect that this could be caused by a bug somewhere else than in the core X server and driver, though, I can do a full upgrade and retest. Even as it is, it's relatively painful to downgrade to the old driver because of the various other downgrades that need to happen. I'm going away for the holiday and won't be back until next week, so I'll not be able to do any hands-on investigation for a while. I'll probably be keeping up with my email, though. Cheers. Just a note -- I'm now back from my vacation, and I've bit the bullet and upgraded my whole system. This bug still occurs. Thanks. Hm, the backtraces aren't that helpful... Are you sure the memory in this machine is ok? Have you run something like memtest86 on it? This problem starts happening when I upgrade to the intel driver, stops when I downgrade to the i810 driver, and always happens in the same place (when exiting X). Hardware trouble is certainly possible, but I doubt that it's faulty memory. I can run memtest86 if you like. I got frustrated last night and tried to debug this myself, without much success. FWIW, the one backtrace I was able to collect (it was that kind of debugging session) was in kernel_vsyscall() instead of anything font-related (with the same SIGTERM-and-process-missing behavior). Is there any way to discover what it is that's sending that SIGTERM signal to my X process? I've debugged X before, but I've never seen that sort of exit before. Well, gdb will trap on signals, but it sounds like you've tried that. The reason why I suggested a memory problem (could be some other hw problem too) is that the crashes seem so random and afaik you're the only one seeing them (though we have our share of other 8xx related problems). So running memtest86 may be a good thing to do. We don't have much else to go on at this point... On rereading, this could also be a timing problem in the exit path of the driver as we restore the console in LeaveVT... might be worth adding some debug code there. I have not made any effort to make gdb handle signals specially -- if there's a way to tell it to break to a debugger when the target process receives SIGTERM, I'd very much appreciate a pointer to some docs, since I don't know how to do that. I didn't mean to give the impression that I had tried to, certainly. The crashes are not random on my end -- they're generally quite consistent. With the 2.2.0-1 version of the driver, they seem to happen every time I exit X, regardless of what kind of artifical load (if any) I put on the system. It makes looking into other X bugs and reconfiguring the server a big PITA, I can tell you -- I've gotten accustomed to typing "sudo init 0" blindly, since I do seem to have a working keyboard on /dev/tty1 when X exits this way :-). If I get time to look into this more on my end, I'll go the route of dropping debugging statements into the driver's shutdown sequence as you suggest instead of fooling with gdb. gdb doesn't seem to be helping. What info are you waiting for, again? Memory test results? One other thing -- I have a half-baked idea that it might be more reproducible on someone else's 855GM if they set up their system to be more like mine. Boot with mem=223M, place a lot of artifical disk load on the system, and see if it happens then. That's where I was going when I asked Wu Nian what steps he was using to try to reproduce this, but he never answered me. Oh well. Upping severity This is definitely a weird one... it may actually be related to 13196 since I think that one is sensitive to the timing of the various pipe enable/disable register writes. That's just a wild guess though since we have so little information to go on. I think we'll either need to reproduce this one in house or get some detailed debug info out of your system to track it down. Okay. Let me know what debugging information I can provide (e.g. a patch that adds logging to the driver so I can run it, repro the problem, and send you the log). I haven't gotten any time to exercise my limited ability to debug this :-(. I'm setting the status back to NEW, since I never heard what information I can provide to resolve the NEEDINFO status. Oh I thought you were going to insert some ErrorF statements in the driver and server LeaveVT paths to see if you can narrow down where the crash is occurring... I hope to, yes, but unfortunately my at-home hobby debugging time is currently taken up with curlftpfs (it's halfway work-related, and also a lot easier to work on, since (a) I don't have to be at home to do it, and (b) I can do it without needing to borrow my housemate's computer for a debugging console). I've worked out a fairly reliable workflow of just rebooting my computer whenever I exit an X session. I won't be offended if you want to mark the bug as NEEDINFO again until I get time to do some investigation of LeaveVT(). Okay, I've tried to look into this some, but I don't really know what I'm looking for. I sprinkled ErrorF() statements throughout I830LeaveVT(), and they all printed as expected but X still left me with an unusable console when it exited. See attached log, config, and patch (this being with driver version 2.2.0-1 and server version 1.4.1~git20071212-1, both from Debian sid). Can you give me a clue as to what codepath(s) might be useful to investigate? Thanks. Created attachment 13384 [details]
Xorg.0.log from version 2.2.0-1 with LeaveVT() debugging
Created attachment 13385 [details]
xorg.conf from version 2.2.0-1
Created attachment 13386 [details] [review] Patch adding the LeaveVT() debugging that I used Thanks for adding the debug output Andrew. Looks like the driver is actually completing its LeaveVT work (i.e. not crashing) but still failing to restore VGA mode for you. Sounds a bit like #13743 and may be a timing or ordering problem in the RestoreHWState code. Could also be something like 13867. And there's a small chance that upping the dpll_settle time to 30ms or higher will fix this problem. This is still occurring for me as of driver version 2.2.0.90-2 from Debian sid. It's gone back to only showing this problem when I exit under load, which is a marked improvement over 2.2.0-1, which would fail to restore the video mode quite often regardless of what load I put on the system. FWIW, the error case I got with 2.2.0.90-2 was when exiting while running a 'find / > /dev/null', a 'while true; dpkg -l > /dev/null; done', and opening a boatload of tabs in firefox until the machine was relatively unresponsive, then exiting. It hung on a white screen instead of taking me back to a console. I'll attach my config and log, since I see that there's some register dumping at the end of the log which I can see being of use (I enable ModeDebug as part of attempting to help debug a separate bug). I haven't been doing any more debugging of this problem on my own, partly because I suspect that I've reached the limit of my ability to debug X problems without some serious investment of time to learn about X drivers and how they work. Anyway, let me know if there's anything else you want me to test or anything else I can provide. Cheers. Created attachment 14193 [details]
xorg.conf from driver version 2.2.0.90-2
Created attachment 14194 [details]
Xorg.0.log from driver version 2.2.0.90-2
Andrew, just to confirm, you're not running with any framebuffer drivers loaded or builtin to the kernel, right? I don't believe so, no. It's a Debian-standard kernel, and 'lsmod | grep fb' turns up nothing. Looks like the AR values from your last log are bad, does the problem still happen w/o the "modedebug" option? I wonder if you're seeing something similar to 14434... I made a few fixes to the display restoration code recently, can you try again with the latest bits? Also, can you get a register dump using intel_reg_dumper from before X starts and after it fails to restore the console? I just got some info from our chipset group that may be relevant, but I'll need to see the dumps to be sure. Thanks, Jesse Sure thing. I'll attach the two register dumps. These are with today's master (commit 66cdccb021a4748b2af41e415c36ed58ca808df6), with ModeDebug disabled, and with intel_reg_dumper from the Debian xserver-xorg-video-intel-dbg package (version 2.2.1-1). The intel_reg_dumper from the master wouldn't compile for me with libpciaccess 0.9.1+git20080219-1: main.o: In function `main': /home/moise/src/xf86-video-intel/src/reg_dumper/main.c:45: undefined reference to `pci_system_init' /home/moise/src/xf86-video-intel/src/reg_dumper/main.c:52: undefined reference to `pci_device_find_by_slot' /home/moise/src/xf86-video-intel/src/reg_dumper/main.c:56: undefined reference to `pci_device_probe' /home/moise/src/xf86-video-intel/src/reg_dumper/main.c:72: undefined reference to `pci_device_map_range' Is that a bug, or is that just because of something I've set up wrong (or changes to libpciacess in the last week)? Created attachment 14571 [details]
Output of intel_reg_dumper on fresh boot
Created attachment 14572 [details]
Output of intel_reg_dumper after exit (when failing to restore console)
Thanks for the register dumps. Is the "fresh boot" register dump really from a power up after a full shutdown (as opposed to just a reboot)? The AR values look bogus... Did you capture them from a console before X started? I captured them from a fresh hard boot before starting X. Do you want register dumps from after running and successfully exiting X and/or while running X for comparison? I must have broken the reg dumper tool somehow; I'm checking it now. For what it's worth, with my Q965 onboard video and intel driver 2.1.1 for 1.3.0, I get similar results: if I switch consoles, sometimes the X server exits entirely, sometimes the display merely becomes nonfunctional. About 70% of the time, however, if I switch back to the X virtual terminal, the video works again. (If the X server has crashed, on the other hand, 'xinit' will usually make video work.) I get the following curious message when switch consoles: (II) intel(0): xf86UnbindGARTMemory: unbind key 0 (II) intel(0): xf86UnbindGARTMemory: unbind key 1 Please note this is with an LCD monitor and DVI; in general those two seem to be very good at evoking bugs in the X drivers. :( Sorry for the lack of action on this one Andrew, we just don't have much to go on. The register dumps (aside from the AR* stuff, but my laptop has the same values) don't have anything obvious... I wonder if this bug is related to #14890. I hope so, since that one seems reproducible... No worries. I generally just run the old "i810" driver (< 2.0) to avoid bugs in the new "intel" driver (>= 2.0), but even when I run the new driver, the more recent versions don't display this bug except under pretty punishing load (though it was still easy to reproduce when I last _tried_ to reproduce it). It's not an actively problematic issue for me. Cheers. Ok, thanks for the update Andrew. I wonder if the type of load matters when reproducing this problem, like whether running a CPU hog or a memory bandwidth hog is more likely to cause the issue... My speculation is that heavy I/O and memory use (leading to swapping out of the X process and therefore long random delays during shutdown) is what's responsible, but all I can really say for sure is the steps that lead it to happen for me. I'd certainly recommend booting with reduced memory (256MB or less) and then running the steps I describe (running a 'find / > /dev/null', a 'while true; dpkg -l > /dev/null; done', and opening a page with a boatload of big images in firefox until the machine is relatively unresponsive, then exiting) if you're interested in reproducing it on your machine. Thanks! I tried the latest source including 2D driver/xserver and kernel 2.6.22+latest drm. But I could not reproduce this bug. My steps are (refer to comment #6): 1, run 'while true; do find /; done' in one console window 2, run 'while true; do cat * > /dev/null; done' (while in a directory belonging to a usb driver with some big files) in another 3, start X from a third and immediately exit. My 855GM platform is Toshiba satellite M20. Any paticular reason why you tried two of the reproduction steps I listed (find / and cat *) and not the other two (limiting your machine to 256MB of memory and opening a memory-consuming page in firefox)? BTW, this is a good page to use for the firefox part of the test: http://ziza.es/2007/07/23/page,1,3,La_vida_de_vagabundos_americanos_42_fotos.html 5-10 seconds after that page starts loading, my blackbox menu starts coming up very slowly when I right-click (several seconds); that's generally when I exit. Thanks! Nian can you try with the memory limit & firefox session Andrew mentioned? Andrew, I'll be getting an 8xx test platform soon, hopefully I'll be able to reproduce & fix this problem. Still can not reproduce this bug with the latest source. My steps are: 1, run 'while true; do find /; done' in one console window 2, run 'while true; do cat * > /dev/null; done' (while in a directory belonging to a usb driver with some big files) in another I did not run "dpkg -l" because our OSD is FC instead of debian/ubuntu. 3, start X from a third, run firefox(long time to start firefox), open page: http://ziza.es/2007/07/23/page,1,3,La_vida_de_vagabundos_americanos_42_fotos.html 4, pkill X; we can back to console successly. My 855GM platform is Toshiba satellite M20 and the memory size is 256MB.. Okay, I've got a theory for what's causing this. Today I exited X under heavy I/O load, and I successfully exited to the console, but this message was waiting for me there: xinit: X server slow to shut down, sending KILL signal. I look at the xinit source code, and lo and behold, if the X server takes longer than 10 seconds to exit, xinit kills it, regardless of how much hardware re-set-up is still waiting to happen. That provides a good theory for what could be causing this bug (especially in the light of comment #15 and comment #17, though those both say SIGTERM instead of SIGKILL). So if that's true, then either the solution is to fix xinit not to do this (thus leaving broken X servers broken, and letting the ones that aren't broken exit even if they're proceeding slowly, which seems like a better solution to me), or to fix the intel driver not to leave the hardware in intermediate states for any length of time during shutdown (which was presumably the i810 driver's behavior). Wow, 10s is a long time to wait for server exit, but I suppose if your system were really loaded and something bad was going on that might be the problem (sending a SIGKILL to the X server would prevent it from restoring hardware state correctly unless the signal came in after the driver's LeaveVT function finished). Anyway, it's easy enough to test, try doing a kill -9 on your server. Does your machine fail in the same way (on mine I get noise rather than a blank screen)? You could also keep an eye on your logs and see if the SIGKILL always corresponds to when you see the failure (you're right that SIGTERM should gracefully shut it down). Unfortunately though, I think we have problems in RestoreHWState on 855 chipsets that cause it to fail sometimes. They're hard to fix though since we can't reproduce them... 10 seconds doesn't seem too outlandish to me for a server exit under heavy I/O and memory load on this machine. I can start counting seconds the next time I exit and make sure. Just testing kill -9 on the server might not do precisely what killing the server halfway through shutdown will do, although if kill -9 from a normal state manages to leave the console useless, then certainly doing it partway through shutdown won't be good. I don't believe xinit writes to Xorg.0.log -- and as I mentioned in comment #6, there's nothing interesting in the log. I'll try starting X from a GNU screen session; then when it fails to restore the video mode I can ssh in and check for the message from xinit. xinit sends SIGTERM exactly ten seconds before it sends SIGKILL. It could be that the behavior I saw in gdb results from xinit sending SIGTERM, the ten seconds elapsing while I'm walking back to the other machine and gdb is loading symbols, and then gdb finding a dead process by the time I try to interact meaningfully with it. This is all pretty easy to test -- I'll just start using a locally-built copy of xinit that does not send SIGKILL, but just sends SIGTERM and then waits as long as it takes for the X server to exit. If I can't reproduce the bug with that setup, then my theory is probably correct. Thanks. Okay, so here's what I've found: Sending kill -9 to the server while it's running does destroy the console -- I can't switch back to the other VTs after the server is dead (and obviously it doesn't shut down cleanly on its own). Modifying xinit to not SIGKILL the server no matter how long it takes to shut down does seem to fix the problem for me. The one time I counted, it took 27 seconds to shut down the server completely (from my mouse click on "exit" to a usable tty1 console). When I ran from GNU screen and then reproduced the bug, I did see the xinit "killing X server" message. That time, I actually was able to switch to the tty consoles after the X server was dead; maybe that aspect of the behavior has changed since my initial bug report, or maybe it just managed to get further along in shutdown on that reproduction. Anyway, cheers. Thanks a lot for tracking it down, Andrew. I'll reclassify this as an xinit bug. It's arguable whether xinit should kill -9 the server at all, but at the very least we know the time limit should be increased for cases like yours. Okay, thanks Jesse! FWIW, it sounds to me as well like the best fix for this is to make xinit not send SIGKILL at all. Cheers. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/app/xinit/issues/2. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.