Summary: | [NVAA] System lockup with X consuming all CPU | ||
---|---|---|---|
Product: | xorg | Reporter: | Rick Stevens <ricks> |
Component: | Driver/nouveau | Assignee: | Nouveau Project <nouveau> |
Status: | RESOLVED MOVED | QA Contact: | Xorg Project Team <xorg-team> |
Severity: | normal | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Forgot to mention, this is Fedora 19, fully updated on a dual-core AMD-based system. Video hardware is: 02:00.0 VGA compatible controller: NVIDIA Corporation C77 [GeForce 8200] (rev a2) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device cb84 Flags: bus master, fast devsel, latency 0, IRQ 5 Memory at fb000000 (32-bit, non-prefetchable) [size=16M] Memory at d8000000 (64-bit, prefetchable) [size=128M] Memory at e6000000 (64-bit, prefetchable) [size=32M] I/O ports at bc00 [size=128] [virtual] Expansion ROM at e0000000 [disabled] [size=128K] Capabilities: [60] Power Management version 2 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Hi Rick Before jumping to compulsions can you configure your Xserver to not load * libglamoregl.so * vesa_drv.so * modesetting_drv.so The last two should be ignored if you use an xorg.conf as shown in here [1]. Whereas for the first one you could try an temporary rename the file (there may be a better way to handle it, which I'm not aware of). With that said, try and narrow down what's causing the issue. A scenario such as * as soon as my screensaver kicks in * using program XX * doing action A, while using program YY, followed by action C. As the issue occurs, try and attach the complete output of dmesg (or save the output to a file and attach at a later point) Cheers Emil [1] "Configuring the X server" http://nouveau.freedesktop.org/wiki/InstallNouveau/ On 07/25/2013 05:01 PM, bugzilla-daemon@freedesktop.org issued this missive: > *Comment # 2 <https://bugs.freedesktop.org/show_bug.cgi?id=67315#c2> on > bug 67315 <https://bugs.freedesktop.org/show_bug.cgi?id=67315> from Emil > Velikov <mailto:emil.l.velikov@gmail.com> * > > Hi Rick > > Before jumping to compulsions can you configure your Xserver to not load > * libglamoregl.so > * vesa_drv.so > * modesetting_drv.so > > The last two should be ignored if you use an xorg.conf as shown in here [1]. > Whereas for the first one you could try an temporary rename the file (there may > be a better way to handle it, which I'm not aware of). There is no xorg.conf file. This is a fresh, virgin install of Fedora 19, then updated. > With that said, try and narrow down what's causing the issue. A scenario such > as > * as soon as my screensaver kicks in > * using program XX > * doing action A, while using program YY, followed by action C. > > As the issue occurs, try and attach the complete output of dmesg (or save the > output to a file and attach at a later point) I stated that it appears to occur when the screen saver starts and I did include the dmesg data that started to spew out when the problem occurs. That being said, I will restart everything and disable the screen saver completely. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - - - - When you don't know what to do, walk fast and look worried. - ---------------------------------------------------------------------- (In reply to comment #3) > > There is no xorg.conf file. This is a fresh, virgin install of Fedora > 19, then updated. > AFAICS I've asked you to try using an xorg.conf, rather than "Are you using one how?" :P > > I stated that it appears to occur when the screen saver starts and I did > include the dmesg data that started to spew out when the problem occurs. > That being said, I will restart everything and disable the screen saver > completely. > You mentioned "possibly when the screen saver kicks in", which indicates a wild guess :P With that confirmed which screen saver you're talking about - bubble3d, etc. There is a reason I've requested dmesg output, the one you attached is at rather unknown point with absolutely no information as to how nouveau handles your card :\. That's why I've mentioned a _complete_ dmesg. If you feel there is something confidential inside if clearly strip it out Created attachment 83046 [details]
20-nouveau.conf, Xorg.0.log and dmesg output
Now this is interesting [19.600630] Initially the kernel "drm" seems to pass NULL pointer to the card, while doing nv50_fbcon_imageblit() ch 1 [0x0007cb0000 DRM] subc 3 class 0x502d mthd 0x0860 data 0x00000000 The way I see it, after this point we're on the mercy of the hardware [19.627513] X/the ddx MP is still funny and fails to execute/set EDGEFLAG_ENABLE to 1 (which for the sake of me I cannot find in the ddx code?) [55.041877] At this point the GPU is completely stuffed fails to execute/set NV50_2D_BLIT_SRC_Y_INT, (coming from X). After that hell breaks loose :P Rick a few interesting notes In the last attachment Xorg.log claims that it's starts at ~81.456, whereas dmesg states that X was running ~19.627541. * Are those logs matching (ie. captured from the same boot/system startup) I'm assuming that you've started your screensaver ~55.041877. Is that correct ? * Do you recall when did your nouveau started started reporting errors (nouveau E) ? Note: the following messages are harmless nouveau E[ PBUS][0000:02:00.0] MMIO read of * FAULT at 0x1002** Cheers Emil On 07/26/2013 01:34 PM, bugzilla-daemon@freedesktop.org issued this missive: > *Comment # 6 <https://bugs.freedesktop.org/show_bug.cgi?id=67315#c6> on > bug 67315 <https://bugs.freedesktop.org/show_bug.cgi?id=67315> from Emil > Velikov <mailto:emil.l.velikov@gmail.com> * > > Now this is interesting > > [19.600630] > Initially the kernel "drm" seems to pass NULL pointer to the card, while doing > nv50_fbcon_imageblit() > > ch 1 [0x0007cb0000 DRM] subc 3 class 0x502d mthd 0x0860 data 0x00000000 > > The way I see it, after this point we're on the mercy of the hardware > > [19.627513] > X/the ddx > MP is still funny and fails to execute/set EDGEFLAG_ENABLE to 1 (which for the > sake of me I cannot find in the ddx code?) > > [55.041877] > At this point the GPU is completely stuffed fails to execute/set > NV50_2D_BLIT_SRC_Y_INT, (coming from X). > > After that hell breaks loose :P Emil, thanks for looking at this. It's, uhm, interesting, isn't it? For clarity sake, this is a bit of an odd system. It's a Shuttle motherboard. From "dmidecode": Handle 0x0002, DMI type 2, 8 bytes Base Board Information Manufacturer: Shuttle Inc Product Name: FN78S Version: V10 Serial Number: > Rick a few interesting notes > In the last attachment Xorg.log claims that it's starts at ~81.456, whereas > dmesg states that X was running ~19.627541. > * Are those logs matching (ie. captured from the same boot/system startup) > I'm assuming that you've started your screensaver ~55.041877. Is that correct ? > * Do you recall when did your nouveau started started reporting errors (nouveau > E) ? I booted the machine and as soon as the display locked up, I ssh'd to it from another machine and simply did a "dmesg >/rick/dmesg.txt" to capture as much as I could. The Xorg.0.log was already generated by the time I logged in via ssh. I combined the various logs into one file and shot it off to bugzilla. As far as the screen saver, I'm running XFCE and the lockup occurred as soon as I went through the "Applications Menu->Settings->Screensaver" menu tree. It tried to render the Screensaver window, got as far as drawing the box around it and everything locked up. After about a minute, the graphical screen blanked and I started getting the GPU lockup and going back to fbcon messages appeared on the console. The GUI screen then reappeared but it was still locked up. Again it cleared, the GPU lockup message appeared again and around and around we go. It might do the same thing rendering other windows, but I'm trying to be consistent to help debug this. I'll be more than happy to keep tinkering, but I've had like 3 hours of sleep in the last two days and I'm a bit knackered. I can pick this back up Monday. I'm in California if that's of any help. > Note: the following messages are harmless > nouveau E[ PBUS][0000:02:00.0] MMIO read of * FAULT at 0x1002** Glad to hear that! That's the first thing that appears after the plymouth "fill the bubble" screen and before I get the GDM login. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - - - - Squawk! Pieces of Seven! Pieces of Seven! Parity Error! - ---------------------------------------------------------------------- I'm ready to continue sorting this out. Enjoyed adequate sleep over the weekend so I'm rarin' to go. What else do you need from me? (In reply to comment #8) > I'm ready to continue sorting this out. Enjoyed adequate sleep over the > weekend so I'm rarin' to go. > > What else do you need from me? Answers to the last two questions would be nice as a start :P (In reply to comment #6) > In the last attachment Xorg.log claims that it's starts at ~81.456, whereas > dmesg states that X was running ~19.627541. > * Are those logs matching (ie. captured from the same boot/system startup) > I'm assuming that you've started your screensaver ~55.041877. Is that > correct ? I.e. understanding the timeline would be beneficial > * Do you recall when did your nouveau started started reporting errors > (nouveau E) ? > I'm talking about nouveau errors in general, indicated by "nouveau E" disregarding the I've mentioned as harmless. If unsure attach dmesg of older kernels that you have handy :) Created attachment 83220 [details] [review] BUG_ON(!data) Also you can try this simple patch, it would trigger a kernel bug so be warned. If/when it does capture the output and attach it in here (either in picture or text form) Thanks (In reply to comment #9) > (In reply to comment #8) > > I'm ready to continue sorting this out. Enjoyed adequate sleep over the > > weekend so I'm rarin' to go. > > > > What else do you need from me? > > Answers to the last two questions would be nice as a start :P In answer to your first question (I think), yes, those captures were from the same boot/crash cycle, acquired as I mentioned in my comments. In answer to the second, I'm not sure how to correlate the timestamps. Looking at the dmesg output, the first "nouveau E" message occurred 19 seconds into the boot process. This is long before the login screen for the desktop appears. The next messages seem to occur once I log in via the desktop login screen, about 55 seconds after the boot. The screensaver shouldn't start for at least 10 minutes after that. As I said, if I try to do something such as adjust the screen saver parameters via XFCE's "Applications Menu->Settings->Screensaver" menu, as soon as the frame containing the the applications window is rendered the display, keyboard and mouse lock up. > > (In reply to comment #6) > > In the last attachment Xorg.log claims that it's starts at ~81.456, whereas > > dmesg states that X was running ~19.627541. > > * Are those logs matching (ie. captured from the same boot/system startup) > > I'm assuming that you've started your screensaver ~55.041877. Is that > > correct ? > I.e. understanding the timeline would be beneficial > > > * Do you recall when did your nouveau started started reporting errors > > (nouveau E) ? > > > I'm talking about nouveau errors in general, indicated by "nouveau E" > disregarding the I've mentioned as harmless. If unsure attach dmesg of older > kernels that you have handy :) dmesg is wiped each time I boot the machine. I guess I could try to set up something that would log it to a disk file instead of just in the ring buffer. (In reply to comment #10) > Created attachment 83220 [details] [review] [review] > BUG_ON(!data) > > Also you can try this simple patch, it would trigger a kernel bug so be > warned. > If/when it does capture the output and attach it in here (either in picture > or text form) > Thanks I'll need to free up some time this week to build a new kernel. I'll put it on the "got to do" list. (In reply to comment #11) > In answer to your first question (I think), yes, those captures were from > the same boot/crash cycle, acquired as I mentioned in my comments. > Obviously I did not express myself clear enough :) The Xorg.0.log attached is _not_ from when the crash/lockup occurred. It is from the second X session (the one that was started by your system after X died the first time). I was seeking confirmation of the above with my first question :P Does your system has plymouth or similar splash manager ? Please disable it, try to reproduce the issue and attach dmesg _before_ trying the patch/recompiling the kernel [...] > adjust the screen saver parameters via XFCE's "Applications > Menu->Settings->Screensaver" menu, as soon as the frame containing the the > applications window is rendered the display, keyboard and mouse lock up. > Please do not assume that I'm running XFCE and/or know what exactly it does behind the scenes. The name and/or package of the screensaver would be great :) [...] > dmesg is wiped each time I boot the machine. I guess I could try to set up > something that would log it to a disk file instead of just in the ring > buffer. No need for such an overkill All I was asking - "Is this a kernel regression ?", or in other words "before you installed Fedora 19 with kernel 3.9.9, you had XXX, running kernel XXX. Do you recall any messages similar to 'nouveau E' in dmesg?" Quick $grep "nouveau E" -r /var/logs/ may help depending on your setup Cheers Emil Created attachment 83247 [details]
More file snapshots
Sorry, forgot to comment about the attachment I just put up. The machine did use plymouth, so I removed the "rhgb" stuff from the boot command and repeated everything as before, but included both Xorg.0.log[.old] files and another capture of dmesg. The screensaver is "xscreensaver" and the menu tree fires up "xscreensaver-demo". They come from the xscreensaver-5.22.1.fc19.x86_64 set of RPMs. The system is up-to-date running kernel 3.10.3-300.fc19.x86_64. Oh, one additional thing...nouveau never seemed to work with this hardware. Before I upgraded this machine to F19, it had been running F18 but I ended up running the nvidia blob driver system since I never could get nouveau to work. I could go back to that if needed, but I thought that since this is currently an experimental system, I might try to get nouveau to work properly and possibly help the community. I do appreciate your help on this Emil. I just wanted you to know that and if I seem a bit, well, dense it's because I'm not that familiar with how the entire X mechanism works on this sort of hardware. (In reply to comment #16) > Oh, one additional thing...nouveau never seemed to work with this hardware. Ouch, now that is interesting :) > I do appreciate your help on this Emil. I just wanted you to know that and > if I seem a bit, well, dense it's because I'm not that familiar with how the > entire X mechanism works on this sort of hardware. We all have our strengths and weaknesses :) Looking at your latest dmesg, the pointer have changed from 0 to another (probably invalid) value - 0xfe00ccc6 nouveau E[ PGRAPH][0000:02:00.0] ch 1 [0x0007cb0000 DRM] subc 3 class 0x502d mthd 0x0860 data 0xfe00ccc6 Would be great if you can do the following * Notice if the the value varies between reboots, or does it differ on kernel version example kernel version 3.9.9 3.10.5 boot 1 0xbla0 0xbla1 2 0xbla2 0xbla3 3 0xbla4 0xbla5 Attached is the same data from kernel 3.9.9-302.fc19.x86_64. Created attachment 83330 [details]
Logs from 3.9.9-302.fc19.x86_64 kernel
Rick there is no need to try the BUG_ON(!data).patch I was assuming that the data printed by nouveau indicates the CPU pointer(rather than one from the gpu's vm), silly me. That said, the only thing that I can think of is that either the VM is not setup properly or data is getting overwritten somewhere :\ Either case a mmiotrace of the blob may be great to have [1], although not too sure when I'll have some time to look into it Cheers Emil [1] http://nouveau.freedesktop.org/wiki/MmioTrace/ https://wiki.ubuntu.com/X/MMIOTracing - FOR ACCELERATION BUGS Created attachment 83383 [details]
mmio trace along with dmesg and Xorg logs
I did as you asked...rebooted in run level 3 with plymouth disabled. Started the mmio trace, then started X by using "telinit 5", tried to bring up the screen saver stuff, waited for it to lock up, waited until it restarted, then stopped the mmio trace and grabbed dmesg and the last two Xorg log files.
I hope that gives you more insight. I await your feedback.
Did you mmiotrace nouveau or the nvidia proprietary driver? The request was for the latter. [Although I don't really see how that will help... but I also haven't looked at all the details of this bug.] It would be worth testing 3.11-rc7 on this (or 3.11 if it's out by the time you get to it). I wonder if there's a bug somewhere handling the NVAA as NVA3+ (which it's not). But then NVAC would also be affected... or something in the ctxprogs... (I guess for which the mmiotrace could come in handy.) Created attachment 106849 [details] [review] Add some writes to 0x100c14 Does this patch help (applying it to a recent kernel code would be better)? Is this still an issue using kernel 3.19-rc4? (In reply to Pierre Moreau from comment #24) > Is this still an issue using kernel 3.19-rc4? Questionable kernel version is not only, Fedora 19 End of Life https://lists.fedoraproject.org/pipermail/announce/2015-January/003248.html (In reply to poma from comment #25) > (In reply to Pierre Moreau from comment #24) > > Is this still an issue using kernel 3.19-rc4? > > Questionable kernel version is not only, > Fedora 19 End of Life > https://lists.fedoraproject.org/pipermail/announce/2015-January/003248.html The fact that he had F19 when he opened the bug report doesn't imply that he didn't upgrade to F20 or F21 since his last post. Besides, one can compile a kernel, or test using one of the live images here: https://nouveau.pmoreau.org. To help you, I left him a note on users@lists.fedoraproject.org to respond here. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/48. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 83002 [details] Xorg.0.log with modeset enabled and output of dmesg while problem is active When using nouveau with modeset, the system starts up fine but eventually (possibly when the screen saver kicks in), X consumes all CPU resources on one core rendering the machine useless except via ssh sessions. If I disable modeset at boot time (nouveau.modeset=0), then the system is stable but the drive does not accept the EDID from the monitor and I'm stuck with a 1024x768 display instead of the 1600x900 it has with modeset enabled. I'm attaching the Xorg.0.log file along with an excerpt from the output of dmesg when this behavior is occurring.