Created attachment 76094 [details] dmesg output
Sorry, accidentally pressed the submit button. Detailed description will follow in a second.
Created attachment 76095 [details] Screenshot of display corruption
Created attachment 76096 [details] Different screenshot of display corruption
Created attachment 76097 [details] PDF export from libreoffice corrupts screen AND output file
Unaffected: Kernel 3.2.x and before Affected: probably everything afterwards, at least 3.5.x to 3.8.2 Card: NVc1 (GF108) When doing graphics-intensive work, video corruption occurs that later leads to a kernel crash. This bug is easy to reproduce for me, but it's hard to provide a minimal test case. How to trigger: Using the "Export to PDF" button in libreoffice may cause video corruption once the export is done (see attached screenshot). Occasionally, also the resulting PDF is corrupted (see attached PDF). Likewise, watching youtube videos or flicking through online magazines (http://issuu.com/gimpmagazine/docs/issue3) will first create 1px wide horizontal lines, later larger distorted structures and eventually crash the kernel. Also, using ardour3 (quite some heavy GTK canvas action) will cause artefacts and then crash the kernel within 40 minutes. Kernel backtraces differ from "page table corruption" to various code paths in almost any subsystem (NFS, ext4, RCU, whatever). Sometimes, individual userspace processes crash or hang in syscalls. I ran memtest for two days without problems. As said before, everything is fine on 3.2.x, too. Looks like random memory access from/via nouveau to me. If need be, I can grab a radeon card to support/falsify this hypothesis.
(In reply to comment #5) > Unaffected: Kernel 3.2.x and before Correction, 3.2.x is affected, too. On earlier 3.2.x kernels (sorry for the lack of precision), it takes weeks to trigger the bug, but with the recent update in Debian unstable (3.2.42-2), it crashes as fast as hand-rolled 3.8.x (and basically every version in between). For the sake of reproducibility, is there a tool that mimics the behaviour of a GTK canvas? Or basically something to allocate video memory from the X server and then moving this area around? I've noticed that especially hover effects (images that change on onMouseOver) in chromium are among the first elements to show screen corruption. My hypothesis is that these elements are already rendered off-screen and then moved to the visible area. During this process (off-screen rendering or moving), something goes wrong. And maybe this something has the potential to write at arbitrary physical addresses.
(In reply to comment #6) > I've noticed that especially hover effects (images that change on > onMouseOver) in chromium are among the first elements to show screen > corruption. I guess we can remove the chromium hover corruption, it's a chromium bug which is present on other video drivers (radeon, NVIDIA blob), too: https://code.google.com/p/chromium/issues/detail?id=227447 However, there is still some correlation with the video driver: if I disable the flashplugin in chromium or stop using chromium altogether, the kernel corruption occurs *much* later. So the original hypothesis still stands that some apps (ardour3, maybe also libreoffice) can write to arbitrary physical memory on my nouveau-based system. A short test with the NVIDIA blob did not exhibit this issue, but I'd need to run for several days to be sure. Maybe a power management issue? (memory timing?)
> A short test with the NVIDIA blob did not exhibit this issue, but I'd need > to run for several days to be sure. Unlike nouveau, the NVIDIA driver indeed does not crash the system. More importantly, disabling the flash plugin in chromium leads to increased uptime before the crash happens, thus indicating a negative correlation between video memory activity and the time to crash: as said before, the more you run ardour3 (or chromium with flash), the earlier your (here: my) system will crash.
Running ardour3 in Xephyr on top of nouveau seems to prevent this bug from happening. I had Xephyr -dumb and ran ardour3 for three hours, something that has never been possible before on this machine (kernel crashes after 2-45mins). It was also possible to run it without the -dumb flag, but I only tested it for 20 minutes, so that's not long enough for a definite answer.
(In reply to comment #9) > I had Xephyr -dumb and ran ardour3 for three hours, something that has never > been possible before on this machine (kernel crashes after 2-45mins). > > It was also possible to run it without the -dumb flag, but I only tested it > for 20 minutes, so that's not long enough for a definite answer. Xephyr (without -dumb) successfully papering over the problem for days now.
Try configuring your kernel with SLUB, and boot with slub_debug=FZPU. See if you get any allocation/usage-related backtraces in dmesg. (This will often prevent crashes from happening, so check dmesg every so often.) Even if you don't, it may be instructive to see dmesg dumps after a crash occurs even if it appears to be due to "random memory corruption". Also, if it got worse in 3.2.42, what was the last "not as bad" version? It may be the case that there are two separate issues going on here.
No response to question in over a year. If things are still broken with the latest software, feel free to reopen (and provide the requested info).
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.