Bug 61953 - arbitrary memory access corrupts kernel memory, eventually crashing the kernel
Summary: arbitrary memory access corrupts kernel memory, eventually crashing the kernel
Status: RESOLVED INVALID
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-07 09:39 UTC by Adrian Knoth
Modified: 2014-08-23 14:54 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (50.65 KB, text/plain)
2013-03-07 09:39 UTC, Adrian Knoth
no flags Details
Screenshot of display corruption (563.78 KB, image/png)
2013-03-07 09:44 UTC, Adrian Knoth
no flags Details
Different screenshot of display corruption (376.61 KB, image/png)
2013-03-07 09:46 UTC, Adrian Knoth
no flags Details
PDF export from libreoffice corrupts screen AND output file (1.46 MB, application/octet-stream)
2013-03-07 09:47 UTC, Adrian Knoth
no flags Details

Description Adrian Knoth 2013-03-07 09:39:52 UTC
Created attachment 76094 [details]
dmesg output
Comment 1 Adrian Knoth 2013-03-07 09:40:42 UTC
Sorry, accidentally pressed the submit button. Detailed description will follow in a second.
Comment 2 Adrian Knoth 2013-03-07 09:44:48 UTC
Created attachment 76095 [details]
Screenshot of display corruption
Comment 3 Adrian Knoth 2013-03-07 09:46:18 UTC
Created attachment 76096 [details]
Different screenshot of display corruption
Comment 4 Adrian Knoth 2013-03-07 09:47:07 UTC
Created attachment 76097 [details]
PDF export from libreoffice corrupts screen AND output file
Comment 5 Adrian Knoth 2013-03-07 11:46:41 UTC
Unaffected: Kernel 3.2.x and before
Affected: probably everything afterwards, at least 3.5.x to 3.8.2
Card: NVc1 (GF108)

When doing graphics-intensive work, video corruption occurs that later leads to a kernel crash. This bug is easy to reproduce for me, but it's hard to provide a minimal test case.

How to trigger:
Using the "Export to PDF" button in libreoffice may cause video corruption once the export is done (see attached screenshot). Occasionally, also the resulting PDF is corrupted (see attached PDF).

Likewise, watching youtube videos or flicking through online magazines (http://issuu.com/gimpmagazine/docs/issue3) will first create 1px wide horizontal lines, later larger distorted structures and eventually crash the kernel.

Also, using ardour3 (quite some heavy GTK canvas action) will cause artefacts and then crash the kernel within 40 minutes.

Kernel backtraces differ from "page table corruption" to various code paths in almost any subsystem (NFS, ext4, RCU, whatever). Sometimes, individual userspace processes crash or hang in syscalls.

I ran memtest for two days without problems. As said before, everything is fine on 3.2.x, too.

Looks like random memory access from/via nouveau to me. If need be, I can grab a radeon card to support/falsify this hypothesis.
Comment 6 Adrian Knoth 2013-03-29 20:12:08 UTC
(In reply to comment #5)
> Unaffected: Kernel 3.2.x and before

Correction, 3.2.x is affected, too.

On earlier 3.2.x kernels (sorry for the lack of precision), it takes weeks to trigger the bug, but with the recent update in Debian unstable (3.2.42-2), it crashes as fast as hand-rolled 3.8.x (and basically every version in between).


For the sake of reproducibility, is there a tool that mimics the behaviour of a GTK canvas? Or basically something to allocate video memory from the X server and then moving this area around?

I've noticed that especially hover effects (images that change on onMouseOver) in chromium are among the first elements to show screen corruption. My hypothesis is that these elements are already rendered off-screen and then moved to the visible area. During this process (off-screen rendering or moving), something goes wrong.

And maybe this something has the potential to write at arbitrary physical addresses.
Comment 7 Adrian Knoth 2013-04-08 20:39:25 UTC
(In reply to comment #6)

> I've noticed that especially hover effects (images that change on
> onMouseOver) in chromium are among the first elements to show screen
> corruption.

I guess we can remove the chromium hover corruption, it's a chromium bug which is present on other video drivers (radeon, NVIDIA blob), too:


    https://code.google.com/p/chromium/issues/detail?id=227447


However, there is still some correlation with the video driver: if I disable the flashplugin in chromium or stop using chromium altogether, the kernel corruption occurs *much* later. So the original hypothesis still stands that some apps (ardour3, maybe also libreoffice) can write to arbitrary physical memory on my nouveau-based system.

A short test with the NVIDIA blob did not exhibit this issue, but I'd need to run for several days to be sure.

Maybe a power management issue? (memory timing?)
Comment 8 Adrian Knoth 2013-04-12 10:59:04 UTC
> A short test with the NVIDIA blob did not exhibit this issue, but I'd need
> to run for several days to be sure.

Unlike nouveau, the NVIDIA driver indeed does not crash the system.

More importantly, disabling the flash plugin in chromium leads to increased uptime before the crash happens, thus indicating a negative correlation between video memory activity and the time to crash: as said before, the more you run ardour3 (or chromium with flash), the earlier your (here: my) system will crash.
Comment 9 Adrian Knoth 2013-04-18 22:48:55 UTC
Running ardour3 in Xephyr on top of nouveau seems to prevent this bug from happening.

I had Xephyr -dumb and ran ardour3 for three hours, something that has never been possible before on this machine (kernel crashes after 2-45mins).

It was also possible to run it without the -dumb flag, but I only tested it for 20 minutes, so that's not long enough for a definite answer.
Comment 10 Adrian Knoth 2013-04-27 10:33:42 UTC
(In reply to comment #9)

> I had Xephyr -dumb and ran ardour3 for three hours, something that has never
> been possible before on this machine (kernel crashes after 2-45mins).
> 
> It was also possible to run it without the -dumb flag, but I only tested it
> for 20 minutes, so that's not long enough for a definite answer.

Xephyr (without -dumb) successfully papering over the problem for days now.
Comment 11 Ilia Mirkin 2013-05-21 03:20:05 UTC
Try configuring your kernel with SLUB, and boot with slub_debug=FZPU. See if you get any allocation/usage-related backtraces in dmesg. (This will often prevent crashes from happening, so check dmesg every so often.) Even if you don't, it may be instructive to see dmesg dumps after a crash occurs even if it appears to be due to "random memory corruption".

Also, if it got worse in 3.2.42, what was the last "not as bad" version? It may be the case that there are two separate issues going on here.
Comment 12 Ilia Mirkin 2014-08-23 14:54:41 UTC
No response to question in over a year. If things are still broken with the latest software, feel free to reopen (and provide the requested info).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.