Bug 67209 - Image decoding corruption and linux kernel crash with hardware reset
Summary: Image decoding corruption and linux kernel crash with hardware reset
Status: RESOLVED WORKSFORME
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-23 08:15 UTC by Carsten Mattner
Modified: 2014-12-26 11:21 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Xorg.0.log (41.25 KB, text/plain)
2013-08-25 18:06 UTC, Dâniel Fraga
no flags Details
Firefox ao2e-index.jpg (452.57 KB, image/png)
2013-10-01 14:30 UTC, Carsten Mattner
no flags Details

Description Carsten Mattner 2013-07-23 08:15:16 UTC
Related bugs filed earlier:
https://bugs.archlinux.org/task/36105
https://bugzilla.mozilla.org/show_bug.cgi?id=892567

Steps to reproduce:

Opening a jpeg like http://blather.michaelwlucas.com/wp-content/uploads/2013/07/ao2e-index.jpg first gets decoded correctly and once it's done the bottom half of the image gets corrupted. In one test the corrupted part displayed small snapshots of the Firefox window. When I clicked reload twice the kernelcrashed and reset the machine. The corruption happens with Firefox 22 to 25 and with 24 and 25 I was able to reliably make it crash while at least 22 doesn't crash the kernel.

A workaround is to either set Firefox's gfx.xrender.enabled to false or disable SNA in /etc/X11/xorg.conf.d.

SNA used to work without corrupted images or any other issues before a couple months ago and the bug must have slipped in one of the kernel or xf86-video-intel release of the last 3 months or so.

Also of interest is that with SNA enabled presumably caused by the same bug Gtk+ 2.x widgets sometimes sporadically draw the inner part of button or scroll bars rectangles corrupted but that never caused an instant crash with hardware reset.

This looks like a critical memory corruption bug in the xorg or kernel code.

Software and hardware info:
* Core2Duo in Mac Mini 2,1 running in 32bit mode
* linux 3.10.2
* xf86-video-intel 2.21.11
* mesa 9.1.5
* Intel 945GM

Actual results:

Corrupted image and kernel crash with reset upon hitting reload button.


Expected results:

Image should be decoded and displayed correctly without resetting the machine.


From Xorg.log with SNA enabled:
[    73.948] (==) Depth 24 pixmap format is 32 bpp
[    73.948] (II) intel(0): SNA initialized with Alviso (gen3) backend
[    73.948] (==) intel(0): Backing store disabled
[    73.948] (==) intel(0): Silken mouse enabled
[    73.948] (II) intel(0): HW Cursor enabled
[    73.948] (II) intel(0): RandR 1.2 enabled, ignore the following RandR disabled message.
[    73.948] (==) intel(0): DPMS enabled
[    73.948] (II) intel(0): [XvMC] i915_xvmc driver initialized.
[    73.948] (II) intel(0): [DRI2] Setup complete
[    73.948] (II) intel(0): [DRI2]   DRI driver: i915
[    73.948] (II) intel(0): direct rendering: DRI2 Enabled
[    73.948] (==) intel(0): hotplug detection: "enabled"
[    73.949] (--) RandR disabled
[    73.968] (II) AIGLX: enabled GLX_MESA_copy_sub_buffer
[    73.968] (II) AIGLX: enabled GLX_INTEL_swap_event
[    73.968] (II) AIGLX: enabled GLX_ARB_create_context
[    73.968] (II) AIGLX: enabled GLX_ARB_create_context_profile
[    73.968] (II) AIGLX: enabled GLX_EXT_create_context_es2_profile
[    73.968] (II) AIGLX: enabled GLX_SGI_swap_control and GLX_MESA_swap_control
[    73.968] (II) AIGLX: GLX_EXT_texture_from_pixmap backed by buffer objects
[    73.969] (II) AIGLX: Loaded and initialized i915
[    73.969] (II) GLX: Initialized DRI2 GL provider for screen 0

From Xorg.log without SNA enabled:
[  4731.653] (==) Depth 24 pixmap format is 32 bpp
[  4731.653] (II) intel(0): [DRI2] Setup complete
[  4731.653] (II) intel(0): [DRI2]   DRI driver: i915
[  4731.653] (II) UXA(0): Driver registered support for the following operations:
[  4731.653] (II)         solid
[  4731.653] (II)         copy
[  4731.653] (II)         composite (RENDER acceleration)
[  4731.654] (II)         put_image
[  4731.654] (II)         get_image
[  4731.654] (==) intel(0): Backing store disabled
[  4731.654] (==) intel(0): Silken mouse enabled
[  4731.654] (II) intel(0): Initializing HW Cursor
[  4731.654] (II) intel(0): RandR 1.2 enabled, ignore the following RandR disabled message.
[  4731.654] (==) intel(0): DPMS enabled
[  4731.654] (==) intel(0): Intel XvMC decoder disabled
[  4731.654] (II) intel(0): Set up textured video
[  4731.654] (II) intel(0): Set up overlay video
[  4731.654] (II) intel(0): direct rendering: DRI2 Enabled
[  4731.654] (==) intel(0): hotplug detection: "enabled"
[  4731.683] (--) RandR disabled
[  4731.703] (II) AIGLX: enabled GLX_MESA_copy_sub_buffer
[  4731.703] (II) AIGLX: enabled GLX_INTEL_swap_event
[  4731.703] (II) AIGLX: enabled GLX_ARB_create_context
[  4731.703] (II) AIGLX: enabled GLX_ARB_create_context_profile
[  4731.703] (II) AIGLX: enabled GLX_EXT_create_context_es2_profile
[  4731.703] (II) AIGLX: enabled GLX_SGI_swap_control and GLX_MESA_swap_control
[  4731.703] (II) AIGLX: GLX_EXT_texture_from_pixmap backed by buffer objects
[  4731.703] (II) AIGLX: Loaded and initialized i915
[  4731.703] (II) GLX: Initialized DRI2 GL provider for screen 0
Comment 1 Daniel Vetter 2013-07-23 08:20:40 UTC
Since you're saying that things worked roughly 3 months ago can you please test whether going back to an old kernel or and old ddx restores correct behaviour? Just so we know in which component the bug is.
Comment 2 Chris Wilson 2013-07-23 08:43:37 UTC
And we really do need the log files from the crashes (/var/log/messages hopefully has the kernel information, along with /var/log/Xorg.0.log[.old] containing any crash information). Note that linux-3.10.2 (and kernels post-3.7) has a known image corruption bug exposed by SNA.
Comment 3 Carsten Mattner 2013-07-23 11:14:34 UTC
(In reply to comment #1)
> Since you're saying that things worked roughly 3 months ago can you please
> test whether going back to an old kernel or and old ddx restores correct
> behaviour? Just so we know in which component the bug is.

I don't know if I can do that but I'll see what's possible.
Comment 4 Carsten Mattner 2013-07-23 11:20:35 UTC
(In reply to comment #2)
> And we really do need the log files from the crashes (/var/log/messages
> hopefully has the kernel information, along with /var/log/Xorg.0.log[.old]

Sadly there's no log at all because the crash is an instant hardware
reset of the machine when that 3.1MB jpeg is reloaded a couple times.

> containing any crash information). Note that linux-3.10.2 (and kernels
> post-3.7) has a known image corruption bug exposed by SNA.

I might be able to try a 3.6 kernel but I'm not sure I can try an older
ddx. Something must have changed in Firefox post 22 which makes the
corruption reset the machine quicker or more reliably. With Firefox 22
a short test of reloading that jpeg multiple times only corrupted the
image without a reset.
Comment 5 Chris Wilson 2013-07-23 11:22:17 UTC
What does the corruption look like? Is it capturable in a screenshot, or only on the display i.e. a photograph?
Comment 6 Carsten Mattner 2013-07-23 11:28:20 UTC
(In reply to comment #5)
> What does the corruption look like? Is it capturable in a screenshot, or
> only on the display i.e. a photograph?

The Gtk2 corruption with Raleigh theme engine manifests itself as scrollbars
or buttons having captcha like random noise inside the inner rectangle
where Raleigh would just display a single color. It looks like a random
pattern. It was sporadic and didn't stay for long when the gtk widget
was redrawn AFAIR.

The jpeg Firefox corruption looked similarly random most of the time but
in one or two cases I've seen it display multiple boxes with a snapshot
of Firefox's window in a tiled layout.

Is there no way to capture logs that indicate what led to the corruption
by setting an environment variable?
Comment 7 Chris Wilson 2013-08-11 11:51:30 UTC
No, we have no way of tracing back to find a specific instance of corruption - the tracing is all or nothing. And is voluminous.

The latter corruption in jpeg images sounds like it should be fixed by the read-write bug fixes in 3.10.5. That might also explain the corruption elsewhere...
Comment 8 Carsten Mattner 2013-08-12 10:45:02 UTC
(In reply to comment #7)
> The latter corruption in jpeg images sounds like it should be fixed by the
> read-write bug fixes in 3.10.5. That might also explain the corruption
> elsewhere...

Same problem with linux 3.10.6, xf86-video-intel 2.21.14 and mesa 9.1.6.
I didn't try to make it reset the machine after confirming the corrupted
jpeg image.
Comment 9 Chris Wilson 2013-08-12 10:49:04 UTC
I still cannot see where you recorded the kernel crash? Do you still have the log messages for that?
Comment 10 Carsten Mattner 2013-08-12 13:57:34 UTC
(In reply to comment #9)
> I still cannot see where you recorded the kernel crash? Do you still have
> the log messages for that?

I don't have log messages from the crash because when it crashes the
machine hard resets instantly.
Comment 11 Dâniel Fraga 2013-08-24 01:38:22 UTC
This bug happens here too with Firefox 23.

I use Linux 3.10.0 (x86-64) with latest git intel driver, xorg 1.14.2, Mesa 8.0.5.

What's interesting is that it only happens if I compile Firefox 23 with -mno-avx (gcc 4.8.2). If I don't use -mno-avx, the jpeg displays correctly, but Firefox gets extremely slow.

If you need some testing, just ask.
Comment 12 Chris Wilson 2013-08-25 13:05:21 UTC
(In reply to comment #11)
> This bug happens here too with Firefox 23.
> 
> I use Linux 3.10.0 (x86-64) with latest git intel driver, xorg 1.14.2, Mesa
> 8.0.5.
> 
> What's interesting is that it only happens if I compile Firefox 23 with
> -mno-avx (gcc 4.8.2). If I don't use -mno-avx, the jpeg displays correctly,
> but Firefox gets extremely slow.
> 
> If you need some testing, just ask.

You didn't mention your hardware, so please add an Xorg.0.log. Since you have AVX, I think you have an entirely different bug? Do you see the same hard lockup up? Do you have any error messages?
Comment 13 Dâniel Fraga 2013-08-25 18:05:48 UTC
(In reply to comment #12)

> You didn't mention your hardware, so please add an Xorg.0.log. Since you
> have AVX, I think you have an entirely different bug? Do you see the same
> hard lockup up? Do you have any error messages?

Ok, I attached my Xorg.0.log. My hardware:

Core i7 2700k using the internal GPU

I don't have any hard lookup, but the jpeg image gets blank (white) no matter how many times I try to reload it.

Regarding AVX, I don't know if it is a different bug, but the results are the same.
Comment 14 Dâniel Fraga 2013-08-25 18:06:51 UTC
Created attachment 84611 [details]
Xorg.0.log

The requested Xorg.0.log.
Comment 15 Chris Wilson 2013-08-26 14:10:16 UTC
Dâniel, I think you have a bug in your compilation of firefox with avx. Either some handwritten assembly is incorrect or gcc's code generation is wrong, we could trace the X drawing commands, but I don't think that is where the bug lies.
Comment 16 Dâniel Fraga 2013-08-26 20:23:39 UTC
(In reply to comment #15)
> Dâniel, I think you have a bug in your compilation of firefox with avx.
> Either some handwritten assembly is incorrect or gcc's code generation is
> wrong, we could trace the X drawing commands, but I don't think that is
> where the bug lies.

Yes Chris, I reported it some time ago:

https://bugzilla.mozilla.org/show_bug.cgi?id=864610

I think they introduced this bug on Firefox 20 (since until 19 version it never happened). Anyway thank you.
Comment 17 zless 2013-09-24 05:13:16 UTC
No crash here with:

- Firefox 24
- Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
- Linux 3.11.1
- xf86-video-intel 2.21.15 with SNA
Comment 18 Carsten Mattner 2013-09-24 08:05:25 UTC
(In reply to comment #17)
> No crash here with:
> 
> - Firefox 24
> - Intel Corporation 2nd Generation Core Processor Family Integrated Graphics
> Controller (rev 09)

Is this the same chip as 945GM?

> - Linux 3.11.1
> - xf86-video-intel 2.21.15 with SNA

I still see the corruption but didn't try to make it crash for obvious reasons.

Did you or do you at least see the corruption in the image's lower half?
I do and have to explicitly enable UXA. AFAICT a stable kernel from April 2013
must have been one of the last where it was all correct.
Comment 19 Carsten Mattner 2013-09-24 08:06:10 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > No crash here with:
> > 
> > - Firefox 24
> > - Intel Corporation 2nd Generation Core Processor Family Integrated Graphics
> > Controller (rev 09)
> 
> Is this the same chip as 945GM?
> 
> > - Linux 3.11.1
> > - xf86-video-intel 2.21.15 with SNA
> 
> I still see the corruption but didn't try to make it crash for obvious
> reasons.

Kernel and DDX are the same version here.

> Did you or do you at least see the corruption in the image's lower half?
> I do and have to explicitly enable UXA. AFAICT a stable kernel from April
> 2013 must have been one of the last where it was all correct.
Comment 20 Chris Wilson 2013-09-24 08:35:42 UTC
We still haven't had sufficient information to be able to even guess at what the problem might be. A screenshot or photograph of the corruption would be a good first step, if you cannot capture the error messages from the lockup.
Comment 21 Carsten Mattner 2013-09-30 21:07:54 UTC
Per https://bugs.archlinux.org/task/36105#comment114630 I built a version of
Firefox 24.0 linking more system versions of libraries and the corrupted jpeg
doesn't seem to happen so far. I was using ftp.mozilla.org binaries and those
reproduce the corruption reliably.

mozconfig:
. $topsrcdir/browser/config/mozconfig
ac_add_options --enable-official-branding
ac_add_options --with-system-jpeg
ac_add_options --with-system-zlib
ac_add_options --with-system-bz2
ac_add_options --with-system-png
ac_add_options --with-system-libevent
ac_add_options --enable-system-sqlite
ac_add_options --enable-system-cairo
ac_add_options --enable-system-pixman
ac_add_options --disable-tests
ac_add_options --disable-crashreporter
ac_add_options --disable-updater
ac_add_options --disable-installer
mk_add_options PROFILE_GEN_SCRIPT='EXTRA_TEST_ARGS=10 $(MAKE) -C $(MOZ_OBJDIR) pgo-profile-run'

This is a non PGO build due to memory limitations on this machine but I copied
that PGO line from ArchLinux's mozconfig as found.

package versions:
cairo 1.12.16-1
pixman 0.30.2-1
libjpeg-turbo 1.3.0-2
libpng 1.6.5-1
zlib 1.2.8-1
bzip2 1.0.6-4
zlib 1.2.8-1
libevent 2.0.21-2
sqlite 3.8.0.2-1

This is how ArchLinux's Firefox package is built with the exception of
--enable-system-cairo:
https://projects.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/firefox
Comment 22 Carsten Mattner 2013-10-01 06:44:36 UTC
The custom Firefox build doesn't provoke the corruption but as suspected by the Intel devs there are corruption issues just hiding. I don't know if it's related but I've seen a Qt3 application draw text destined for its statusbar outside of the window and right into the X root or over other windows managed by the wm.
Comment 23 Carsten Mattner 2013-10-01 06:45:44 UTC
(In reply to comment #22)
> The custom Firefox build doesn't provoke the corruption but as suspected by
> the Intel devs there are corruption issues just hiding. I don't know if it's
> related but I've seen a Qt3 application draw text destined for its statusbar
> outside of the window and right into the X root or over other windows
> managed by the wm.

I'll keep an eye out for this after enabling UXA.
Comment 24 Carsten Mattner 2013-10-01 14:30:08 UTC
Created attachment 86908 [details]
Firefox ao2e-index.jpg
Comment 25 Carsten Mattner 2013-10-01 14:32:06 UTC
(In reply to comment #20)
> We still haven't had sufficient information to be able to even guess at what
> the problem might be. A screenshot or photograph of the corruption would be
> a good first step, if you cannot capture the error messages from the lockup.

Attached a screenshot of what it usually looks like. On reload of the page the
bottom half of the picture changes to another random pattern and as said before
sometimes to 4 little screenshots of the Firefox user interface.
Comment 26 Carsten Mattner 2013-10-27 17:19:02 UTC
News?
Comment 27 Chris Wilson 2013-11-08 12:22:43 UTC
The image corruption should be fixed in xf86-video-intel.git (I believe

commit 8f6e227ba8127a2ca034271f2a660c24abbe056f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 4 12:57:01 2013 +0000

    sna: Apply the BLT source offset for individual copies

is the right fix amongst several related fixes.)

But we have no information about the kernel crash, so we can't begin a diagnosis, but must just hope that it gets randomly fixed... I suspect it may not have been a kernel crash, but a page-fault-of-doom which should also have been mitigated recently.
Comment 28 Carsten Mattner 2013-11-09 19:22:53 UTC
(In reply to comment #27)
> The image corruption should be fixed in xf86-video-intel.git (I believe

Thanks Chris

> commit 8f6e227ba8127a2ca034271f2a660c24abbe056f
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Nov 4 12:57:01 2013 +0000
> 
>     sna: Apply the BLT source offset for individual copies
> 
> is the right fix amongst several related fixes.)

Will see if I can try this before the next release but do you know
when the next xf86-video-intel release will be?

> But we have no information about the kernel crash, so we can't begin a
> diagnosis, but must just hope that it gets randomly fixed... I suspect it
> may not have been a kernel crash, but a page-fault-of-doom which should also
> have been mitigated recently.

I was once able to easily make it hard reset (crash) the machine by reloading
the corrupted image a couple times. So I believe it's at least triggered
by the big jpeg image corruption quickly.

Is it the right idea to wait and test for the image corruption fix first
and try to reproduce the Qt3 out-of-window corrupted text drawing SNA bug
next or would you say that's something else and needs a new bug?
Comment 29 Carsten Mattner 2013-11-18 17:03:34 UTC
Image decoding bug looks fixed in xf86-video-intel-2.99.906 but I still have to check the Qt3 out-of-window text drawing/corruption.
Comment 30 Carsten Mattner 2013-11-18 19:59:11 UTC
With SNA enabled I once again see an SDVOB kernel message: [drm] Setting output timings on SDVOB failed. Is this ok to ignore?
Comment 31 Chris Wilson 2013-12-30 13:42:00 UTC
(In reply to comment #30)
> With SNA enabled I once again see an SDVOB kernel message: [drm] Setting
> output timings on SDVOB failed. Is this ok to ignore?

Yeah, that's just an internal detail of your ADD card (the bit that we speak SDVO to and that actually controls the display interface on your machine). It should just work fine even with that warning.
Comment 32 Chris Wilson 2014-03-05 09:14:12 UTC
I'm closing this for lack of information about the earlier kernel crash, without which we can not begin to diagnose it. Hopefully it too has resolved itself.

Please do reopen if you can provide any information to point us in the right direction.
Comment 33 Carsten Mattner 2014-03-05 14:14:07 UTC
Both issues are fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.