Bug 44682 - SwingSet2 demo applet artifacts [SNA]
Summary: SwingSet2 demo applet artifacts [SNA]
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-01-11 07:54 UTC by Clemens Eisserer
Modified: 2012-01-19 02:17 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
text rendered with wrong background (read after write hazard?) (13.91 KB, image/png)
2012-01-11 07:54 UTC, Clemens Eisserer
no flags Details
artifact while scrolling text (72.38 KB, image/png)
2012-01-11 07:54 UTC, Clemens Eisserer
no flags Details
spontaneous corruption in java2demo (42.27 KB, image/png)
2012-01-11 07:55 UTC, Clemens Eisserer
no flags Details
color_chooser_demo flickering (reproduceable) (37.29 KB, image/png)
2012-01-11 07:56 UTC, Clemens Eisserer
no flags Details
reg dump of wedged gpu (10.87 KB, text/plain)
2012-01-11 11:26 UTC, Clemens Eisserer
no flags Details
intel error decode log (1.33 MB, text/plain)
2012-01-11 11:27 UTC, Clemens Eisserer
no flags Details
glyph corruption in firefox (4.90 KB, image/png)
2012-01-12 01:07 UTC, Clemens Eisserer
no flags Details
log (not involved in the crash logged below) (2.36 MB, application/octet-stream)
2012-01-14 15:26 UTC, Clemens Eisserer
no flags Details
black border arround buttons after initialization (100.55 KB, image/png)
2012-01-17 16:00 UTC, Clemens Eisserer
no flags Details
artifacts on desktop, when resizing misbehaving window (214.24 KB, image/png)
2012-01-17 16:23 UTC, Clemens Eisserer
no flags Details

Description Clemens Eisserer 2012-01-11 07:54:21 UTC
Created attachment 55439 [details]
text rendered with wrong background (read after write hazard?)

When runnig the SwingSet2 demo located at the URL mentioned below, I get two different artifacts:

- The color-chooser demo flickers
- Several hard-to-reproduce rendering errors, like shown on the screenshot. Sometimes those errors happen when scrolling on tabs containing text, sometimes I also get similar looking errors for other 


intel: 2.17.0-341-gc05e90a
URL: http://java.sun.com/products/plugin/1.5.0/demos/plugin/jfc/SwingSet2/SwingSet2.html
Comment 1 Clemens Eisserer 2012-01-11 07:54:50 UTC
Created attachment 55440 [details]
artifact while scrolling text
Comment 2 Clemens Eisserer 2012-01-11 07:55:27 UTC
Created attachment 55441 [details]
spontaneous corruption in java2demo
Comment 3 Clemens Eisserer 2012-01-11 07:56:08 UTC
Created attachment 55442 [details]
color_chooser_demo flickering (reproduceable)
Comment 4 Clemens Eisserer 2012-01-11 07:56:32 UTC
forgot to mention, this is on my i945GM.
Comment 5 Chris Wilson 2012-01-11 08:39:20 UTC
Yes, lots of little bugs all coming together in one big hit... (Increasing the use of upload buffers which I had neglected to damage along with the extra fallback avoidance code suddenly started causing fallbacks and trying to get damage that wasn't there...)

Can you pull the latest code and retest? Quick inspection shows that things look back to normal.
Comment 6 Clemens Eisserer 2012-01-11 11:18:24 UTC
With 2.17.0-378-gb82851e I get still corruptions (even more) and it reliably hangs the GPU after a short time.

[  160.080107] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  160.080119] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  160.081329] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 47204 at 47201, next 47205)
Comment 7 Clemens Eisserer 2012-01-11 11:26:04 UTC
Created attachment 55463 [details]
reg dump of wedged gpu
Comment 8 Clemens Eisserer 2012-01-11 11:27:04 UTC
Created attachment 55464 [details]
intel error decode log
Comment 9 Chris Wilson 2012-01-11 18:48:57 UTC
So I found the second bug that caused random corruption in the icons and images, and is likely the cause of the GPU hang.

commit 112b8959261712aaa82f92af0aca4b97fa7c7f03
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jan 12 01:38:18 2012 +0000

    sna: Prevent shrinking a partial buffer stolen for a read
    
    If we reuse a partial buffer for a read, we cannot shrink it during
    upload to the device as we do not track how many bytes we actually need
    for the read operation.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 10 Clemens Eisserer 2012-01-12 01:06:34 UTC
Unfourtunatly I still sometimes see corruptions like shown in https://bugs.freedesktop.org/attachment.cgi?id=55440.

Additionally I now get glyph corruptions in firefox sometimes.

I turned of semaphores/rc6/fbc I had enabled previously to make sure its not caused by my settings, but that didn't help.
Comment 11 Clemens Eisserer 2012-01-12 01:07:33 UTC
Created attachment 55479 [details]
glyph corruption in firefox
Comment 12 Chris Wilson 2012-01-12 02:14:57 UTC
I've seen the glyph cache corruption, but only on gen4 so far -- I have not nailed down precisely which path it is creeping in yet nor why it is so sporadic. I've not seen the damage corruption at all since the first update.
Comment 13 Chris Wilson 2012-01-12 02:21:02 UTC
Of course, just the mere act of mentioning it on a bug report, I've now found examples on gen3/4/5. ;-)

At least that means it is on a common path!
Comment 14 Chris Wilson 2012-01-14 14:17:49 UTC
Found a much nastier damage bug:

commit 983b755313df8a0d256c59c32ec4106e35f237aa
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jan 12 21:05:16 2012 +0000

    sna/damage: Fix union of extents with dirty damage but no region
    
    By failing to account for certain paths which would create a damage elt
    without fully initialisating the damage region (only the damage extents),
    we would later overwrite the damage extents with only the extents for
    this operation (rather than the union of this operation with the current
    damage). This fixes a regression from 098592ca5d,
    (sna: Remove the independent tracking of elts from boxes).
    
    Include the associated damage migration debugging code of the callers.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 15 Clemens Eisserer 2012-01-14 14:30:29 UTC
still get the text-corruption when scrolling, like the one shown in "artifact while scrolling text".

I found out the corruption only occurs when dragging the scrollbars, when scrolling using the scroll-wheel everything works as expected.
Comment 16 Chris Wilson 2012-01-14 14:46:19 UTC
Can you refresh me on your setup? At the moment, I have just one issue with SwingSet2 and that appears that I get a planeshift in the images (I suspect an XYPixmap issue). I haven't seen that text corruption which looks like the damage issue from so long ago... If everything is working correctly, we should not be hitting the CPU at all for SwingSet2. If you can find the time to grab an --enable-debug=full log, that will help me understand why we have pixmap migration at all. Thanks.
Comment 17 Clemens Eisserer 2012-01-14 15:24:29 UTC
My setup is a Toshiba Tecra A8 running linux-3.1.7, as well as the latest pixmap and whatever Xorg shipped currently in Fedora-16-updates.

The "traditional" Java/X11 backend doesn't use Xrender, so for a lot operations (e.g. antialiased text) it falls back to its own software fallbacks - issuing a lot of GetImage/PutImage calls.

With debug=full, Xorg unfourtunatly dies very quickly:
#0  0x00249416 in __kernel_vsyscall ()
#1  0x49d4998f in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0x49d4b2d5 in __GI_abort () at abort.c:91
#3  0x49d426a5 in __assert_fail_base (fmt=0x49e82c48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x63c0d6 "(size & (4096-1)) == 0", file=0x63bf5c "kgem.c", line=328,
    function=0x63de58 "gem_create") at assert.c:94
#4  0x49d42757 in __GI___assert_fail (assertion=0x63c0d6 "(size & (4096-1)) == 0", file=0x63bf5c "kgem.c", line=328, function=0x63de58 "gem_create") at assert.c:103
#5  0x00567f13 in gem_create (fd=11, size=262148) at kgem.c:328
#6  0x0057082f in kgem_create_buffer (kgem=0xa0c6e54, size=140268, flags=1, ret=0xbfde43c8) at kgem.c:3145
#7  0x005adaf6 in sna_write_boxes (sna=0xa0c6d18, dst=0xa341530, dst_bo=0xa3415b8, dst_dx=0, dst_dy=0, src=0xb5b95008, stride=7680, src_dx=0, src_dy=0, box=0xa4c8310, nbox=0) at sna_io.c:475
#8  0x005774ef in sna_pixmap_move_to_gpu (pixmap=0xa341530, flags=2) at sna_accel.c:1855
#9  0x00590dc4 in sna_accel_flush (sna=0xa0c6d18) at sna_accel.c:9904
#10 0x0059180d in sna_accel_block_handler (sna=0xa0c6d18) at sna_accel.c:10157
#11 0x005a5b21 in sna_block_handler (i=0, data=0xa0c6d18, timeout=0xbfde47cc, read_mask=0x8235c20) at sna_driver.c:593
#12 0x0807a438 in BlockHandler (pTimeout=0xbfde47cc, pReadmask=0x8235c20) at dixutils.c:385
#13 0x080a5891 in WaitForSomething (pClientsReady=0xa295c98) at WaitFor.c:219
#14 0x08075eae in Dispatch () at dispatch.c:381
#15 0x0806439a in main (argc=6, argv=0xbfde4934, envp=0xbfde4950) at main.c:287
Comment 18 Clemens Eisserer 2012-01-14 15:26:25 UTC
Created attachment 55586 [details]
log (not involved in the crash logged below)
Comment 19 Chris Wilson 2012-01-15 02:01:55 UTC
As with quickly hit assertions, the oversight was rather simple. In this case, I had assumed that everybody would have a power-of-two aperture. In your case, some of that space is probably reserved through the PCI config.

commit 421ee0bb53321e9fb2f17c850c51b80cc385bf57
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jan 15 00:46:47 2012 +0000

    sna: Do not assume that the mappable aperture size is a power of two
    
    And instead derive a power-of-two alignment value for partial buffer
    sizes from the mappable aperture size and use that during
    kgem_create_buffer()
    
    Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=44682
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

As you said, the log is simply GetImage followed by PutImage, nothing too scary. It is a path for which I had hoped to get better kernel support for by now, so I had not put much effort into tuning it. But that is a digression to the issue at hand...
Comment 20 Clemens Eisserer 2012-01-17 08:26:13 UTC
Thanks to your fix, --enable-debug=full now works properly.

I switched back from a VT to X at 5169.79 and immediatly started scrolling the text, where I was able to reproduce the artifacts.

Note that sometimes the artifacts occur, and sometimes I can scroll for a few seconds with everything looking fine.
Comment 21 Clemens Eisserer 2012-01-17 08:39:35 UTC
file is too large for bugzilla, please download it from here: http://93.83.133.214/Xorg.0.log.old.bz2
Comment 22 Chris Wilson 2012-01-17 08:54:36 UTC
Thank you, I've downloaded the log file.
Comment 23 Chris Wilson 2012-01-17 15:25:13 UTC
Valgrind found something that looks suspiciously like what you describe:

commit 4b5c9affd4968fd5b3d85551cf5a0b4d600a7892
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 17 20:23:26 2012 +0000

    sna: Restore orginal shadow pointer before uploading CPU damage
    
    Detected by valgrind:
    ==22012== Source and destination overlap in memcpy(0xd101000, 0xd101000,
    783360)
    ==22012==    at 0x402A180: memcpy (in
    /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==22012==    by 0x89BD4ED: memcpy_blt (blt.c:209)
    ==22012==    by 0x89F2921: sna_write_boxes (sna_io.c:364)
    ==22012==    by 0x89CFABF: sna_pixmap_move_to_gpu (sna_accel.c:1900)
    ==22012==    by 0x89F49B0: sna_render_pixmap_bo (sna_render.c:571)
    ==22012==    by 0x8A268CE: gen5_composite_picture (gen5_render.c:1908)
    ==22012==    by 0x8A29B8A: gen5_render_composite (gen5_render.c:2252)
    ==22012==    by 0x89E6762: sna_composite (sna_composite.c:485)
    ==22012==    by 0x21D3C3: damageComposite (damage.c:569)
    ==22012==    by 0x215963: ProcRenderComposite (render.c:728)
    ==22012==    by 0x159FB0: Dispatch (dispatch.c:437)
    ==22012==    by 0x1491D9: main (main.c:287)
    ==22012==
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 24 Clemens Eisserer 2012-01-17 16:00:04 UTC
Unfourtunatly no change - I still can easily reproduce the artifacts when scrolling.

I also noticed some odd rendered button-borders (black), which only sometimes crop up when the applet initializes (I have the java logging console enabled, thats the window on top) - and are drawn fine when causing a repaint.
Comment 25 Clemens Eisserer 2012-01-17 16:00:52 UTC
Created attachment 55703 [details]
black border arround buttons after initialization
Comment 26 Clemens Eisserer 2012-01-17 16:23:29 UTC
Created attachment 55704 [details]
artifacts on desktop, when resizing misbehaving window

I was able to even trigger a rendering corruption on my desktop by resizing a misbehaving window.

The rendering-errors (including the black button borders) seem to only happen when resizing the window - when moving the mouse over the buttons to highlight/dehighlight them everything is painted fine.
Comment 27 Chris Wilson 2012-01-18 03:21:05 UTC
ARGH.
Comment 28 Clemens Eisserer 2012-01-18 12:05:32 UTC
Was able to reproduce the problem also with Java2Demo, where it occurs in some animations, sometimes rendering is fast and no artifacts occur, sometimes it degrades to a state where it performs very slow and rendering artifacts occur.

I digged a bit in Xorg's log:

[  2516.044] region_inplace: (32x1), inplace? 0
[  2516.044] sna_put_zpixmap_blt: marking damage
[  2516.045] _sna_damage_subtract([[(0, 0), (722, 544)]: [(0, 0), (722, 544) x 929: ((0, 0), (722, 101)), ((0, 101), (569, 128)), ((704, 101), (722, 128)), ((0, 128), (192, 130)), ((288, 128), (569, 130)), ((704, 128), (722, 130)), ((0, 130), (569, 160)), ((704, 130), (722, 160)), ((21, 160), (463, 173)), ((480, 160), (569, 173)), ((704, 160), (722, 173)), ((21, 173), (463, 201)), ((480, 173), (722, 201)), ((21, 201), (82, 202)), ((83, 201), (463, 202)), ((480, 201), (722, 202)), ((21, 202), (79, 203)), ((87, 202), (463, 203)), ...] -[ ...]] - [(294, 403), (326, 404)])...
[  2516.234]     _sna_damage_create_elt: prev=(remain 7), count=1
[  2516.234]   = [[(0, 0), (722, 544)]: [(0, 0), (722, 544) x 929: ((

Can it be _sna_damage_subtract really took 200ms?
Maybe this is damage related after all?
Comment 29 Chris Wilson 2012-01-18 12:21:16 UTC
Ok, I've hit the issue. And it reproduced first time (xfce4-session + SwingSet2), so hopefully I'll be able to start narrowing down the suspect code paths.
Comment 30 Chris Wilson 2012-01-18 12:29:15 UTC
fwiw, now I have the recipe, it also fails on SNB. So that rules out the gen3 specific paths and the tiny CPU paths.
Comment 31 Chris Wilson 2012-01-18 12:35:20 UTC
(In reply to comment #28)
> Was able to reproduce the problem also with Java2Demo, where it occurs in some
> animations, sometimes rendering is fast and no artifacts occur, sometimes it
> degrades to a state where it performs very slow and rendering artifacts occur.
> 
> I digged a bit in Xorg's log:
> 
> [  2516.044] region_inplace: (32x1), inplace? 0
> [  2516.044] sna_put_zpixmap_blt: marking damage
> [  2516.045] _sna_damage_subtract([[(0, 0), (722, 544)]: [(0, 0), (722, 544) x
> 929: ((0, 0), (722, 101)), ((0, 101), (569, 128)), ((704, 101), (722, 128)),
> ((0, 128), (192, 130)), ((288, 128), (569, 130)), ((704, 128), (722, 130)),
> ((0, 130), (569, 160)), ((704, 130), (722, 160)), ((21, 160), (463, 173)),
> ((480, 160), (569, 173)), ((704, 160), (722, 173)), ((21, 173), (463, 201)),
> ((480, 173), (722, 201)), ((21, 201), (82, 202)), ((83, 201), (463, 202)),
> ((480, 201), (722, 202)), ((21, 202), (79, 203)), ((87, 202), (463, 203)), ...]
> -[ ...]] - [(294, 403), (326, 404)])...
> [  2516.234]     _sna_damage_create_elt: prev=(remain 7), count=1
> [  2516.234]   = [[(0, 0), (722, 544)]: [(0, 0), (722, 544) x 929: ((
> 
> Can it be _sna_damage_subtract really took 200ms?
> Maybe this is damage related after all?

No, it should not have taken 0.2s, the whole point of the sna_damage layer is to try and avoid having to compute subtractions and instead stores the damage region plus list of rectangles to remove later if necessary. In this case, it should have just checked that the region to remove overlaps with the dirty region (by checking the extents against each other) and then copy the actual box.

The 0.2s could have been a side-effect of logging. I hope it was...
Comment 32 Chris Wilson 2012-01-18 17:02:05 UTC
So the actual bug turns out to have been quite trivial and embarrassing:

commit 35f81005f91d294e61bb4ced7cbddd1a76ccb324
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jan 19 00:41:08 2012 +0000

    sna/damage: Always mark the damage as dirty when recording new boxes
    
    A few of the create_elts() routines missed marking the damage as dirty
    so that if only part of the emebbed box was used (i.e. the damage
    contained less than 8 rectangles that needed to included in the damage
    region) then those were being ignored during migration and testing.
    
    Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=44682
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

I do hope that's the last of the damage tracking bugs for a while!
Comment 33 Chris Wilson 2012-01-18 17:04:06 UTC
I'm going to guess the missing border is likely the same damage problem. However, the desktop corruption looks like it might involve tiling, and I fixed a related problem earlier:

commit ab387a89cf64ace9f7d7fa52b7743e78b625f47d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 18 18:32:51 2012 +0000

    sna: Update bo->tiling during search_linear_cache
    
    search_linear_cache() was updated to track the first good match whilst it
    continued to search for a better match. This resulted in the first good
    bo being modified and a record of those modifications lost, in
    particular the change in tiling.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 34 Clemens Eisserer 2012-01-18 23:14:30 UTC
I can confirm the artifacts are gone :)

Thanks!
Comment 35 Chris Wilson 2012-01-19 02:17:14 UTC
Moving on to the next bug...

Thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.