Bug 73580

Summary: cairo-perf-trace failure
Product: cairo Reporter: M Welinder <terra>
Component: generalAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: cairo-bugs mailing list <cairo-bugs>
Severity: normal    
Priority: medium CC: emanuele.aina, marco.barisione, ppaalanen
Version: 1.12.16   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Compressed trace
Zenity trace
testcase
[PATCH] cairo-trace: Fix duplicated surface push on similar-image
[PATCH] cairo-trace: Extract common helper for _surface_create_similar[_image]()
[PATCH 1/3] cairo-trace: Fix duplicated surface push on similar-image
[PATCH 2/3] cairo-trace: Sync _create_similar_image() with _create_similar()
[PATCH 3/3] cairo-trace: Extract common helper for _surface_create_similar[_image]()

Description M Welinder 2014-01-14 00:34:45 UTC
A freshly made trace causes problems for cairo-perf-trace.  I'll attach
the compressed trace shortly.

welinder@toshiba ~/gnome/gnumeric/src $ cairo-perf-trace lt-gnumeric.15837.trace 
[ # ]  backend                         test   min(s) median(s) stddev. count
[  0]      xcb            lt-gnumeric.15837 Error during replay, line 6201: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[  0]     xlib            lt-gnumeric.15837 Error during replay, line 6201: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[ # ]    image: pixman 0.30.2
[  0]    image            lt-gnumeric.15837 Error during replay, line 6201: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[ # ]  image16: pixman 0.30.2
[  0]  image16            lt-gnumeric.15837 Error during replay, line 6201: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
Comment 1 M Welinder 2014-01-14 00:37:09 UTC
Created attachment 92010 [details]
Compressed trace
Comment 2 Pekka Paalanen 2014-04-11 13:36:09 UTC
That "invalid value" error is tormenting me too. All tracing attempts of complicated applications (Midori, netsurf-gtk, epiphany-browser) end up either in segfault during the tracing, or then the produced trace causes "invalid value" when trying to replay it.

The only thing I managed to do, is to get a backtrace of when the error happens during replay:

#0  0x00007ffff7bd0244 in _csi_error () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#1  0x00007ffff7bc5744 in _csi_ostack_get_context.3734 ()
   from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#2  0x00007ffff7bc56de in _do_cairo_op.3739 () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#3  0x00007ffff7bc0c8d in _reset_clip.3755 () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#4  0x00007ffff7bd1d58 in csi_object_execute () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#5  0x00007ffff7bce59c in _scan_execute.5675 () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#6  0x00007ffff7bc8482 in scan_execute.5474 () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#7  0x00007ffff7bcac32 in _scan_file.5693 () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#8  0x00007ffff7bd1eed in _csi_scan_file.2241 ()
   from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#9  0x00007ffff7bd1ddd in _csi_file_execute.2233 ()
   from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#10 0x00007ffff7bd1d8a in csi_object_execute () from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#11 0x00007ffff7bc517f in cairo_script_interpreter_run ()
   from /home/pq/tmp/pixman-install/lib/libcairo-script-interpreter.so.2
#12 0x00000000004062a0 in cairo_perf_trace (perf=0x7fffffffdb50, target=0x611cc0 <builtin_targets>, 
    trace=0x7fffffffe06a "./netsurf-load-madoka.trace") at cairo-perf-trace.c:757
#13 0x0000000000406fbc in main (argc=3, argv=0x7fffffffdcf8) at cairo-perf-trace.c:1032

And for whatever reason, libcairo-script-interpreter.so seems to never contain debug information while libcairo.so does, but that's off-topic.

I do run the traces through csi-bind from the Cairo trace repositories, but that doesn't help.

What can I do to investigate this?

Is there some other tool, that would verbosely tell, what it parses from the trace, so that I might at least determine whether the trace itself is corrupted, or if the script interpreter is broken?

I found DEBUG_SCAN in cairo-script-scanner.c, but it does not tell what the operators are, and I couldn't figure out how to print what they are.

I have been recording and playing back on Raspberry Pi with Cairo 1.12.16. I have managed to create one working trace from Leafpad. All the traces work or do not work equally well on both RPi and x86_64.
Comment 3 Emanuele Aina 2014-12-12 23:08:31 UTC
> And for whatever reason, libcairo-script-interpreter.so seems to never contain debug information while libcairo.so does, but that's off-topic.

As far as can tell that's because Link Time Optimization effectively disables debug symbols on libcairo-script-interpreter.so and libcairo-trace.so. My understandin is that LTO does not work well with debug symbols yet, but I wonder why it seems to work fine on libcairo.so.

After brutally stripping away the section related to -flto in build/configure.ac.warnings I got debug symbols working everywhere.

To get back to the main point of this bug, I seem to be affected too and here's the simple way to trigger it on my Debian installation (sid, libcairo2 1.14.0-2.1) using zenity:

$ cairo-trace cairo-trace zenity --info --text foo
cairo-trace: Recording cairo trace data to /home/em/zenity.25955.trace

$ cairo-perf-trace /home/em/zenity.25955.trace
[ # ]  backend                         test   min(s) median(s) stddev. count
[  0]      xcb                 zenity.25955 Error during replay, line 61: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[  0]     xlib                 zenity.25955 Error during replay, line 61: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[ # ]    image: pixman 0.32.6
[  0]    image                 zenity.25955 Error during replay, line 61: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[ # ]  image16: pixman 0.32.6
[  0]  image16                 zenity.25955 Error during replay, line 61: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
Comment 4 Emanuele Aina 2014-12-15 10:11:02 UTC
Testing latest master (4a225fca5f12) on my laptop (Debian sid amd64) I'm still able to reproduce the issue (but debug symbols work everywhere without hacks, yay!):

$ CAIRO_TRACE_SO=util/cairo-trace/.libs/libcairo-trace.so util/cairo-trace/cairo-trace zenity --info --text foo
cairo-trace: Recording cairo trace data to ./zenity.24699.trace

$ perf/cairo-perf-trace ./zenity.24699.trace 
[ # ]  backend                         test   min(s) median(s) stddev. count
[  0]      xcb                 zenity.24699 Error during replay, line 86: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[  0]     xlib                 zenity.24699 Error during replay, line 86: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[ # ]    image: pixman 0.32.6
[  0]    image                 zenity.24699 Error during replay, line 86: invalid value (typically too big) for the size of the input (surface, pattern, etc.)
[ # ]  image16: pixman 0.32.6
[  0]  image16                 zenity.24699 Error during replay, line 86: invalid value (typically too big) for the size of the input (surface, pattern, etc.)

Breaking with gdb on _csi_error() shows that _set_source() is peeking at the stack, expecting a context object and getting a surface object instead. Indeed, the current top of the stack contains:

0 CSI_OBJECT_TYPE_PATTERN
1 CSI_OBJECT_TYPE_SURFACE
2 CSI_OBJECT_TYPE_CONTEXT


This is in response to a cairo_set_source_surface() call, so I would suspect that the pattern object is out-of-place but this would mean that there are two bugs: the pattern object in the stack and _set_source() assuming a pattern even when replaying a cairo_set_source_surface() call.

That said my understanding of how the cairo script stack management works is too limited to figure out if it's a problem with the trace or with the interpreter, so some help would be greatly appreciated.
Comment 5 Emanuele Aina 2014-12-15 10:17:17 UTC
Created attachment 110850 [details]
Zenity trace
Comment 6 Massimo 2014-12-16 10:31:14 UTC
Created attachment 110900 [details]
testcase

Shortest program whose trace shows the problem,
probably in util/cairo-trace/trace.c
Comment 7 Massimo 2014-12-16 15:20:05 UTC
I think it is the same problem that led to this commit:

http://cgit.freedesktop.org/cairo/commit/util/cairo-trace/trace.c?id=c1ac8db7e60bcec8235b09729b34154e95790534

which changed the structure of the code in

cairo_surface_create_similar

but not as much in 

cairo_surface_create_similar_image

they were similar before, but now seem to follow a different 
stack logic. Using cairo_surface_create_similar in my
testcase does not produce the error.
Comment 8 Emanuele Aina 2014-12-17 12:53:31 UTC
With my limited understanding on how the interpreter stack is supposed to work, I suspect that cairo_surface_create_similar_image() is pushing the reference surface one time too many.

I've attached a very simple patch that at least allows me to get working traces for every GTK3 program I've tested (zenity, gedit, epiphany).

Indeed I was surprised that cairo_surface_create_similar() and cairo_surface_create_similar_image() weren't particularly similar (no pun intended).

Should I try to update cairo_surface_create_similar_image() to look more like cairo_surface_create_similar() or even sharing code if possible?
Comment 9 Emanuele Aina 2014-12-17 12:54:30 UTC
Created attachment 110950 [details] [review]
[PATCH] cairo-trace: Fix duplicated surface push on similar-image
Comment 10 Emanuele Aina 2014-12-17 14:06:53 UTC
Created attachment 110953 [details] [review]
[PATCH] cairo-trace: Extract common helper for _surface_create_similar[_image]()

I've reworked the patch taking in consideration what Massimo suggested, extracting a helper function from cairo_surface_create_similar() to be shared with cairo_surface_create_similar_image().
Comment 11 Andrea Canciani 2015-01-03 10:59:13 UTC
Comment on attachment 110950 [details] [review]
[PATCH] cairo-trace: Fix duplicated surface push on similar-image

Review of attachment 110950 [details] [review]:
-----------------------------------------------------------------

LGTM. I confirm that it fixes the issue (at least on the testcase).
Comment 12 Andrea Canciani 2015-01-03 11:08:23 UTC
Comment on attachment 110953 [details] [review]
[PATCH] cairo-trace: Extract common helper for _surface_create_similar[_image]()

Review of attachment 110953 [details] [review]:
-----------------------------------------------------------------

I expressed some concerns about this refactoring on IRC, but I'll also post them here to keep track of them together with the bug.

While removing some code duplication between similar and similar-image is ideally good, in practice the two operators have different signatures (they use different parameters in a different order), so I'm afraid that coalescing both in a single function makes them more complicated than needed.
I agree that the changes from c1ac8db7e60bcec8235b09729b34154e95790534 should also apply to similar-image, but I believe that merging the emission of the two operator is not a good idea.

Ideally I would merge the previous patch (which fixes the cairo_surface_create_similar_image() function so that it does not pushes more surfaces than expected on the stack) and (in a separate commit) resync the implementation of similar and similar-image, possibly by only sharing the code for _emit_other_obj().
Comment 13 Andrea Canciani 2015-01-03 11:12:33 UTC
If possible, I would also try to find out (and document in a comment) why _emit_surface_op() should not be used in similar/similar-image.
Comment 14 Emanuele Aina 2015-01-08 14:21:53 UTC
Created attachment 111956 [details] [review]
[PATCH 1/3] cairo-trace: Fix duplicated surface push on similar-image

Reinstate the minimal patch from attachment 110950 [details] [review] as suggested by Andrea to fix the bug with the less intrusive change possible and move all the enhancements to the code in separate commits.
Comment 15 Emanuele Aina 2015-01-08 14:23:28 UTC
Created attachment 111957 [details] [review]
[PATCH 2/3] cairo-trace: Sync _create_similar_image() with _create_similar()

After commit c1ac8db7e60b the cairo_surface_create_similar() and
cairo_surface_create_similar_image() overrides diverged, with the latter
not getting the bugfixes that have been applied to the first.

This commit updates the cairo_surface_create_similar_image() override
implementation to look like the one of cairo_surface_create_similar(),
ensuring that the surface operand gets printed in the trace and always
assigning an identifier for the newly created surface.
Comment 16 Emanuele Aina 2015-01-08 14:26:33 UTC
Created attachment 111958 [details] [review]
[PATCH 3/3] cairo-trace: Extract common helper for _surface_create_similar[_image]()

To avoid any divergence between cairo_surface_create_similar_image() and
cairo_surface_create_similar() in future, coalesce the code to emit the
trace line in a common helper.

Unfortunately the signature of the two trace lines are slightly
different (in 'similar' the content type precedes width and height,
while for 'similar-image' width and height come before the format
identifier), but to be able to run new traces with older cairo tools it
cannot be changed.

This is optional, I still included it in the patchset since I feel that the two implementations may drift again after this resync, leading to subtle bugs.
Comment 17 Emanuele Aina 2015-01-08 15:14:05 UTC
> If possible, I would also try to find out (and document in a comment) why _emit_surface_op() should not be used in similar/similar-image.

I honestly don't know, _emit_surface_op() calls into _emit_current() (by way of _emit_surface()) which seems to be willing to roll the stack up to the requested surface instead of using indexed addressing or even 'dup'.

I'm not sufficiently intimate with the cairo interpreter stack machine to understand the two behaviours, but they seem quite different. :/
Comment 18 Andrea Canciani 2015-01-17 14:57:54 UTC
(In reply to Emanuele Aina from comment #17)
> > If possible, I would also try to find out (and document in a comment) why _emit_surface_op() should not be used in similar/similar-image.
> 
> I honestly don't know, _emit_surface_op() calls into _emit_current() (by way
> of _emit_surface()) which seems to be willing to roll the stack up to the
> requested surface instead of using indexed addressing or even 'dup'.
> 
> I'm not sufficiently intimate with the cairo interpreter stack machine to
> understand the two behaviours, but they seem quite different. :/

TL;DR: Let's merge patches 1&2 and ask Chris if it is safe to use _emit_surface ;)

Each patch in the patchset is:
Tested-By: Andrea Canciani <ranma42@gmail.com>

I confirm that all of the patches work in my environment and in particular that the first one fixes this bug.

The first 2 patches are
Reviewed-By: Andrea Canciani <ranma42@gmail.com>

The first patch fixes this bug, it is quite straightforward, provides a complete explanation in the commit message and does not affect unrelated code

The second one tries to reduce code complexity by restoring the similarity between two functions

Before merging the third one, I would try to get a more complete understanding of the cairoscript generator...

For example, inside cairo_surface_create_for_rectangle() there is a code fragment that is suspiciously similar to the one which was in the old version of cairo_surface_create_similar[_image]().
Would this be the right fragment to abstract?
(I think the concern that Bryce expressed on IRC might be related to this)

Even better, should we just use _emit_surface[_op]()?
In some functions, a comment explains why _emit_surface should not be used (example: cairo_surface_mark_dirty() in trace.c), but the one which are involved in this bug do not say anything on this.
Comment 19 Emanuele Aina 2015-01-19 13:23:00 UTC
> For example, inside cairo_surface_create_for_rectangle() there is a code
> fragment that is suspiciously similar to the one which was in the old
> version of cairo_surface_create_similar[_image]().
> Would this be the right fragment to abstract?
> (I think the concern that Bryce expressed on IRC might be related to this)

Thanks, I'll have a look to see if I can find the right level of abstraction.

> Even better, should we just use _emit_surface[_op]()?
> In some functions, a comment explains why _emit_surface should not be used
> (example: cairo_surface_mark_dirty() in trace.c), but the one which are
> involved in this bug do not say anything on this.

A quite difficult to understand (for me) issue is that _emit_surface() is quite different from what the highlighted functions do to print the indexing information. In particular _emit_surface() seems to pop all the objects on the stack until it reaches the target surface, which seems quite destructive, while our functions all try different indexing strategies to pick the right surface without altering the stack.
Comment 20 Emanuele Aina 2015-03-13 12:32:45 UTC
Unfortunately I haven't had the chance to investigate  _emit_surface() further, so at the moment I can't do much more than what patch 3/3 does.

Given that the original problem has been fixed I guess we can close this bug for now.
Comment 21 Bryce Harrington 2017-04-17 16:25:11 UTC
Closing as per last comment, sounds like the follow-on work would be done separately.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.