74733 – [post 0.6.0 regression] X crash/segfault when using latest glamor

Bug 74733 - [post 0.6.0 regression] X crash/segfault when using latest glamor

Summary: [post 0.6.0 regression] X crash/segfault when using latest glamor

Status:	CLOSED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Server/Acceleration/glamor (show other bugs)
Version:	git
Hardware:	Other All

Importance:	medium normal
Assignee:	Zhigang Gong
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (1):	74841 (view as bug list)
Depends on:
Blocks:

Reported:	2014-02-09 03:10 UTC by Dan Doel
Modified:	2014-02-24 11:18 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg log from crash (107.23 KB, text/plain) 2014-02-09 03:10 UTC, Dan Doel	no flags	Details
Possible fix for the crash (2.86 KB, patch) 2014-02-11 02:37 UTC, Anthony Waters	no flags	Details \| Splinter Review
View All

Description Dan Doel 2014-02-09 03:10:53 UTC

Created attachment 93689 [details]
Xorg log from crash

Hello,

I decided to try the latest glamor from git today, hearing that it had fixed some glyph rendering issues. However, I am getting reliable X crashes when using it.

The steps I've found to reproduce are:

  1) Go to gmail or google+
  2) Open a hangouts chat with someone

This causes a segfault that results in X closing. I'll attach a log file.

I do not see this using glamor 0.6.0, so it must be related to one of the three patches since then. It seems most likely that it is related to one of the two patches added on January 31st, 2014.

I'm running a relatively recent (few weeks) from-git version of mesa with a radeon 7870. Kernel version is 3.12.9.

If any additional information is desired, I'll be happy to provide it if I can.

Comment 1 Zhigang Gong 2014-02-10 03:00:12 UTC

CC to Alex, it seems that this bug is a regression caused by the patches after
0.6.0. Could you help to look into it? Thanks.

Comment 2 Anthony Waters 2014-02-10 19:40:24 UTC

Works fine for me here, although, I am running xserver from git, I'll try downgrading to 1.15.0 to see if I can reproduce.

Comment 3 Anthony Waters 2014-02-10 20:02:55 UTC

I tried reproducing with xorg-server-1.15.0 and mesa-10.0.3 with glamor from git and I couldn't reproduce with firefox-27.0, cairo-1.12.16, and gtk+-3.10.7/gtk+-2.24.22.

What browser does this happen in?

Comment 4 Dan Doel 2014-02-11 00:53:25 UTC

I'm using:

    Firefox 27.0
    xorg-server 1.15.0
    gtk3 3.10.7
    gtk2 2.24.22
    mesa 10.1 from git (git rev-list --count HEAD = 60899)
    xf86-video-ati 7.3.0

I was using a version of cairo from git as well, to enable hidpi mode in Gnome. But I just reinstalled the stock version and see the same crash. I'll try rolling back mesa next.

Comment 5 Dan Doel 2014-02-11 01:06:54 UTC

Still happens with mesa 10.0.3.

And I just verified that it happens with my desktop in single-display mode, and with that single display in 1920x1080, so it doesn't seem to be related to very high resolutions.

Are there any additional logs or traces I can send that would help narrow things down?

Comment 6 Anthony Waters 2014-02-11 01:38:57 UTC

I reproduced it with a 32 bit version of firefox-27.0, I'll see what I can do.

Comment 7 Anthony Waters 2014-02-11 02:37:07 UTC

Created attachment 93821 [details] [review]
Possible fix for the crash

See if this patch fixes it, I'm not too sure what was wrong the old code but from the looks of it, the code was doing too much work with copying data when all it needed to do was update the location of the data to put into the pixmap.

Comment 8 Zhigang Gong 2014-02-11 02:44:22 UTC

(In reply to comment #7)
> Created attachment 93821 [details] [review] [review]
> Possible fix for the crash
> 
> See if this patch fixes it, I'm not too sure what was wrong the old code but
> from the looks of it, the code was doing too much work with copying data
> when all it needed to do was update the location of the data to put into the
> pixmap.

IMO, this patch is not correct. The reason why we need to gather the subregion to a contiguous buffer is that OpenGL ES(glTexImage2D/glTexSubImage2D)only support to upload from a contiguous external buffer.

Comment 9 Dan Doel 2014-02-11 02:58:15 UTC

It does eliminate the crash, yes.

Comment 10 Anthony Waters 2014-02-11 03:36:10 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > Created attachment 93821 [details] [review] [review] [review]
> > Possible fix for the crash
> > 
> > See if this patch fixes it, I'm not too sure what was wrong the old code but
> > from the looks of it, the code was doing too much work with copying data
> > when all it needed to do was update the location of the data to put into the
> > pixmap.
> 
> IMO, this patch is not correct. The reason why we need to gather the
> subregion to a contiguous buffer is that OpenGL
> ES(glTexImage2D/glTexSubImage2D)only support to upload from a contiguous
> external buffer.

I see, is https://www.opengl.org/sdk/docs/man/html/glTexImage2D.xhtml the correct documentation for OpenGL ES? I was having trouble finding the specs.  

When I set MAX_FBO_SIZE I did notice some corruption, not sure of the cause, but I will work more on this.

Comment 11 Zhigang Gong 2014-02-11 03:50:00 UTC

(In reply to comment #10)
> (In reply to comment #8)
> > (In reply to comment #7)
> > > Created attachment 93821 [details] [review] [review] [review] [review]
> > > Possible fix for the crash
> > > 
> > > See if this patch fixes it, I'm not too sure what was wrong the old code but
> > > from the looks of it, the code was doing too much work with copying data
> > > when all it needed to do was update the location of the data to put into the
> > > pixmap.
> > 
> > IMO, this patch is not correct. The reason why we need to gather the
> > subregion to a contiguous buffer is that OpenGL
> > ES(glTexImage2D/glTexSubImage2D)only support to upload from a contiguous
> > external buffer.
> 
> I see, is https://www.opengl.org/sdk/docs/man/html/glTexImage2D.xhtml the
> correct documentation for OpenGL ES? I was having trouble finding the specs.
> 
> 
> When I set MAX_FBO_SIZE I did notice some corruption, not sure of the cause,
> but I will work more on this.

I checked the commit:
commit 654e1536a6d6539dcc30d9151122536c1807f34a
Author: Anthony Waters <awaters1@gmail.com>
Date:   Wed Jan 29 11:21:13 2014 -0500

    glamor: Add in support for the stride parameter when uploading texture data


                glamor_upload_sub_pixmap_to_texture(temp_pixmap, 0, 0, w, h,
-                                                   pixmap->devKind, bits, 0);
-
+                                                   0, bits, 0);
                glamor_copy_area(&temp_pixmap->drawable, drawable, gc, 0, 0, w, h, x, y);
                glamor_destroy_pixmap(temp_pixmap);
-       } else
+       } else {
                glamor_upload_sub_pixmap_to_texture(pixmap, x + drawable->x + x_off, y + drawable->y + y_off,
-                                                   w, h, PixmapBytePad(w, depth), bits, 0);
+                                                   w, h, 0, bits, 0);
+       }

why do you set the stride to zero at the glamor_upload_sub_pixmap_to_texture()?
The original stride is pixmap->devKind which is needed to upload the sub pixmap.

I think this is the root cause of this regression. Could you look at it? Thanks.

Comment 12 Zhigang Gong 2014-02-11 13:59:37 UTC

*** Bug 74841 has been marked as a duplicate of this bug. ***

Comment 13 Anthony Waters 2014-02-11 15:45:49 UTC

I had it set to 0 originally because it would cause the xserver to crash, but turns out that crash was related to something else, so changing it back to be that PixmapBytePad is the correct path. I also noticed that I was using the variable i instead of j when getting the data into temp_bits, so I will fix that up and repost the patch.

Also, from looking at what the old code did with glamor_put_bits to copy the data from bits to temp_bits it should be equivalent to setting temp_bits to point to the correct place in bits.  I believe the only difference would be if temp_stride is different from stride, but this doesn't matter because _glamor_upload_bits_to_pixmap_texture takes the stride parameter into account when uploading the data.

Comment 14 Zhigang Gong 2014-02-12 02:07:27 UTC

(In reply to comment #13)
> I had it set to 0 originally because it would cause the xserver to crash,
> but turns out that crash was related to something else, so changing it back
> to be that PixmapBytePad is the correct path. I also noticed that I was
> using the variable i instead of j when getting the data into temp_bits, so I
> will fix that up and repost the patch.
> 
> Also, from looking at what the old code did with glamor_put_bits to copy the
> data from bits to temp_bits it should be equivalent to setting temp_bits to
> point to the correct place in bits.  I believe the only difference would be
> if temp_stride is different from stride, but this doesn't matter because
> _glamor_upload_bits_to_pixmap_texture takes the stride parameter into
> account when uploading the data.

I just mentioned in my previous comment. What we need to do in the glamor_upload_sub_pixmap_to_texture() is to gather the sub-region into a contiguous buffer. Let me explain more details here:

As the temp_bits and the temp_stride will pass to the function __glamor_upload_pixmap_to_texture() eventually. In that function,
it will use the stride as below:
dispatch->glPixelStorei(GL_UNPACK_ROW_LENGTH, stride / alignment);

But unfortunately, OpenGL ES doesn't support that storage type which means, OpenGL ES will treat the external buffer of glTexImage2D/glTexSubImage2D is
always contiguous buffer. The stride will be determined by the width and height
of that temp_bit buffer. That is obviously incorrect.

So my comments to your new patch is that:
1. change back to pass correct stride value devKind to glamor_upload_sub_pixmap_to_texture().
2. Don't change the logic in glamor_upload_sub_pixmap_to_texture().

BTW, even when we are using OpenGL which does support GL_UNPACK_ROW_LENGTH, to gather the subregion in glamor_upload_sub_pixmap_to_texture is not really a overhead as what it looks like. As when you set GL_UNPACK_ROW_LENGTH to a larger stride, the OpenGL dri driver will know that this is not a contiguous buffer, and will go to a slow path to upload that buffer.

Comment 15 Anthony Waters 2014-02-14 21:59:54 UTC

I think I see, so in OpenGL ES the stride is taken from width/height (so boxes[j].x2 - boxes[j].x1 and boxes[j].y2 - boxes[j].y1), however, because the data in temp_bits is according to w and h, it will copy incorrectly.  Assuming I am understanding it correctly, I see how it works now. 

I'll work on restoring how it used to work and see if the bug is still apparent.

Comment 16 Zhigang Gong 2014-02-17 14:40:02 UTC

(In reply to comment #0)
> Created attachment 93689 [details]
> Xorg log from crash
> 
> Hello,
> 
> I decided to try the latest glamor from git today, hearing that it had fixed
> some glyph rendering issues. However, I am getting reliable X crashes when
> using it.
> 
> The steps I've found to reproduce are:
> 
>   1) Go to gmail or google+
>   2) Open a hangouts chat with someone
> 
> This causes a segfault that results in X closing. I'll attach a log file.
> 
> I do not see this using glamor 0.6.0, so it must be related to one of the
> three patches since then. It seems most likely that it is related to one of
> the two patches added on January 31st, 2014.
> 
> I'm running a relatively recent (few weeks) from-git version of mesa with a
> radeon 7870. Kernel version is 3.12.9.
> 
> If any additional information is desired, I'll be happy to provide it if I
> can.

I just pushed Anthony's patch to the git master branch. Could you test whether it fix this bug in your system? Thanks.

Comment 17 Dan Doel 2014-02-17 17:15:09 UTC

Yes. Fixed here. Thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.