Bug 32188 - radeon evergreen (5xxx) segfault (Komodo Edit, other apps)
Summary: radeon evergreen (5xxx) segfault (Komodo Edit, other apps)
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/Radeon (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: xf86-video-ati maintainers
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
: 30754 30945 31414 32440 32706 32785 32800 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-12-07 09:52 UTC by Chris Bandy
Modified: 2011-01-27 13:45 UTC (History)
14 users (show)

See Also:
i915 platform:
i915 features:


Attachments
backtrace (10.67 KB, text/plain)
2010-12-07 09:52 UTC, Chris Bandy
no flags Details
dmesg.txt (70.73 KB, text/plain)
2010-12-07 09:53 UTC, Chris Bandy
no flags Details
Xorg.0.log.txt (25.79 KB, text/plain)
2010-12-07 09:53 UTC, Chris Bandy
no flags Details
possible fix (1011 bytes, patch)
2010-12-13 10:52 UTC, Alex Deucher
no flags Details | Splinter Review
backtrace, patched comment 5 (9.84 KB, text/plain)
2010-12-13 13:38 UTC, Chris Bandy
no flags Details
don't use vbos for constant buffers (15.40 KB, patch)
2011-01-03 17:13 UTC, Alex Deucher
no flags Details | Splinter Review
backtrace, server 1.7.7, patched comment 12 (39.97 KB, text/plain)
2011-01-04 10:39 UTC, Chris Bandy
no flags Details
Backtraces of calls which introduce dma_bo->bo with a valid pointer (not null) (13.29 KB, text/plain)
2011-01-05 18:56 UTC, Tobias Droste
no flags Details
backtrace, server 1.9.2, patched comment 24 (1.67 KB, text/plain)
2011-01-21 06:24 UTC, Chris Bandy
no flags Details
Detailed backtrace (14.60 KB, text/plain)
2011-01-21 11:38 UTC, Eduard Bloch
no flags Details
various (2.51 KB, patch)
2011-01-23 13:25 UTC, Alex Deucher
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Bandy 2010-12-07 09:52:04 UTC
Komodo Edit 6.0.2 causes X to segfault somewhat reliably. I'm not sure which component is to blame. Attaching logs.

sys-kernel/gentoo-sources-2.6.36
x11-base/xorg-server-1.7.7-r1

media-libs/mesa-9999 27609a826734594c092390c2538b74f8358c1aba
x11-libs/libdrm-9999 39e5e982242cd2b611a9dfc1e9b63f857d52da61
x11-drivers/xf86-video-ati-9999 f9bbb26dd97254b66de11bb2abd821aa293ecba5
Comment 1 Chris Bandy 2010-12-07 09:52:50 UTC
Created attachment 40875 [details]
backtrace
Comment 2 Chris Bandy 2010-12-07 09:53:09 UTC
Created attachment 40876 [details]
dmesg.txt
Comment 3 Chris Bandy 2010-12-07 09:53:29 UTC
Created attachment 40877 [details]
Xorg.0.log.txt
Comment 4 Alex Deucher 2010-12-13 08:27:04 UTC
This is a bug in the ddx.
Comment 5 Alex Deucher 2010-12-13 10:52:11 UTC
Created attachment 41073 [details] [review]
possible fix

Does this patch help?
Comment 6 Chris Bandy 2010-12-13 13:38:58 UTC
Created attachment 41088 [details]
backtrace, patched comment 5

Still crashes, new backtrace.
Comment 7 Alex Deucher 2010-12-16 06:58:44 UTC
*** Bug 30945 has been marked as a duplicate of this bug. ***
Comment 8 Alex Deucher 2010-12-16 06:58:50 UTC
*** Bug 32440 has been marked as a duplicate of this bug. ***
Comment 9 Alex Deucher 2010-12-28 19:34:28 UTC
*** Bug 32706 has been marked as a duplicate of this bug. ***
Comment 10 Alex Deucher 2011-01-02 16:59:55 UTC
*** Bug 32785 has been marked as a duplicate of this bug. ***
Comment 11 Alex Deucher 2011-01-03 16:25:59 UTC
Can you print accel_state->vbo.vb_offset when you get the crash?
Comment 12 Alex Deucher 2011-01-03 17:13:32 UTC
Created attachment 41603 [details] [review]
don't use vbos for constant buffers

Does this patch help?  It will hurt performance, but it might help narrow down where the problem is.  It allocates buffers on the fly for constant buffers rather than using the vbo pool.
Comment 13 Alex Deucher 2011-01-03 22:06:37 UTC
Some notes from Droste on IRC so they don't get lost:

<Droste> agd5f: https://bugs.freedesktop.org/show_bug.cgi?id=32188 <-- for some reason bo with an pointer (not null) get in the radeon_vbo_wait_list and radeon_vbo_reserved_list. that is the reason why radeon_vbo_space() is returning null and causing this segfault

<airlied> I remember some plan to check when adding to the list and get a backtrace if possible
<Droste> yeah that's where I got a 400mb gdb log ;-) I'm debugging with a single pc, so stopping X is not a choice ;-)
 (while debugging otherwise I can't VT switch)
<airlied> we could triggger an X crash when we see it and see if the X backtrace code finds it

Workaround:
void radeon_vbo_put(ScrnInfoPtr pScrn, struct radeon_vbo_object *vbo)
{

    if (vbo->vb_bo) {
        if (vbo->vb_bo->ptr) {
            radeon_bo_unmap(vbo->vb_bo);
        } else {
            LogMessage(X_WARNING, "radeon_vbo_put: vbo->vb_bo->ptr null! vbo->vb_bo->ptr: %p, vbo->vb_size: %d, vbo->vb_offset: %d, vbo->vb_start_op: %d, vbo->vb_total: %d, vbo->vb_op_vert_size: %d, vbo->verts_per_op: %d, vbo->vb_bo->size: %d\n", vbo->vb_bo->ptr, vbo->vb_size, vbo->vb_offset, vbo->vb_start_op, vbo->vb_total, vbo->vb_op_vert_size, vbo->verts_per_op, vbo->vb_bo->size);
        }
        radeon_bo_unref(vbo->vb_bo);
        vbo->vb_bo = NULL;
        vbo->vb_total = 0;
    }

    vbo->vb_offset = 0;
}
Comment 14 Alex Deucher 2011-01-03 22:25:36 UTC
*** Bug 32800 has been marked as a duplicate of this bug. ***
Comment 15 Chris Bandy 2011-01-04 10:39:54 UTC
Created attachment 41634 [details]
backtrace, server 1.7.7, patched comment 12

I recently updated to xorg-server-1.9.2 along with a slew of related libraries. To get this backtrace, I downgraded to xorg-server-1.7.7-r1 (not the libraries). Without patches, the segfault looks the same, so I expect this environment is still helpful.

This backtrace looks similar to the one I get when using xorg-server-1.9.2 and the patch from comment 12.

What may be interesting is that I get a different segfault (in EVERGREENPrepareSolid) when using xorg-server-1.9.2 without any patches. Would you like me to post that backtrace as well?
Comment 16 Tobias Droste 2011-01-05 18:56:43 UTC
Created attachment 41698 [details]
Backtraces of calls which introduce dma_bo->bo with a valid pointer (not null)

The attachement contains some backtraces of calls which lead to the problem I described in IRC:

bo with a valid pointer (ptr!=null) get in the bo lists. This causes the driver to wrongly unmap/map the bo removing the ptr from the bo without adding a new one. 
If one of the bos is going to be reused radeon_vbo_space() returns the bo without a valid pointer (ptr==null).

The last backtrace in the attachment show this case which leads to the problem described in the bug.

Software:
xserver, drm, ddx current git master and kernel current drm-next
Comment 17 Tobias Droste 2011-01-05 21:31:32 UTC
It looks like 

static int bo_is_busy(struct radeon_bo_int *boi, uint32_t *domain) {...}
in radeon_bo_gem.c:220 (libdrm) 

returns that a bo is not busy but the bo is still mapped and in use. 

That's why the bo with the valid pointer gets from bo_wait list to the bo_free list and is reused while still used by another part -> crash!
Comment 18 Alex Deucher 2011-01-06 19:21:11 UTC
*** Bug 30754 has been marked as a duplicate of this bug. ***
Comment 19 Michel Dänzer 2011-01-07 02:50:57 UTC
(In reply to comment #17)

bo_is_busy() only returns -EBUSY if the BO has been submitted for or is in processing by the GPU.

Does changing the test in radeon_vbo_flush_bos() to something like

	if (dma_bo->bo->ptr || !radeon_bo_is_idle(dma_bo->bo))

help?
Comment 20 Tobias Droste 2011-01-07 09:10:08 UTC
Yes, this would help. I suggested that, too (in IRC). The problem is that this is also a workaround and not a fix. At this point there shouldn't be a ptr. It seems like there is a missing map or unmap.
Comment 21 Siganderson 2011-01-20 08:57:45 UTC
I too have this bug with a radeon 5670, kubuntu 10.10, drm+mesa+xf86-video-ati git master (and with all other versions). It happens with r600c and r600g too.

I can reproduce it instantly activating kwin's desktop effects, then opening gimp and starting to paint something with the Paintbrush Tool. Xorg crashes and restarts.
Comment 22 Alex Deucher 2011-01-20 14:35:01 UTC
I still can't reproduce it on any of my evergreen cards using gmp, yakuake, kwin, etc.  At this point I'm tempted to just apply the workaround, as I don't see how we can end up with a non-NULL pointer on the list.
Comment 23 Chris Bandy 2011-01-20 14:47:23 UTC
I've been using the workaround from comment 13 for quite some time. I see "(WW) radeon_vbo_put: vbo->vb_bo->ptr null!" in my logs everday, but at least I'm not crashing any more.

Unlike Droste, I am able to debug remotely. How can I help track this down further?
Comment 24 Alex Deucher 2011-01-20 15:04:35 UTC
(In reply to comment #23)

> Unlike Droste, I am able to debug remotely. How can I help track this down
> further?

Try this patch:

diff --git a/src/radeon_vbo.c b/src/radeon_vbo.c
index c0a668f..86b57bd 100644
--- a/src/radeon_vbo.c
+++ b/src/radeon_vbo.c
@@ -132,6 +132,7 @@ void radeon_vbo_flush_bos(ScrnInfoPtr pScrn)
        remove_from_list(dma_bo);
        dma_bo->expire_counter = expire_at;
        insert_at_tail(&accel_state->bo_free, dma_bo);
+       assert(dma_bo->bo->ptr == NULL);
     }
 
     /* move reserved to wait list */

And get a backtrace with gdb when the assert happens.
Comment 25 Alex Deucher 2011-01-20 22:14:47 UTC
*** Bug 31414 has been marked as a duplicate of this bug. ***
Comment 26 Chris Bandy 2011-01-21 06:24:10 UTC
Created attachment 42268 [details]
backtrace, server 1.9.2, patched comment 24

sys-kernel/gentoo-sources-2.6.37
x11-base/xorg-server-1.9.2

x11-libs/libdrm    550fe2ca3b29ad2191eab4fdfbed9ed21e25492d
media-libs/mesa    af4e2f46653cbc7ceaf1291ba22087ec5758d07f
x11-drivers/xf86-video-ati    0a03f03a65aad925ba2d9c76b1d3356184607bf9
 + patch in comment 24

Got this crash by opening a large (6600x5100) jpg in gimp and moving the mouse into/over the window.
Comment 27 Eduard Bloch 2011-01-21 11:38:38 UTC
Created attachment 42285 [details]
Detailed backtrace

The workaround also solved the problem for me. I attach another backtrace which contains a walk around the stack and examination of various variables (CFLAGS="-g -O0").
Comment 28 Alex Deucher 2011-01-21 14:38:09 UTC
Do any of my commits to git master today help?
Comment 29 Tobias Droste 2011-01-21 16:12:15 UTC
Yes it seems to be fixed (for me). Neither yakuake nor gimp are causing a crash. Lets see what the others are saying and hope this is fixed finally.
Comment 30 Tobias Droste 2011-01-21 16:39:54 UTC
X doesn't crash anymore, but there are still bo in the reserved/wait list with a ptr and they sometimes get reused. With the new map/unmap behaviour this doesn't crash X anymore, but gives some screen damages. So it now works like it used to with my first workaround.
Comment 31 Chris Bandy 2011-01-21 22:38:47 UTC
I have similar results as comment 30. No more crashes in gimp, but still consistent corruption with certain apps.

In my case, distccmon-gui during emerge triggers this quite frequently.
Comment 32 Tobias Kaminsky 2011-01-21 22:45:22 UTC
For me it works, too.
I do not have any screen damages (yet). Just tested it 5 minutes.
Comment 33 Siganderson 2011-01-22 01:49:12 UTC
it's working on my system too; some corruption sometimes.
Comment 34 Stefan Dirsch 2011-01-22 02:33:36 UTC
Apparently the crashes have been fixed (some do still report screen corruption though). So does this still block a new radeon driver release with Evergreen support?
Comment 35 Eduard Bloch 2011-01-22 04:06:54 UTC
Alex' changes from Fri Jan 21 did also fix it for me, instead of the #13 workaround. Cannot tell about corruption, never got any (which seemed to be related to this problem).
Comment 36 boris64 2011-01-22 06:17:22 UTC
Seems to work for me as well, no more crashes in 
gimp (for 20 minutes, yay!). Looks like i finally 
can remove that nasty 'Option "RenderAccel" "off"'
from my xorg.conf.
Comment 37 Frédéric L. W. Meunier 2011-01-22 08:46:44 UTC
No more crashes with VirtualBox. I didn't see any screen corruption.
Comment 38 Alex Deucher 2011-01-23 12:51:46 UTC
(In reply to comment #30)
> X doesn't crash anymore, but there are still bo in the reserved/wait list with
> a ptr and they sometimes get reused.

I still can't understand how we get an unmapped bo with a ptr on the list.  Did anyone get the corruption issues with any of the previous workarounds?

Does this patch make any difference?

diff --git a/src/radeon_kms.c b/src/radeon_kms.c
index fbdb530..bfc9584 100644
--- a/src/radeon_kms.c
+++ b/src/radeon_kms.c
@@ -85,13 +85,13 @@ void radeon_cs_flush_indirect(ScrnInfoPtr pScrn)
        return;
 
     /* release the current VBO so we don't block on mapping it later */
-    if (info->accel_state->vbo.vb_offset && info->accel_state->vbo.vb_bo) {
+    if (info->accel_state->vbo.vb_bo) {
         radeon_vbo_put(pScrn, &info->accel_state->vbo);
         info->accel_state->vbo.vb_start_op = -1;
     }
 
     /* release the current VBO so we don't block on mapping it later */
-    if (info->accel_state->cbuf.vb_offset && info->accel_state->cbuf.vb_bo) {
+    if (info->accel_state->cbuf.vb_bo) {
         radeon_vbo_put(pScrn, &info->accel_state->cbuf);
         info->accel_state->cbuf.vb_start_op = -1;
     }
Comment 39 Alex Deucher 2011-01-23 13:25:08 UTC
Created attachment 42350 [details] [review]
various

Try this patch as well and let me know if you see corruption or the following:
"bo with pointer on wait list!"
in your xorg log.
Comment 40 Alex Deucher 2011-01-23 21:05:42 UTC
So all the issues seem to be fixed by the patch in comment 38 according to feedback on IRC.  My current hypothesis is that we end up mapping the cbuf bo in PrepareSolid() or PrepareComposite(), then never end up adding any verts to the vbo, or never end up calling DoneSolid() or DoneComposite().  Perhaps an EXA bug in some versions of the xserver?  If that cbuf is indeed the problem, this patch should also fix the issues:

diff --git a/src/radeon_kms.c b/src/radeon_kms.c
index fbdb530..4297f6b 100644
--- a/src/radeon_kms.c
+++ b/src/radeon_kms.c
@@ -91,7 +91,7 @@ void radeon_cs_flush_indirect(ScrnInfoPtr pScrn)
     }
 
     /* release the current VBO so we don't block on mapping it later */
-    if (info->accel_state->cbuf.vb_offset && info->accel_state->cbuf.vb_bo) {
+    if (info->accel_state->cbuf.vb_bo) {
         radeon_vbo_put(pScrn, &info->accel_state->cbuf);
         info->accel_state->cbuf.vb_start_op = -1;
     }
@@ -108,14 +108,6 @@ void radeon_cs_flush_indirect(ScrnInfoPtr pScrn)
     if (ret)
       ErrorF("space check failed in flush\n");
 
-    if (accel_state->cbuf.vb_bo) {
-       ret = radeon_cs_space_check_with_bo(info->cs,
-                                           accel_state->cbuf.vb_bo,
-                                           RADEON_GEM_DOMAIN_GTT, 0);
-       if (ret)
-           ErrorF("space check failed in flush\n");
-    }
-
     if (info->reemit_current2d && info->state_2d.op)
         info->reemit_current2d(pScrn, info->state_2d.op);
Comment 41 Michel Dänzer 2011-01-24 03:32:44 UTC
(In reply to comment #40)
> So all the issues seem to be fixed by the patch in comment 38 according to
> feedback on IRC.  My current hypothesis is that we end up mapping the cbuf bo
> in PrepareSolid() or PrepareComposite(), then never end up adding any verts to
> the vbo, or never end up calling DoneSolid() or DoneComposite().  Perhaps an
> EXA bug in some versions of the xserver?

There have certainly been cases where EXA called the Prepare/Done hooks but not the actual operation hook in between. Dave fixed at least one of them at some point, though it might still be a good idea to make the driver robust against this.

I've been wondering if any of these issues might be related to running out of CS space during operation hook calls. The hooks in radeon_exa_funcs.c and radeon_exa_render.c handle this by calling the Done hook followed by radeon_cs_flush_indirect(), but I can't seem to find any corresponding code for >= R6xx.


> @@ -108,14 +108,6 @@ void radeon_cs_flush_indirect(ScrnInfoPtr pScrn)
>      if (ret)
>        ErrorF("space check failed in flush\n");
> 
> -    if (accel_state->cbuf.vb_bo) {
> -       ret = radeon_cs_space_check_with_bo(info->cs,
> -                                           accel_state->cbuf.vb_bo,
> -                                           RADEON_GEM_DOMAIN_GTT, 0);
> -       if (ret)
> -           ErrorF("space check failed in flush\n");
> -    }
> -
>      if (info->reemit_current2d && info->state_2d.op)
>          info->reemit_current2d(pScrn, info->state_2d.op);

Why would this no longer be necessary? Where is it checked whether the constant buffer will fit?
Comment 42 Tobias Droste 2011-01-24 06:45:02 UTC
(In reply to comment #40)
> If that cbuf is indeed the problem,
> this patch should also fix the issues:

With this last patch applied the problem stays fixed. Your hypothesis seems to be correct and the cbuf is the problem.
Comment 43 Chris Bandy 2011-01-24 07:08:16 UTC
(In reply to comment #40)
x11-drivers/xf86-video-ati fadee0409a8e13b78bbccb83dd70f590fee23d57

The patches in comment 38, comment 39 and comment 40 each (individually) fix the corruption I was experiencing.

I did not get any messages in the log using comment 39.
Comment 44 Alex Deucher 2011-01-24 08:00:16 UTC
(In reply to comment #41)
> 
> There have certainly been cases where EXA called the Prepare/Done hooks but not
> the actual operation hook in between. Dave fixed at least one of them at some
> point, though it might still be a good idea to make the driver robust against
> this.
> 
> I've been wondering if any of these issues might be related to running out of
> CS space during operation hook calls. The hooks in radeon_exa_funcs.c and
> radeon_exa_render.c handle this by calling the Done hook followed by
> radeon_cs_flush_indirect(), but I can't seem to find any corresponding code for
> >= R6xx.

We shouldn't run out of CS space during the op hook calls since we are using vbos not inline verts.

> 
> Why would this no longer be necessary? Where is it checked whether the constant
> buffer will fit?

The cbuf bo is NULL at this point due to the radeon_vbo_put() just before so the code is no longer necessary.
Comment 45 Alex Deucher 2011-01-24 10:03:12 UTC
I've gone ahead and committed the fixes.  I think we can call this closed finally.
Comment 46 Siganderson 2011-01-25 03:32:20 UTC
With last git master everything seems to work correctly. I have not corruptions anymore.
Comment 47 Siganderson 2011-01-25 10:23:44 UTC
I don't know if this is important but I launched gimp, and after some second these lines appeared into the dmesg output:
[ 3921.976683] radeon 0000:03:00.0: f4046000 reserve failed for wait
[ 3922.992906] radeon 0000:03:00.0: f4046000 reserve failed for wait
Comment 48 Tobias Droste 2011-01-26 05:59:57 UTC
(In reply to comment #47)
> I don't know if this is important but I launched gimp, and after some second
> these lines appeared into the dmesg output:
> [ 3921.976683] radeon 0000:03:00.0: f4046000 reserve failed for wait
> [ 3922.992906] radeon 0000:03:00.0: f4046000 reserve failed for wait

I can't confirm this. This may be another bug. I also think we can close this one.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.