Bug 90537 - radeonsi bo/va conflict on RADEON_GEM_VA (rscreen->ws->buffer_from_handle returns NULL)
Summary: radeonsi bo/va conflict on RADEON_GEM_VA (rscreen->ws->buffer_from_handle ret...
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: 10.5
Hardware: x86 (IA32) other
: low trivial
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-20 14:06 UTC by pstglia
Modified: 2015-06-25 02:46 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (drm.debug=7) (255.92 KB, text/plain)
2015-05-20 14:06 UTC, pstglia
Details
winsys/radeon: Unmap VA when destroying BO (1.48 KB, patch)
2015-05-21 02:12 UTC, Michel Dänzer
Details | Splinter Review
logs_10_5_4_HD7950_full_debug.zip - dmesg drm.debug=7/dumpsys/logcat (441.96 KB, application/zip)
2015-05-24 14:26 UTC, pstglia
Details
logs_10_6rc1_HD7950_full_debug.zip - dmesg debug.drm=7/logcat (1.48 MB, application/zip)
2015-05-24 14:29 UTC, pstglia
Details
dmesg (without drm.debug=7) and logcat for HD7950 (195.23 KB, application/zip)
2015-05-26 16:16 UTC, Mauro Rossi
Details

Note You need to log in before you can comment on or make changes to this bug.
Description pstglia 2015-05-20 14:06:42 UTC
Created attachment 115927 [details]
dmesg output (drm.debug=7)

Hi, we're trying to make radeonsi work on Android-x86 (porting of AOSP to x86 architecture).

We can have graphical output, but at some circunstances (certain apps, like Antutu Benchmark) we are receiving a "radeon bo/va conflict with bo/va" from kernel. When this happens, graphical components crashes.

As I could check, This error occurs on "r600_texture_from_handle" (which is being called on native_android.cpp gallium state tracker; screen->resource_from_handle - use_drm is set ) 

When calling "rscreen->ws->buffer_from_handle" NULL is being returned. From kernel dmesg (drm.debug = 7), we get this ioctl error:

...
<7>[  244.847256] [drm:drm_ioctl] pid=3469, dev=0xe200, auth=0, DRM_IOCTL_GEM_CLOSE
<7>[  244.847269] [drm:drm_ioctl] pid=3469, dev=0xe200, auth=0, DRM_IOCTL_GEM_CLOSE
<7>[  244.848972] [drm:drm_ioctl] pid=4171, dev=0xe200, auth=0, RADEON_GEM_CREATE
<7>[  244.849030] [drm:drm_ioctl] pid=4171, dev=0xe200, auth=0, RADEON_GEM_VA
<7>[  244.849084] [drm:drm_ioctl] pid=4171, dev=0xe200, auth=0, RADEON_GEM_MMAP
<7>[  244.849150] [drm:drm_ioctl] pid=4171, dev=0xe200, auth=0, RADEON_GEM_CREATE
<7>[  244.849177] [drm:drm_ioctl] pid=4171, dev=0xe200, auth=0, RADEON_GEM_VA
<3>[  244.849191] radeon 0000:00:01.0: bo ccf78800 va 0x0000000858 conflict with (bo cec3a000 0x0000000869 0x000000086a)
<7>[  244.849199] [drm:drm_ioctl] ret = -22
<7>[  244.849251] [drm:drm_ioctl] pid=4171, dev=0xe200, auth=0, DRM_IOCTL_GEM_CLOSE
<7>[  244.853770] [drm:drm_release] open_count = 5
<7>[  244.853782] [drm:drm_release] pid = 3464, device = 0xe200, open_count = 5
...

I suppose this "buffer_from_handle" in this case is mapped with "radeon_winsys_bo_from_handle" function. If this assumption is correct, the error is occuring during this call:


        va.handle = bo->handle;
        va.operation = RADEON_VA_MAP;
        va.vm_id = 0;
        va.offset = bo->va;
        va.flags = RADEON_VM_PAGE_READABLE |
                   RADEON_VM_PAGE_WRITEABLE |
                   RADEON_VM_PAGE_SNOOPED;
        va.offset = bo->va;
        r = drmCommandWriteRead(ws->fd, DRM_RADEON_GEM_VA, &va, sizeof(va));
        if (r && va.operation == RADEON_VA_RESULT_ERROR) {
            fprintf(stderr, "radeon: Failed to assign virtual address space\n");
            radeon_bo_destroy(&bo->base);
            return NULL;

I tried using kernel 4.0.3 (which contains some kernel patches apparently related to this):

drm/radeon: fix lockup when BOs aren't part of the VM on release 
drm/radeon: reset BOs address after clearing it. 
drm/radeon: check new address before removing old one 

But the same error happens.

I'd like some help in order to find out what's wrong (bug on Mesa/drm or wrong config at Android side):

- I'm building for a 32 bits environment. Does this can cause the problem I described? Maybe the driver/drm/mesa works better on a 64 bits environment?

- For graphical buffer management (alloc, map, unmap, etc) we have drm_gralloc, which is based on xf86-video-ati. If possible, can you take a quick look an see if there's something that needs to changed in particullar for radeonsi?(gralloc_drm_radeon.c on http://git.android-x86.org/?p=platform/hardware/drm_gralloc.git;a=tree;h=refs/heads/lollipop-x86;hb=refs/heads/lollipop-x86)

Any help is appreciated. Thank you!
pstglia

ps: It's working nice with r600g driver/hardware
Comment 1 pstglia 2015-05-20 14:26:29 UTC
Additional Notes:
 - Mesa version tested is 10.5.4
 - libdrm version used is 2.4.61
 - kernel version tested are 4.0 and 4.0.3
 - llvm version is 3.5.0svn (used by AOSP)

Hardwares tested: 
 - AMD E1 2100 (KANIBI)
 - AMD R7 240 (OLAND)
Comment 2 Michel Dänzer 2015-05-21 02:12:41 UTC
Created attachment 115936 [details] [review]
winsys/radeon: Unmap VA when destroying BO

I think this is most likely because we're not unmapping the virtual address range when closing a GEM handle.

Does this Mesa patch help? Note that you'll also need the kernel fixes in 4.0.3, or this patch will likely cause kernel hangs.
Comment 3 pstglia 2015-05-21 19:14:59 UTC
Hi Michel,

Thanks a lot for your help!

I tested the patch with AMD E1 2100 (KANIBI) and it worked ok (no more conflicts on bo/va - used kernel 4.0.3 as you pointed). The same app which was producing the crash (Antutu Benchmark) was executed successfully (2d and 3d tests).

Now I'm uploading the patched ISO. Will ask on Android-x86 message group for more volunteers to test it, just to confirm it's ok on more hardwares. I'll post the ISO link here when upload is complete.


Again, thanks a lot for your nice work!

pstglia
Comment 4 Christian König 2015-05-21 19:19:39 UTC
Watch out for memory leaks as well.

When mesa drops the last reference to the BO by closing it's handle the VA mapping should go away automatically. That's the reason why it works for X and EGL.

What you see here is that you manually need to remove the VA mapping because there is still a reference to the BO outside of mesa.

Could be intentional, but could be a bug as well.

Anyway Michels patch is clearly a good idea.
Comment 5 pstglia 2015-05-21 20:09:39 UTC
Hi Christian,

> Watch out for memory leaks as well.
> 
> When mesa drops the last reference to the BO by closing it's handle the VA
> mapping should go away automatically. That's the reason why it works for X
> and EGL.
> 
> What you see here is that you manually need to remove the VA mapping because
> there is still a reference to the BO outside of mesa.
> 
> Could be intentional, but could be a bug as well.

Thanks for pointing this out. Just a question: In this case (bug on application on mapping BOs/VAs) shouldn't we have a crash using other drivers? With r600g we don't have this problem (tested with an ARUBA A10 5800K), just on radeonsi (at least on the tests I performed, which are basically running "Antutu Benchmark" and some other apps).

However r600g/other drivers/hardwares can handle this in a different manner of course, which could explain why we don't experience this on other hardwares.

> 
> Anyway Michels patch is clearly a good idea.

I've uploaded an testing Android-x86 ISO with Michel patch applied. It's using Lollipop (5.1), Mesa 10.5.4, Kernel 4.0.3 and LLVM 3.5svn (AOSP). It's also enabled for radeonsi (this is the main purpose). In case anyone wants to test it on a radeonsi hardware, here's the link:
https://drive.google.com/file/d/0Bx12U5yGNcQCRWdoWWtNdC0weW8/view?usp=sharing

Thank you all!

pstglia
Comment 6 Christian König 2015-05-22 08:53:15 UTC
(In reply to pstglia from comment #5) 
> Thanks for pointing this out. Just a question: In this case (bug on
> application on mapping BOs/VAs) shouldn't we have a crash using other
> drivers? With r600g we don't have this problem (tested with an ARUBA A10
> 5800K), just on radeonsi (at least on the tests I performed, which are
> basically running "Antutu Benchmark" and some other apps).

No crash, but just no more memory after a while. The reason why it works with X and other desktop environments for randeonsi is that when mesa drops the last reference the VA range is freed automatically as well.

But in your case it looks like that mesa drops the last reference and the buffer isn't freed. This means that somebody has still a reference to the buffer which is a bit odd.

It probably points to a bug somewhere.

> 
> However r600g/other drivers/hardwares can handle this in a different manner
> of course, which could explain why we don't experience this on other
> hardwares.

R600 hardware uses command stream patching instead of virtual memory, so you never run into this issue.

Regards,
Christian.
Comment 7 Michel Dänzer 2015-05-22 09:30:24 UTC
(In reply to Christian König from comment #6)
> 
> But in your case it looks like that mesa drops the last reference and the
> buffer isn't freed. This means that somebody has still a reference to the
> buffer which is a bit odd.
> 
> It probably points to a bug somewhere.

Not necessarily if the BO is shared. E.g. in the Xorg radeon driver we have to leak a reference to the fbcon BO because of this issue.

Another possibility is that the same BO is referenced by several GEM handles in the same DRM fd, e.g. because it's re-imported via GEM flink. In that case, my patch is probably the wrong solution, as it would make the BO unusable via the other GEM handle.
Comment 8 pstglia 2015-05-22 11:49:29 UTC
Christian,
Thanks for the explanations.

Michel/Christian,

If there's anything I can do to provide more info/debug please let me now. 

We've tested current patch on E1 2100 and Radeon HD 7750 - No problems observed so far.

Thanks,
pstglia
Comment 9 Christian König 2015-05-22 12:07:55 UTC
(In reply to pstglia from comment #8)
> We've tested current patch on E1 2100 and Radeon HD 7750 - No problems
> observed so far.

Sounds good.

BTW: Is it sufficient to put the ISO on an USB stick with usb-creator-gtk to get a bootable Android?

If yes then I'm clearly going to try that sooner or later.
Comment 10 pstglia 2015-05-22 12:16:15 UTC
> BTW: Is it sufficient to put the ISO on an USB stick with usb-creator-gtk to
> get a bootable Android?
> 
> If yes then I'm clearly going to try that sooner or later.

Never tried usb-creator-gtk. But "dd" works just fine (/dev/sdb is the device mapped to my usb flash drive):

dd if=android-x86-5.1_kernel_4.0.3_mesa_10.5.4_radeonsi_test_20150521.iso of=/dev/sdb bs=65535
Comment 11 pstglia 2015-05-23 01:22:12 UTC
(In reply to pstglia from comment #10)
> > BTW: Is it sufficient to put the ISO on an USB stick with usb-creator-gtk to
> > get a bootable Android?
> > 
> > If yes then I'm clearly going to try that sooner or later.
> 
> Never tried usb-creator-gtk. But "dd" works just fine (/dev/sdb is the
> device mapped to my usb flash drive):
> 
> dd if=android-x86-5.1_kernel_4.0.3_mesa_10.5.4_radeonsi_test_20150521.iso
> of=/dev/sdb bs=65535

And yes, it is a bootable Android image (bootable under bios legacy mode - a UEFI img can also be generated). This img has no google apps (but you can download apps/apks from sites like www.apkmirror.com) or houdini (ARM binary translator). Many apps can run on it (those which depend on Dalvik) and some have x86 version (Like Angry Birds Series).

About the tests, still waiting for more feedback from more users ( Android-x86 topic is https://groups.google.com/forum/#!topic/android-x86/AayRmQiZAlw )
Comment 12 pstglia 2015-05-24 14:25:04 UTC
Updating test results:

These cards worked ok with applied patch (solves the bo/va conflict observed in some apps, "Antutu Benchmark" in special):

 - E1 2100 (KANIBI)
 - Radeon HD 7750 (BONAIRE)
 - Radeon R7 240 (OALAND)

It was reported by one member that patch is causing stability problems for HD 7950 (TAHITI). I've attached the full dmesg. The volunteer tested the patch with Mesa 10.5.4 and 10.6 RC1. Same behavior in both cases according to him:

"
Hi,

reporting test results on HD7950.

I could launch Antutu this time but I could not complete the 3D rendering session.
 RADEON_VA_UNMAP patch is producing a regression in the stability of the GUI.

There are still cases where bo conflict appears in dmesg and I noticed crashes that produces a deadlock, the GUI restarts sometimes OK, but at some point it stays perpetually on the android logo.
In between I've seen very strange things happen, simple OpenGL apps working fine but producing restarts when pressing home button or closing them
"

I attached dmesg logs he provided:
 - logs_10_5_4_HD7950_full_debug.zip (Mesa 10.5.4)
 - logs_10_6rc1_HD7950_full_debug.zip (Mesa 10.6 RC1)

Regards,
pstglia
Comment 13 pstglia 2015-05-24 14:26:51 UTC
Created attachment 116008 [details]
logs_10_5_4_HD7950_full_debug.zip - dmesg drm.debug=7/dumpsys/logcat
Comment 14 pstglia 2015-05-24 14:29:13 UTC
Created attachment 116009 [details]
logs_10_6rc1_HD7950_full_debug.zip - dmesg debug.drm=7/logcat
Comment 15 Christian König 2015-05-25 12:10:01 UTC
(In reply to pstglia from comment #12)
> It was reported by one member that patch is causing stability problems for
> HD 7950 (TAHITI). I've attached the full dmesg. The volunteer tested the
> patch with Mesa 10.5.4 and 10.6 RC1. Same behavior in both cases according
> to him

The mesa version is uninteresting, but what kernel version are you using?

Older kernels have stability problems when you unmap the BO VA. That's why we don't use it in mesa by default.

Regards,
Christian.
Comment 16 pstglia 2015-05-25 16:32:52 UTC
> The mesa version is uninteresting, but what kernel version are you using?
> 
> Older kernels have stability problems when you unmap the BO VA. That's why
> we don't use it in mesa by default.
> 
> Regards,
> Christian.

We are using kernel 4.0.3
Comment 17 Michel Dänzer 2015-05-26 08:12:37 UTC
My feeling is that the only proper solution might be to track the VA ranges per GEM handle instead of per BO in the kernel. Christian, would that be feasible in the radeon driver?

BTW, I think the dmesg output would be more useful without all the debugging output enabled. That makes it like looking for a needle in a haystack.
Comment 18 Christian König 2015-05-26 09:06:17 UTC
(In reply to Michel Dänzer from comment #17)
> My feeling is that the only proper solution might be to track the VA ranges
> per GEM handle instead of per BO in the kernel. Christian, would that be
> feasible in the radeon driver?

I'm working on that problem for years now, tracking VA ranges per GEM handle isn't really doable either (e.g. without breaking backward compatibility). Haven't come up with a good solution so far.

> BTW, I think the dmesg output would be more useful without all the debugging
> output enabled. That makes it like looking for a needle in a haystack.

Yeah, agree. That is a bit two much.
Comment 19 Christian König 2015-05-26 09:06:42 UTC
(In reply to pstglia from comment #16)
> > The mesa version is uninteresting, but what kernel version are you using?
> > 
> > Older kernels have stability problems when you unmap the BO VA. That's why
> > we don't use it in mesa by default.
> > 
> > Regards,
> > Christian.
> 
> We are using kernel 4.0.3

Strange, that kernel should work perfectly fine.
Comment 20 Michel Dänzer 2015-05-26 09:30:54 UTC
(In reply to Christian König from comment #19)
> > We are using kernel 4.0.3
> 
> Strange, that kernel should work perfectly fine.

I suspect the problem of that Tahiti user might not be directly related to the patch. Or maybe he's experiencing the problematic aspects of unmapping the VA range for all GEM handles...

(In reply to Christian König from comment #18)
> I'm working on that problem for years now, tracking VA ranges per GEM handle
> isn't really doable either (e.g. without breaking backward compatibility).

How would it break backwards compatibility? I'm not sure how not tracking the VA ranges per GEM handle could ever work as expected with several GEM handles referencing the same BO.

Anyway, if you think the Mesa patch is the best we can do for now, there would at least need to be a way for userspace to know it's safe to unmap the VA range, e.g. an info query.
Comment 21 Mauro Rossi 2015-05-26 16:16:36 UTC
Created attachment 116054 [details]
dmesg (without drm.debug=7) and logcat for HD7950

Hi, 

here is dmesg, without drm.debug=7, and logcat collected on HD7750 with VA BO proposed patch applied.

VA conflict is still happening.

<3>[  807.802668] radeon 0000:01:00.0: bo f085f400 va 0x0000001138 conflict with (bo f183c400 0x000000121c 0x000000121d)

[behavior without RADEON_VA_UNMAP patch, i.e. initial status]

Without the VO BO patch applied, GUI was stable on HD7750 and HD7950 and many OpenGLES apps were fine (Gears, Rajawali, Harism Effects, Harism Shaderize, and Antutu benchmark could not be launched, failure in allocating bo for 'atlas' (NOTE: idk what 'atlas' is).
OpenGLES 1.0 Demo com.rtsw.opengldemo cannot reach the second cube demo (app crashing and GUI restarting).

[behavior with RADEON_VA_UNMAP patch]
Logcat logs are showing "GPU lockups" that cannot be recovered until reboot, we also see intermittent GUI instability and GUI sometimes is not starting.
Antutu can start and pass the 'atlas' bo allocation, but cannot complete 3D tests.
OpenGLES 1.0 Demo com.rtsw.opengldemo cannot reach the second cube demo (app crashing and GUI restarting).

It's strange that OpenGLES 1.0 Demo is not workinh either ways.

Question (please be patient I'm not skilled as you are): could some of these problem due to multicore, hyperthreading or be cacheing related?

If it can help, I can provide further logs with kernel parameters maxcpus=1, maxcpus=0 or by disabling hyperthreding in the BIOS.

If problems were cacheing related, is there a way to highlight this?

Mauro Rossi
Comment 22 Christian König 2015-05-26 18:09:10 UTC
(In reply to Mauro Rossi from comment #21)
> Question (please be patient I'm not skilled as you are): could some of these
> problem due to multicore, hyperthreading or be cacheing related?

No, not at all. What we have here is a driver internal problem, completely unrelated to the number of CPU cores.
Comment 23 Christian König 2015-05-26 18:12:23 UTC
(In reply to Michel Dänzer from comment #20)
> How would it break backwards compatibility?

You would need to allow multiple mappings into the same address space per BO.

Which is exactly what I've did for amdgpu, but IIRC that would break the userspace interface because you won't return the mapped address any more when you try to map it multiple times....

> I'm not sure how not tracking
> the VA ranges per GEM handle could ever work as expected with several GEM
> handles referencing the same BO.

Actually it can indeed never work correctly. What we just do all the time is trying to avoid the case that several GEM handles reference the same BO very hard.
Comment 24 Michel Dänzer 2015-05-27 10:00:45 UTC
(In reply to Christian König from comment #23)
> > How would it break backwards compatibility?
> 
> You would need to allow multiple mappings into the same address space per BO.
> 
> Which is exactly what I've did for amdgpu, but IIRC that would break the
> userspace interface because you won't return the mapped address any more
> when you try to map it multiple times....

What would that break? It could result in the same BO having several representations in userspace, but (why) is that a problem?


> > I'm not sure how not tracking the VA ranges per GEM handle could ever work
> > as expected with several GEM handles referencing the same BO.
> 
> Actually it can indeed never work correctly. What we just do all the time is
> trying to avoid the case that several GEM handles reference the same BO very
> hard.

I'm afraid we can't always avoid that though, e.g. when sharing BOs between glamor and the Xorg driver.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.