91704 – Tonga startx fails since mesa/drm 56d8dd6 amdgpu: make vamgr per device v2

Bug 91704 - Tonga startx fails since mesa/drm 56d8dd6 amdgpu: make vamgr per device v2

Summary: Tonga startx fails since mesa/drm 56d8dd6 amdgpu: make vamgr per device v2

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (1):	91708 (view as bug list)
Depends on:
Blocks:

Reported:	2015-08-20 18:19 UTC by Andy Furniss
Modified:	2015-08-24 11:27 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
xorg log (8.10 KB, text/plain) 2015-08-20 18:19 UTC, Andy Furniss	no flags	Details
xorg log modesetting aiglx fails (30.85 KB, text/plain) 2015-08-20 22:06 UTC, Andy Furniss	no flags	Details
egl fail causing scrolling issue (29.88 KB, text/plain) 2015-08-21 09:58 UTC, Andy Furniss	no flags	Details
va_test (2.49 KB, text/plain) 2015-08-22 13:29 UTC, Jammy Zhou	no flags	Details
Debug output (2.12 KB, text/plain) 2015-08-23 09:17 UTC, Mathias Tillman	no flags	Details
Potential fix (485 bytes, patch) 2015-08-23 09:44 UTC, Mathias Tillman	no flags	Details \| Splinter Review
View All

Description Andy Furniss 2015-08-20 18:19:03 UTC

Created attachment 117820 [details]
xorg log

With latest libdrm git I can't startx due to a segfault shown in attached xorg log.

dmesg logs several -

amdgpu 0000:01:00.0: bo ffff8800bd34b800 va 0x0000100000-0x0000100400 conflict with 0x0000100000-0x0000100400

This happens with mesa/drm on commit 

56d8dd6 amdgpu: make vamgr per device v2

On the commit before that -

ffa305d amdgpu: add flag to support 32bit VA address v4

I can startx without error but performance is broken eg. scrolling in xterm takes about a second to refresh the screen.

Comment 1 Alex Deucher 2015-08-20 20:50:07 UTC

*** Bug 91708 has been marked as a duplicate of this bug. ***

Comment 2 Andy Furniss 2015-08-20 21:20:53 UTC

Further testing - booting into a pre existing drm-fixes-4.2 has the same issue.

The modesetting ddx works OK and using that I still have OpenGL, though uvd will fail with dmesg errors as above.

Comment 3 Andy Furniss 2015-08-20 22:06:13 UTC

Created attachment 117831 [details]
xorg log modesetting aiglx fails

When I say OK for modesetting there are still errors with dri2 and AIGLX falls back to s/w.

Which makes me a bit confused as to why xonotic-linux64-glx is working OK - in fact it benchmarks 20% faster than I've seen so far on tonga. I can't be using llvm with this test - it's set everything as high as it will go and I get 64 fps on the big keybench.

Comment 4 Jammy Zhou 2015-08-21 03:56:20 UTC

Were you only changing the libdrm_amdgpu for the bisecting? It looks like those two libdrm commits you mentioned shouldn't cause this VA confict and performance issue for xterm scrolling. Which kernel were you using?

Comment 5 Andy Furniss 2015-08-21 09:58:17 UTC

Created attachment 117837 [details]
egl fail causing scrolling issue

I am using agd5f drm-next-4.3-wip but have tested an older drm fixes.

I am just changing libdrm, but I do rebuild mesa after (and ddx usually). On heads I also tried rebuilding xserver + kernel as well, just in case in magically started working by doing that :-)

As for the scrolling issue - I didn't look into it as I was just "passing" while looking at the main issue.

Looking now I can see why - glamor/egl fail log attached. I have after resetting onto ffa305d this time, rebuilt xserver as well and checked output. Mesa xserver ddx all build OK AFAICT and the issue persists.

Maybe these commits are causing some strange build issues for stuff that depends on/uses libdrm headers or something.

Thanks to Mathias taking time to report with obiaf ppas, at least I know it's not just me (using an old LFS set up).

Comment 6 Mathias Tillman 2015-08-21 11:54:52 UTC

I can confirm that reverting to ffa305d0fc926418e4dff432381ead8907dc18d9 makes it work, before it would fail before reaching the dm login page (I'm using sddm), and now it reaches the login page. I also did a manual unpatching of 56d8dd6a9c03680700e0b0043cb56e0af7e3e3de when on the latest commit, which gave me the same result.

The conflict looks to me like it's trying to initialize the device multiple times - I did some debugging, and it reaches the amdgpu_device_initialize function several times with the same virtual address. I'm not sure if this is intentional or not, but it does seem a bit odd to me. Maybe there should be a saved global state of vamgr's that corresponds to the virtual address? Or is it just not supposed to initialize the same device multiple times?

Comment 7 Jammy Zhou 2015-08-22 13:29:58 UTC

Created attachment 117860 [details]
va_test

Comment 8 Jammy Zhou 2015-08-22 13:31:24 UTC

Thanks for the information. But it seems not expected. Even if amdgpu_device_initialize is called multiple times with different fds (returned by opening the device nodes) in one single process, only one amdgpu_device should be created, since a hash table is maitained for the created devices. In this case, the per-device vamgr should only be initialized as well in one process.I wrote a simple test application (see attached), and it worked for me quite well. Could you try it on your failing system?

Comment 9 Andy Furniss 2015-08-22 16:20:39 UTC

(In reply to Jammy Zhou from comment #7)
> Created attachment 117860 [details]
> va_test

Where to compile this from - and why is 

#include "amdgpu_drm.h"

there twice - is the second one supposed to be something else (changing the includes to <libdrm/amdgpu_drm.h> etc. won't build.

Comment 10 Andy Furniss 2015-08-22 19:36:54 UTC

> (changing the
> includes to <libdrm/amdgpu_drm.h> etc. won't build.

Oh NVM it will build with system headers as long as I remember to specify the lib :-)

So outside X it works (assuming silence is success) as user and root.

Inside X it only works as root, device_initialization fails as user - which prompted me to try startx as root - this works with amdgpu (unless its a fluke because root got twm and I use fluxbox).

Starting X as root with modesetting starts but I see the same as the modesetting log I posted.

So maybe some permissions issue.

FWIW card0 is 

crw-rw---- 1 root video 226, 0 Aug 22  2015 /dev/dri/card0

and my user is in group video.

Comment 11 Andy Furniss 2015-08-22 19:52:14 UTC

(In reply to Andy Furniss from comment #10)
> > (changing the
> > includes to <libdrm/amdgpu_drm.h> etc. won't build.
> 
> Oh NVM it will build with system headers as long as I remember to specify
> the lib :-)
> 
> So outside X it works (assuming silence is success) as user and root.
> 
> Inside X it only works as root, device_initialization fails as user - which
> prompted me to try startx as root - this works with amdgpu (unless its a
> fluke because root got twm and I use fluxbox).
> 
> Starting X as root with modesetting starts but I see the same as the
> modesetting log I posted.
> 
> So maybe some permissions issue.

Permissions seems to be a red herring - It's possible for startx as root to fail in the same way as user. So startx may work as root but only if I am lucky - repeated testing and it mostly fails now.

Comment 12 Mathias Tillman 2015-08-23 09:17:15 UTC

Created attachment 117870 [details]
Debug output

I've done some testing by adding debug output in amdgpu_device.c and while testing your va_test does indeed work, I think I have found the reason as to why startx does not work. Basically, the hash generation is broken, if you check util_hash_table_get in the logs, you can see that it returns null (in the startx output), even when the fd exists in the hash table. This seems to be because drmGetPrimaryDeviceNameFromFd in fd_hash returns the directory name of the dri device, as opposed to the dri device itself (/dev/dri/ vs /dev/dri/card0) - you can see that in the log at lines 14 and 25. Strangely it works just fine the first time amdgpu_device_initialize is called (line 4), and in your va_test.

To explain the log a bit more, the amdgpu_callback output is a foreach of the fd_tab hash table right before the util_hash_table_get call. The rest should be self-explanatory I think.

I will see if I can find out why drmGetPrimaryDeviceNameFromFd returns the directory name, so if you have any ideas, please let me know.

Comment 13 Mathias Tillman 2015-08-23 09:44:44 UTC

Created attachment 117871 [details] [review]
Potential fix

Andy: Could you test the attached patch, please? It's now working fine for me.

Comment 14 Andy Furniss 2015-08-23 10:55:42 UTC

(In reply to Mathias Tillman from comment #13)
> Created attachment 117871 [details] [review] [review]
> Potential fix
> 
> Andy: Could you test the attached patch, please? It's now working fine for
> me.

It also seems to be working for me :-)

I just did about 20 startx in a row and all is OK.

va_test behaves the same as before = fail as user from within X, but maybe that is expected?

modesetting still has issues, but they probably are nothing to do with this bug anyway - I had never used it before this issue arose.

Comment 15 Andy Furniss 2015-08-23 11:56:18 UTC

(In reply to Andy Furniss from comment #14)
> (In reply to Mathias Tillman from comment #13)
> > Created attachment 117871 [details] [review] [review] [review]
> > Potential fix
> > 
> > Andy: Could you test the attached patch, please? It's now working fine for
> > me.
> 
> It also seems to be working for me :-)

It gets better - there is a chance that this (well it + working current libdrm it enables) has fixed a long standing UVD issue and a more recent vapau_interop one (basically both would randomly fail). It's slightly possible that something else fixed it - will have to play over coming days.

Comment 16 Mathias Tillman 2015-08-23 14:07:02 UTC

Doubt my patch had anything to do with that, unless the affected function is used from anywhere outside the drm code.
Glad to hear that it works though, hopefully the fix can be pushed to master soon.

Comment 17 Andy Furniss 2015-08-23 14:13:16 UTC

(In reply to Mathias Tillman from comment #16)
> Doubt my patch had anything to do with that, unless the affected function is
> used from anywhere outside the drm code.
> Glad to hear that it works though, hopefully the fix can be pushed to master
> soon.

Yea turned out to be false (for uvd lock at least haven't had time to stress vapau interop) extended testing = I can still lock - just with current setup it takes much longer than I've seen recently.

Comment 18 Jammy Zhou 2015-08-24 04:00:03 UTC

Thanks guys for the investigation. I have posted the patch to the dri-devel mailing list. Please check below.

http://lists.freedesktop.org/archives/dri-devel/2015-August/089003.html

Comment 19 Christian König 2015-08-24 07:53:59 UTC

Nice catch!

And Andy is right this could explain some of the problem we had with VDPAU interop as well, cause there the device is also opened from multiple sides at the same time.

Comment 20 Andy Furniss 2015-08-24 11:25:07 UTC

(In reply to Christian König from comment #19)
> Nice catch!
> 
> And Andy is right this could explain some of the problem we had with VDPAU
> interop as well, cause there the device is also opened from multiple sides
> at the same time.

New version of the patch seems good for startx.

Also tried harder than yesterday with vdpau interop and it also seems good.

Comment 21 Mathias Tillman 2015-08-24 11:27:17 UTC

Marking this as resolved then :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.