Created attachment 117820 [details]
With latest libdrm git I can't startx due to a segfault shown in attached xorg log.
dmesg logs several -
amdgpu 0000:01:00.0: bo ffff8800bd34b800 va 0x0000100000-0x0000100400 conflict with 0x0000100000-0x0000100400
This happens with mesa/drm on commit
56d8dd6 amdgpu: make vamgr per device v2
On the commit before that -
ffa305d amdgpu: add flag to support 32bit VA address v4
I can startx without error but performance is broken eg. scrolling in xterm takes about a second to refresh the screen.
*** Bug 91708 has been marked as a duplicate of this bug. ***
Further testing - booting into a pre existing drm-fixes-4.2 has the same issue.
The modesetting ddx works OK and using that I still have OpenGL, though uvd will fail with dmesg errors as above.
Created attachment 117831 [details]
xorg log modesetting aiglx fails
When I say OK for modesetting there are still errors with dri2 and AIGLX falls back to s/w.
Which makes me a bit confused as to why xonotic-linux64-glx is working OK - in fact it benchmarks 20% faster than I've seen so far on tonga. I can't be using llvm with this test - it's set everything as high as it will go and I get 64 fps on the big keybench.
Were you only changing the libdrm_amdgpu for the bisecting? It looks like those two libdrm commits you mentioned shouldn't cause this VA confict and performance issue for xterm scrolling. Which kernel were you using?
Created attachment 117837 [details]
egl fail causing scrolling issue
I am using agd5f drm-next-4.3-wip but have tested an older drm fixes.
I am just changing libdrm, but I do rebuild mesa after (and ddx usually). On heads I also tried rebuilding xserver + kernel as well, just in case in magically started working by doing that :-)
As for the scrolling issue - I didn't look into it as I was just "passing" while looking at the main issue.
Looking now I can see why - glamor/egl fail log attached. I have after resetting onto ffa305d this time, rebuilt xserver as well and checked output. Mesa xserver ddx all build OK AFAICT and the issue persists.
Maybe these commits are causing some strange build issues for stuff that depends on/uses libdrm headers or something.
Thanks to Mathias taking time to report with obiaf ppas, at least I know it's not just me (using an old LFS set up).
I can confirm that reverting to ffa305d0fc926418e4dff432381ead8907dc18d9 makes it work, before it would fail before reaching the dm login page (I'm using sddm), and now it reaches the login page. I also did a manual unpatching of 56d8dd6a9c03680700e0b0043cb56e0af7e3e3de when on the latest commit, which gave me the same result.
The conflict looks to me like it's trying to initialize the device multiple times - I did some debugging, and it reaches the amdgpu_device_initialize function several times with the same virtual address. I'm not sure if this is intentional or not, but it does seem a bit odd to me. Maybe there should be a saved global state of vamgr's that corresponds to the virtual address? Or is it just not supposed to initialize the same device multiple times?
Created attachment 117860 [details]
Thanks for the information. But it seems not expected. Even if amdgpu_device_initialize is called multiple times with different fds (returned by opening the device nodes) in one single process, only one amdgpu_device should be created, since a hash table is maitained for the created devices. In this case, the per-device vamgr should only be initialized as well in one process.I wrote a simple test application (see attached), and it worked for me quite well. Could you try it on your failing system?
(In reply to Jammy Zhou from comment #7)
> Created attachment 117860 [details]
Where to compile this from - and why is
there twice - is the second one supposed to be something else (changing the includes to <libdrm/amdgpu_drm.h> etc. won't build.
> (changing the
> includes to <libdrm/amdgpu_drm.h> etc. won't build.
Oh NVM it will build with system headers as long as I remember to specify the lib :-)
So outside X it works (assuming silence is success) as user and root.
Inside X it only works as root, device_initialization fails as user - which prompted me to try startx as root - this works with amdgpu (unless its a fluke because root got twm and I use fluxbox).
Starting X as root with modesetting starts but I see the same as the modesetting log I posted.
So maybe some permissions issue.
FWIW card0 is
crw-rw---- 1 root video 226, 0 Aug 22 2015 /dev/dri/card0
and my user is in group video.
(In reply to Andy Furniss from comment #10)
> > (changing the
> > includes to <libdrm/amdgpu_drm.h> etc. won't build.
> Oh NVM it will build with system headers as long as I remember to specify
> the lib :-)
> So outside X it works (assuming silence is success) as user and root.
> Inside X it only works as root, device_initialization fails as user - which
> prompted me to try startx as root - this works with amdgpu (unless its a
> fluke because root got twm and I use fluxbox).
> Starting X as root with modesetting starts but I see the same as the
> modesetting log I posted.
> So maybe some permissions issue.
Permissions seems to be a red herring - It's possible for startx as root to fail in the same way as user. So startx may work as root but only if I am lucky - repeated testing and it mostly fails now.
Created attachment 117870 [details]
I've done some testing by adding debug output in amdgpu_device.c and while testing your va_test does indeed work, I think I have found the reason as to why startx does not work. Basically, the hash generation is broken, if you check util_hash_table_get in the logs, you can see that it returns null (in the startx output), even when the fd exists in the hash table. This seems to be because drmGetPrimaryDeviceNameFromFd in fd_hash returns the directory name of the dri device, as opposed to the dri device itself (/dev/dri/ vs /dev/dri/card0) - you can see that in the log at lines 14 and 25. Strangely it works just fine the first time amdgpu_device_initialize is called (line 4), and in your va_test.
To explain the log a bit more, the amdgpu_callback output is a foreach of the fd_tab hash table right before the util_hash_table_get call. The rest should be self-explanatory I think.
I will see if I can find out why drmGetPrimaryDeviceNameFromFd returns the directory name, so if you have any ideas, please let me know.
Created attachment 117871 [details] [review]
Andy: Could you test the attached patch, please? It's now working fine for me.
(In reply to Mathias Tillman from comment #13)
> Created attachment 117871 [details] [review] [review]
> Potential fix
> Andy: Could you test the attached patch, please? It's now working fine for
It also seems to be working for me :-)
I just did about 20 startx in a row and all is OK.
va_test behaves the same as before = fail as user from within X, but maybe that is expected?
modesetting still has issues, but they probably are nothing to do with this bug anyway - I had never used it before this issue arose.
(In reply to Andy Furniss from comment #14)
> (In reply to Mathias Tillman from comment #13)
> > Created attachment 117871 [details] [review] [review] [review]
> > Potential fix
> > Andy: Could you test the attached patch, please? It's now working fine for
> > me.
> It also seems to be working for me :-)
It gets better - there is a chance that this (well it + working current libdrm it enables) has fixed a long standing UVD issue and a more recent vapau_interop one (basically both would randomly fail). It's slightly possible that something else fixed it - will have to play over coming days.
Doubt my patch had anything to do with that, unless the affected function is used from anywhere outside the drm code.
Glad to hear that it works though, hopefully the fix can be pushed to master soon.
(In reply to Mathias Tillman from comment #16)
> Doubt my patch had anything to do with that, unless the affected function is
> used from anywhere outside the drm code.
> Glad to hear that it works though, hopefully the fix can be pushed to master
Yea turned out to be false (for uvd lock at least haven't had time to stress vapau interop) extended testing = I can still lock - just with current setup it takes much longer than I've seen recently.
Thanks guys for the investigation. I have posted the patch to the dri-devel mailing list. Please check below.
And Andy is right this could explain some of the problem we had with VDPAU interop as well, cause there the device is also opened from multiple sides at the same time.
(In reply to Christian König from comment #19)
> Nice catch!
> And Andy is right this could explain some of the problem we had with VDPAU
> interop as well, cause there the device is also opened from multiple sides
> at the same time.
New version of the patch seems good for startx.
Also tried harder than yesterday with vdpau interop and it also seems good.
Marking this as resolved then :)