Bug 108096 - [amd-staging-drm-next] SDDM screen corruption (not usable) with RX580, amdgpu, dc=1 (of course), regression - [bisected]
Summary: [amd-staging-drm-next] SDDM screen corruption (not usable) with RX580, amdgpu...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 108533 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-09-29 03:38 UTC by Dieter Nützel
Modified: 2018-11-07 22:13 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
SDDM corruption 4.18.0-rc1 (1.76 MB, image/jpeg)
2018-09-29 03:38 UTC, Dieter Nützel
no flags Details
SDDM corruption 4.18.0-rc1 (with some recognizable parts) (2.36 MB, image/jpeg)
2018-09-29 03:40 UTC, Dieter Nützel
no flags Details
SDDM corruption 4.19.0-rc1 (2.46 MB, image/jpeg)
2018-09-29 03:43 UTC, Dieter Nützel
no flags Details
dmesg-4.18.0-rc1-1.g7262353-default+.log (66.18 KB, text/x-log)
2018-09-29 03:43 UTC, Dieter Nützel
no flags Details
dmesg-4.18.0-rc1-1.g7262353-default+.log3 (68.59 KB, text/plain)
2018-09-29 03:44 UTC, Dieter Nützel
no flags Details
dmesg-4.19.0-rc1-1.g7262353-default+.log-25.10 (65.93 KB, text/plain)
2018-09-29 03:45 UTC, Dieter Nützel
no flags Details
dmesg-4.19.0-rc1-1.g7262353-default+.log-25.09 (65.93 KB, text/plain)
2018-09-29 03:47 UTC, Dieter Nützel
no flags Details
Xorg.0.log.25.09 (35.18 KB, text/plain)
2018-09-29 03:48 UTC, Dieter Nützel
no flags Details

Description Dieter Nützel 2018-09-29 03:38:40 UTC
Created attachment 141786 [details]
SDDM corruption 4.18.0-rc1

amd-staging-drm-next (since 21/22 August 2018)
didn't work any longer with my nice RX580

Latest working amd-staging-drm-next kernel was from 16th August 2018
#5024f8dfe478 (my testing for Huang Rui ray.huang at amd.com)
https://lists.freedesktop.org/archives/amd-gfx/2018-August/025332.html

On 22th August I got first SDDM corruption (see attchament) and
[   14.322826] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser !
in dmesg with amd-staging-drm-next 4.18.0-rc1.
(see attchament) dmesg-4.18.0-rc1-1.g7262353-default+.log

Later (23th August) I saw SDDM corruption and
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser !
plus
[    3.487757] RIP: 0010:amdgpu_bo_gpu_offset+0x56/0x90 [amdgpu]
(see attchament) dmesg-4.18.0-rc1-1.g7262353-default+.log3


After Alex's upgrade to amd-staging-drm-next 4.19.0-rc1
I got SDDM corruption (see attchament)
and
[   14.439424] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser !
(see attchament) dmesg-4.19.0-rc1-1.g7262353-default+.log-25.10

First 'bisect' attempt went nuts  'cause the big merge with Linus tree didn't resolve.
Can't find my former amd-staging-drm-next tag (#5024f8dfe478, see above) any longer ;-(
Comment 1 Dieter Nützel 2018-09-29 03:40:44 UTC
Created attachment 141787 [details]
SDDM corruption 4.18.0-rc1 (with some recognizable parts)
Comment 2 Dieter Nützel 2018-09-29 03:43:11 UTC
Created attachment 141788 [details]
SDDM corruption 4.19.0-rc1
Comment 3 Dieter Nützel 2018-09-29 03:43:56 UTC
Created attachment 141789 [details]
dmesg-4.18.0-rc1-1.g7262353-default+.log
Comment 4 Dieter Nützel 2018-09-29 03:44:23 UTC
Created attachment 141790 [details]
dmesg-4.18.0-rc1-1.g7262353-default+.log3
Comment 5 Dieter Nützel 2018-09-29 03:45:06 UTC
Created attachment 141791 [details]
dmesg-4.19.0-rc1-1.g7262353-default+.log-25.10
Comment 6 Dieter Nützel 2018-09-29 03:47:20 UTC
Created attachment 141792 [details]
dmesg-4.19.0-rc1-1.g7262353-default+.log-25.09
Comment 7 Dieter Nützel 2018-09-29 03:48:02 UTC
Created attachment 141793 [details]
Xorg.0.log.25.09
Comment 8 Dieter Nützel 2018-09-30 13:33:35 UTC
(In reply to Dieter Nützel from comment #6)
> Created attachment 141792 [details]
> dmesg-4.19.0-rc1-1.g7262353-default+.log-25.09

Hi,

could this be a hint?
[    6.716492] [drm] Fence fallback timer expired on ring kiq_2.1.0

Should I send a log with 'amdgpu.dc_log=1 drm.debug=6'?
Comment 9 Michel Dänzer 2018-10-24 08:33:52 UTC
*** Bug 108533 has been marked as a duplicate of this bug. ***
Comment 10 fin4478 2018-10-24 09:01:24 UTC
You have plenty of display managers and desktops in the Linux world. KDE stuff is slow, buggy and uses a lot of hardware resources. The Xfce desktop with lightdm and the Whisker menu is stable, fast, light and freely configurable. I have had never desktop problems with the amdgpu driver in four years of using it. I have RX560 and the mainline kernel 4.19.0 from kernel.org works fine. Use Mesa git too, like Oibaf ppa Mesa.
Comment 11 Alex Deucher 2018-10-25 00:56:27 UTC
Can you bisect 4.19?
Comment 12 Dieter Nützel 2018-10-25 01:24:19 UTC
(In reply to Alex Deucher from comment #11)
> Can you bisect 4.19?
Well, I'll try that, too.
I'm currently trying amd-staging-drm-next, again.

Have some trouble with 4.19 final on my main home server (32 bit, pae), too.
Now, I'm back to 4.18.16 on all systems.

Maybe tomorrow I have some results.

Thanks, Alex!
Comment 13 Dieter Nützel 2018-10-26 03:36:12 UTC
DONE - amd-staging-drm-next

964d0fbf6301d3dc8dfad19ffab5a06d002d27f1 is the first bad commit
commit 964d0fbf6301d3dc8dfad19ffab5a06d002d27f1
Author: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date:   Fri Jul 6 14:16:54 2018 -0400

    drm/amdgpu: Allow to create BO lists in CS ioctl v3
    
    This change is to support MESA performace optimization.
    Modify CS IOCTL to allow its input as command buffer and an array of
    buffer handles to create a temporay bo list and then destroy it
    when IOCTL completes.
    This saves on calling for BO_LIST create and destry IOCTLs in MESA
    and by this improves performance.
    
    v2: Avoid inserting the temp list into idr struct.
    
    v3:
    Remove idr alloation from amdgpu_bo_list_create.
    Remove useless argument from amdgpu_cs_parser_fini
    Minor cosmetic stuff.
    
    v4: Revert amdgpu_bo_list_destroy back to static
    
    Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 d621cc2fb523ffcaa877faa8ae682878c268478e 6a758c959d69df05023339fa981b067fa027875c M drivers
:040000 040000 7f32e65fd49cb9305b5c7440b161771f429aad09 9137d92f44e0b34512b3104141412595da64ce96 M include


But 'git revert 964d0fbf6301' do NOT work on amd-staging-drm-next:

error: Konnte "revert" nicht auf 964d0fbf6301... (drm/amdgpu: Allow to create BO lists in CS ioctl v3) ausführen
Hinweis: nach Auflösung der Konflikte markieren Sie die korrigierten Pfade
Hinweis: mit 'git add <Pfade>' oder 'git rm <Pfade>' und tragen Sie das Ergebnis mit
Hinweis: 'git commit' ein
SOURCE/amd-staging-drm-next> git status
Auf Branch amd-staging-drm-next
Ihr Branch ist auf demselben Stand wie 'origin/amd-staging-drm-next'.

Sie sind gerade an einem Revert von Commit '964d0fbf6301'.
  (beheben Sie die Konflikte und führen Sie dann "git revert --continue" aus)
  (benutzen Sie "git revert --abort", um die Revert-Operation abzubrechen)

zum Commit vorgemerkte Änderungen:
  (benutzen Sie "git reset HEAD <Datei>..." zum Entfernen aus der Staging-Area)

        geändert:       drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
        geändert:       include/uapi/drm/amdgpu_drm.h

Nicht zusammengeführte Pfade:
  (benutzen Sie "git reset HEAD <Datei>..." zum Entfernen aus der Staging-Area)
  (benutzen Sie "git add/rm <Datei>...", um die Auflösung zu markieren)

        von beiden geändert:    drivers/gpu/drm/amd/amdgpu/amdgpu.h
        von beiden geändert:    drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c
        von beiden geändert:    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
Comment 14 Michel Dänzer 2018-10-26 08:06:32 UTC
Which Git commit of Mesa are you using? Any local patches on top?
Comment 15 Dieter Nützel 2018-10-26 12:05:56 UTC
(In reply to Michel Dänzer from comment #14)
> Which Git commit of Mesa are you using? Any local patches on top?

Every, since ever, as always (even, since _before_ Aug 22, 2018) ...;-)

But kidding aside, _currently_

#0ff1ccca25

(with merged branch from Marek for testing purposes)
04ba4eae68 (HEAD -> ext_gpu_shader4) Merge branch 'ext_gpu_shader4' of git://people.freedesktop.org/~mareko/mesa into ext_gpu_shader4
0ff1ccca25 (origin/master, origin/HEAD, master) radv: call nir_link_xfb_varyings()

Has it something to do with the DRM version?
DRM 3.26.0 (4.18) vs. DRM 3.27.0 (4.19)? 

[-]
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 06aede1..529500c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -69,9 +69,10 @@
  * - 3.24.0 - Add high priority compute support for gfx9
  * - 3.25.0 - Add support for sensor query info (stable pstate sclk/mclk).
  * - 3.26.0 - GFX9: Process AMDGPU_IB_FLAG_TC_WB_NOT_INVALIDATE.
+ * - 3.27.0 - Add new chunk to to AMDGPU_CS to enable BO_LIST creation.
  */
 #define KMS_DRIVER_MAJOR	3
-#define KMS_DRIVER_MINOR	26
+#define KMS_DRIVER_MINOR	27
 #define KMS_DRIVER_PATCHLEVEL	0
[-]

But this commit is _in_ since

author	Andrey Grodzovsky <andrey.grodzovsky@amd.com>	2018-07-06 14:16:54 -0400
committer	Alex Deucher <alexander.deucher@amd.com>	2018-07-16 15:29:47 -0500

and I had it running (on stable and amd-staging-drm-next, daily), even with AMD testing code (Huang Rui ray.huang at amd.com), Aug 16, 2018.
https://lists.freedesktop.org/archives/amd-gfx/2018-August/025411.html

Do you need more logs?
With which kernel parameter?
System _is_ running, but with unusable gfx/dri screen.
Comment 16 Andrey Grodzovsky 2018-10-26 20:20:37 UTC
Question is do you have " winsys/amdgpu: pass the BO list via the CS ioctl on DRM >= 3.27.0" commit in your MESA tree ? I am not clear on that.
Comment 17 Dieter Nützel 2018-10-26 23:29:10 UTC
(In reply to Andrey Grodzovsky from comment #16)
> Question is do you have " winsys/amdgpu: pass the BO list via the CS ioctl
> on DRM >= 3.27.0" commit in your MESA tree ? I am not clear on that.

461a864316 winsys/amdgpu: pass the BO list via the CS ioctl on DRM >= 3.27.0

commit 461a864316d5b70ea99c9e1dba7d71973af2aacc
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Thu Jul 12 00:50:52 2018 -0400

    winsys/amdgpu: pass the BO list via the CS ioctl on DRM >= 3.27.0

Any other ideas?

Thank you Andrey for looking into, it!
Comment 18 Michel Dänzer 2018-10-29 11:08:28 UTC
Andrey, can you work with Dieter to figure out where the error is coming from? E.g. by attaching patches adding debugging printks.
Comment 19 Andrey Grodzovsky 2018-10-29 14:08:01 UTC
(In reply to Michel Dänzer from comment #18)
> Andrey, can you work with Dieter to figure out where the error is coming
> from? E.g. by attaching patches adding debugging printks.

Yes, i will look into it.
Comment 20 Samuel Pitoiset 2018-10-29 14:56:03 UTC
Make sure to load the right version of libdrm (ie. 2.4.93 or more recent). I had this problem today because I was loading and old version of libdrm. Something was installed in the wrong place.
Comment 21 Andrey Grodzovsky 2018-10-29 17:08:06 UTC
Please load the driver in debug mode so I can see the error code value in dmesg - 
when loading the kernel add drm.debug=0xff

Also to trace where exactly the error originated from please install trace-cmd and beore starting X (assuming you get the failure and the dmesg error right on start)
sudo trace-cmd start -p function_graph -l amdgpu_cs_ioctl
and get the output from /sys/kernel/debug/tracing/trace
Comment 22 Andrey Grodzovsky 2018-10-30 20:01:48 UTC
(In reply to Andrey Grodzovsky from comment #21)
> Please load the driver in debug mode so I can see the error code value in
> dmesg - 
> when loading the kernel add drm.debug=0xff
> 
> Also to trace where exactly the error originated from please install
> trace-cmd and beore starting X (assuming you get the failure and the dmesg
> error right on start)
> sudo trace-cmd start -p function_graph -l amdgpu_cs_ioctl
> and get the output from /sys/kernel/debug/tracing/trace

My bad, the correct command is
sudo trace-cmd start -p function_graph -g amdgpu_cs_ioctl
Comment 23 Dieter Nützel 2018-11-01 01:42:17 UTC
(In reply to Samuel Pitoiset from comment #20)
> Make sure to load the right version of libdrm (ie. 2.4.93 or more recent). I
> had this problem today because I was loading and old version of libdrm.
> Something was installed in the wrong place.

Doh!

Sorry!  Ugh. Development systems...

My latest AMDGPU-PRO OpenCL (amdgpu-pro-18.30-635379-sle-12.tar.xz)
installation broad me

/opt/amdgpu/lib64/
insgesamt 268
drwxr-xr-x 2 root root  4096 18. Aug 06:31 .
drwxr-xr-x 4 root root  4096 18. Aug 06:31 ..
lrwxrwxrwx 1 root root    22  8. Aug 18:33 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.0.0
-rwxr-xr-x 1 root root 69192  8. Aug 18:33 libdrm_amdgpu.so.1.0.0
lrwxrwxrwx 1 root root    22  8. Aug 18:33 libdrm_radeon.so.1 -> libdrm_radeon.so.1.0.1
-rwxr-xr-x 1 root root 68968  8. Aug 18:33 libdrm_radeon.so.1.0.1
lrwxrwxrwx 1 root root    15  8. Aug 18:33 libdrm.so.2 -> libdrm.so.2.4.0
-rwxr-xr-x 1 root root 99600  8. Aug 18:33 libdrm.so.2.4.0
lrwxrwxrwx 1 root root    15  8. Aug 18:33 libkms.so.1 -> libkms.so.1.0.0
-rwxr-xr-x 1 root root 22096  8. Aug 18:33 libkms.so.1.0.0

but the screen corruption appeared first around Aug 22, 2018.

So it worked 'halfway' with upstream _and_ AMDGPU-PRO libdrm.

Maybe the AMD developers could include a 'tag' or something like that to differentiate both version?!

CONCLUSION

After deleting /opt/amdgpu/lib64/ ALL is fine, again.
Comment 24 Dieter Nützel 2018-11-01 01:44:02 UTC
(In reply to Andrey Grodzovsky from comment #22)
> (In reply to Andrey Grodzovsky from comment #21)
> > Please load the driver in debug mode so I can see the error code value in
> > dmesg - 
> > when loading the kernel add drm.debug=0xff
> > 
> > Also to trace where exactly the error originated from please install
> > trace-cmd and beore starting X (assuming you get the failure and the dmesg
> > error right on start)
> > sudo trace-cmd start -p function_graph -l amdgpu_cs_ioctl
> > and get the output from /sys/kernel/debug/tracing/trace
> 
> My bad, the correct command is
> sudo trace-cmd start -p function_graph -g amdgpu_cs_ioctl

Andrey, do you need these logs even after my commit #23?
Comment 25 Andrey Grodzovsky 2018-11-01 14:32:34 UTC
(In reply to Dieter Nützel from comment #24)
> (In reply to Andrey Grodzovsky from comment #22)
> > (In reply to Andrey Grodzovsky from comment #21)
> > > Please load the driver in debug mode so I can see the error code value in
> > > dmesg - 
> > > when loading the kernel add drm.debug=0xff
> > > 
> > > Also to trace where exactly the error originated from please install
> > > trace-cmd and beore starting X (assuming you get the failure and the dmesg
> > > error right on start)
> > > sudo trace-cmd start -p function_graph -l amdgpu_cs_ioctl
> > > and get the output from /sys/kernel/debug/tracing/trace
> > 
> > My bad, the correct command is
> > sudo trace-cmd start -p function_graph -g amdgpu_cs_ioctl
> 
> Andrey, do you need these logs even after my commit #23?

If everything is working fine after whatever you did then no.
Comment 26 Dieter Nützel 2018-11-07 22:11:59 UTC
Fixed with

winsys/amdgpu: Stop using amdgpu_bo_handle_type_kms_noimport

It only behaves any different from amdgpu_bo_handle_type_kms with
libdrm 2.4.93, and it breaks if an older version is picked up.

https://cgit.freedesktop.org/mesa/mesa/commit/?id=32b0eb51a310ef3d6605cdb31c70a10202463e6d


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.