Bug 107229

Summary: Metro 2033 Redux hangs
Product: DRI Reporter: Alexander Tsoy <alexander>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
linux_metro_bisect.log
none
dmesg none

Description Alexander Tsoy 2018-07-14 20:38:42 UTC
Metro 2033 Redux hangs when the certain combination of mesa version, kernel version and kernel configuration is used. This is always happen on loading screen.

I have done some tests using integrated benchmark (benchmark.sh):

linux-4.14.x + mesa-7.3.x = OK
linux-4.14.x + mesa-8.0.x / mesa-8.1.x = hang
linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=y = OK
linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=n + mesa-8.0.x / mesa-8.1.x = hang

When the hang occur, it is causes massive slowdown of all other graphical applications. With 4.14 kernels the game process is unkillable so it hangs somewhere in the kernel space. With 4.17 kernels it can be killed but this takes some time.


My GPU:
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] (rev f1) (prog-if 00 [VGA controller])
	Subsystem: PC Partner Limited / Sapphire Technology Tonga PRO [Radeon R9 285/380] [174b:e305]
Comment 1 Alexander Tsoy 2018-07-14 20:44:38 UTC
Created attachment 140639 [details]
linux_metro_bisect.log

> # first bad commit: [6ed4e2e673d348df6623012a628a8ab8624e3222] drm/ttm: add transparent huge page support for wc or uc allocations v2

Bisect is done with CONFIG_TRANSPARENT_HUGEPAGE=y. This is how I came to an idea to play with transparent huge pages. Yes, I forgot about --term-old/--term-new bisect options :)
Comment 2 Alexander Tsoy 2018-07-15 16:52:08 UTC
(In reply to Alexander Tsoy from comment #0)
> With 4.14 kernels the game process is unkillable so it hangs somewhere 
> in the kernel space. With 4.17 kernels it can be killed but this
> takes some time.
The process actually can be killed in a while loop.

Perf report:

$ sudo perf report | grep metro | head
    33.33%  metro            metro                          [.] cbackend_OGL::delayed_upload
    31.56%  metro            [kernel.vmlinux]               [k] rb_prev
     2.07%  metro            [kernel.vmlinux]               [k] alloc_iova
     0.20%  metro            [kernel.vmlinux]               [k] __switch_to
     0.18%  metro            [kernel.vmlinux]               [k] native_load_gs_index
     0.13%  metro            [kernel.vmlinux]               [k] __x86_indirect_thunk_rax
     0.12%  metro            [kernel.vmlinux]               [k] entry_SYSCALL_64
     0.08%  metro            [kernel.vmlinux]               [k] __schedule
     0.08%  metro            [kernel.vmlinux]               [k] read_tsc
     0.07%  metro            libc-2.26.so                   [.] __nanosleep
Comment 3 Michel Dänzer 2018-07-16 09:44:48 UTC
(In reply to Alexander Tsoy from comment #0)
> 
> linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=y = OK
> linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=n + mesa-8.0.x / mesa-8.1.x =
> hang

Did you swap CONFIG_TRANSPARENT_HUGEPAGE=y/n here? I.e. CONFIG_TRANSPARENT_HUGEPAGE=y is bad, CONFIG_TRANSPARENT_HUGEPAGE=n is good?

If not, how exactly did you bisect with CONFIG_TRANSPARENT_HUGEPAGE=y ?
Comment 4 Alexander Tsoy 2018-07-16 10:15:13 UTC
(In reply to Michel Dänzer from comment #3)
> (In reply to Alexander Tsoy from comment #0)
> > 
> > linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=y = OK
> > linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=n + mesa-8.0.x / mesa-8.1.x =
> > hang
> 
> Did you swap CONFIG_TRANSPARENT_HUGEPAGE=y/n here? I.e.
> CONFIG_TRANSPARENT_HUGEPAGE=y is bad, CONFIG_TRANSPARENT_HUGEPAGE=n is good?

Yes, after getting a clue that this bug could be related to transparent huge pages, I tried to disable CONFIG_TRANSPARENT_HUGEPAGE in 4.17.6 kernel. This results in the same hang I had with 4.14.x kernels.

Note that transparent huge pages must be disabled at build time. cmdline option " transparent_hugepage=never" doesn't change anything.
Comment 5 Alexander Tsoy 2018-07-16 10:18:34 UTC
To clarify a bit: first bad commit in bisect is actually the first good commit that fixed hangs in Metro.
Comment 6 Alexander Tsoy 2018-07-16 10:19:17 UTC
(In reply to Alexander Tsoy from comment #5)
> To clarify a bit: first bad commit in bisect is actually the first good
> commit that fixed hangs in Metro.
But only when transparent huge pages are enabled of course.
Comment 7 Alexander Tsoy 2018-08-04 11:34:52 UTC
Created attachment 140964 [details]
dmesg

Same problem with the latest amd-staging-drm-next (commit bf1fd52b0632cd17ac875432a36d3e92be96d8cb). Now the kernel gives me the following errors:

[  324.552371] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* amdgpu_cs_list_validate(validated) failed.
[  324.561030] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!

And with CONFIG_TRANSPARENT_HUGEPAGE=y the same kernel works fine.
Comment 8 Martin Peres 2019-11-19 08:43:36 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/447.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.