Bug 107229 - Metro 2033 Redux hangs
Summary: Metro 2033 Redux hangs
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-14 20:38 UTC by Alexander Tsoy
Modified: 2018-08-04 11:34 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
linux_metro_bisect.log (4.56 KB, text/plain)
2018-07-14 20:44 UTC, Alexander Tsoy
no flags Details
dmesg (76.32 KB, text/plain)
2018-08-04 11:34 UTC, Alexander Tsoy
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Tsoy 2018-07-14 20:38:42 UTC
Metro 2033 Redux hangs when the certain combination of mesa version, kernel version and kernel configuration is used. This is always happen on loading screen.

I have done some tests using integrated benchmark (benchmark.sh):

linux-4.14.x + mesa-7.3.x = OK
linux-4.14.x + mesa-8.0.x / mesa-8.1.x = hang
linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=y = OK
linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=n + mesa-8.0.x / mesa-8.1.x = hang

When the hang occur, it is causes massive slowdown of all other graphical applications. With 4.14 kernels the game process is unkillable so it hangs somewhere in the kernel space. With 4.17 kernels it can be killed but this takes some time.


My GPU:
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] (rev f1) (prog-if 00 [VGA controller])
	Subsystem: PC Partner Limited / Sapphire Technology Tonga PRO [Radeon R9 285/380] [174b:e305]
Comment 1 Alexander Tsoy 2018-07-14 20:44:38 UTC
Created attachment 140639 [details]
linux_metro_bisect.log

> # first bad commit: [6ed4e2e673d348df6623012a628a8ab8624e3222] drm/ttm: add transparent huge page support for wc or uc allocations v2

Bisect is done with CONFIG_TRANSPARENT_HUGEPAGE=y. This is how I came to an idea to play with transparent huge pages. Yes, I forgot about --term-old/--term-new bisect options :)
Comment 2 Alexander Tsoy 2018-07-15 16:52:08 UTC
(In reply to Alexander Tsoy from comment #0)
> With 4.14 kernels the game process is unkillable so it hangs somewhere 
> in the kernel space. With 4.17 kernels it can be killed but this
> takes some time.
The process actually can be killed in a while loop.

Perf report:

$ sudo perf report | grep metro | head
    33.33%  metro            metro                          [.] cbackend_OGL::delayed_upload
    31.56%  metro            [kernel.vmlinux]               [k] rb_prev
     2.07%  metro            [kernel.vmlinux]               [k] alloc_iova
     0.20%  metro            [kernel.vmlinux]               [k] __switch_to
     0.18%  metro            [kernel.vmlinux]               [k] native_load_gs_index
     0.13%  metro            [kernel.vmlinux]               [k] __x86_indirect_thunk_rax
     0.12%  metro            [kernel.vmlinux]               [k] entry_SYSCALL_64
     0.08%  metro            [kernel.vmlinux]               [k] __schedule
     0.08%  metro            [kernel.vmlinux]               [k] read_tsc
     0.07%  metro            libc-2.26.so                   [.] __nanosleep
Comment 3 Michel Dänzer 2018-07-16 09:44:48 UTC
(In reply to Alexander Tsoy from comment #0)
> 
> linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=y = OK
> linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=n + mesa-8.0.x / mesa-8.1.x =
> hang

Did you swap CONFIG_TRANSPARENT_HUGEPAGE=y/n here? I.e. CONFIG_TRANSPARENT_HUGEPAGE=y is bad, CONFIG_TRANSPARENT_HUGEPAGE=n is good?

If not, how exactly did you bisect with CONFIG_TRANSPARENT_HUGEPAGE=y ?
Comment 4 Alexander Tsoy 2018-07-16 10:15:13 UTC
(In reply to Michel Dänzer from comment #3)
> (In reply to Alexander Tsoy from comment #0)
> > 
> > linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=y = OK
> > linux-4.17.x with CONFIG_TRANSPARENT_HUGEPAGE=n + mesa-8.0.x / mesa-8.1.x =
> > hang
> 
> Did you swap CONFIG_TRANSPARENT_HUGEPAGE=y/n here? I.e.
> CONFIG_TRANSPARENT_HUGEPAGE=y is bad, CONFIG_TRANSPARENT_HUGEPAGE=n is good?

Yes, after getting a clue that this bug could be related to transparent huge pages, I tried to disable CONFIG_TRANSPARENT_HUGEPAGE in 4.17.6 kernel. This results in the same hang I had with 4.14.x kernels.

Note that transparent huge pages must be disabled at build time. cmdline option " transparent_hugepage=never" doesn't change anything.
Comment 5 Alexander Tsoy 2018-07-16 10:18:34 UTC
To clarify a bit: first bad commit in bisect is actually the first good commit that fixed hangs in Metro.
Comment 6 Alexander Tsoy 2018-07-16 10:19:17 UTC
(In reply to Alexander Tsoy from comment #5)
> To clarify a bit: first bad commit in bisect is actually the first good
> commit that fixed hangs in Metro.
But only when transparent huge pages are enabled of course.
Comment 7 Alexander Tsoy 2018-08-04 11:34:52 UTC
Created attachment 140964 [details]
dmesg

Same problem with the latest amd-staging-drm-next (commit bf1fd52b0632cd17ac875432a36d3e92be96d8cb). Now the kernel gives me the following errors:

[  324.552371] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* amdgpu_cs_list_validate(validated) failed.
[  324.561030] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!

And with CONFIG_TRANSPARENT_HUGEPAGE=y the same kernel works fine.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.