Bug 107367 - [regression, bisected] Games freeze the PC with newest AMD Staging DRM Next Kernel
Summary: [regression, bisected] Games freeze the PC with newest AMD Staging DRM Next K...
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-24 19:10 UTC by Gregor Münch
Modified: 2018-07-31 19:22 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Possible fix (2.36 KB, patch)
2018-07-27 07:55 UTC, Christian König
Details | Splinter Review
dmesg with HEAD at drm/scheduler: add NULL pointer check for run queue (v2) (142.44 KB, text/plain)
2018-07-28 12:33 UTC, Gregor Münch
Details
dmesg crash with current HEAD (148.18 KB, text/plain)
2018-07-28 12:36 UTC, Gregor Münch
Details
Possible fix (3.40 KB, patch)
2018-07-30 11:05 UTC, Christian König
Details | Splinter Review
Freeze with Shadow of Mordor (195.44 KB, text/plain)
2018-07-30 14:28 UTC, Gregor Münch
Details

Description Gregor Münch 2018-07-24 19:10:44 UTC
This reproducible all the time with Dota2 starting a benchmark with 1920x1080 and Vulkan using Phoronix test suite.
Starting the benchmark makes the machine freeze and you just can hit the reset button.
I also experienced the freeze with Shadow of Mordor but not all the time.

fc9a53946d9c107a36b79de3dd1a4eac43f13f3f is the first bad commit
commit fc9a53946d9c107a36b79de3dd1a4eac43f13f3f
Author: Junwei Zhang <Jerry.Zhang@amd.com>
Date:   Mon Jul 16 10:53:43 2018 +0800

    drm/scheduler: add NULL pointer check for run queue (v2)
    
    To check rq pointer before adding entity into it.
    That avoids NULL pointer access in some case.
    
    v2: move the check to caller
    
    Suggested-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>

:040000 040000 aa8b4eb640e3995a1f77907edd6ec8fc6fccd1b3 4a061e532ddb8a1293e781a4ccd5ece2efe35d8a M	drivers


Reverting the commit let the freeze go away.
Comment 1 Gregor Münch 2018-07-25 09:05:39 UTC
Forgot to add:

direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: X.Org (0x1002)
    Device: AMD Radeon HD 7900 Series (TAHITI, DRM 3.27.0, 4.18.0-rc1-amd-staging-drm-next-git, LLVM 7.0.0) (0x6798)
    Version: 18.2.0
    Accelerated: yes
    Video memory: 3072MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.5
    Max compat profile version: 4.4
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.1
OpenGL vendor string: X.Org
OpenGL renderer string: AMD Radeon HD 7900 Series (TAHITI, DRM 3.27.0, 4.18.0-rc1-amd-staging-drm-next-git, LLVM 7.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.2.0-devel (git-6853862a58)
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.4 (Compatibility Profile) Mesa 18.2.0-devel (git-6853862a58)
OpenGL shading language version string: 4.40
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.1 Mesa 18.2.0-devel (git-6853862a58)
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10

I use latest linuxfirmware git to be able to use my Tahiti card with AMDGPU.
Comment 2 Christian König 2018-07-25 12:14:27 UTC
Well that is more than just a bit strange.

Do you have an error message in dmesg that the entity was killed?
Comment 3 Gregor Münch 2018-07-26 14:53:12 UTC
This is the crash with kernel from today:

Jul 26 16:47:57 greg-pc kernel: Fixing recursive fault but reboot is needed!
Jul 26 16:47:57 greg-pc kernel: CR2: 00007f14fd433fd2 CR3: 0000000003009005 CR4: 00000000001606f0
Jul 26 16:47:57 greg-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 16:47:57 greg-pc kernel: FS:  0000000000000000(0000) GS:ffff88041ec00000(0000) knlGS:0000000000000000
Jul 26 16:47:57 greg-pc kernel: R13: ffff880407a32200 R14: 0000000000000206 R15: dead000000000200
Jul 26 16:47:57 greg-pc kernel: R10: ffff8803fd74e018 R11: 000000000000000d R12: ffff880408f42768
Jul 26 16:47:57 greg-pc kernel: RBP: ffff8803d62a18b8 R08: 0000000000000000 R09: 000000000000000c
Jul 26 16:47:57 greg-pc kernel: RDX: ffff880408f46d40 RSI: 0000000000000000 RDI: ffff8803d62a18b8
Jul 26 16:47:57 greg-pc kernel: RAX: 0000000000000000 RBX: ffff8803d62a1800 RCX: 00000000ffffffff
Jul 26 16:47:57 greg-pc kernel: RSP: 0018:ffffc9000821fb88 EFLAGS: 00010286
Jul 26 16:47:57 greg-pc kernel: Code: 89 df 48 89 04 24 e8 9d fd ff ff 48 8b 04 24 eb bc 31 c0 eb 8d 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 47 10 <48> 8b 58 08 48 85 c0 74 07 31 f6 e8 6e fd ff ff 48 83 7d 20 00 0f 
Jul 26 16:47:57 greg-pc kernel: RIP: 0010:drm_sched_entity_fini+0x12/0x190 [gpu_sched]
Jul 26 16:47:57 greg-pc kernel: ---[ end trace 2d06a2d2eeb82fce ]---
Jul 26 16:47:57 greg-pc kernel: CR2: 0000000000000008
Jul 26 16:47:57 greg-pc kernel:  ehci_pci xhci_hcd ehci_hcd scsi_mod usbcore usb_common amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm
Jul 26 16:47:57 greg-pc kernel: Modules linked in: ccm fuse nls_iso8859_1 nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp arc4 kvm ath9k ath9k_common irqbypass ath9k_hw crct10dif_pclmul crc32_pclmul eeepc_wmi>
Jul 26 16:47:57 greg-pc kernel: R13: 000055d5bb149e78 R14: 0000000000000000 R15: 000055d5bb149ecc
Jul 26 16:47:57 greg-pc kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
Jul 26 16:47:57 greg-pc kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Jul 26 16:47:57 greg-pc kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000055d5bb149ecc
Jul 26 16:47:57 greg-pc kernel: RAX: fffffffffffffe00 RBX: 000055d5bb149ea0 RCX: 00007f14fd433ffc
Jul 26 16:47:57 greg-pc kernel: RSP: 002b:00007f14ea320d60 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Jul 26 16:47:57 greg-pc kernel: Code: Bad RIP value.
Jul 26 16:47:57 greg-pc kernel: RIP: 0033:0x7f14fd433ffc
Jul 26 16:47:57 greg-pc kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 16:47:57 greg-pc kernel:  do_syscall_64+0xcb/0x100
Jul 26 16:47:57 greg-pc kernel:  exit_to_usermode_loop+0x8e/0xb0
Jul 26 16:47:57 greg-pc kernel:  ? task_work_run+0x7c/0xb0
Jul 26 16:47:57 greg-pc kernel:  do_signal+0x36/0x630
Jul 26 16:47:57 greg-pc kernel:  get_signal+0x240/0x590
Jul 26 16:47:57 greg-pc kernel:  do_group_exit+0x33/0xa0
Jul 26 16:47:57 greg-pc kernel:  ? ___preempt_schedule+0x16/0x18
Jul 26 16:47:57 greg-pc kernel:  ? preempt_schedule_common+0x11/0x30
Jul 26 16:47:57 greg-pc kernel:  do_exit+0x301/0xa90
Jul 26 16:47:57 greg-pc kernel:  task_work_run+0x90/0xb0
Jul 26 16:47:57 greg-pc kernel:  __fput+0xa0/0x1f0
Jul 26 16:47:57 greg-pc kernel:  drm_release+0x25a/0x390 [drm]
Jul 26 16:47:57 greg-pc kernel:  amdgpu_driver_postclose_kms+0x102/0x230 [amdgpu]
Jul 26 16:47:57 greg-pc kernel:  amdgpu_vm_fini+0x84/0x400 [amdgpu]
Jul 26 16:47:57 greg-pc kernel: Call Trace:
Jul 26 16:47:57 greg-pc kernel: CR2: 0000000000000008 CR3: 0000000003009005 CR4: 00000000001606f0
Jul 26 16:47:57 greg-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 16:47:57 greg-pc kernel: FS:  0000000000000000(0000) GS:ffff88041ec00000(0000) knlGS:0000000000000000
Jul 26 16:47:57 greg-pc kernel: R13: ffff880407a32200 R14: 0000000000000206 R15: dead000000000200
Jul 26 16:47:57 greg-pc kernel: R10: ffff8803fd74e018 R11: 000000000000000d R12: ffff880408f42768
Jul 26 16:47:57 greg-pc kernel: RBP: ffff8803d62a18b8 R08: 0000000000000000 R09: 000000000000000c
Jul 26 16:47:57 greg-pc kernel: RDX: ffff880408f46d40 RSI: 0000000000000000 RDI: ffff8803d62a18b8
Jul 26 16:47:57 greg-pc kernel:  get_signal+0x240/0x590
Jul 26 16:47:57 greg-pc kernel:  do_group_exit+0x33/0xa0
Jul 26 16:47:57 greg-pc kernel:  ? ___preempt_schedule+0x16/0x18
Jul 26 16:47:57 greg-pc kernel:  ? preempt_schedule_common+0x11/0x30
Jul 26 16:47:57 greg-pc kernel:  do_exit+0x301/0xa90
Jul 26 16:47:57 greg-pc kernel:  task_work_run+0x90/0xb0
Jul 26 16:47:57 greg-pc kernel:  __fput+0xa0/0x1f0
Jul 26 16:47:57 greg-pc kernel:  drm_release+0x25a/0x390 [drm]
Jul 26 16:47:57 greg-pc kernel:  amdgpu_driver_postclose_kms+0x102/0x230 [amdgpu]
Jul 26 16:47:57 greg-pc kernel:  amdgpu_vm_fini+0x84/0x400 [amdgpu]
Jul 26 16:47:57 greg-pc kernel: Call Trace:
Jul 26 16:47:57 greg-pc kernel: CR2: 0000000000000008 CR3: 0000000003009005 CR4: 00000000001606f0
Jul 26 16:47:57 greg-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 16:47:57 greg-pc kernel: FS:  0000000000000000(0000) GS:ffff88041ec00000(0000) knlGS:0000000000000000
Jul 26 16:47:57 greg-pc kernel: R13: ffff880407a32200 R14: 0000000000000206 R15: dead000000000200
Jul 26 16:47:57 greg-pc kernel: R10: ffff8803fd74e018 R11: 000000000000000d R12: ffff880408f42768
Jul 26 16:47:57 greg-pc kernel: RBP: ffff8803d62a18b8 R08: 0000000000000000 R09: 000000000000000c
Jul 26 16:47:57 greg-pc kernel: RDX: ffff880408f46d40 RSI: 0000000000000000 RDI: ffff8803d62a18b8
Jul 26 16:47:57 greg-pc kernel: RAX: 0000000000000000 RBX: ffff8803d62a1800 RCX: 00000000ffffffff
Jul 26 16:47:57 greg-pc kernel: RSP: 0018:ffffc9000821fb88 EFLAGS: 00010286
Jul 26 16:47:57 greg-pc kernel: Code: 89 df 48 89 04 24 e8 9d fd ff ff 48 8b 04 24 eb bc 31 c0 eb 8d 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 47 10 <48> 8b 58 08 48 85 c0 74 07 31 f6 e8 6e fd ff ff 48 83 7d 20 00 0f 
Jul 26 16:47:57 greg-pc kernel: RIP: 0010:drm_sched_entity_fini+0x12/0x190 [gpu_sched]
Jul 26 16:47:57 greg-pc kernel: Hardware name: ASUS All Series/MAXIMUS VI IMPACT, BIOS 1603 08/15/2014
Jul 26 16:47:57 greg-pc kernel: CPU: 0 PID: 1746 Comm: dota2:disk$0 Tainted: G           O      4.18.0-2-amd-staging-drm-next-git #1
Jul 26 16:47:57 greg-pc kernel: Oops: 0000 [#1] PREEMPT SMP
Jul 26 16:47:57 greg-pc kernel: PGD 0 P4D 0 
Jul 26 16:47:57 greg-pc kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Comment 4 Christian König 2018-07-27 07:53:47 UTC
(In reply to Gregor Münch from comment #3)
> This is the crash with kernel from today:

Well that is not very helpful, please add the full dmesg as attachment.
Comment 5 Christian König 2018-07-27 07:55:43 UTC
Created attachment 140849 [details] [review]
Possible fix

A shoot into the dark, but maybe the attached patch helps.
Comment 6 Gregor Münch 2018-07-28 12:33:30 UTC
Created attachment 140866 [details]
dmesg with HEAD at drm/scheduler: add NULL pointer check for run queue (v2)

The provided patch didnt helped. And reverting the add NULL pointer check didnt helped either anymore with current git HEAD.

I did a git checkout 28756ed597d51b3ceb79c29387a40c51ea353e30 to confirm once more that the kernel worked here and confirmed that with git checkout fc9a53946d9c107a36b79de3dd1a4eac43f13f3f is the first commit that make the kernel crash.
So there must be another commit after that that is making the problem worse.

Also the crash in dmesg is quite different.
Comment 7 Gregor Münch 2018-07-28 12:36:01 UTC
Created attachment 140867 [details]
dmesg crash with current HEAD
Comment 8 Christian König 2018-07-30 11:05:13 UTC
Created attachment 140889 [details] [review]
Possible fix

Yeah, I see. Please test the new fix for this.
Comment 9 Gregor Münch 2018-07-30 14:28:35 UTC
Created attachment 140895 [details]
Freeze with Shadow of Mordor

I applied both patches and the crash with Dota2 is fixed.

However I tried with Shadow of Mordor and the third benchmark run froze my PC again.
Comment 10 Christian König 2018-07-31 11:13:02 UTC
(In reply to Gregor Münch from comment #9)
> Created attachment 140895 [details]
> Freeze with Shadow of Mordor
> 
> I applied both patches and the crash with Dota2 is fixed.
> 
> However I tried with Shadow of Mordor and the third benchmark run froze my
> PC again.

Completely different issue, please open a new bug report when you see that again.

This is a hardware lockup and not a NULL pointer deref in the software stack.
Comment 11 Gregor Münch 2018-07-31 19:22:19 UTC
Will do if I find a way to reproduce it. Seems like some other reports are about exact the same error message and random freezes.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.