This reproducible all the time with Dota2 starting a benchmark with 1920x1080 and Vulkan using Phoronix test suite. Starting the benchmark makes the machine freeze and you just can hit the reset button. I also experienced the freeze with Shadow of Mordor but not all the time. fc9a53946d9c107a36b79de3dd1a4eac43f13f3f is the first bad commit commit fc9a53946d9c107a36b79de3dd1a4eac43f13f3f Author: Junwei Zhang <Jerry.Zhang@amd.com> Date: Mon Jul 16 10:53:43 2018 +0800 drm/scheduler: add NULL pointer check for run queue (v2) To check rq pointer before adding entity into it. That avoids NULL pointer access in some case. v2: move the check to caller Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> :040000 040000 aa8b4eb640e3995a1f77907edd6ec8fc6fccd1b3 4a061e532ddb8a1293e781a4ccd5ece2efe35d8a M drivers Reverting the commit let the freeze go away.
Forgot to add: direct rendering: Yes Extended renderer info (GLX_MESA_query_renderer): Vendor: X.Org (0x1002) Device: AMD Radeon HD 7900 Series (TAHITI, DRM 3.27.0, 4.18.0-rc1-amd-staging-drm-next-git, LLVM 7.0.0) (0x6798) Version: 18.2.0 Accelerated: yes Video memory: 3072MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.5 Max compat profile version: 4.4 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.1 OpenGL vendor string: X.Org OpenGL renderer string: AMD Radeon HD 7900 Series (TAHITI, DRM 3.27.0, 4.18.0-rc1-amd-staging-drm-next-git, LLVM 7.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.2.0-devel (git-6853862a58) OpenGL core profile shading language version string: 4.50 OpenGL core profile context flags: (none) OpenGL core profile profile mask: core profile OpenGL version string: 4.4 (Compatibility Profile) Mesa 18.2.0-devel (git-6853862a58) OpenGL shading language version string: 4.40 OpenGL context flags: (none) OpenGL profile mask: compatibility profile OpenGL ES profile version string: OpenGL ES 3.1 Mesa 18.2.0-devel (git-6853862a58) OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10 I use latest linuxfirmware git to be able to use my Tahiti card with AMDGPU.
Well that is more than just a bit strange. Do you have an error message in dmesg that the entity was killed?
This is the crash with kernel from today: Jul 26 16:47:57 greg-pc kernel: Fixing recursive fault but reboot is needed! Jul 26 16:47:57 greg-pc kernel: CR2: 00007f14fd433fd2 CR3: 0000000003009005 CR4: 00000000001606f0 Jul 26 16:47:57 greg-pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 16:47:57 greg-pc kernel: FS: 0000000000000000(0000) GS:ffff88041ec00000(0000) knlGS:0000000000000000 Jul 26 16:47:57 greg-pc kernel: R13: ffff880407a32200 R14: 0000000000000206 R15: dead000000000200 Jul 26 16:47:57 greg-pc kernel: R10: ffff8803fd74e018 R11: 000000000000000d R12: ffff880408f42768 Jul 26 16:47:57 greg-pc kernel: RBP: ffff8803d62a18b8 R08: 0000000000000000 R09: 000000000000000c Jul 26 16:47:57 greg-pc kernel: RDX: ffff880408f46d40 RSI: 0000000000000000 RDI: ffff8803d62a18b8 Jul 26 16:47:57 greg-pc kernel: RAX: 0000000000000000 RBX: ffff8803d62a1800 RCX: 00000000ffffffff Jul 26 16:47:57 greg-pc kernel: RSP: 0018:ffffc9000821fb88 EFLAGS: 00010286 Jul 26 16:47:57 greg-pc kernel: Code: 89 df 48 89 04 24 e8 9d fd ff ff 48 8b 04 24 eb bc 31 c0 eb 8d 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 47 10 <48> 8b 58 08 48 85 c0 74 07 31 f6 e8 6e fd ff ff 48 83 7d 20 00 0f Jul 26 16:47:57 greg-pc kernel: RIP: 0010:drm_sched_entity_fini+0x12/0x190 [gpu_sched] Jul 26 16:47:57 greg-pc kernel: ---[ end trace 2d06a2d2eeb82fce ]--- Jul 26 16:47:57 greg-pc kernel: CR2: 0000000000000008 Jul 26 16:47:57 greg-pc kernel: ehci_pci xhci_hcd ehci_hcd scsi_mod usbcore usb_common amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm Jul 26 16:47:57 greg-pc kernel: Modules linked in: ccm fuse nls_iso8859_1 nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp arc4 kvm ath9k ath9k_common irqbypass ath9k_hw crct10dif_pclmul crc32_pclmul eeepc_wmi> Jul 26 16:47:57 greg-pc kernel: R13: 000055d5bb149e78 R14: 0000000000000000 R15: 000055d5bb149ecc Jul 26 16:47:57 greg-pc kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 Jul 26 16:47:57 greg-pc kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Jul 26 16:47:57 greg-pc kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000055d5bb149ecc Jul 26 16:47:57 greg-pc kernel: RAX: fffffffffffffe00 RBX: 000055d5bb149ea0 RCX: 00007f14fd433ffc Jul 26 16:47:57 greg-pc kernel: RSP: 002b:00007f14ea320d60 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca Jul 26 16:47:57 greg-pc kernel: Code: Bad RIP value. Jul 26 16:47:57 greg-pc kernel: RIP: 0033:0x7f14fd433ffc Jul 26 16:47:57 greg-pc kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jul 26 16:47:57 greg-pc kernel: do_syscall_64+0xcb/0x100 Jul 26 16:47:57 greg-pc kernel: exit_to_usermode_loop+0x8e/0xb0 Jul 26 16:47:57 greg-pc kernel: ? task_work_run+0x7c/0xb0 Jul 26 16:47:57 greg-pc kernel: do_signal+0x36/0x630 Jul 26 16:47:57 greg-pc kernel: get_signal+0x240/0x590 Jul 26 16:47:57 greg-pc kernel: do_group_exit+0x33/0xa0 Jul 26 16:47:57 greg-pc kernel: ? ___preempt_schedule+0x16/0x18 Jul 26 16:47:57 greg-pc kernel: ? preempt_schedule_common+0x11/0x30 Jul 26 16:47:57 greg-pc kernel: do_exit+0x301/0xa90 Jul 26 16:47:57 greg-pc kernel: task_work_run+0x90/0xb0 Jul 26 16:47:57 greg-pc kernel: __fput+0xa0/0x1f0 Jul 26 16:47:57 greg-pc kernel: drm_release+0x25a/0x390 [drm] Jul 26 16:47:57 greg-pc kernel: amdgpu_driver_postclose_kms+0x102/0x230 [amdgpu] Jul 26 16:47:57 greg-pc kernel: amdgpu_vm_fini+0x84/0x400 [amdgpu] Jul 26 16:47:57 greg-pc kernel: Call Trace: Jul 26 16:47:57 greg-pc kernel: CR2: 0000000000000008 CR3: 0000000003009005 CR4: 00000000001606f0 Jul 26 16:47:57 greg-pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 16:47:57 greg-pc kernel: FS: 0000000000000000(0000) GS:ffff88041ec00000(0000) knlGS:0000000000000000 Jul 26 16:47:57 greg-pc kernel: R13: ffff880407a32200 R14: 0000000000000206 R15: dead000000000200 Jul 26 16:47:57 greg-pc kernel: R10: ffff8803fd74e018 R11: 000000000000000d R12: ffff880408f42768 Jul 26 16:47:57 greg-pc kernel: RBP: ffff8803d62a18b8 R08: 0000000000000000 R09: 000000000000000c Jul 26 16:47:57 greg-pc kernel: RDX: ffff880408f46d40 RSI: 0000000000000000 RDI: ffff8803d62a18b8 Jul 26 16:47:57 greg-pc kernel: get_signal+0x240/0x590 Jul 26 16:47:57 greg-pc kernel: do_group_exit+0x33/0xa0 Jul 26 16:47:57 greg-pc kernel: ? ___preempt_schedule+0x16/0x18 Jul 26 16:47:57 greg-pc kernel: ? preempt_schedule_common+0x11/0x30 Jul 26 16:47:57 greg-pc kernel: do_exit+0x301/0xa90 Jul 26 16:47:57 greg-pc kernel: task_work_run+0x90/0xb0 Jul 26 16:47:57 greg-pc kernel: __fput+0xa0/0x1f0 Jul 26 16:47:57 greg-pc kernel: drm_release+0x25a/0x390 [drm] Jul 26 16:47:57 greg-pc kernel: amdgpu_driver_postclose_kms+0x102/0x230 [amdgpu] Jul 26 16:47:57 greg-pc kernel: amdgpu_vm_fini+0x84/0x400 [amdgpu] Jul 26 16:47:57 greg-pc kernel: Call Trace: Jul 26 16:47:57 greg-pc kernel: CR2: 0000000000000008 CR3: 0000000003009005 CR4: 00000000001606f0 Jul 26 16:47:57 greg-pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 16:47:57 greg-pc kernel: FS: 0000000000000000(0000) GS:ffff88041ec00000(0000) knlGS:0000000000000000 Jul 26 16:47:57 greg-pc kernel: R13: ffff880407a32200 R14: 0000000000000206 R15: dead000000000200 Jul 26 16:47:57 greg-pc kernel: R10: ffff8803fd74e018 R11: 000000000000000d R12: ffff880408f42768 Jul 26 16:47:57 greg-pc kernel: RBP: ffff8803d62a18b8 R08: 0000000000000000 R09: 000000000000000c Jul 26 16:47:57 greg-pc kernel: RDX: ffff880408f46d40 RSI: 0000000000000000 RDI: ffff8803d62a18b8 Jul 26 16:47:57 greg-pc kernel: RAX: 0000000000000000 RBX: ffff8803d62a1800 RCX: 00000000ffffffff Jul 26 16:47:57 greg-pc kernel: RSP: 0018:ffffc9000821fb88 EFLAGS: 00010286 Jul 26 16:47:57 greg-pc kernel: Code: 89 df 48 89 04 24 e8 9d fd ff ff 48 8b 04 24 eb bc 31 c0 eb 8d 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 47 10 <48> 8b 58 08 48 85 c0 74 07 31 f6 e8 6e fd ff ff 48 83 7d 20 00 0f Jul 26 16:47:57 greg-pc kernel: RIP: 0010:drm_sched_entity_fini+0x12/0x190 [gpu_sched] Jul 26 16:47:57 greg-pc kernel: Hardware name: ASUS All Series/MAXIMUS VI IMPACT, BIOS 1603 08/15/2014 Jul 26 16:47:57 greg-pc kernel: CPU: 0 PID: 1746 Comm: dota2:disk$0 Tainted: G O 4.18.0-2-amd-staging-drm-next-git #1 Jul 26 16:47:57 greg-pc kernel: Oops: 0000 [#1] PREEMPT SMP Jul 26 16:47:57 greg-pc kernel: PGD 0 P4D 0 Jul 26 16:47:57 greg-pc kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
(In reply to Gregor Münch from comment #3) > This is the crash with kernel from today: Well that is not very helpful, please add the full dmesg as attachment.
Created attachment 140849 [details] [review] Possible fix A shoot into the dark, but maybe the attached patch helps.
Created attachment 140866 [details] dmesg with HEAD at drm/scheduler: add NULL pointer check for run queue (v2) The provided patch didnt helped. And reverting the add NULL pointer check didnt helped either anymore with current git HEAD. I did a git checkout 28756ed597d51b3ceb79c29387a40c51ea353e30 to confirm once more that the kernel worked here and confirmed that with git checkout fc9a53946d9c107a36b79de3dd1a4eac43f13f3f is the first commit that make the kernel crash. So there must be another commit after that that is making the problem worse. Also the crash in dmesg is quite different.
Created attachment 140867 [details] dmesg crash with current HEAD
Created attachment 140889 [details] [review] Possible fix Yeah, I see. Please test the new fix for this.
Created attachment 140895 [details] Freeze with Shadow of Mordor I applied both patches and the crash with Dota2 is fixed. However I tried with Shadow of Mordor and the third benchmark run froze my PC again.
(In reply to Gregor Münch from comment #9) > Created attachment 140895 [details] > Freeze with Shadow of Mordor > > I applied both patches and the crash with Dota2 is fixed. > > However I tried with Shadow of Mordor and the third benchmark run froze my > PC again. Completely different issue, please open a new bug report when you see that again. This is a hardware lockup and not a NULL pointer deref in the software stack.
Will do if I find a way to reproduce it. Seems like some other reports are about exact the same error message and random freezes.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.