Bug 106670 - AMD GPU Error, random lockup, Ryzen 2500U Vega 8 GPU
Summary: AMD GPU Error, random lockup, Ryzen 2500U Vega 8 GPU
Status: RESOLVED INVALID
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-26 18:13 UTC by JerryD
Modified: 2018-07-23 08:28 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Full dmesg text (137.21 KB, text/plain)
2018-05-29 01:17 UTC, JerryD
no flags Details

Description JerryD 2018-05-26 18:13:01 UTC
I am monitoring HP Laptop via ssh to try to catch a lockup propblem.  I am not sure which component to select for the bug reports.  Here is output from dmesg. This while running glxgear. It did not lockup yet, but spotted this first.  I will post more as I find. Please advise other info needed. [aside: the PCI Bus Error I think is unrelated but included so others can discern]

[  270.207119] pcieport 0000:00:01.7: AER: Multiple Corrected error received: id=0008
[  270.207136] pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
[  270.207144] pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
[  270.207149] pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
[  397.899405] pcieport 0000:00:01.7: AER: Multiple Corrected error received: id=0008
[  397.899426] pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
[  397.899434] pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
[  397.899439] pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
[  793.776505] pcieport 0000:00:01.7: AER: Multiple Corrected error received: id=0008
[  793.776524] pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
[  793.776532] pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
[  793.776537] pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
[  797.012006] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
[ 1079.061454] pcieport 0000:00:01.7: AER: Corrected error received: id=0008
[ 1079.061469] pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
[ 1079.061478] pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
[ 1079.061483] pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
[ 1079.061489] pcieport 0000:00:01.7: AER: Corrected error received: id=0008
[ 1079.061503] pcieport 0000:00:01.7: can't find device of ID0008
[ 1145.211182] pcieport 0000:00:01.7: AER: Corrected error received: id=0008
[ 1145.211196] pcieport 0000:00:01.7: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000f(Transmitter ID)
[ 1145.211214] pcieport 0000:00:01.7:   device [1022:15d3] error status/mask=00001000/00006000
[ 1145.211220] pcieport 0000:00:01.7:    [12] Replay Timer Timeout  
[ 1145.211229] pcieport 0000:00:01.7: AER: Corrected error received: id=0008
[ 1145.211239] pcieport 0000:00:01.7: can't find device of ID0008
[ 1350.594831] [drm:generic_reg_wait [amdgpu]] *ERROR* REG_WAIT timeout 1us * 10 tries - optc1_lock line:553
[ 1350.594955] WARNING: CPU: 3 PID: 1828 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:195 generic_reg_wait+0xf3/0x170 [amdgpu]
[ 1350.594956] Modules linked in: ccm fuse rfcomm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack devlink ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle cmac iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables bnep sunrpc vfat fat arc4 r8822be(C) hp_wmi sparse_keymap wmi_bmof edac_mce_amd kvm_amd ccp kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi mac80211 snd_hda_intel btusb irqbypass btrtl crct10dif_pclmul crc32_pclmul btbcm
[ 1350.594989]  btintel snd_hda_codec bluetooth hid_sensor_accel_3d hid_sensor_incl_3d hid_sensor_gyro_3d ghash_clmulni_intel uvcvideo hid_sensor_rotation hid_sensor_magn_3d snd_hda_core hid_sensor_trigger hid_sensor_iio_common industrialio_triggered_buffer videobuf2_vmalloc videobuf2_memops kfifo_buf videobuf2_v4l2 snd_hwdep videobuf2_common industrialio snd_seq videodev cfg80211 snd_seq_device ecdh_generic joydev snd_pcm media rtsx_pci_ms memstick rfkill snd_timer snd sp5100_tco soundcore shpchp i2c_piix4 k10temp tpm_crb wmi tpm_tis hp_accel tpm_tis_core lis3lv02d tpm i2c_scmi video hp_wireless input_polldev pinctrl_amd acpi_cpufreq amdkfd hid_sensor_hub amd_iommu_v2 amdgpu hid_logitech_hidpp chash i2c_algo_bit gpu_sched drm_kms_helper ttm rtsx_pci_sdmmc drm mmc_core crc32c_intel nvme serio_raw nvme_core
[ 1350.595018]  rtsx_pci i2c_hid hid_logitech_dj
[ 1350.595023] CPU: 3 PID: 1828 Comm: gnome-shell Tainted: G         C       4.16.11-300.fc28.x86_64 #1
[ 1350.595024] Hardware name: HP HP ENVY x360 Convertible 15-bq1xx/83C6, BIOS F.17 03/29/2018
[ 1350.595064] RIP: 0010:generic_reg_wait+0xf3/0x170 [amdgpu]
[ 1350.595065] RSP: 0018:ffffbf7048407948 EFLAGS: 00010297
[ 1350.595066] RAX: 0000000000000229 RBX: 0000000000000001 RCX: 0000000000000000
[ 1350.595067] RDX: 0000000000000000 RSI: ffff9fd2decd6938 RDI: ffff9fd2decd6938
[ 1350.595068] RBP: ffff9fd2cb352a00 R08: 0000000000000005 R09: 000000000000042b
[ 1350.595068] R10: 0000000000000001 R11: ffffffff9d9751ed R12: 000000000000000b
[ 1350.595069] R13: 000000000000504d R14: 0000000000000100 R15: 0000000000000001
[ 1350.595071] FS:  00007f2b0357f280(0000) GS:ffff9fd2decc0000(0000) knlGS:0000000000000000
[ 1350.595072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1350.595072] CR2: 000055e154e8f3a8 CR3: 00000001ecaa0000 CR4: 00000000003406e0
[ 1350.595073] Call Trace:
[ 1350.595121]  optc1_lock+0xa0/0xb0 [amdgpu]
[ 1350.595165]  dcn10_apply_ctx_for_surface+0xdf/0x13f0 [amdgpu]
[ 1350.595173]  ? __alloc_pages_nodemask+0x11e/0x2b0
[ 1350.595175]  ? free_one_page+0x3d6/0x510
[ 1350.595214]  dc_commit_state+0x262/0x560 [amdgpu]
[ 1350.595252]  ? mod_freesync_set_user_enable+0x11b/0x150 [amdgpu]
[ 1350.595295]  amdgpu_dm_atomic_commit_tail+0x373/0xd90 [amdgpu]
[ 1350.595329]  ? amdgpu_bo_pin_restricted+0x1cb/0x2c0 [amdgpu]
[ 1350.595333]  ? _cond_resched+0x15/0x30
[ 1350.595335]  ? wait_for_completion_timeout+0x3a/0x190
[ 1350.595336]  ? wait_for_completion_interruptible+0x35/0x1d0
[ 1350.595347]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 1350.595354]  drm_atomic_helper_commit+0x103/0x110 [drm_kms_helper]
[ 1350.595372]  drm_atomic_connector_commit_dpms+0xdb/0x100 [drm]
[ 1350.595384]  drm_mode_obj_set_property_ioctl+0x178/0x280 [drm]
[ 1350.595394]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[ 1350.595403]  drm_ioctl_kernel+0x5b/0xb0 [drm]
[ 1350.595413]  drm_ioctl+0x1c0/0x380 [drm]
[ 1350.595424]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[ 1350.595428]  ? eventfd_read+0xe6/0x290
[ 1350.595459]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 1350.595463]  do_vfs_ioctl+0xa4/0x610
[ 1350.595465]  SyS_ioctl+0x74/0x80
[ 1350.595469]  do_syscall_64+0x74/0x180
[ 1350.595472]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1350.595473] Code: d9 48 c7 c2 90 81 82 c0 48 c7 c7 9c fd 82 c0 50 4c 8b 4c 24 58 44 8b 44 24 50 e8 c9 c8 de ff 83 7d 20 01 58 44 8b 54 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 d0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 
[ 1350.595495] ---[ end trace 3336f495b7ef729e ]---
Comment 1 Ernst Sjöstrand 2018-05-26 20:58:39 UTC
Please post some information about your hardware and the kernel version you're running. Full dmesg is also good, shows what the driver says when it's loading.
Comment 2 JerryD 2018-05-29 01:17:11 UTC
Created attachment 139819 [details]
Full dmesg text

dmesg output
Comment 3 JerryD 2018-05-29 01:19:37 UTC
(In reply to Ernst Sjöstrand from comment #1)
> Please post some information about your hardware and the kernel version
> you're running. Full dmesg is also good, shows what the driver says when
> it's loading.

I have attached the dmesg text. I also tried to use ssh from a remote machine to try to 'see' what is going on. The ssh session also completely locks up. I can reproduce this hange when running glxgears with vblank_mode=0. The time it takes is random, something like 10 to 30 minutes.
Comment 4 JerryD 2018-05-30 01:06:35 UTC
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: X.Org (0x1002)
    Device: AMD RAVEN (DRM 3.23.0 / 4.16.12-300.fc28.x86_64, LLVM 6.0.0) (0x15dd)
    Version: 18.0.2
    Accelerated: yes
    Video memory: 223MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.5
    Max compat profile version: 3.0
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.1
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 223 MB, largest block: 223 MB
    VBO free aux. memory - total: 3067 MB, largest block: 3067 MB
    Texture free memory - total: 223 MB, largest block: 223 MB
    Texture free aux. memory - total: 3067 MB, largest block: 3067 MB
    Renderbuffer free memory - total: 223 MB, largest block: 223 MB
    Renderbuffer free aux. memory - total: 3067 MB, largest block: 3067 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 223 MB
    Total available memory: 3291 MB
    Currently available dedicated video memory: 223 MB
OpenGL vendor string: X.Org
OpenGL renderer string: AMD RAVEN (DRM 3.23.0 / 4.16.12-300.fc28.x86_64, LLVM 6.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.0.2
OpenGL core profile shading language version string: 4.50
Comment 5 Jack Wolf 2018-06-09 16:14:26 UTC
I have same issues.

AMD Ryzen 1800x 
Sapphier Vega 56
amdgpu git
kernel 4.17.0

This is what dmesg say from time to time beyond a hang up.

[Sa Jun  9 17:34:54 2018] [drm:generic_reg_wait] *ERROR* REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:563
[Sa Jun  9 17:34:54 2018] WARNING: CPU: 14 PID: 175 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:195 generic_reg_wait+0xe2/0x160
[Sa Jun  9 17:34:54 2018] Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
[Sa Jun  9 17:34:54 2018] CPU: 14 PID: 175 Comm: kworker/14:1 Tainted: G           O      4.17.0 #1
[Sa Jun  9 17:34:54 2018] Hardware name: System manufacturer System Product Name/PRIME B350-PLUS, BIOS 4011 04/19/2018
[Sa Jun  9 17:34:54 2018] Workqueue: events dm_irq_work_func
[Sa Jun  9 17:34:54 2018] RIP: 0010:generic_reg_wait+0xe2/0x160
[Sa Jun  9 17:34:54 2018] RSP: 0018:ffffba1941e9fa88 EFLAGS: 00010297
[Sa Jun  9 17:34:54 2018] RAX: 0000000000000000 RBX: 0000000000000dad RCX: 0000000000000000
[Sa Jun  9 17:34:54 2018] RDX: 0000000000000000 RSI: ffff985a9ef953b8 RDI: ffff985a9ef953b8
[Sa Jun  9 17:34:54 2018] RBP: 000000000000000a R08: 0000000000000416 R09: 0000000000000002
[Sa Jun  9 17:34:54 2018] R10: 0000000000000002 R11: 0000000000000001 R12: ffff985a8c8b5280
[Sa Jun  9 17:34:54 2018] R13: 00000000000035af R14: 0000000000000010 R15: 0000000000000001
[Sa Jun  9 17:34:54 2018] FS:  0000000000000000(0000) GS:ffff985a9ef80000(0000) knlGS:0000000000000000
[Sa Jun  9 17:34:54 2018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sa Jun  9 17:34:54 2018] CR2: 000055e803d42c78 CR3: 00000003c621c000 CR4: 00000000003406e0
[Sa Jun  9 17:34:54 2018] Call Trace:
[Sa Jun  9 17:34:54 2018]  dce_mi_free_dmif+0x11c/0x1a0
[Sa Jun  9 17:34:54 2018]  dce110_reset_hw_ctx_wrap+0x13b/0x1c0
[Sa Jun  9 17:34:54 2018]  dce110_apply_ctx_to_hw+0x51/0x8c0
[Sa Jun  9 17:34:54 2018]  ? amdgpu_pm_compute_clocks+0xa2/0x570
[Sa Jun  9 17:34:54 2018]  dc_commit_state+0x333/0x5f0
[Sa Jun  9 17:34:54 2018]  ? set_freesync_on_streams.part.6+0x48/0x240
[Sa Jun  9 17:34:54 2018]  ? mod_freesync_set_user_enable+0x116/0x140
[Sa Jun  9 17:34:54 2018]  amdgpu_dm_atomic_commit_tail+0x359/0xd10
[Sa Jun  9 17:34:54 2018]  ? amdgpu_bo_pin_restricted+0x227/0x2e0
[Sa Jun  9 17:34:54 2018]  ? _cond_resched+0x10/0x40
[Sa Jun  9 17:34:54 2018]  ? wait_for_completion_timeout+0x2f/0x130
[Sa Jun  9 17:34:54 2018]  ? _cond_resched+0x10/0x40
[Sa Jun  9 17:34:54 2018]  ? wait_for_completion_interruptible+0x2c/0x160
[Sa Jun  9 17:34:54 2018]  ? dm_plane_helper_prepare_fb+0xea/0x290
[Sa Jun  9 17:34:54 2018]  commit_tail+0x38/0x70
[Sa Jun  9 17:34:54 2018]  drm_atomic_helper_commit+0x11c/0x130
[Sa Jun  9 17:34:54 2018]  dm_restore_drm_connector_state+0x100/0x190
[Sa Jun  9 17:34:54 2018]  handle_hpd_irq+0x81/0xa0
[Sa Jun  9 17:34:54 2018]  dm_irq_work_func+0x49/0x60
[Sa Jun  9 17:34:54 2018]  process_one_work+0x1cc/0x3c0
[Sa Jun  9 17:34:54 2018]  worker_thread+0x26/0x3f0
[Sa Jun  9 17:34:54 2018]  ? trace_event_raw_event_workqueue_execute_start+0xc0/0xc0
[Sa Jun  9 17:34:54 2018]  kthread+0x10e/0x130
[Sa Jun  9 17:34:54 2018]  ? kthread_create_worker_on_cpu+0x70/0x70
[Sa Jun  9 17:34:54 2018]  ret_from_fork+0x22/0x40
[Sa Jun  9 17:34:54 2018] Code: 24 58 48 8b 4c 24 50 89 ee 8b 54 24 48 48 c7 c7 48 1d 4b 9a 44 89 4c 24 08 e8 6b 70 eb ff 41 83 7c 24 20 01 44 8b 4c 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 c8
5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f
[Sa Jun  9 17:34:54 2018] ---[ end trace b03679a92b01c897 ]---


I can't give logs about hangups because i can't enter the machine via ssh.
Comment 6 JerryD 2018-07-22 03:11:08 UTC
I used grubby to add to my kernel boot command 'idle=nomwait' and the problem seems resolved. The mwait instruction is known to possibly hang threads on some earlier released ryzen chips as documented in the AMD Errata.
Comment 7 Michel Dänzer 2018-07-23 08:28:07 UTC
(In reply to JerryD from comment #6)
> I used grubby to add to my kernel boot command 'idle=nomwait' and the
> problem seems resolved. The mwait instruction is known to possibly hang
> threads on some earlier released ryzen chips as documented in the AMD Errata.

Thanks for the follow-up, resolving accordingly.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.