Bug 109978 - Unprivileged user mode program can cause GPU reset
Summary: Unprivileged user mode program can cause GPU reset
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/amdkfd (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-12 13:56 UTC by Michal
Modified: 2019-11-19 07:53 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Michal 2019-03-12 13:56:07 UTC
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/72

Sample program which causes this (needs ROCm):

> #include <hc.hpp>
> int main()
> {
> 	parallel_for_each(hc::extent<1>(1), [=]() [[hc]]
> 	{
> 		asm("s_trap 2");
> 	});
> 	return 0;
> }

> hcc -hc main.cpp
> ./a.out

Process never ends and CTRL-C causes GPU reset which breaks all other processes actually using rocm on that GPU. Seems trap handler expects queue handle in s[0:1] which is set when using __builtin_trap() so without it trap handler causes another exceptions.

System logs:

[  247.428727] qcm fence wait loop timeout expired
[  247.428730] The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  247.428736] amdgpu 0000:0b:00.0: GPU reset begin!
[  247.619440] amdgpu 0000:0b:00.0: GPU reset
[  248.152762] [drm] psp mode1 reset succeed 
[  248.279461] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[  248.279584] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[  248.279639] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[  248.279769] [drm] PSP is resuming...
[  248.428305] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[  248.472774] WARNING: CPU: 23 PID: 21634 at /build/linux-uQJ2um/linux-4.15.0/kernel/kthread.c:498 kthread_park+0x67/0x80
[  248.472775] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs msr nls_utf8 cifs ccm fscache cmac bnep binfmt_misc nls_iso8859_1 edac_mce_amd arc4 snd_hda_codec_realtek snd_hda_codec_generic kvm_amd snd_hda_codec_hdmi kvm snd_seq_midi irqbypass snd_hda_intel snd_seq_midi_event snd_hda_codec btusb snd_hda_core btrtl wmi_bmof snd_rawmidi iwlmvm snd_hwdep btbcm btintel snd_pcm snd_seq bluetooth mac80211 snd_seq_device ecdh_generic snd_timer iwlwifi ccp snd cfg80211 soundcore k10temp shpchp mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
[  248.472823]  multipath linear raid0 amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 amdkcl(OE) crypto_simd glue_helper amd_iommu_v2 cryptd drm_kms_helper syscopyarea sysfillrect sysimgblt igb fb_sys_fops drm dca nvme i2c_algo_bit i2c_piix4 nvme_core ptp ahci atlantic libahci pps_core gpio_amdpt wmi gpio_generic
[  248.472846] CPU: 23 PID: 21634 Comm: a.out Tainted: G           OE    4.15.0-45-generic #48-Ubuntu
[  248.472847] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[  248.472849] RIP: 0010:kthread_park+0x67/0x80
[  248.472850] RSP: 0018:ffffb44fc7e27ad0 EFLAGS: 00010202
[  248.472852] RAX: 0000000000000004 RBX: ffff9ec63f49e480 RCX: 0000000000000000
[  248.472853] RDX: ffff9ec63c717198 RSI: ffff9ec63ea0c0c0 RDI: ffff9ec63dd38000
[  248.472854] RBP: ffffb44fc7e27ae0 R08: 0000000000000051 R09: 0000000000000000
[  248.472855] R10: 0000000000000000 R11: 0000000000000056 R12: ffff9ec63ea0c0c0
[  248.472855] R13: ffff9ec64f4f4200 R14: ffff9ec63c710000 R15: 0000000000000000
[  248.472857] FS:  00007fd52a286c00(0000) GS:ffff9ec65cdc0000(0000) knlGS:0000000000000000
[  248.472858] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  248.472859] CR2: 00007f0c07687a98 CR3: 000000081b5b6000 CR4: 00000000003406e0
[  248.472860] Call Trace:
[  248.472865]  amddrm_sched_entity_fini+0x44/0x1b0 [amd_sched]
[  248.472868]  amddrm_sched_entity_destroy+0x1f/0x30 [amd_sched]
[  248.472907]  amdgpu_vm_fini+0xbb/0x4f0 [amdgpu]
[  248.472942]  amdgpu_driver_postclose_kms+0x15b/0x2b0 [amdgpu]
[  248.472952]  drm_release+0x26b/0x390 [drm]
[  248.472955]  __fput+0xea/0x220
[  248.472957]  ____fput+0xe/0x10
[  248.472959]  task_work_run+0x9d/0xc0
[  248.472961]  do_exit+0x2ec/0xb40
[  248.472963]  do_group_exit+0x43/0xb0
[  248.472965]  get_signal+0x27b/0x590
[  248.472968]  do_signal+0x37/0x730
[  248.472971]  ? __switch_to_asm+0x34/0x70
[  248.472973]  ? __switch_to_asm+0x40/0x70
[  248.472976]  ? do_vfs_ioctl+0xa8/0x630
[  248.472978]  ? __schedule+0x299/0x8a0
[  248.472980]  exit_to_usermode_loop+0x73/0xd0
[  248.472982]  do_syscall_64+0x115/0x130
[  248.472984]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  248.472986] RIP: 0033:0x7fd528bdd5d7
[  248.472987] RSP: 002b:00007ffe830d4778 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  248.472988] RAX: fffffffffffffffc RBX: 0000000000000001 RCX: 00007fd528bdd5d7
[  248.472989] RDX: 00007ffe830d47d0 RSI: 00000000c0184b0c RDI: 0000000000000003
[  248.472990] RBP: 00007ffe830d47d0 R08: 00007ffe830d4890 R09: 0000000000000001
[  248.472990] R10: 0000000000c92010 R11: 0000000000000246 R12: 00000000c0184b0c
[  248.472991] R13: 0000000000000003 R14: 0000000000000000 R15: 00000000fffffffe
[  248.472992] Code: 0e e8 6e c0 00 00 48 8d 7b 18 e8 35 d2 8e 00 44 89 e0 5b 41 5c 5d c3 0f 0b 41 bc da ff ff ff 44 89 e0 5b 41 5c 5d c3 0f 0b eb af <0f> 0b 41 bc f0 ff ff ff eb da 0f 1f 44 00 00 66 2e 0f 1f 84 00 
[  248.473020] ---[ end trace 19649ddd4a6314f7 ]---
[  248.648453] [drm] UVD and UVD ENC initialized successfully.
[  248.748509] [drm] VCE initialized successfully.
[  248.749616] [drm] recover vram bo from shadow start
[  248.749666] [drm] recover vram bo from shadow done
[  248.749680] amdgpu 0000:0b:00.0: GPU reset(1) succeeded!
Comment 1 Martin Peres 2019-11-19 07:53:53 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/5.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.