Bug 110888 - 5.0.21 kernel crash when many GPU app run concurrently , error msg: amdgpu_vm_validate_pt_bos() failed. , Not enough memory for command submission!
Summary: 5.0.21 kernel crash when many GPU app run concurrently , error msg: amdgpu_v...
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: ARM Linux (All)
: medium critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-11 07:19 UTC by wormwang
Modified: 2019-11-19 09:31 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
radeontop just before kernel crash (23.27 KB, image/png)
2019-06-11 10:20 UTC, wormwang
no flags Details

Description wormwang 2019-06-11 07:19:34 UTC
Env:kernel 5.0.21 mesa 18.2.8 firmware 1.179 drm 2.4.97 binder-dkms 1.3 +android image kydroid cm-13.0-19.05.30-1-clouddisk RAM 192G. AMD RX580 8GB

We test run 77 GPU apps concurrently, kernel crash and auto reboot

journalctl log #100 (comment)

crash dump


[ 3138.636753] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.636831] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.636915] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.636989] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.647377] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.657138] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.801062] Unable to handle kernel access to user memory outside uaccess routines at virtual address 00000000000000a8
[ 3138.801240] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.811638] Mem abort info:
[ 3138.811642] ESR = 0x96000004
[ 3138.811644] Exception class = DABT (current EL), IL = 32 bits
[ 3138.811647] SET = 0, FnV = 0
[ 3138.811649] EA = 0, S1PTW = 0
[ 3138.811651] Data abort info:
[ 3138.811653] ISV = 0, ISS = 0x00000004
[ 3138.811655] CM = 0, WnR = 0
[ 3138.811660] user pgtable: 4k pages, 48-bit VAs, pgdp = 000000000787c0fb
[ 3138.811663] [00000000000000a8] pgd=0000000000000000
[ 3138.811669] Internal error: Oops: 96000004 [#1] SMP
[ 3138.811673] Modules linked in: nfnetlink_log veth xt_CHECKSUM iptable_mangle nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo br_netfilter xt_nat ipt_MASQUERADE overlay xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp devlink xt_mark xt_comment xt_conntrack bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xt_addrtype iptable_nat nf_nat_ipv4 nf_nat bpfilter ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 input_leds joydev nls_iso8859_1 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi binder_dkms(OE) ip_tables x_tables autofs4 ses enclosure btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hibmc_drm hid_generic usbhid hid marvell aes_ce_blk
[ 3138.811754] aes_ce_cipher
[ 3138.822304] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.827351] amdgpu crct10dif_ce chash i2c_algo_bit ghash_ce gpu_sched ttm sha2_ce sha256_arm64 drm_kms_helper sha1_ce syscopyarea sysfillrect sysimgblt fb_sys_fops drm hns_enet_drv mpt3sas e1000e hisi_sas_v2_hw raid_class hisi_sas_main ehci_platform libsas hns_dsaf scsi_transport_sas hns_mdio hnae aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 3138.827381] Process BootAnimation (pid: 240132, stack limit = 0x00000000184b1ef3)
[ 3138.827386] CPU: 17 PID: 240132 Comm: BootAnimation Kdump: loaded Tainted: G OE 5.0.0-2106051013-generic #appstreamdebug
[ 3138.827388] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.56 09/20/2018
[ 3138.827391] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 3138.827499] pc : amdgpu_vm_init+0x1e4/0x490 [amdgpu]
[ 3138.827583] lr : amdgpu_vm_init+0x298/0x490 [amdgpu]
[ 3138.867149] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.868460] sp : ffff0003b1a5b900
[ 3138.868462] x29: ffff0003b1a5b900 x28: ffff8013f4f36000
[ 3138.868466] x27: ffff8013ae49e0c0 x26: ffff8013ae49e100
[ 3138.868469] x25: ffff0000097de000 x24: 0000000000008143
[ 3138.868472] x23: 0000000000000000 x22: ffff000011994000
[ 3138.868474] x21: 00000000fffffff4 x20: 0000000000000050
[ 3138.868477] x19: ffff8013ae49e000 x18: 0000000000000000
[ 3138.868480] x17: 0000000000000000 x16: 0000000000000101
[ 3138.868483] x15: 0000000000000000 x14: ffff0000110a6748
[ 3138.868485] x13: 0000000000000001 x12: 0000000000000000
[ 3138.873930] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.878709] x11: 0000000000000001 x10: 0000000000000000
[ 3138.878712] x9 : ffff000008f674f0 x8 : ffff000011994b48
[ 3138.878715] x7 : ffff000008f58e20 x6 : 0000000000000000
[ 3138.878718] x5 : 0000000000000000 x4 : ffff000011994b48
[ 3138.878720] x3 : 0000000000000001 x2 : 7d8b3ec762676c00
[ 3138.878723] x1 : 0000000000000000 x0 : 00000000fffffff4
[ 3138.878729] Call trace:
[ 3138.878823] amdgpu_vm_init+0x1e4/0x490 [amdgpu]
[ 3138.878912] amdgpu_driver_open_kms+0x9c/0x200 [amdgpu]
[ 3139.153799] drm_file_alloc+0x134/0x258 [drm]
[ 3139.158515] drm_open+0xac/0x210 [drm]
[ 3139.163037] drm_stub_open+0xec/0x118 [drm]
[ 3139.167537] chrdev_open+0xac/0x1c0
[ 3139.171858] do_dentry_open+0x1c4/0x370
[ 3139.175949] vfs_open+0x38/0x48
[ 3139.179895] do_last+0x32c/0x8b0
[ 3139.183680] path_openat+0x90/0x288
[ 3139.187217] do_filp_open+0x88/0x108
[ 3139.190768] do_sys_open+0x1b0/0x3b0
[ 3139.194222] __arm64_sys_openat+0x2c/0x38
[ 3139.197480] el0_svc_common+0x8c/0x190
[ 3139.200847] el0_svc_handler+0x38/0x78
[ 3139.202961] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3139.203982] el0_svc+0x8/0xc
[ 3139.211009] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3139.214079] Code: 2a0003f5 34000540 f9406277 910142f4 (b9405a80)
[ 3139.214210] SMP: stopping secondary CPUs
[ 3139.226747] Starting crashdump kernel...
[ 3139.230360] Bye!
Comment 1 wormwang 2019-06-11 10:20:36 UTC
Created attachment 144503 [details]
radeontop just before kernel crash

radeontop just before the kernel crash.

VRAM just is about 65% free.
Comment 2 Christian König 2019-06-12 08:16:56 UTC
Looks like a NULL pointer check is missing somewhere in amdgpu_vm_init() to me.

But in general you are running out of system memory, not video memory. So whatever you try to do here won't work in general unless you either add more system memory or add a swap file.
Comment 3 freedesktop35 2019-08-22 09:00:18 UTC Comment hidden (spam)
Comment 4 ram2x@dmailpro.net (Spammer; Account disabled) 2019-10-31 13:58:15 UTC Comment hidden (spam)
Comment 5 Martin Peres 2019-11-19 09:31:21 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/828.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.