Summary: | amdgpu 0000:07:00.0: swiotlb buffer is full (sz: 2097152 bytes) | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | mikhail.v.gavrilov | ||||||||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||
Severity: | normal | ||||||||||||||||||||
Priority: | medium | CC: | bjo, ckoenig.leichtzumerken, coolo1mc, edt, fdsfgs, jdelvare, jlp.bugs, ken, kilgus, matt.scheirer, mnowak, ojab, oliverml1, soprwa, strzol, sustmidown, vedran, xmakerenx | ||||||||||||||||||
Version: | XOrg git | ||||||||||||||||||||
Hardware: | Other | ||||||||||||||||||||
OS: | All | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||
Attachments: |
|
Description
mikhail.v.gavrilov
2017-12-04 18:47:59 UTC
also getting the same issue with my vega 56, linux 4.15, mesa-git Created attachment 136362 [details]
dmesg
Same issue with AMD Radeon R9 M385X my Lenovo Y700ACZ laptop with linux-next and Debian unstable Mesa (17.3.1-1). Created attachment 136378 [details]
dmesg on M385X
happends here aswell on various applications, i do however not really notice any problems in the applications themself when it happends. lscpi: VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c3) (prog-if 00 [VGA controller]) glxinfo: OpenGL renderer string: Radeon RX Vega (VEGA10 / DRM 3.23.0 / 4.15.0-rc2-mainline, LLVM 6.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.4.0-devel (git-3667714ccd) *** Bug 104435 has been marked as a duplicate of this bug. *** Created attachment 136745 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
*** Bug 104685 has been marked as a duplicate of this bug. *** Same on RX580 (polaris10). Happening mesa 17.3 Kernel 4.15-rc9 on RX480 (4G) too 143575.460499] perf: interrupt took too long (3928 > 3910), lowering kernel.perf_event_max_sample_rate to 50700 [146927.696615] [drm] {1920x1200, 2080x1235@154000Khz} [147985.703511] amdgpu 0000:01:00.0: swiotlb buffer is full (sz: 2097152 bytes) [147985.703515] swiotlb: coherent allocation failed for device 0000:01:00.0 size=2097152 [147985.703517] CPU: 6 PID: 848 Comm: Xorg Not tainted 4.15.0-3-git #1 [147985.703517] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87E-ITX, BIOS P2.50 07/11/2014 [147985.703518] Call Trace: [147985.703525] dump_stack+0x5c/0x85 [147985.703529] swiotlb_alloc_coherent+0xd7/0x150 [147985.703534] ttm_dma_pool_get_pages+0x1f0/0x5c0 [ttm] [147985.703537] ttm_dma_populate+0x24a/0x340 [ttm] [147985.703539] ttm_tt_bind+0x23/0x50 [ttm] [147985.703541] ttm_bo_handle_move_mem+0x5cd/0x600 [ttm] [147985.703544] ttm_bo_validate+0x12f/0x140 [ttm] [147985.703551] ? drm_gem_init+0x11/0xa0 [drm] [147985.703553] ttm_bo_init_reserved+0x3a5/0x470 [ttm] [147985.703577] amdgpu_bo_do_create+0x1b4/0x430 [amdgpu] [147985.703592] ? amdgpu_fill_buffer+0x300/0x300 [amdgpu] [147985.703594] ? unix_write_space+0x60/0xa0 [147985.703605] amdgpu_bo_create+0x50/0x210 [amdgpu] [147985.703608] ? kmem_cache_free+0x1b6/0x1e0 [147985.703618] amdgpu_gem_object_create+0x7f/0x110 [amdgpu] [147985.703628] amdgpu_gem_create_ioctl+0x1ec/0x280 [amdgpu] [147985.703638] ? amdgpu_gem_object_close+0x1e0/0x1e0 [amdgpu] [147985.703643] drm_ioctl_kernel+0x59/0xb0 [drm] [147985.703647] drm_ioctl+0x2cb/0x380 [drm] [147985.703657] ? amdgpu_gem_object_close+0x1e0/0x1e0 [amdgpu] [147985.703659] ? timerqueue_add+0x52/0x80 [147985.703667] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [147985.703670] do_vfs_ioctl+0xa1/0x610 [147985.703672] ? __sys_recvmsg+0x4e/0x90 [147985.703673] ? __sys_recvmsg+0x7d/0x90 [147985.703675] SyS_ioctl+0x74/0x80 [147985.703676] entry_SYSCALL_64_fastpath+0x20/0x83 [147985.703678] RIP: 0033:0x7fa503389d87 [147985.703678] RSP: 002b:00007ffc92250c38 EFLAGS: 00000246 Similar with Radeon R4 APU - a6 6310 Kernel 4.15, mesa 17.3.3: "swiotlb_tbl_map_single: 10 callbacks suppressed [76882.115961] amdgpu 0000:00:01.0: swiotlb buffer is full (sz: 2097152 bytes) [76882.115964] swiotlb: coherent allocation failed for device 0000:00:01.0 size=2097152 [76882.115969] CPU: 3 PID: 12480 Comm: kworker/u8:15 Not tainted 4.15.0-gentoo #4 [76882.115971] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13) 08/04/2016 [76882.115979] Workqueue: events_unbound async_run_entry_fn [76882.115981] Call Trace: [76882.115992] dump_stack+0x46/0x59 [76882.115998] swiotlb_alloc_coherent+0xdc/0x160 [76882.116004] ttm_dma_pool_get_pages+0x1ba/0x460 [76882.116009] ttm_dma_populate+0x24a/0x340 [76882.116012] ttm_tt_bind+0x24/0x58 [76882.116015] ttm_bo_handle_move_mem+0x380/0x3b8 [76882.116018] ? ttm_bo_mem_space+0x397/0x470 [76882.116021] ttm_bo_evict+0xc9/0x298 [76882.116024] ? __slab_free+0x146/0x300 [76882.116027] ? kmem_cache_free+0x138/0x168 [76882.116031] ? drm_add_edid_modes+0x811/0x1338 [76882.116034] ttm_mem_evict_first+0x15b/0x1c8 [76882.116037] ttm_bo_force_list_clean+0x62/0x110 [76882.116040] amdgpu_device_suspend+0x1db/0x3a8 [76882.116044] ? pci_pm_freeze+0xc0/0xc0 [76882.116047] pci_pm_suspend+0x7d/0x138 [76882.116051] dpm_run_callback+0x28/0xf0 [76882.116055] __device_suspend+0xdd/0x378 [76882.116058] async_suspend+0x15/0x88 [76882.116061] async_run_entry_fn+0x32/0xd8 [76882.116065] process_one_work+0x1d6/0x3c8 [76882.116069] worker_thread+0x26/0x3b8 [76882.116073] ? trace_event_raw_event_workqueue_execute_start+0x88/0x88 [76882.116075] kthread+0x10e/0x128 [76882.116078] ? kthread_create_worker_on_cpu+0x48/0x48 [76882.116081] ret_from_fork+0x22/0x40 [76882.175426] amdgpu 0000:00:01.0: swiotlb buffer is full (sz: 2097152 bytes) [76882.175429] swiotlb: coherent allocation failed for device 0000:00:01.0 size=2097152 [76882.175433] CPU: 3 PID: 12480 Comm: kworker/u8:15 Not tainted 4.15.0-gentoo #4 [76882.175435] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13) 08/04/2016 [76882.175444] Workqueue: events_unbound async_run_entry_fn [76882.175447] Call Trace: [76882.175456] dump_stack+0x46/0x59 [76882.175462] swiotlb_alloc_coherent+0xdc/0x160 [76882.175468] ttm_dma_pool_get_pages+0x1ba/0x460 [76882.175473] ttm_dma_populate+0x24a/0x340 [76882.175476] ttm_tt_bind+0x24/0x58 [76882.175479] ttm_bo_handle_move_mem+0x380/0x3b8 [76882.175482] ? ttm_bo_mem_space+0x397/0x470 [76882.175485] ttm_bo_evict+0xc9/0x298 [76882.175489] ? __slab_free+0x146/0x300 [76882.175491] ? kmem_cache_free+0x138/0x168 [76882.175495] ttm_mem_evict_first+0x15b/0x1c8 [76882.175498] ttm_bo_force_list_clean+0x62/0x110 [76882.175501] amdgpu_device_suspend+0x1db/0x3a8 [76882.175505] ? pci_pm_freeze+0xc0/0xc0 [76882.175507] pci_pm_suspend+0x7d/0x138 [76882.175512] dpm_run_callback+0x28/0xf0 [76882.175516] __device_suspend+0xdd/0x378 [76882.175519] async_suspend+0x15/0x88 [76882.175522] async_run_entry_fn+0x32/0xd8 [76882.175527] process_one_work+0x1d6/0x3c8 [76882.175530] worker_thread+0x26/0x3b8 [76882.175534] ? trace_event_raw_event_workqueue_execute_start+0x88/0x88 [76882.175537] kthread+0x10e/0x128 [76882.175540] ? kthread_create_worker_on_cpu+0x48/0x48 [76882.175543] ret_from_fork+0x22/0x40 [76882.205482] amdgpu 0000:00:01.0: swiotlb buffer is full (sz: 2097152 bytes) [76882.205484] swiotlb: coherent allocation failed for device 0000:00:01.0 size=2097152 [76882.205489] CPU: 3 PID: 12480 Comm: kworker/u8:15 Not tainted 4.15.0-gentoo #4 [76882.205490] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13) 08/04/2016 [76882.205499] Workqueue: events_unbound async_run_entry_fn [76882.205502] Call Trace: [76882.205512] dump_stack+0x46/0x59 [76882.205518] swiotlb_alloc_coherent+0xdc/0x160 [76882.205523] ttm_dma_pool_get_pages+0x1ba/0x460 [76882.205527] ttm_dma_populate+0x24a/0x340 [76882.205531] ttm_tt_bind+0x24/0x58 [76882.205534] ttm_bo_handle_move_mem+0x380/0x3b8 [76882.205537] ? ttm_bo_mem_space+0x397/0x470 [76882.205540] ttm_bo_evict+0xc9/0x298 [76882.205543] ? __slab_free+0x146/0x300 [76882.205546] ? kmem_cache_free+0x138/0x168 [76882.205549] ttm_mem_evict_first+0x15b/0x1c8 [76882.205553] ttm_bo_force_list_clean+0x62/0x110 [76882.205556] amdgpu_device_suspend+0x1db/0x3a8 [76882.205559] ? pci_pm_freeze+0xc0/0xc0 [76882.205562] pci_pm_suspend+0x7d/0x138 [76882.205567] dpm_run_callback+0x28/0xf0 [76882.205570] __device_suspend+0xdd/0x378 [76882.205574] async_suspend+0x15/0x88 [76882.205577] async_run_entry_fn+0x32/0xd8 [76882.205582] process_one_work+0x1d6/0x3c8 [76882.205585] worker_thread+0x26/0x3b8 [76882.205589] ? trace_event_raw_event_workqueue_execute_start+0x88/0x88 [76882.205591] kthread+0x10e/0x128 [76882.205595] ? kthread_create_worker_on_cpu+0x48/0x48 [76882.205598] ret_from_fork+0x22/0x40 [76882.207492] amdgpu 0000:00:01.0: swiotlb buffer is full (sz: 2097152 bytes) [76882.207493] swiotlb: coherent allocation failed for device 0000:00:01.0 size=2097152 [76882.207496] CPU: 3 PID: 12480 Comm: kworker/u8:15 Not tainted 4.15.0-gentoo #4 [76882.207498] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13) 08/04/2016 [76882.207501] Workqueue: events_unbound async_run_entry_fn [76882.207503] Call Trace: [76882.207507] dump_stack+0x46/0x59 [76882.207510] swiotlb_alloc_coherent+0xdc/0x160 [76882.207514] ttm_dma_pool_get_pages+0x1ba/0x460 [76882.207518] ttm_dma_populate+0x24a/0x340 [76882.207521] ttm_tt_bind+0x24/0x58 [76882.207523] ttm_bo_handle_move_mem+0x380/0x3b8 [76882.207526] ? ttm_bo_mem_space+0x397/0x470 [76882.207529] ttm_bo_evict+0xc9/0x298 [76882.207532] ? __slab_free+0x146/0x300 [76882.207534] ? kmem_cache_free+0x138/0x168 [76882.207537] ttm_mem_evict_first+0x15b/0x1c8 [76882.207540] ttm_bo_force_list_clean+0x62/0x110 [76882.207543] amdgpu_device_suspend+0x1db/0x3a8 [76882.207546] ? pci_pm_freeze+0xc0/0xc0 [76882.207548] pci_pm_suspend+0x7d/0x138 [76882.207552] dpm_run_callback+0x28/0xf0 [76882.207555] __device_suspend+0xdd/0x378 [76882.207558] async_suspend+0x15/0x88 [76882.207561] async_run_entry_fn+0x32/0xd8 [76882.207564] process_one_work+0x1d6/0x3c8 [76882.207567] worker_thread+0x26/0x3b8 [76882.207571] ? trace_event_raw_event_workqueue_execute_start+0x88/0x88 [76882.207573] kthread+0x10e/0x128 [76882.207576] ? kthread_create_worker_on_cpu+0x48/0x48 [76882.207578] ret_from_fork+0x22/0x40 [76882.363670] amdgpu 0000:00:01.0: swiotlb buffer is full (sz: 2097152 bytes) [76882.363674] swiotlb: coherent allocation failed for device 0000:00:01.0 size=2097152 [76882.363680] CPU: 3 PID: 12480 Comm: kworker/u8:15 Not tainted 4.15.0-gentoo #4 [76882.363682] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13) 08/04/2016 [76882.363696] Workqueue: events_unbound async_run_entry_fn [76882.363699] Call Trace: [76882.363714] dump_stack+0x46/0x59 [76882.363721] swiotlb_alloc_coherent+0xdc/0x160 [76882.363730] ttm_dma_pool_get_pages+0x1ba/0x460 [76882.363734] ttm_dma_populate+0x24a/0x340 [76882.363738] ttm_bo_move_memcpy+0xe3/0x488 [76882.363746] amdgpu_bo_move+0xac/0x178 [76882.363749] ttm_bo_handle_move_mem+0x23e/0x3b8 [76882.363753] ? ttm_bo_mem_space+0xcb/0x470 [76882.363756] ttm_bo_evict+0xc9/0x298 [76882.363762] ? __slab_free+0x146/0x300 [76882.363764] ? __slab_free+0x146/0x300 [76882.363768] ttm_mem_evict_first+0x15b/0x1c8 [76882.363771] ttm_bo_force_list_clean+0x62/0x110 [76882.363774] amdgpu_device_suspend+0x1f3/0x3a8 [76882.363779] ? pci_pm_freeze+0xc0/0xc0 [76882.363781] pci_pm_suspend+0x7d/0x138 [76882.363787] dpm_run_callback+0x28/0xf0 [76882.363791] __device_suspend+0xdd/0x378 [76882.363794] async_suspend+0x15/0x88 [76882.363798] async_run_entry_fn+0x32/0xd8 [76882.363804] process_one_work+0x1d6/0x3c8 [76882.363808] worker_thread+0x26/0x3b8 [76882.363812] ? trace_event_raw_event_workqueue_execute_start+0x88/0x88 [76882.363814] kthread+0x10e/0x128 [76882.363818] ? kthread_create_worker_on_cpu+0x48/0x48 [76882.363822] ret_from_fork+0x22/0x40 [76882.485582] ACPI: Preparing to enter system sleep state S3 [76882.486150] ACPI: EC: event blocked [76882.486151] ACPI: EC: EC stopped [76882.486152] PM: Saving platform NVS memory [76882.488070] Disabling non-boot CPUs ... [76882.503782] smpboot: CPU 1 is now offline [76882.523634] smpboot: CPU 2 is now offline [76882.550181] smpboot: CPU 3 is now offline" I dont know if it is related, but: https://lkml.org/lkml/2018/1/16/106 Yes, sorry, my bad. Wrong copy-paste action here. Thanks Alex. Created attachment 137234 [details]
dmesg with RX460
Hello,
I have the same messages with one RX460 coming from X and plasmashell on my system. Running 4.15.0 and mesa 18.0.0 on opensuse Tumbleweed. I'm attaching some lines from dmesg.
Regards,
Stratos
I get this as well with a 290 on 4.15 ONLY when DC is enabled. Its a dirty enough taint to leave btrfs volumes in inconsistent states so its straight crashing the kernel. Feb 08 17:06:08 system kernel: RSP: 002b:00007f56c62fc1a8 EFLAGS: 00000246 Feb 08 17:06:08 system kernel: amdgpu 0000:01:00.0: swiotlb buffer is full (sz: 2097152 bytes) Feb 08 17:06:08 system kernel: swiotlb: coherent allocation failed for device 0000:01:00.0 size=2097152 Feb 08 17:06:08 system kernel: CPU: 5 PID: 2216 Comm: Compositor Tainted: G O 4.15.1-2-ARCH #1 Feb 08 17:06:08 system kernel: Hardware name: ASUS All Series/Z87I-DELUXE, BIOS 1204 11/26/2014 Feb 08 17:06:08 system kernel: Call Trace: Feb 08 17:06:08 system kernel: dump_stack+0x5c/0x85 Feb 08 17:06:08 system kernel: swiotlb_alloc_coherent+0xe0/0x150 Feb 08 17:06:08 system kernel: ttm_dma_pool_get_pages+0x1f3/0x5c0 [ttm] Feb 08 17:06:08 system kernel: ttm_dma_populate+0x24a/0x340 [ttm] Feb 08 17:06:08 system kernel: ttm_tt_bind+0x29/0x60 [ttm] Feb 08 17:06:08 system kernel: ttm_bo_handle_move_mem+0x5da/0x610 [ttm] Feb 08 17:06:08 system kernel: ttm_bo_validate+0x135/0x150 [ttm] Feb 08 17:06:08 system kernel: ttm_bo_init_reserved+0x3a7/0x480 [ttm] Feb 08 17:06:08 system kernel: amdgpu_bo_do_create+0x1b4/0x420 [amdgpu] Feb 08 17:06:08 system kernel: ? amdgpu_fill_buffer+0x310/0x310 [amdgpu] Feb 08 17:06:08 system kernel: amdgpu_bo_create+0x50/0x210 [amdgpu] Feb 08 17:06:08 system kernel: ? ttm_eu_backoff_reservation+0x4d/0x70 [ttm] Feb 08 17:06:08 system kernel: amdgpu_gem_object_create+0x7f/0x110 [amdgpu] Feb 08 17:06:08 system kernel: amdgpu_gem_create_ioctl+0x1e6/0x280 [amdgpu] Feb 08 17:06:08 system kernel: ? page_add_file_rmap+0x11/0x140 Feb 08 17:06:08 system kernel: ? amdgpu_gem_object_close+0x1e0/0x1e0 [amdgpu] Feb 08 17:06:08 system kernel: drm_ioctl_kernel+0x5b/0xb0 [drm] Feb 08 17:06:08 system kernel: drm_ioctl+0x2d5/0x370 [drm] Feb 08 17:06:08 system kernel: ? amdgpu_gem_object_close+0x1e0/0x1e0 [amdgpu] Feb 08 17:06:08 system kernel: ? __handle_mm_fault+0xd30/0x1260 Feb 08 17:06:08 system kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Feb 08 17:06:08 system kernel: do_vfs_ioctl+0xa4/0x630 Feb 08 17:06:08 system kernel: ? __do_page_fault+0x29d/0x500 Feb 08 17:06:08 system kernel: ? SyS_futex+0x12d/0x180 Feb 08 17:06:08 system kernel: SyS_ioctl+0x74/0x80 Feb 08 17:06:08 system kernel: ? do_page_fault+0x32/0x110 Feb 08 17:06:08 system kernel: entry_SYSCALL_64_fastpath+0x20/0x83 Feb 08 17:06:08 system kernel: RIP: 0033:0x7f56f7570d87 (In reply to Matthew Scheirer from comment #16) > I get this as well with a 290 on 4.15 ONLY when DC is enabled. Its a dirty > enough taint to leave btrfs volumes in inconsistent states so its straight > crashing the kernel. The swiotlb messages this report is about are harmless (though a fix is on the way anyway), the other issues you mention are probably not directly related. (In reply to Michel Dänzer from comment #17) > The swiotlb messages this report is about are harmless (though a fix is on > the way anyway), the other issues you mention are probably not directly > related. I can say that it is not completely harmless. I'm using virtualbox on my KDE plasma desktop and I have several crashes on windows VMs which are "aborted" while running. At the time the "aborted" is happening, I can see a flood of this type of messages on my logs. Didn't had any issue with my windows virtualbox VMs up to now and the "aborted" crash has resulted on broken VMs who need repair. Although haven't seen any btrfs corruption or other issue. (In reply to Stratos Zolotas from comment #18) > I can say that it is not completely harmless. I'm using virtualbox on my KDE > plasma desktop and I have several crashes on windows VMs which are "aborted" > while running. At the time the "aborted" is happening, I can see a flood of > this type of messages on my logs. That's either simply coincidence, or means the kernel ran out of memory. The messages this report is about are triggered by transient memory allocation failures, for which there is a fallback path. (In reply to Michel Dänzer from comment #19) > > That's either simply coincidence, or means the kernel ran out of memory. The > messages this report is about are triggered by transient memory allocation > failures, for which there is a fallback path. Maybe it is a coincidence, for sure I'm not running out of memory when it is happening and started with kernel 4.15 and (or) Mesa 18.0 without any change on Virtualbox side. I'll wait for the fix to see what happens... (In reply to Stratos Zolotas from comment #20) > Maybe it is a coincidence, for sure I'm not running out of memory when it is > happening and started with kernel 4.15 and (or) Mesa 18.0 without any change > on Virtualbox side. I'll wait for the fix to see what happens... Well the message means that we ran out of memory to allocate a 2MB block and instead use many small 4K blocks. What could be is that there is a memory leak somewhere triggering both the message and the VM faults you are seeing. (In reply to Christian König from comment #21) > (In reply to Stratos Zolotas from comment #20) > > Maybe it is a coincidence, for sure I'm not running out of memory when it is > > happening and started with kernel 4.15 and (or) Mesa 18.0 without any change > > on Virtualbox side. I'll wait for the fix to see what happens... > > Well the message means that we ran out of memory to allocate a 2MB block and > instead use many small 4K blocks. I am getting these messsages on my machine right now, too. total used free shared buffers cached Mem: 16386788 15157252 1229536 761064 88852 2660664 -/+ buffers/cache: 12407736 3979052 Swap: 25165820 156 25165664 There is enough RAM left to allocate a 2 MiB block. (In reply to Andreas Kilgus from comment #22) > I am getting these messsages on my machine right now, too. > > total used free shared buffers cached > Mem: 16386788 15157252 1229536 761064 88852 2660664 > -/+ buffers/cache: 12407736 3979052 > Swap: 25165820 156 25165664 > > There is enough RAM left to allocate a 2 MiB block. No there isn't. 1229536 pages free means that there are 1229536 4k pages available that doesn't tell us anything about 2MB pages, nor if those are usable for swiotlb. Take a look at /proc/pagetypeinfo. There must at least be pages at order 9 or higher for an 2MB swiotlb allocation to succeed. An alternative workaround is David Zhou's patch which completely avoids the swiotlb allocator on x86. (In reply to Christian König from comment #23) > Take a look at /proc/pagetypeinfo. There must at least be pages at order 9 > or higher for an 2MB swiotlb allocation to succeed. OK, I had a look and if I got it right: memory fragmentation lets the malloc fail. <nitpick>I still would not call that to "run out of memory" - there is enough memory available, its current layout is just not in the desired shape.</nitpick> ;) (In reply to Andreas Kilgus from comment #24) > (In reply to Christian König from comment #23) > > Take a look at /proc/pagetypeinfo. There must at least be pages at order 9 > > or higher for an 2MB swiotlb allocation to succeed. > > OK, I had a look and if I got it right: memory fragmentation lets the malloc > fail. <nitpick>I still would not call that to "run out of memory" - there is > enough memory available, its current layout is just not in the desired > shape.</nitpick> ;) Yeah, correct. One interesting point is that this only seems to happen when DC is enabled, so that points to a possible memory leak there. Would be interesting to monitor amdgpu_gem_info if there are an increasing number of buffers around when DC is enabled. [65574.681417] amdgpu 0000:07:00.0: swiotlb buffer is full (sz: 2097152 bytes) [65574.681436] swiotlb: coherent allocation failed for device 0000:07:00.0 size=2097152 [65574.681439] CPU: 3 PID: 4658 Comm: steam Not tainted 4.15.0-rc4-amd-vega+ #13 [65574.681441] Hardware name: Gigabyte Technology Co., Ltd. Z87M-D3H/Z87M-D3H, BIOS F11 08/12/2014 [65574.681443] Call Trace: [65574.681452] dump_stack+0x8e/0xd6 [65574.681457] swiotlb_alloc_coherent+0xe8/0x160 [65574.681466] x86_swiotlb_alloc_coherent+0x43/0x50 [65574.681476] ttm_dma_pool_get_pages+0x230/0x630 [ttm] [65574.681492] ttm_dma_populate+0x139/0x360 [ttm] [65574.681534] amdgpu_ttm_tt_populate+0xd0/0xf0 [amdgpu] [65574.681541] ttm_tt_bind+0x2b/0x60 [ttm] [65574.681547] ttm_bo_handle_move_mem+0x566/0x5a0 [ttm] [65574.681552] ? ttm_bo_mem_space+0x37a/0x450 [ttm] [65574.681566] ttm_bo_validate+0x186/0x1a0 [ttm] [65574.681577] ? mutex_trylock+0xd4/0xf0 [65574.681585] ttm_bo_init_reserved+0x472/0x500 [ttm] [65574.681617] amdgpu_bo_do_create+0x216/0x550 [amdgpu] [65574.681646] ? amdgpu_fill_buffer+0x2f0/0x2f0 [amdgpu] [65574.681655] ? __lock_acquire+0x2d4/0x1350 [65574.681686] amdgpu_bo_create+0x4d/0x2e0 [amdgpu] [65574.681721] amdgpu_gem_object_create+0x81/0x110 [amdgpu] [65574.681752] ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] [65574.681776] amdgpu_gem_create_ioctl+0x1f4/0x280 [amdgpu] [65574.681782] ? __might_fault+0x3e/0x90 [65574.681809] ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] [65574.681829] drm_ioctl_kernel+0x5d/0xb0 [drm] [65574.681842] drm_ioctl+0x31b/0x3d0 [drm] [65574.681865] ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] [65574.681875] ? trace_hardirqs_on_caller+0xf4/0x190 [65574.681880] ? trace_hardirqs_on+0xd/0x10 [65574.681908] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] [65574.681941] amdgpu_kms_compat_ioctl+0x14/0x20 [amdgpu] [65574.681946] compat_SyS_ioctl+0x6f9/0x1d60 [65574.681949] ? trace_hardirqs_on_caller+0xf4/0x190 [65574.681959] do_fast_syscall_32+0xb0/0x374 [65574.681965] entry_SYSENTER_compat+0x51/0x60 [65574.681969] RIP: 0023:0xf7f47db9 [65574.681971] RSP: 002b:00000000ff836fe8 EFLAGS: 00200286 ORIG_RAX: 0000000000000036 [65574.681975] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00000000c0206440 [65574.681977] RDX: 00000000ff83707c RSI: 0000000058c9cbc0 RDI: 00000000c0206440 [65574.681979] RBP: 0000000000000013 R08: 0000000000000000 R09: 0000000000000000 [65574.681981] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [65574.681983] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [65736.626861] Xwayland (2026) used greatest stack depth: 9712 bytes left (In reply to Christian König from comment #25) > One interesting point is that this only seems to happen when DC is enabled, > so that points to a possible memory leak there. I do still get the swiotlib OOM errors without DC enabled, but they are fairly infrequent and don't panic crash the kernel. With DC enabled, swiotlib OOM messages happen much more often coinciding with crashes. Ok setting this as fixed which hopefully makes people stop adding new comments to this bug report. (In reply to Matthew Scheirer from comment #27) > I do still get the swiotlib OOM errors without DC enabled, but they are > fairly infrequent and don't panic crash the kernel. With DC enabled, > swiotlib OOM messages happen much more often coinciding with crashes. Thanks and yeah that only means that DC releases the memory later than the classic code path. So actually not DC related. (In reply to Christian König from comment #28) > Ok setting this as fixed which hopefully makes people stop adding new > comments to this bug report. I didn't understand It's already fixed in which commit? If yes why I see this issue with latest build amd-staging-drm-next? (In reply to mikhail.v.gavrilov from comment #30) > I didn't understand It's already fixed in which commit? > If yes why I see this issue with latest build amd-staging-drm-next? See here https://lkml.org/lkml/2018/1/16/106. I've already pinged Konrad multiple times as well, but I'm still not sure if and when he send that patch to Linus. The real point is that those messages are just false positives, e.g. they are completely harmless and we haven't rebased amd-staging-drm-next in a while which is most likely the reason why it isn't in there. Created attachment 137285 [details]
system log
Why are you sure what this totally harmless? I have proof this this isn't it. Look at my system log: - I got message "amdgpu 0000:07:00.0: swiotlb buffer is full (sz: 2097152 bytes)" at Feb 10 18:46:00 - And then happened that gnome-shell crashed 2 seconds later at "Feb 10 18:48:53" Today gnome-shell not crashed but I noted that gnome-terminal was closed after this. Feb 12 02:30:58 localhost.localdomain gnome-terminal-[2799]: Error flushing display: Resource temporarily unavailable Feb 12 02:30:58 localhost.localdomain systemd[1872]: gnome-terminal-server.service: Main process exited, code=exited, status=1/FAILURE Feb 12 02:30:58 localhost.localdomain systemd[1872]: gnome-terminal-server.service: Unit entered failed state. Feb 12 02:30:58 localhost.localdomain systemd[1872]: gnome-terminal-server.service: Failed with result 'exit-code'. Of course, this may just be a coincidence, but the unpleasant residue remains. Two seconds (In reply to mikhail.v.gavrilov from comment #33) > Why are you sure what this totally harmless? Those messages are a sympton and not a cause. E.g. take a look at your logs again, gnome-shell started to crash before you got the message. What happens in your case is that some process or the kernels runs away with memory and because of this you get the message and the crash. We can disabled the warning message, but fixing the root problem is a complete different task. The same on NVIDIA Corporation C79 [GeForce 9300 / nForce 730i] (rev b1). This happens when high disk IO load. For example - copy foles from HDD to USB. [30790.063300] nouveau 0000:03:00.0: swiotlb buffer is full (sz: 2097152 bytes) [30790.063304] nouveau 0000:03:00.0: swiotlb: coherent allocation failed, size=2097152 [30790.063308] CPU: 1 PID: 2071 Comm: gnome-shell Not tainted 4.16.0-0.rc3.git2.1.vanilla.knurd.1.fc27.x86_64 #1 [30790.063309] Hardware name: NVIDIA MCP7A/MCP7A, BIOS 6.00 PG 04/22/2009 [30790.063310] Call Trace: [30790.063321] dump_stack+0x5c/0x85 [30790.063326] swiotlb_alloc_coherent+0x1be/0x1d0 [30790.063338] ttm_dma_pool_get_pages+0x235/0x620 [ttm] [30790.063345] ttm_dma_populate+0x25e/0x350 [ttm] [30790.063350] ttm_tt_bind+0x2c/0x60 [ttm] [30790.063356] ttm_bo_handle_move_mem+0x577/0x5b0 [ttm] [30790.063362] ttm_bo_validate+0x120/0x130 [ttm] [30790.063393] ? drm_pcie_get_speed_cap_mask+0x8e/0xe0 [drm] [30790.063398] ttm_bo_init_reserved+0x378/0x420 [ttm] [30790.063403] ttm_bo_init+0x62/0xd0 [ttm] [30790.063473] ? nouveau_bo_invalidate_caches+0x10/0x10 [nouveau] [30790.063516] nouveau_bo_new+0x416/0x590 [nouveau] [30790.063562] ? nouveau_bo_invalidate_caches+0x10/0x10 [nouveau] [30790.063605] ? nouveau_gem_new+0x120/0x120 [nouveau] [30790.063648] nouveau_gem_new+0x5d/0x120 [nouveau] [30790.063692] nouveau_gem_ioctl_new+0x51/0xd0 [nouveau] [30790.063706] drm_ioctl_kernel+0x5b/0xb0 [drm] [30790.063720] drm_ioctl+0x2d5/0x370 [drm] [30790.063764] ? nouveau_gem_new+0x120/0x120 [nouveau] [30790.063809] nouveau_drm_ioctl+0x64/0xc0 [nouveau] [30790.063815] do_vfs_ioctl+0xa4/0x620 [30790.063818] SyS_ioctl+0x74/0x80 [30790.063822] do_syscall_64+0x74/0x180 [30790.063826] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [30790.063829] RIP: 0033:0x7f2a2164b8e7 [30790.063830] RSP: 002b:00007ffdc53b6bd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [30790.063833] RAX: ffffffffffffffda RBX: 000055c99031d960 RCX: 00007f2a2164b8e7 [30790.063834] RDX: 00007ffdc53b6c30 RSI: 00000000c0306480 RDI: 000000000000000c [30790.063836] RBP: 00007ffdc53b6c30 R08: 0000000000000004 R09: 000055c99031d950 [30790.063837] R10: ffffffffffffffb0 R11: 0000000000000246 R12: 00000000c0306480 [30790.063839] R13: 000000000000000c R14: 000055c98cfab250 R15: 000055c98a0e4140 [30790.110592] nouveau 0000:03:00.0: swiotlb buffer is full (sz: 2097152 bytes) In which kernel mainline version merged patch for this issue? I see that on `amd-staging-drm-next` branch which branched from 4.16-rc1 this issue not happens now, but on mainline 4.16-rc6 still actively occurred again and again. Created attachment 138301 [details]
dmesg 4.16-rc6
Created attachment 138462 [details]
In 4.16.0-rc7 still not fixed
Issue appears here on 4.16 too: [19470.368310] radeon 0000:00:01.0: swiotlb buffer is full (sz: 2097152 bytes) [19470.368334] radeon 0000:00:01.0: swiotlb: coherent allocation failed, size=2097152 [19470.368341] CPU: 0 PID: 815 Comm: Xorg Not tainted 4.16.0-1-ARCH #1 [19470.368343] Hardware name: LENOVO 20BLS00400/20BLS00400, BIOS GSET69WW (2.14 ) 09/28/2017 [19470.368345] Call Trace: [19470.368362] dump_stack+0x5c/0x85 [19470.368369] swiotlb_alloc_coherent+0x1be/0x1d0 [19470.368387] ttm_dma_pool_get_pages+0x215/0x5d0 [ttm] [19470.368401] ttm_dma_populate+0x25b/0x350 [ttm] [19470.368412] ttm_tt_bind+0x2c/0x60 [ttm] [19470.368423] ttm_bo_handle_move_mem+0x577/0x5b0 [ttm] [19470.368436] ttm_bo_validate+0x120/0x130 [ttm] [19470.368474] ? drm_vma_offset_add+0x41/0x60 [drm] [19470.368478] ? acpi_os_map_iomem+0x52/0x180 [19470.368489] ttm_bo_init_reserved+0x395/0x460 [ttm] [19470.368500] ttm_bo_init+0x62/0xe0 [ttm] [19470.368549] ? radeon_update_memory_usage.isra.0+0x50/0x50 [radeon] [19470.368584] radeon_bo_create+0x180/0x240 [radeon] [19470.368622] ? radeon_update_memory_usage.isra.0+0x50/0x50 [radeon] [19470.368660] radeon_gem_object_create+0xa7/0x190 [radeon] [19470.368700] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon] [19470.368737] radeon_gem_create_ioctl+0x66/0x100 [radeon] [19470.368776] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon] [19470.368798] drm_ioctl_kernel+0x5b/0xb0 [drm] [19470.368823] drm_ioctl+0x2d5/0x370 [drm] [19470.368862] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon] [19470.368869] ? vfs_writev+0xb9/0x110 [19470.368901] radeon_drm_ioctl+0x49/0x80 [radeon] [19470.368909] do_vfs_ioctl+0xa4/0x630 [19470.368915] ? __sys_recvmsg+0x4e/0x90 [19470.368919] ? __sys_recvmsg+0x7d/0x90 [19470.368924] SyS_ioctl+0x74/0x80 [19470.368931] do_syscall_64+0x74/0x190 [19470.368937] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [19470.368942] RIP: 0033:0x7f7b3e17ad87 [19470.368945] RSP: 002b:00007ffe18b59e08 EFLAGS: 00003246 ORIG_RAX: 0000000000000010 [19470.368950] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f7b3e17ad87 [19470.368952] RDX: 00007ffe18b59e80 RSI: 00000000c020645d RDI: 0000000000000010 [19470.368954] RBP: 00007ffe18b59e80 R08: 0000000000000002 R09: 0000000000000000 [19470.368956] R10: 0000000000207000 R11: 0000000000003246 R12: 00000000c020645d [19470.368958] R13: 0000000000000010 R14: 000055c9b1526e90 R15: 000055c9b1526e90 [19615.347634] swiotlb_tbl_map_single: 46 callbacks suppressed I believe there are 2 incarnations of this bug, with the same symptoms but different root causes. The first bug is affecting kernels v4.14 and v4.15, it is caused by 2 MB allocations failing too verbosely when in fact 4 kB allocations worked just fine as a fallback. That one is fixed in v4.16 by: commit d0bc0c2a31c95002d37c3cc511ffdcab851b3256 Author: Christian König Date: Thu Jan 4 14:24:19 2018 +0100 swiotlb: suppress warning when __GFP_NOWARN is set However kernel v4.16 also includes the following commit: commit 0176adb004065d6815a8e67946752df4cd947c5b Author: Christoph Hellwig Date: Tue Jan 9 22:15:30 2018 +0100 swiotlb: refactor coherent buffer allocation which contains a bug producing exactly the same backtraces. That bug is caused by an improper check of the result of dma_coherent_ok(), which was fixed 10 days ago with: commit 9e7f06c8beee304ee21b791653fefcd713f48b9a Author: Takashi Iwai Date: Tue Apr 10 19:05:13 2018 +0200 swiotlb: fix unexpected swiotlb_alloc_coherent failures I backported this fix on top of v4.16.3 and the backtraces are gone. The commit was tagged for stable inclusion so I expect it to be included in v4.16. Bah, I spoke too fast. Even with commit 9e7f06c8beee ("swiotlb: fix unexpected swiotlb_alloc_coherent failures") backported, I end up seeing the backtraces again, after several hours. So that commit helps but doesn't solve the problem. It seems there are at least 3 different bugs leading to the same problem. I can confirm that I'm getting the backtraces on Opensuse Tumbleweed with 4.16.2 kernel installed. They appear fewer times now but every time I have a flickering on my monitors (using a 3 monitor setup). Same here on ryzen 2400g + asus a320m-k w/ iommu disabled, vanilla kernel 4.16.4. I'm not sure if it's another issue since it happens on driver loading and backtrace looks like: [Wed Apr 25 20:18:50 2018] dump_stack+0x67/0x93 [Wed Apr 25 20:18:50 2018] swiotlb_alloc_coherent+0x1da/0x1f0 [Wed Apr 25 20:18:50 2018] amdgpu_ih_ring_init+0x1fc/0x2c0 [amdgpu] [Wed Apr 25 20:18:50 2018] vega10_ih_sw_init+0x18/0xb0 [amdgpu] [Wed Apr 25 20:18:50 2018] amdgpu_device_init+0xa35/0x11e0 [amdgpu] [Wed Apr 25 20:18:50 2018] amdgpu_driver_load_kms+0x56/0x170 [amdgpu] but it happens after `swiotlb buffer is full` error, so writing here. Created attachment 139111 [details]
4.16.4 raven ridge dmesg, driver fails to load
(In reply to ojab from comment #43) > I'm not sure if it's another issue [...] It is, please file your own report. Problem still present in kernel 4.16.5. I guess I'll have to bisect it... the bug occurs also with radeon driver, and not only amdgpu : radeon 0000:01:00.0: swiotlb buffer is full (sz: 2097152 bytes) radeon 0000:01:00.0: swiotlb: coherent allocation failed, size=2097152 CPU: 1 PID: 979 Comm: Compositor Not tainted 4.16.4-1-ARCH #1 my graphic card : amd radeon HD4650 Pcie OS: archlinux 64 bits full log error, it seems related to ttm code, I use the radeon driver (not amdgpu) : [20021.576712] radeon 0000:01:00.0: swiotlb buffer is full (sz: 2097152 bytes) [20021.576715] radeon 0000:01:00.0: swiotlb: coherent allocation failed, size=2097152 [20021.578048] CPU: 1 PID: 979 Comm: Compositor Not tainted 4.16.4-1-ARCH #1 [20021.578049] Hardware name: Gigabyte Technology Co., Ltd. P35-DS3L/P35-DS3L, BIOS F9 06/19/2009 [20021.578050] Call Trace: [20021.578062] dump_stack+0x5c/0x85 [20021.578066] swiotlb_alloc_coherent+0x1be/0x1d0 [20021.578078] ttm_dma_pool_get_pages+0x215/0x5d0 [ttm] [20021.578084] ttm_dma_populate+0x25b/0x350 [ttm] [20021.578088] ttm_tt_bind+0x2c/0x60 [ttm] [20021.578092] ttm_bo_handle_move_mem+0x577/0x5b0 [ttm] [20021.578097] ttm_bo_validate+0x120/0x130 [ttm] [20021.578119] ? drm_vma_offset_add+0x41/0x60 [drm] [20021.578122] ? csum_partial_copy_generic+0x9c2/0x1a00 [20021.578126] ttm_bo_init_reserved+0x395/0x460 [ttm] [20021.578130] ttm_bo_init+0x62/0xe0 [ttm] [20021.578161] ? radeon_update_memory_usage.isra.0+0x50/0x50 [radeon] [20021.578174] radeon_bo_create+0x180/0x240 [radeon] [20021.578189] ? radeon_update_memory_usage.isra.0+0x50/0x50 [radeon] [20021.578204] radeon_gem_object_create+0xa7/0x190 [radeon] [20021.578220] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon] [20021.578235] radeon_gem_create_ioctl+0x66/0x100 [radeon] [20021.578250] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon] [20021.578259] drm_ioctl_kernel+0x5b/0xb0 [drm] [20021.578268] drm_ioctl+0x2d5/0x370 [drm] [20021.578284] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon] [20021.578288] ? __handle_mm_fault+0xc88/0x14d0 [20021.578290] ? cap_task_prctl+0x310/0x310 [20021.578302] radeon_drm_ioctl+0x49/0x80 [radeon] [20021.578305] do_vfs_ioctl+0xa4/0x630 [20021.578308] ? __do_page_fault+0x317/0x5a0 [20021.578310] SyS_ioctl+0x74/0x80 [20021.578313] do_syscall_64+0x74/0x190 [20021.578316] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [20021.578318] RIP: 0033:0x7f314a1b44c7 [20021.578319] RSP: 002b:00007f311c5fd3c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [20021.578321] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f314a1b44c7 [20021.578322] RDX: 00007f311c5fd440 RSI: 00000000c020645d RDI: 0000000000000007 [20021.578323] RBP: 00007f311c5fd440 R08: 0000000000000002 R09: 0000000000000000 [20021.578324] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c020645d [20021.578325] R13: 0000000000000007 R14: 00007f31343d0000 R15: 0000000000000002 The remaining issue should be fixed with https://patchwork.freedesktop.org/patch/219765/ . Anyway, the issues in 4.16 were swiotlb regressions, not directly related to the issue originally reported here. If there's ever an issue again with the fix above and Takashi-san's fix referenced in comment 40, please report it to the swiotlb developers instead of here. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.