Summary: | AMD Xorg start failes with non-4K page sizes | ||
---|---|---|---|
Product: | DRI | Reporter: | Matt Corallo <freedesktop> |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | andrey.grodzovsky, bcrocker, dan, tpearson |
Version: | unspecified | ||
Hardware: | PowerPC | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Matt Corallo
2018-04-26 21:12:41 UTC
This also affects the WX7100, same symptoms. Shows up as soon as an OpenGL application tries to use the accelerated graphics. Kernel 4.16. Oops, forgot to specify this is on a WX4100. (In reply to Timothy Pearson from comment #1) > This also affects the WX7100, same symptoms. Shows up as soon as an OpenGL > application tries to use the accelerated graphics. Kernel 4.16. On our end Ureal Engine 4 would reliably trigger the problem. A number of other open source 3D applications worked without issues. Oops, looks like we got our wires crossed on what has and hasn't been tested. It only *may* be PAGE_SIZE related, but seems relevant 3d stuff was never tested on PPC64LE with 4K pages, only 64K. Just checked with today's linus/master (somwehere between 4.17-rc2 and 4.17-rc3) and the issue is still present. This looks like something that should be filed at bugzilla.kernel.org against the amdgpu driver itself. It will get more visibility by the right people there. (In reply to Timothy Pearson from comment #6) > This looks like something that should be filed at bugzilla.kernel.org > against the amdgpu driver itself. It will get more visibility by the right > people there. No, we prefer this bugzilla for tracking kernel driver issues as well. I Wonder is there a way to reproduce this on x86 platform, from looking in kflags there is no CONFIG_x86_4/16/64K_PAGES flags. Switched the machine over to a 4k-PAGE_SIZE kernel and it works fine, so that does appear to be the issue here. (In reply to Andrey Grodzovsky from comment #8) > I Wonder is there a way to reproduce this on x86 platform, from looking in > kflags there is no CONFIG_x86_4/16/64K_PAGES flags. AFAIK there is only 4k page size on x86 possible for regular pages (non-huge). Adding Ben to CC, he might have some ideas based on his work on the graphic drivers for ppc64le in RHEL. We could get you direct access to test hardware if that would help. Basically, these systems are completely controllable via BMC, so we could give remote access to a Talos machine with a WX7100 installed for you guys to run tests on. Would that help? (In reply to Timothy Pearson from comment #11) > We could get you direct access to test hardware if that would help. > Basically, these systems are completely controllable via BMC, so we could > give remote access to a Talos machine with a WX7100 installed for you guys > to run tests on. Would that help? That a good idea since I don't think we have any non x86 systems here t experiment with. I am currently handling another issue, once I am done and assuming no one else jumped in to fix it I will use your system to debug. Andrey (In reply to Timothy Pearson from comment #11) > We could get you direct access to test hardware if that would help. > Basically, these systems are completely controllable via BMC, so we could > give remote access to a Talos machine with a WX7100 installed for you guys > to run tests on. Would that help? That a good idea since I don't think we have any non x86 systems here t experiment with. I am currently handling another issue, once I am done and assuming no one else jumped in to fix it I will use your system to debug. Andrey (In reply to Andrey Grodzovsky from comment #13) > (In reply to Timothy Pearson from comment #11) > > We could get you direct access to test hardware if that would help. > > Basically, these systems are completely controllable via BMC, so we could > > give remote access to a Talos machine with a WX7100 installed for you guys > > to run tests on. Would that help? > > That a good idea since I don't think we have any non x86 systems here t > experiment with. I am currently handling another issue, once I am done and > assuming no one else jumped in to fix it I will use your system to debug. > > Andrey Works for me. Just let me know when you're ready and we'll get you access. Hello Matt, Timothy, Michel, Andrey, and Dan, I have been testing some older cards (FirePro 2270, Embedded Radeon E6465) on a Power8 system with no problems of the type you're describing; I've run the Piglit test suite and the Khronos OpenGL Conformance Test Suite. Direct access to a Talos machine could be beneficial, OR perhaps one of you (Timothy, Michel, Andrey) could send me a WX 7100 for testing; I've been meaning to ask for one of these anyway for a while now. Just out of curiosity, what (other) differences in behavior do you see between PPC64LE systems configured for 64K pages and systems configured for 4K pages? (In reply to Ben Crocker from comment #15) > Hello Matt, Timothy, Michel, Andrey, and Dan, > > I have been testing some older cards (FirePro 2270, Embedded Radeon > E6465) on a Power8 system with no problems of the type you're > describing; I've run the Piglit test suite and the Khronos OpenGL > Conformance Test Suite. > > Direct access to a Talos machine could be beneficial, OR perhaps one > of you (Timothy, Michel, Andrey) could send me a WX 7100 for testing; > I've been meaning to ask for one of these anyway for a while now. > > Just out of curiosity, what (other) differences in behavior do you see > between PPC64LE systems configured for 64K pages and systems > configured for 4K pages? OK, I have a Talos II machine up with a WX4100 installed. Debian is loaded on it right now, and I can grant direct BMC access if anyone wants to take a crack at debug. Email me directly for login credentials. Thanks! We have been unable to replicate this with the latest AMD GPU firmware from https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/ . Xorg starts fine with no errors. System: Debian Testing Kernel: 4.17.0-rc6 $ getconf PAGESIZE 65536 If you still experience the issues please give detailed instructions on how to reproduce your test environment, including OS installed and packages / configuration files in use. (In reply to Timothy Pearson from comment #17) > We have been unable to replicate this with the latest AMD GPU firmware from > https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/ . Xorg starts > fine with no errors. > > System: Debian Testing > Kernel: 4.17.0-rc6 > > $ getconf PAGESIZE > 65536 > > If you still experience the issues please give detailed instructions on how > to reproduce your test environment, including OS installed and packages / > configuration files in use. To clarify, please install all of the files in that directory into your /lib/firmware/amdgpu/ directory (on Debian), then test to see if the issue persists. The WX7100 uses the polaris10* files, the WX4100 uses the polaris11* files. (In reply to Timothy Pearson from comment #17) > We have been unable to replicate this with the latest AMD GPU firmware from > https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/ . Xorg starts > fine with no errors. Those are old firmware files I posted for testing a while ago. Please test with the latest firmware from the linux-firmware tree: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git Please note that the path to the firmware, /lib/firmware/amdgpu, is correct for Red Hat products (RHEL, Fedora, CentOS) as well. I am still getting amdgpu crashes on POWER9 (ppc64le) using the latest AMD firmware and on kernel 4.17 (release). My card is a WX5100 - Polaris 10. I can reliably reproduce this by having spectacle attempt to capture an area of the screen. However I also experience crashes randomly while moving a window. Here is the dmesg output: [ 75.701274] CPU: 37 PID: 3391 Comm: spectacle Not tainted 4.17.0-foxbat-64k #1 [ 75.701275] NIP: c00800001748cd08 LR: c008000017495944 CTR: 0000000000000000 [ 75.701278] REGS: c000001f8ea53230 TRAP: 0700 Not tainted (4.17.0-foxbat-64k) [ 75.701278] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24828848 XER: 20040000 [ 75.701284] CFAR: c00800001748c724 SOFTE: 0 GPR00: c008000017495944 c000001f8ea534b0 c0080000176e6400 c000201cb2303418 GPR04: c000201ca9cfce60 0000000009fff138 0000000000000100 00000000000005a3 GPR08: 3817b594461fff5f 0000000000100000 0000000000000000 c008000017649dd0 GPR12: 0000000000008000 c000001ffffd5600 c000201cb2300000 c000001fe89d9d20 GPR16: 00000000027ffc4e c000201ca9cfce60 00000000fffe2000 ffffffffffffffef GPR20: 0000000009fff138 0000000000000100 c000201cb23066a8 0000000000000000 GPR24: 00000000027ffc4e 00000000000e5fff c000201cb2303418 00000001fe000000 GPR28: c000201cb2300000 0000000000000000 c000201ca9cfce60 0000000009fff138 [ 75.701323] NIP [c00800001748cd08] amdgpu_sa_bo_new+0x640/0x6c0 [amdgpu] [ 75.701347] LR [c008000017495944] amdgpu_ib_get+0x8c/0x120 [amdgpu] [ 75.701348] Call Trace: [ 75.701370] [c000001f8ea534b0] [c00800001748c7f8] amdgpu_sa_bo_new+0x130/0x6c0 [amdgpu] (unreliable) [ 75.701395] [c000001f8ea53710] [c008000017495944] amdgpu_ib_get+0x8c/0x120 [amdgpu] [ 75.701421] [c000001f8ea53790] [c008000017572558] amdgpu_job_alloc_with_ib+0x90/0x110 [amdgpu] [ 75.701442] [c000001f8ea537d0] [c008000017492124] amdgpu_vm_bo_update_mapping+0x35c/0x480 [amdgpu] [ 75.701469] [c000001f8ea538c0] [c0080000174925c8] amdgpu_vm_bo_update+0x380/0x740 [amdgpu] [ 75.701489] [c000001f8ea539d0] [c008000017476d10] amdgpu_gem_va_ioctl+0x5f8/0x620 [amdgpu] [ 75.701496] [c000001f8ea53b20] [c008000015348ce8] drm_ioctl_kernel+0xa0/0x140 [drm] [ 75.701502] [c000001f8ea53b70] [c0080000153491b4] drm_ioctl+0x1ac/0x4d0 [drm] [ 75.701518] [c000001f8ea53cb0] [c008000017450078] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu] [ 75.701521] [c000001f8ea53d00] [c0000000003e216c] do_vfs_ioctl+0xdc/0x8a0 [ 75.701523] [c000001f8ea53da0] [c0000000003e2a34] ksys_ioctl+0x104/0x120 [ 75.701525] [c000001f8ea53df0] [c0000000003e2a90] sys_ioctl+0x40/0xa0 [ 75.701528] [c000001f8ea53e30] [c00000000000b9e0] system_call+0x58/0x6c [ 75.701529] Instruction dump: [ 75.701531] 7fc3f378 38210260 e8010010 81810008 ea21ff88 ea81ffa0 eaa1ffa8 eb41ffd0 [ 75.701535] ebc1fff0 7c0803a6 7d908120 4e800020 <0fe00000> 3bc0ffea 7fc3f378 38210260 [ 75.701540] ---[ end trace 838930e3a806a76d ]--- [ 75.701557] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 75.701637] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 75.702847] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 75.736208] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 75.741527] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 75.741602] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 75.743120] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 75.743182] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) (In reply to foxbat from comment #21) > [ 75.701323] NIP [c00800001748cd08] amdgpu_sa_bo_new+0x640/0x6c0 [amdgpu] What does scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko amdgpu_sa_bo_new+0x640/0x6c0 say in the kernel build tree? (In reply to Michel Dänzer from comment #22) > (In reply to foxbat from comment #21) > > [ 75.701323] NIP [c00800001748cd08] amdgpu_sa_bo_new+0x640/0x6c0 [amdgpu] > > What does > > scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko > amdgpu_sa_bo_new+0x640/0x6c0 > > say in the kernel build tree? foxbat@colossus:~/build/linux-4.17$ scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko amdgpu_sa_bo_new+0x640/0x6c0 skipping amdgpu_sa_bo_new address at 0x3cd08 due to size mismatch (0x6c0 != 0x18e) no match for amdgpu_sa_bo_new+0x640/0x6c0 Created attachment 140046 [details] [review] Add some debugging output to amdgpu_sa_bo_new This patch should tell us which of the WARN_ON_ONCE in amdgpu_sa_bo_new is hit, and what the values are. (In reply to Michel Dänzer from comment #24) > Created attachment 140046 [details] [review] [review] > Add some debugging output to amdgpu_sa_bo_new > > This patch should tell us which of the WARN_ON_ONCE in amdgpu_sa_bo_new is > hit, and what the values are. Hello, I have reproduced the error with the provided patch. Here is the output: [ 175.501689] WARNING: CPU: 24 PID: 3212 at drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c:288 amdgpu_sa_bo_new+0x648/0x6d0 [amdgpu] [ 175.501692] Modules linked in: binfmt_misc snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device amdgpu evdev snd_hda_codec_hdmi chash ast gpu_sched snd_hda_intel ttm snd_hda_codec ghash_generic drm_kms_helper gf128mul ecb snd_hda_core drm xts snd_hwdep snd_pcm drm_panel_orientation_quirks syscopyarea ctr sysfillrect sysimgblt snd_timer fb_sys_fops cbc vmx_crypto snd i2c_algo_bit sg soundcore ofpart ipmi_powernv powernv_flash ipmi_devintf opal_prd mtd ipmi_msghandler at24 ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid hid xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear sd_mod md_mod ses enclosure xhci_pci xhci_hcd mpt3sas raid_class usbcore scsi_transport_sas nvme tg3 nvme_core [ 175.501731] aacraid libphy [ 175.501735] CPU: 24 PID: 3212 Comm: spectacle Not tainted 4.17.0-foxbat-64k #2 [ 175.501736] NIP: c00800001885cd10 LR: c008000018865944 CTR: 0000000000000000 [ 175.501738] REGS: c000001f618b71f0 TRAP: 0700 Not tainted (4.17.0-foxbat-64k) [ 175.501738] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24828848 XER: 20040000 [ 175.501744] CFAR: c00800001885c724 SOFTE: 0 GPR00: c008000018865944 c000001f618b7470 c008000018ab6400 c000001fdbea3418 GPR04: c000201ca9246260 0000000000100000 0000000000000100 00000000000017c1 GPR08: 3817b594461fff5f c000201ca9246000 0000000000000000 c008000018a19dd0 GPR12: 0000000000008000 c000001ffffe4000 c000001fdbea0000 c000001feab786a0 GPR16: 00000000027ffc4e c000201ca9246260 00000000fffe2000 ffffffffffffffef GPR20: 0000000009fff138 0000000000000100 c000001fdbea6980 0000000000000000 GPR24: 00000000027ffc4e 00000000000e5fff c000001fdbea3418 00000001fe000000 GPR28: c000001fdbea0000 0000000000000000 c000201ca9246260 0000000009fff138 [ 175.501780] NIP [c00800001885cd10] amdgpu_sa_bo_new+0x648/0x6d0 [amdgpu] [ 175.501798] LR [c008000018865944] amdgpu_ib_get+0x8c/0x120 [amdgpu] [ 175.501799] Call Trace: [ 175.501820] [c000001f618b7470] [c00800001885c7f8] amdgpu_sa_bo_new+0x130/0x6d0 [amdgpu] (unreliable) [ 175.501840] [c000001f618b7710] [c008000018865944] amdgpu_ib_get+0x8c/0x120 [amdgpu] [ 175.501861] [c000001f618b7790] [c008000018942558] amdgpu_job_alloc_with_ib+0x90/0x110 [amdgpu] [ 175.501880] [c000001f618b77d0] [c008000018862124] amdgpu_vm_bo_update_mapping+0x35c/0x480 [amdgpu] [ 175.501899] [c000001f618b78c0] [c0080000188625c8] amdgpu_vm_bo_update+0x380/0x740 [amdgpu] [ 175.501916] [c000001f618b79d0] [c008000018846d10] amdgpu_gem_va_ioctl+0x5f8/0x620 [amdgpu] [ 175.501923] [c000001f618b7b20] [c008000017748ce8] drm_ioctl_kernel+0xa0/0x140 [drm] [ 175.501928] [c000001f618b7b70] [c0080000177491b4] drm_ioctl+0x1ac/0x4d0 [drm] [ 175.501943] [c000001f618b7cb0] [c008000018820078] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu] [ 175.501946] [c000001f618b7d00] [c0000000003e216c] do_vfs_ioctl+0xdc/0x8a0 [ 175.501948] [c000001f618b7da0] [c0000000003e2a34] ksys_ioctl+0x104/0x120 [ 175.501950] [c000001f618b7df0] [c0000000003e2a90] sys_ioctl+0x40/0xa0 [ 175.501953] [c000001f618b7e30] [c00000000000b9e0] system_call+0x58/0x6c [ 175.501954] Instruction dump: [ 175.501956] 892a0000 2f890000 409efd4c 3c620000 e8639aa0 39200001 7ea4ab78 992a0000 [ 175.501959] 481bee31 e8410018 4bfffd2c 60000000 <0fe00000> 3d420000 e94a9a98 3bc0ffea [ 175.501963] ---[ end trace b955b8bff21188f9 ]--- [ 175.501965] [drm] size=167768376 > sa_manager->size=1048576 [ 175.501968] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 175.502036] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 175.502153] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 175.535050] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 175.539783] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 175.539870] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 175.540410] amdgpu 0000:01:00.0: failed to get a new IB (-22) [ 175.540489] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) foxbat, thanks for posting this!
> [ 175.501965] [drm] size=167768376 > sa_manager->size=1048576
Well, isn't THAT interesting!
Requested size is ALMOST, but not quite, 160 MB, while what
the sub-allocation manager has left is 1 MB.
(In reply to Ben Crocker from comment #26) > foxbat, thanks for posting this! > > > [ 175.501965] [drm] size=167768376 > sa_manager->size=1048576 > > Well, isn't THAT interesting! > Requested size is ALMOST, but not quite, 160 MB, while what > the sub-allocation manager has left is 1 MB. Sounds like we have an overrun in amdgpu_vm_bo_update_mapping() while calculating how many bytes we need to allocate from the sa_manager. Created attachment 140208 [details] [review] Patches for additional debug info foxbat, could you please apply the three attached patches and provide dmesg output again? Thanks! (In reply to Ben Crocker from comment #28) Hi Ben, Please see the dmesg output at https://pastebin.com/i7sZhUKr I reproduced the graphics "crash" at around 950. However, one thing I did find out is I am able to get video back by simply pressing a key on my keyboard. After doing this video appears to act normally. I suspect there's an issue in amdgpu_vm_bo_split_mapping, e.g. this code looks suspicious: if (pages_addr) { uint64_t count; max_entries = min(max_entries, 16ull * 1024ull); for (count = 1; count < max_entries; ++count) { uint64_t idx = pfn + count; if (pages_addr[idx] != (pages_addr[idx - 1] + PAGE_SIZE)) break; } count is compared to max_entries, which is a number of GPU pages, but is also added to pfn, which is a number of CPU pages. Created attachment 140239 [details] [review] More patches (2) for additional debug info Hi foxbat, Could you please apply these two patches on top of the patches I supplied yesterday, and post the output? (The patch to amdgpu_job.c is a corrected version of yesterday's patch.) Thanks, Ben Created attachment 140252 [details]
amdgpu dmesg output
Crash is at 246.306571
Created attachment 140253 [details] [review] Refining the amdgpu_vm.c:amdgpu_vm_bo_split_mapping further Print max_entries in both decimal and hex. Things start to go haywire at 246.305790, which appears to be the first time the do-loop in amdgpu_vm_bo_update_mapping executes more than once. The value of max_entries is obviously absurd; a little later, the value of start (0x104000) is consistent with the first trip through the loop, but the value of last (0xe5fff) looks wrong, starting with the fact that it is LESS than start. [ 246.305139] [drm] amdgpu_vm_bo_split_mapping nodes->size=512 pfn=0 max_entries=8192 [ 246.305230] [drm] amdgpu_vm_bo_split_mapping: addr=0x1e2000000 vram_base_offset=0x0 [ 246.305322] [drm] amdgpu_vm_bo_split_mapping: start=0x102000 last=0x103fff [ 246.305400] [drm] amdgpu_vm_bo_update_mapping l.1304: ndw=64 ncmds=9 fragment_size=9 [ 246.305474] [drm] amdgpu_vm_bo_update_mapping l.1310: resulting ndw=334 [ 246.305533] [drm] amdgpu_vm_bo_update_mapping calls amdgpu_job_alloc_with_ib(..., ndw*4 = 1336 (00000538)) [ 246.305630] [drm] amdgpu_job_alloc_with_ib calls amdgpu_ib_get(..., size=1336 (00000538)) [ 246.305704] [drm] amdgpu_ib_get calls amdgpu_sa_bo_new(..., size=1336 (00000538), align=256 Things go haywire: [ 246.305790] [drm] amdgpu_vm_bo_split_mapping nodes->size=512 pfn=8192 max_entries=18446744073709428736 [ 246.305878] [drm] amdgpu_vm_bo_split_mapping: addr=0x1e2000000 vram_base_offset=0x0 [ 246.305970] [drm] amdgpu_vm_bo_split_mapping: start=0x104000 last=0xe5fff [ 246.306029] [drm] amdgpu_vm_bo_update_mapping l.1304: ndw=64 ncmds=4194185 fragment_size=9 [ 246.306111] [drm] amdgpu_vm_bo_update_mapping l.1310: resulting ndw=41942094 [ 246.306197] [drm] amdgpu_vm_bo_update_mapping calls amdgpu_job_alloc_with_ib(..., ndw*4 = 167768376 (09FFF138)) [ 246.306322] [drm] amdgpu_job_alloc_with_ib calls amdgpu_ib_get(..., size=167768376 (09fff138)) [ 246.306429] [drm] amdgpu_ib_get calls amdgpu_sa_bo_new(..., size=167768376 (09fff138), align=256 [ 246.306571] WARNING: CPU: 67 PID: 21839 at amdgpu_sa_bo_new+0x628/0x6b0 [amdgpu] [ 246.306645] Modules linked in: binfmt_misc ext4 crc16 mbcache jbd2 fscrypto evdev snd_usb_audio amdgpu snd_usbmidi_lib snd_rawmidi snd_seq_device ghash_generic gf128mul ecb xts ctr snd_hda_codec_hdmi chash ast gpu_sched cbc ttm snd_hda_intel vmx_crypto snd_hda_codec drm_kms_helper snd_hda_core drm snd_hwdep snd_pcm snd_timer ofpart drm_panel_orientation_quirks syscopyarea snd sysfillrect ipmi_powernv sysimgblt fb_sys_fops ipmi_devintf i2c_algo_bit sg powernv_flash soundcore mtd ipmi_msghandler opal_prd at24 sunrpc ecryptfs ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid hid xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear sd_mod md_mod ses Created attachment 140257 [details] [review] drm/amdgpu: GPU vs CPU page size fixes in amdgpu_vm_bo_split_mapping Does this patch help? (In reply to Michel Dänzer from comment #35) > Created attachment 140257 [details] [review] [review] > drm/amdgpu: GPU vs CPU page size fixes in amdgpu_vm_bo_split_mapping > > Does this patch help? It looks like the patch isn't fully applying (using tree 4.17): patching file drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c Hunk #1 FAILED at 1577. Hunk #2 succeeded at 1463 with fuzz 2 (offset -127 lines). Hunk #3 succeeded at 1480 (offset -125 lines). 1 out of 3 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c.rej rejects: --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1577,7 +1577,9 @@ static int amdgpu_vm_bo_split_mapping(struct amdgpu_device *adev, uint64_t count; max_entries = min(max_entries, 16ull * 1024ull); - for (count = 1; count < max_entries; ++count) { + for (count = 1; + count < max_entries / (PAGE_SIZE / AMDGPU_GPU_PAGE_SIZE); + ++count) { uint64_t idx = pfn + count; if (pages_addr[idx] != (In reply to foxbat from comment #36) > It looks like the patch isn't fully applying (using tree 4.17): It should apply to 4.17. Did you revert Ben's debugging patches before applying it? (In reply to Michel Dänzer from comment #37) > (In reply to foxbat from comment #36) > > It looks like the patch isn't fully applying (using tree 4.17): > > It should apply to 4.17. Did you revert Ben's debugging patches before > applying it? Sorry, I did not. The patch does apply after reverting the debug patches. I'll be able to test this afternoon (US eastern time). (In reply to Michel Dänzer from comment #35) > Created attachment 140257 [details] [review] [review] > drm/amdgpu: GPU vs CPU page size fixes in amdgpu_vm_bo_split_mapping > > Does this patch help? Hi Michel, I have applied the patch and so far it appears to help a lot. I have not been able to reproduce any amdgpu crashes/errors so far. I will continue running with the patch and let you know if I have any issues. (In reply to foxbat from comment #39) > (In reply to Michel Dänzer from comment #35) > > Created attachment 140257 [details] [review] [review] [review] > > drm/amdgpu: GPU vs CPU page size fixes in amdgpu_vm_bo_split_mapping > > > > Does this patch help? > > Hi Michel, > > I have applied the patch and so far it appears to help a lot. I have not > been able to reproduce any amdgpu crashes/errors so far. I will continue > running with the patch and let you know if I have any issues. I can confirm that Michel's patch, along with the following patches, also enable full 3D acceleration via 64-bit DMA across a wide range of tested applications: https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-June/174985.html https://scm.raptorengineering.com/scm/#changesetPanel;6IQvJHpVSM;578e2d761554130f7c6abfe821c2509912a00ac6;commit https://scm.raptorengineering.com/scm/#changesetPanel;6IQvJHpVSM;461333cedd470140eb1e1667608da2a6d65a58e5;commit https://bugs.freedesktop.org/show_bug.cgi?id=107012 All of these patches have been submitted upstream to the respective maintainers for review. When Michel's is submitted and merged, we should try to get some of this downstream into the various distros as well. Thanks for the report and help tracking down the problem. The fix is queued for 4.18, and will be backported to stable branches: https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-4.18&id=38e624a18f9a05b8c894409be6b14709a7206c7c for the record output of "git describe 38e624a18f9a05b" v4.18-rc1-36-g38e624a18f9a |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.