Bug 106258

Summary: AMD Xorg start failes with non-4K page sizes
Product: DRI Reporter: Matt Corallo <freedesktop>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: andrey.grodzovsky, bcrocker, dan, tpearson
Version: unspecified   
Hardware: PowerPC   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Add some debugging output to amdgpu_sa_bo_new
none
Patches for additional debug info
none
More patches (2) for additional debug info
none
amdgpu dmesg output
none
Refining the amdgpu_vm.c:amdgpu_vm_bo_split_mapping further
none
drm/amdgpu: GPU vs CPU page size fixes in amdgpu_vm_bo_split_mapping none

Description Matt Corallo 2018-04-26 21:12:41 UTC
Have two nearly-identical boxes, both running Debian testing with a 4.16 kernel, only major difference is one is configured with a 4k page size, one with Debian's default 64K page size (on PPC64LE). This results in a corrupted output when running X (with a non-corrupt mouse overlayed on top) and the following WARN in dmesg:

[   33.990146] WARNING: CPU: 8 PID: 1401 at /build/linux-z743uR/linux-4.16/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c:326 amdgpu_sa_bo_new+0x630/0x6b0 [amdgpu]
[   33.990148] Modules linked in: ext4 crc16 mbcache jbd2 fscrypto amdgpu chash evdev ast gpu_sched ttm snd_hda_codec_hdmi drm_kms_helper snd_hda_intel ghash_generic snd_hda_codec gf128mul drm ctr snd_hda_core snd_hwdep cbc sg snd_pcm vmx_crypto drm_panel_orientation_quirks syscopyarea ofpart snd_timer sysfillrect ipmi_powernv sysimgblt snd powernv_flash ipmi_devintf fb_sys_fops mtd ipmi_msghandler i2c_algo_bit opal_prd soundcore at24 ip_tables x_tables autofs4 btrfs crc32c_generic xor zstd_decompress zstd_compress xxhash raid6_pq ecb xts algif_skcipher af_alg hid_generic usbhid hid sd_mod dm_crypt dm_mod xhci_pci xhci_hcd mpt3sas usbcore tg3 raid_class scsi_transport_sas libphy usb_common
[   33.990187] CPU: 8 PID: 1401 Comm: gnome-shell Not tainted 4.16.0-trunk-powerpc64le #1 Debian 4.16-1~exp1
[   33.990189] NIP:  c00800000e08bfe8 LR: c00800000e0940d4 CTR: 0000000000000000
[   33.990190] REGS: c000000fac573260 TRAP: 0700   Not tainted  (4.16.0-trunk-powerpc64le Debian 4.16-1~exp1)
[   33.990191] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24828848  XER: 20040000
[   33.990196] CFAR: c00800000e08ba14 SOFTE: 0 
               GPR00: c00800000e0940d4 c000000fac5734e0 c00800000e2e1c00 c000000ff4673318 
               GPR04: c000000feedc5a58 0000000009fff138 0000000000000100 0000000ffacf0000 
               GPR08: 0000000000000010 0000000000100000 0000000000000000 c00800000e246b98 
               GPR12: 0000000000008000 c00000000fa85800 c000000ff4670000 c000000ff45067e0 
               GPR16: 00000000000e69ff c000000feedc5a58 00000000fffe2000 ffffffffffffffef 
               GPR20: 0000000009fff138 c000000ff4670000 0000000000000100 0000000000000000 
               GPR24: 00000000000e69ff 0000000000104a00 c000000ff4673318 0000000116000000 
               GPR28: c000000ff4670000 0000000000000000 c000000feedc5a58 0000000009fff138 
[   33.990228] NIP [c00800000e08bfe8] amdgpu_sa_bo_new+0x630/0x6b0 [amdgpu]
[   33.990244] LR [c00800000e0940d4] amdgpu_ib_get+0x8c/0x120 [amdgpu]
[   33.990245] Call Trace:
[   33.990261] [c000000fac5734e0] [c00800000e08bae4] amdgpu_sa_bo_new+0x12c/0x6b0 [amdgpu] (unreliable)
[   33.990278] [c000000fac573740] [c00800000e0940d4] amdgpu_ib_get+0x8c/0x120 [amdgpu]
[   33.990298] [c000000fac5737c0] [c00800000e176db8] amdgpu_job_alloc_with_ib+0x90/0x110 [amdgpu]
[   33.990316] [c000000fac573800] [c00800000e090be8] amdgpu_vm_bo_update_mapping+0x360/0x4b0 [amdgpu]
[   33.990332] [c000000fac5738f0] [c00800000e091084] amdgpu_vm_bo_update+0x34c/0x710 [amdgpu]
[   33.990350] [c000000fac573a00] [c00800000e0768b0] amdgpu_gem_va_ioctl+0x5f8/0x620 [amdgpu]
[   33.990356] [c000000fac573b50] [c00800000cd88f48] drm_ioctl_kernel+0xa0/0x140 [drm]
[   33.990361] [c000000fac573ba0] [c00800000cd89424] drm_ioctl+0x1bc/0x4f0 [drm]
[   33.990376] [c000000fac573cf0] [c00800000e050078] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu]
[   33.990379] [c000000fac573d40] [c0000000003dd8dc] do_vfs_ioctl+0xdc/0x8a0
[   33.990381] [c000000fac573de0] [c0000000003de164] SyS_ioctl+0xc4/0x130
[   33.990383] [c000000fac573e30] [c00000000000b8e0] system_call+0x58/0x6c
[   33.990384] Instruction dump:
[   33.990386] 7fc3f378 38210260 e8010010 81810008 ea21ff88 ea81ffa0 eac1ffb0 eb41ffd0 
[   33.990389] ebc1fff0 7c0803a6 7d908120 4e800020 <0fe00000> 3bc0ffea 7fc3f378 38210260 
[   33.990393] ---[ end trace 59adbd4db83fa3b4 ]---
[   33.990397] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[   33.990501] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[   34.007673] amdgpu 0000:01:00.0: failed to get a new IB (-22)
Comment 1 Timothy Pearson 2018-04-26 21:14:59 UTC
This also affects the WX7100, same symptoms.  Shows up as soon as an OpenGL application tries to use the accelerated graphics.  Kernel 4.16.
Comment 2 Matt Corallo 2018-04-26 21:17:25 UTC
Oops, forgot to specify this is on a WX4100.
Comment 3 Timothy Pearson 2018-04-26 21:19:05 UTC
(In reply to Timothy Pearson from comment #1)
> This also affects the WX7100, same symptoms.  Shows up as soon as an OpenGL
> application tries to use the accelerated graphics.  Kernel 4.16.

On our end Ureal Engine 4 would reliably trigger the problem.  A number of other open source 3D applications worked without issues.
Comment 4 Matt Corallo 2018-04-26 21:22:16 UTC
Oops, looks like we got our wires crossed on what has and hasn't been tested. It only *may* be PAGE_SIZE related, but seems relevant 3d stuff was never tested on PPC64LE with 4K pages, only 64K.
Comment 5 Matt Corallo 2018-04-27 20:25:18 UTC
Just checked with today's linus/master (somwehere between 4.17-rc2 and 4.17-rc3) and the issue is still present.
Comment 6 Timothy Pearson 2018-05-01 10:08:46 UTC
This looks like something that should be filed at bugzilla.kernel.org against the amdgpu driver itself.  It will get more visibility by the right people there.
Comment 7 Michel Dänzer 2018-05-01 13:02:42 UTC
(In reply to Timothy Pearson from comment #6)
> This looks like something that should be filed at bugzilla.kernel.org
> against the amdgpu driver itself.  It will get more visibility by the right
> people there.

No, we prefer this bugzilla for tracking kernel driver issues as well.
Comment 8 Andrey Grodzovsky 2018-05-01 13:06:23 UTC
I Wonder is there a way to reproduce this on x86 platform, from looking in kflags there is no  CONFIG_x86_4/16/64K_PAGES flags.
Comment 9 Matt Corallo 2018-05-01 16:05:50 UTC
Switched the machine over to a 4k-PAGE_SIZE kernel and it works fine, so that does appear to be the issue here.
Comment 10 Dan Horák 2018-05-02 09:22:20 UTC
(In reply to Andrey Grodzovsky from comment #8)
> I Wonder is there a way to reproduce this on x86 platform, from looking in
> kflags there is no  CONFIG_x86_4/16/64K_PAGES flags.

AFAIK there is only 4k page size on x86 possible for regular pages (non-huge).

Adding Ben to CC, he might have some ideas based on his work on the graphic drivers for ppc64le in RHEL.
Comment 11 Timothy Pearson 2018-05-02 19:54:41 UTC
We could get you direct access to test hardware if that would help.  Basically, these systems are completely controllable via BMC, so we could give remote access to a Talos machine with a WX7100 installed for you guys to run tests on.  Would that help?
Comment 12 Andrey Grodzovsky 2018-05-02 20:38:27 UTC
(In reply to Timothy Pearson from comment #11)
> We could get you direct access to test hardware if that would help. 
> Basically, these systems are completely controllable via BMC, so we could
> give remote access to a Talos machine with a WX7100 installed for you guys
> to run tests on.  Would that help?

That a good idea since I don't think we have any non x86 systems here t experiment with. I am currently handling another issue, once I am done and assuming no one else jumped in to fix it I will use your system to debug.

Andrey
Comment 13 Andrey Grodzovsky 2018-05-02 20:38:42 UTC
(In reply to Timothy Pearson from comment #11)
> We could get you direct access to test hardware if that would help. 
> Basically, these systems are completely controllable via BMC, so we could
> give remote access to a Talos machine with a WX7100 installed for you guys
> to run tests on.  Would that help?

That a good idea since I don't think we have any non x86 systems here t experiment with. I am currently handling another issue, once I am done and assuming no one else jumped in to fix it I will use your system to debug.

Andrey
Comment 14 Timothy Pearson 2018-05-02 22:12:52 UTC
(In reply to Andrey Grodzovsky from comment #13)
> (In reply to Timothy Pearson from comment #11)
> > We could get you direct access to test hardware if that would help. 
> > Basically, these systems are completely controllable via BMC, so we could
> > give remote access to a Talos machine with a WX7100 installed for you guys
> > to run tests on.  Would that help?
> 
> That a good idea since I don't think we have any non x86 systems here t
> experiment with. I am currently handling another issue, once I am done and
> assuming no one else jumped in to fix it I will use your system to debug.
> 
> Andrey

Works for me.  Just let me know when you're ready and we'll get you access.
Comment 15 Ben Crocker 2018-05-06 03:09:16 UTC
Hello Matt, Timothy, Michel, Andrey, and Dan,

I have been testing some older cards (FirePro 2270, Embedded Radeon
E6465) on a Power8 system with no problems of the type you're
describing; I've run the Piglit test suite and the Khronos OpenGL
Conformance Test Suite.

Direct access to a Talos machine could be beneficial, OR perhaps one
of you (Timothy, Michel, Andrey) could send me a WX 7100 for testing;
I've been meaning to ask for one of these anyway for a while now.

Just out of curiosity, what (other) differences in behavior do you see
between PPC64LE systems configured for 64K pages and systems
configured for 4K pages?
Comment 16 Timothy Pearson 2018-05-15 11:54:37 UTC
(In reply to Ben Crocker from comment #15)
> Hello Matt, Timothy, Michel, Andrey, and Dan,
> 
> I have been testing some older cards (FirePro 2270, Embedded Radeon
> E6465) on a Power8 system with no problems of the type you're
> describing; I've run the Piglit test suite and the Khronos OpenGL
> Conformance Test Suite.
> 
> Direct access to a Talos machine could be beneficial, OR perhaps one
> of you (Timothy, Michel, Andrey) could send me a WX 7100 for testing;
> I've been meaning to ask for one of these anyway for a while now.
> 
> Just out of curiosity, what (other) differences in behavior do you see
> between PPC64LE systems configured for 64K pages and systems
> configured for 4K pages?

OK, I have a Talos II machine up with a WX4100 installed.  Debian is loaded on it right now, and I can grant direct BMC access if anyone wants to take a crack at debug.

Email me directly for login credentials.

Thanks!
Comment 17 Timothy Pearson 2018-05-25 18:04:50 UTC
We have been unable to replicate this with the latest AMD GPU firmware from https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/ .  Xorg starts fine with no errors.

System: Debian Testing
Kernel: 4.17.0-rc6

$ getconf PAGESIZE
65536

If you still experience the issues please give detailed instructions on how to reproduce your test environment, including OS installed and packages / configuration files in use.
Comment 18 Timothy Pearson 2018-05-25 19:07:42 UTC
(In reply to Timothy Pearson from comment #17)
> We have been unable to replicate this with the latest AMD GPU firmware from
> https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/ .  Xorg starts
> fine with no errors.
> 
> System: Debian Testing
> Kernel: 4.17.0-rc6
> 
> $ getconf PAGESIZE
> 65536
> 
> If you still experience the issues please give detailed instructions on how
> to reproduce your test environment, including OS installed and packages /
> configuration files in use.

To clarify, please install all of the files in that directory into your /lib/firmware/amdgpu/ directory (on Debian), then test to see if the issue persists.  The WX7100 uses the polaris10* files, the WX4100 uses the polaris11* files.
Comment 19 Alex Deucher 2018-05-25 19:09:44 UTC
(In reply to Timothy Pearson from comment #17)
> We have been unable to replicate this with the latest AMD GPU firmware from
> https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/ .  Xorg starts
> fine with no errors.

Those are old firmware files I posted for testing a while ago.  Please test with the latest firmware from the linux-firmware tree:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
Comment 20 Ben Crocker 2018-05-30 17:16:13 UTC
Please note that the path to the firmware,

/lib/firmware/amdgpu,

is correct for Red Hat products (RHEL, Fedora, CentOS) as well.
Comment 21 foxbat 2018-06-05 15:52:51 UTC
I am still getting amdgpu crashes on POWER9 (ppc64le) using the latest AMD firmware and on kernel 4.17 (release). My card is a WX5100 - Polaris 10. I can reliably reproduce this by having spectacle attempt to capture an area of the screen. However I also experience crashes randomly while moving a window.
Here is the dmesg output:

[   75.701274] CPU: 37 PID: 3391 Comm: spectacle Not tainted 4.17.0-foxbat-64k #1
[   75.701275] NIP:  c00800001748cd08 LR: c008000017495944 CTR: 0000000000000000
[   75.701278] REGS: c000001f8ea53230 TRAP: 0700   Not tainted  (4.17.0-foxbat-64k)
[   75.701278] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24828848  XER: 20040000
[   75.701284] CFAR: c00800001748c724 SOFTE: 0
               GPR00: c008000017495944 c000001f8ea534b0 c0080000176e6400 c000201cb2303418
               GPR04: c000201ca9cfce60 0000000009fff138 0000000000000100 00000000000005a3
               GPR08: 3817b594461fff5f 0000000000100000 0000000000000000 c008000017649dd0
               GPR12: 0000000000008000 c000001ffffd5600 c000201cb2300000 c000001fe89d9d20
               GPR16: 00000000027ffc4e c000201ca9cfce60 00000000fffe2000 ffffffffffffffef
               GPR20: 0000000009fff138 0000000000000100 c000201cb23066a8 0000000000000000
               GPR24: 00000000027ffc4e 00000000000e5fff c000201cb2303418 00000001fe000000
               GPR28: c000201cb2300000 0000000000000000 c000201ca9cfce60 0000000009fff138
[   75.701323] NIP [c00800001748cd08] amdgpu_sa_bo_new+0x640/0x6c0 [amdgpu]
[   75.701347] LR [c008000017495944] amdgpu_ib_get+0x8c/0x120 [amdgpu]
[   75.701348] Call Trace:
[   75.701370] [c000001f8ea534b0] [c00800001748c7f8] amdgpu_sa_bo_new+0x130/0x6c0 [amdgpu] (unreliable)
[   75.701395] [c000001f8ea53710] [c008000017495944] amdgpu_ib_get+0x8c/0x120 [amdgpu]
[   75.701421] [c000001f8ea53790] [c008000017572558] amdgpu_job_alloc_with_ib+0x90/0x110 [amdgpu]
[   75.701442] [c000001f8ea537d0] [c008000017492124] amdgpu_vm_bo_update_mapping+0x35c/0x480 [amdgpu]
[   75.701469] [c000001f8ea538c0] [c0080000174925c8] amdgpu_vm_bo_update+0x380/0x740 [amdgpu]
[   75.701489] [c000001f8ea539d0] [c008000017476d10] amdgpu_gem_va_ioctl+0x5f8/0x620 [amdgpu]
[   75.701496] [c000001f8ea53b20] [c008000015348ce8] drm_ioctl_kernel+0xa0/0x140 [drm]
[   75.701502] [c000001f8ea53b70] [c0080000153491b4] drm_ioctl+0x1ac/0x4d0 [drm]
[   75.701518] [c000001f8ea53cb0] [c008000017450078] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu]
[   75.701521] [c000001f8ea53d00] [c0000000003e216c] do_vfs_ioctl+0xdc/0x8a0
[   75.701523] [c000001f8ea53da0] [c0000000003e2a34] ksys_ioctl+0x104/0x120
[   75.701525] [c000001f8ea53df0] [c0000000003e2a90] sys_ioctl+0x40/0xa0
[   75.701528] [c000001f8ea53e30] [c00000000000b9e0] system_call+0x58/0x6c
[   75.701529] Instruction dump:
[   75.701531] 7fc3f378 38210260 e8010010 81810008 ea21ff88 ea81ffa0 eaa1ffa8 eb41ffd0
[   75.701535] ebc1fff0 7c0803a6 7d908120 4e800020 <0fe00000> 3bc0ffea 7fc3f378 38210260
[   75.701540] ---[ end trace 838930e3a806a76d ]---
[   75.701557] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[   75.701637] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
[   75.702847] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[   75.736208] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[   75.741527] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[   75.741602] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
[   75.743120] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[   75.743182] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
Comment 22 Michel Dänzer 2018-06-05 16:10:10 UTC
(In reply to foxbat from comment #21)
> [   75.701323] NIP [c00800001748cd08] amdgpu_sa_bo_new+0x640/0x6c0 [amdgpu]

What does

 scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko amdgpu_sa_bo_new+0x640/0x6c0

say in the kernel build tree?
Comment 23 foxbat 2018-06-05 17:17:25 UTC
(In reply to Michel Dänzer from comment #22)
> (In reply to foxbat from comment #21)
> > [   75.701323] NIP [c00800001748cd08] amdgpu_sa_bo_new+0x640/0x6c0 [amdgpu]
> 
> What does
> 
>  scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko
> amdgpu_sa_bo_new+0x640/0x6c0
> 
> say in the kernel build tree?

foxbat@colossus:~/build/linux-4.17$ scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko amdgpu_sa_bo_new+0x640/0x6c0 

skipping amdgpu_sa_bo_new address at 0x3cd08 due to size mismatch (0x6c0 != 0x18e) no match for amdgpu_sa_bo_new+0x640/0x6c0
Comment 24 Michel Dänzer 2018-06-06 07:18:32 UTC
Created attachment 140046 [details] [review]
Add some debugging output to amdgpu_sa_bo_new

This patch should tell us which of the WARN_ON_ONCE in amdgpu_sa_bo_new is hit, and what the values are.
Comment 25 foxbat 2018-06-06 20:39:04 UTC
(In reply to Michel Dänzer from comment #24)
> Created attachment 140046 [details] [review] [review]
> Add some debugging output to amdgpu_sa_bo_new
> 
> This patch should tell us which of the WARN_ON_ONCE in amdgpu_sa_bo_new is
> hit, and what the values are.

Hello,

I have reproduced the error with the provided patch. Here is the output:

[  175.501689] WARNING: CPU: 24 PID: 3212 at drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c:288 amdgpu_sa_bo_new+0x648/0x6d0 [amdgpu]
[  175.501692] Modules linked in: binfmt_misc snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device amdgpu evdev snd_hda_codec_hdmi chash ast gpu_sched snd_hda_intel ttm snd_hda_codec ghash_generic drm_kms_helper gf128mul ecb snd_hda_core drm xts snd_hwdep snd_pcm drm_panel_orientation_quirks syscopyarea ctr sysfillrect sysimgblt snd_timer fb_sys_fops cbc vmx_crypto snd i2c_algo_bit sg soundcore ofpart ipmi_powernv powernv_flash ipmi_devintf opal_prd mtd ipmi_msghandler at24 ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid hid xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear sd_mod md_mod ses enclosure xhci_pci xhci_hcd mpt3sas raid_class usbcore scsi_transport_sas nvme tg3 nvme_core
[  175.501731]  aacraid libphy
[  175.501735] CPU: 24 PID: 3212 Comm: spectacle Not tainted 4.17.0-foxbat-64k #2
[  175.501736] NIP:  c00800001885cd10 LR: c008000018865944 CTR: 0000000000000000
[  175.501738] REGS: c000001f618b71f0 TRAP: 0700   Not tainted  (4.17.0-foxbat-64k)
[  175.501738] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24828848  XER: 20040000
[  175.501744] CFAR: c00800001885c724 SOFTE: 0 
               GPR00: c008000018865944 c000001f618b7470 c008000018ab6400 c000001fdbea3418 
               GPR04: c000201ca9246260 0000000000100000 0000000000000100 00000000000017c1 
               GPR08: 3817b594461fff5f c000201ca9246000 0000000000000000 c008000018a19dd0 
               GPR12: 0000000000008000 c000001ffffe4000 c000001fdbea0000 c000001feab786a0 
               GPR16: 00000000027ffc4e c000201ca9246260 00000000fffe2000 ffffffffffffffef 
               GPR20: 0000000009fff138 0000000000000100 c000001fdbea6980 0000000000000000 
               GPR24: 00000000027ffc4e 00000000000e5fff c000001fdbea3418 00000001fe000000 
               GPR28: c000001fdbea0000 0000000000000000 c000201ca9246260 0000000009fff138 
[  175.501780] NIP [c00800001885cd10] amdgpu_sa_bo_new+0x648/0x6d0 [amdgpu]
[  175.501798] LR [c008000018865944] amdgpu_ib_get+0x8c/0x120 [amdgpu]
[  175.501799] Call Trace:
[  175.501820] [c000001f618b7470] [c00800001885c7f8] amdgpu_sa_bo_new+0x130/0x6d0 [amdgpu] (unreliable)
[  175.501840] [c000001f618b7710] [c008000018865944] amdgpu_ib_get+0x8c/0x120 [amdgpu]
[  175.501861] [c000001f618b7790] [c008000018942558] amdgpu_job_alloc_with_ib+0x90/0x110 [amdgpu]
[  175.501880] [c000001f618b77d0] [c008000018862124] amdgpu_vm_bo_update_mapping+0x35c/0x480 [amdgpu]
[  175.501899] [c000001f618b78c0] [c0080000188625c8] amdgpu_vm_bo_update+0x380/0x740 [amdgpu]
[  175.501916] [c000001f618b79d0] [c008000018846d10] amdgpu_gem_va_ioctl+0x5f8/0x620 [amdgpu]
[  175.501923] [c000001f618b7b20] [c008000017748ce8] drm_ioctl_kernel+0xa0/0x140 [drm]
[  175.501928] [c000001f618b7b70] [c0080000177491b4] drm_ioctl+0x1ac/0x4d0 [drm]
[  175.501943] [c000001f618b7cb0] [c008000018820078] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu]
[  175.501946] [c000001f618b7d00] [c0000000003e216c] do_vfs_ioctl+0xdc/0x8a0
[  175.501948] [c000001f618b7da0] [c0000000003e2a34] ksys_ioctl+0x104/0x120
[  175.501950] [c000001f618b7df0] [c0000000003e2a90] sys_ioctl+0x40/0xa0
[  175.501953] [c000001f618b7e30] [c00000000000b9e0] system_call+0x58/0x6c
[  175.501954] Instruction dump:
[  175.501956] 892a0000 2f890000 409efd4c 3c620000 e8639aa0 39200001 7ea4ab78 992a0000 
[  175.501959] 481bee31 e8410018 4bfffd2c 60000000 <0fe00000> 3d420000 e94a9a98 3bc0ffea 
[  175.501963] ---[ end trace b955b8bff21188f9 ]---
[  175.501965] [drm] size=167768376 > sa_manager->size=1048576
[  175.501968] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[  175.502036] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
[  175.502153] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[  175.535050] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[  175.539783] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[  175.539870] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
[  175.540410] amdgpu 0000:01:00.0: failed to get a new IB (-22)
[  175.540489] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
Comment 26 Ben Crocker 2018-06-06 21:10:38 UTC
foxbat, thanks for posting this!

> [  175.501965] [drm] size=167768376 > sa_manager->size=1048576

Well, isn't THAT interesting!
Requested size is ALMOST, but not quite, 160 MB, while what
the sub-allocation manager has left is 1 MB.
Comment 27 Christian König 2018-06-07 07:14:34 UTC
(In reply to Ben Crocker from comment #26)
> foxbat, thanks for posting this!
> 
> > [  175.501965] [drm] size=167768376 > sa_manager->size=1048576
> 
> Well, isn't THAT interesting!
> Requested size is ALMOST, but not quite, 160 MB, while what
> the sub-allocation manager has left is 1 MB.

Sounds like we have an overrun in amdgpu_vm_bo_update_mapping() while calculating how many bytes we need to allocate from the sa_manager.
Comment 28 Ben Crocker 2018-06-18 17:37:57 UTC
Created attachment 140208 [details] [review]
Patches for additional debug info

foxbat, could you please apply the three attached patches and provide dmesg output again?

Thanks!
Comment 29 foxbat 2018-06-18 22:24:28 UTC
(In reply to Ben Crocker from comment #28)

Hi Ben,

Please see the dmesg output at https://pastebin.com/i7sZhUKr
I reproduced the graphics "crash" at around 950.

However, one thing I did find out is I am able to get video back by simply pressing a key on my keyboard. After doing this video appears to act normally.
Comment 30 Michel Dänzer 2018-06-19 09:07:47 UTC
I suspect there's an issue in amdgpu_vm_bo_split_mapping, e.g. this code looks suspicious:

		if (pages_addr) {
			uint64_t count;

			max_entries = min(max_entries, 16ull * 1024ull);
			for (count = 1; count < max_entries; ++count) {
				uint64_t idx = pfn + count;

				if (pages_addr[idx] !=
				    (pages_addr[idx - 1] + PAGE_SIZE))
					break;
			}

count is compared to max_entries, which is a number of GPU pages, but is also added to pfn, which is a number of CPU pages.
Comment 31 Ben Crocker 2018-06-20 01:14:37 UTC
Created attachment 140239 [details] [review]
More patches (2) for additional debug info

Hi foxbat,

Could you please apply these two patches on top of the patches I supplied
yesterday, and post the output?  (The patch to amdgpu_job.c is a corrected
version of yesterday's patch.)

  Thanks,
  Ben
Comment 32 foxbat 2018-06-20 21:50:05 UTC
Created attachment 140252 [details]
amdgpu dmesg output

Crash is at 246.306571
Comment 33 Ben Crocker 2018-06-20 22:39:00 UTC
Created attachment 140253 [details] [review]
Refining the amdgpu_vm.c:amdgpu_vm_bo_split_mapping further

Print max_entries in both decimal and hex.
Comment 34 Ben Crocker 2018-06-20 22:57:11 UTC
Things start to go haywire at 246.305790, which appears to be the first time
the do-loop in amdgpu_vm_bo_update_mapping executes more than once.
The value of max_entries is obviously absurd; a little later, the value of
start (0x104000) is consistent with the first trip through the loop, but
the value of last (0xe5fff) looks wrong, starting with the fact that it
is LESS than start.

[  246.305139] [drm] amdgpu_vm_bo_split_mapping nodes->size=512 pfn=0 max_entries=8192
[  246.305230] [drm] amdgpu_vm_bo_split_mapping: addr=0x1e2000000 vram_base_offset=0x0
[  246.305322] [drm] amdgpu_vm_bo_split_mapping: start=0x102000 last=0x103fff
[  246.305400] [drm] amdgpu_vm_bo_update_mapping l.1304: ndw=64 ncmds=9 fragment_size=9
[  246.305474] [drm] amdgpu_vm_bo_update_mapping l.1310: resulting ndw=334
[  246.305533] [drm] amdgpu_vm_bo_update_mapping calls amdgpu_job_alloc_with_ib(..., ndw*4 = 1336 (00000538))
[  246.305630] [drm] amdgpu_job_alloc_with_ib calls amdgpu_ib_get(..., size=1336 (00000538))
[  246.305704] [drm] amdgpu_ib_get calls amdgpu_sa_bo_new(..., size=1336 (00000538), align=256
Things go haywire:
[  246.305790] [drm] amdgpu_vm_bo_split_mapping nodes->size=512 pfn=8192 max_entries=18446744073709428736
[  246.305878] [drm] amdgpu_vm_bo_split_mapping: addr=0x1e2000000 vram_base_offset=0x0
[  246.305970] [drm] amdgpu_vm_bo_split_mapping: start=0x104000 last=0xe5fff
[  246.306029] [drm] amdgpu_vm_bo_update_mapping l.1304: ndw=64 ncmds=4194185 fragment_size=9
[  246.306111] [drm] amdgpu_vm_bo_update_mapping l.1310: resulting ndw=41942094
[  246.306197] [drm] amdgpu_vm_bo_update_mapping calls amdgpu_job_alloc_with_ib(..., ndw*4 = 167768376 (09FFF138))
[  246.306322] [drm] amdgpu_job_alloc_with_ib calls amdgpu_ib_get(..., size=167768376 (09fff138))
[  246.306429] [drm] amdgpu_ib_get calls amdgpu_sa_bo_new(..., size=167768376 (09fff138), align=256
[  246.306571] WARNING: CPU: 67 PID: 21839 at amdgpu_sa_bo_new+0x628/0x6b0 [amdgpu]
[  246.306645] Modules linked in: binfmt_misc ext4 crc16 mbcache jbd2 fscrypto evdev snd_usb_audio amdgpu snd_usbmidi_lib snd_rawmidi snd_seq_device ghash_generic gf128mul ecb xts ctr snd_hda_codec_hdmi chash ast gpu_sched cbc ttm snd_hda_intel vmx_crypto snd_hda_codec drm_kms_helper snd_hda_core drm snd_hwdep snd_pcm snd_timer ofpart drm_panel_orientation_quirks syscopyarea snd sysfillrect ipmi_powernv sysimgblt fb_sys_fops ipmi_devintf i2c_algo_bit sg powernv_flash soundcore mtd ipmi_msghandler opal_prd at24 sunrpc ecryptfs ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid hid xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear sd_mod md_mod ses
Comment 35 Michel Dänzer 2018-06-21 09:29:57 UTC
Created attachment 140257 [details] [review]
drm/amdgpu: GPU vs CPU page size fixes in  amdgpu_vm_bo_split_mapping

Does this patch help?
Comment 36 foxbat 2018-06-21 10:22:02 UTC
(In reply to Michel Dänzer from comment #35)
> Created attachment 140257 [details] [review] [review]
> drm/amdgpu: GPU vs CPU page size fixes in  amdgpu_vm_bo_split_mapping
> 
> Does this patch help?

It looks like the patch isn't fully applying (using tree 4.17):

patching file drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
Hunk #1 FAILED at 1577.
Hunk #2 succeeded at 1463 with fuzz 2 (offset -127 lines).
Hunk #3 succeeded at 1480 (offset -125 lines).
1 out of 3 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c.rej

rejects:

--- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1577,7 +1577,9 @@ static int amdgpu_vm_bo_split_mapping(struct amdgpu_device *adev,
                        uint64_t count;
 
                        max_entries = min(max_entries, 16ull * 1024ull);
-                       for (count = 1; count < max_entries; ++count) {
+                       for (count = 1;
+                            count < max_entries / (PAGE_SIZE / AMDGPU_GPU_PAGE_SIZE);
+                            ++count) {
                                uint64_t idx = pfn + count;
 
                                if (pages_addr[idx] !=
Comment 37 Michel Dänzer 2018-06-21 10:34:52 UTC
(In reply to foxbat from comment #36)
> It looks like the patch isn't fully applying (using tree 4.17):

It should apply to 4.17. Did you revert Ben's debugging patches before applying it?
Comment 38 foxbat 2018-06-21 11:17:11 UTC
(In reply to Michel Dänzer from comment #37)
> (In reply to foxbat from comment #36)
> > It looks like the patch isn't fully applying (using tree 4.17):
> 
> It should apply to 4.17. Did you revert Ben's debugging patches before
> applying it?

Sorry, I did not. The patch does apply after reverting the debug patches. I'll be able to test this afternoon (US eastern time).
Comment 39 foxbat 2018-06-21 22:32:25 UTC
(In reply to Michel Dänzer from comment #35)
> Created attachment 140257 [details] [review] [review]
> drm/amdgpu: GPU vs CPU page size fixes in  amdgpu_vm_bo_split_mapping
> 
> Does this patch help?

Hi Michel,

I have applied the patch and so far it appears to help a lot. I have not been able to reproduce any amdgpu crashes/errors so far. I will continue running with the patch and let you know if I have any issues.
Comment 40 Timothy Pearson 2018-06-24 09:15:31 UTC
(In reply to foxbat from comment #39)
> (In reply to Michel Dänzer from comment #35)
> > Created attachment 140257 [details] [review] [review] [review]
> > drm/amdgpu: GPU vs CPU page size fixes in  amdgpu_vm_bo_split_mapping
> > 
> > Does this patch help?
> 
> Hi Michel,
> 
> I have applied the patch and so far it appears to help a lot. I have not
> been able to reproduce any amdgpu crashes/errors so far. I will continue
> running with the patch and let you know if I have any issues.

I can confirm that Michel's patch, along with the following patches, also enable full 3D acceleration via 64-bit DMA across a wide range of tested applications:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-June/174985.html

https://scm.raptorengineering.com/scm/#changesetPanel;6IQvJHpVSM;578e2d761554130f7c6abfe821c2509912a00ac6;commit

https://scm.raptorengineering.com/scm/#changesetPanel;6IQvJHpVSM;461333cedd470140eb1e1667608da2a6d65a58e5;commit

https://bugs.freedesktop.org/show_bug.cgi?id=107012

All of these patches have been submitted upstream to the respective maintainers for review.  When Michel's is submitted and merged, we should try to get some of this downstream into the various distros as well.
Comment 41 Michel Dänzer 2018-06-28 09:12:33 UTC
Thanks for the report and help tracking down the problem. The fix is queued for 4.18, and will be backported to stable branches:

https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-4.18&id=38e624a18f9a05b8c894409be6b14709a7206c7c
Comment 42 Dan Horák 2018-07-12 11:51:01 UTC
for the record output of "git describe 38e624a18f9a05b"
v4.18-rc1-36-g38e624a18f9a

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.