Bug 103138

Summary: [regression, vega] BUG: Bad page state in process gnome-shell pfn:77cc33
Product: Mesa Reporter: Vedran Miletić <vedran>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: critical    
Priority: medium Keywords: regression
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Possible fix

Description Vedran Miletić 2017-10-07 18:32:23 UTC
With amd-staging-drm-next revision e5f6a57e350a7921e4edc30874679bdff11b13f4 I get:

Lis 07 19:44:22 jaffa kernel: BUG: Bad page state in process gnome-shell  pfn:77cc33
Lis 07 19:44:22 jaffa kernel: page:ffffe3c5ddf30cc0 count:0 mapcount:0 mapping:          (null) index:0x33 compound_mapcount: 1
Lis 07 19:44:22 jaffa kernel: flags: 0x17ffffc0000000()
Lis 07 19:44:22 jaffa kernel: raw: 0017ffffc0000000 0000000000000000 0000000000000000 00000000ffffffff
Lis 07 19:44:22 jaffa kernel: raw: ffffe3c5ddf30001 ffffe3c5ddf30ce0 0000000000000000 0000000000000000
Lis 07 19:44:22 jaffa kernel: page dumped because: corrupted mapping in tail page
Lis 07 19:44:22 jaffa kernel: Modules linked in: bnep fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_securi
Lis 07 19:44:22 jaffa kernel:  iTCO_vendor_support intel_uncore ppdev mxm_wmi snd_hda_core intel_rapl_perf snd_hwdep snd_seq snd_seq_device pcspkr hci_uart snd_pcm i2c_i801 snd_timer joydev snd btbcm btqca btintel mei_me soundcore mei tpm_tis bluetooth shpchp tpm_tis_core intel_pch_thermal tpm parport_pc pinctrl_sunrisepoint parport ecdh_generic wmi video acpi_als pinctrl_intel 
Lis 07 19:44:22 jaffa kernel: CPU: 2 PID: 2093 Comm: gnome-shell Tainted: G    B   W       4.13.0-rc5+ #15
Lis 07 19:44:22 jaffa kernel: Hardware name: MSI MS-7971/Z170-A PRO (MS-7971), BIOS 1.I0 05/02/2017
Lis 07 19:44:22 jaffa kernel: Call Trace:
Lis 07 19:44:22 jaffa kernel:  dump_stack+0x63/0x8b
Lis 07 19:44:22 jaffa kernel:  bad_page+0xcb/0x120
Lis 07 19:44:22 jaffa kernel:  __free_pages_ok+0x2f7/0x400
Lis 07 19:44:22 jaffa kernel:  __free_pages+0x1f/0x40
Lis 07 19:44:22 jaffa kernel:  free_pages+0x54/0x70
Lis 07 19:44:22 jaffa kernel:  dma_generic_free_coherent+0x25/0x30
Lis 07 19:44:22 jaffa kernel:  x86_swiotlb_free_coherent+0x41/0x70
Lis 07 19:44:22 jaffa kernel:  __ttm_dma_free_page.isra.6+0x52/0x70 [ttm]
Lis 07 19:44:22 jaffa kernel:  ttm_dma_page_put+0xb0/0xf0 [ttm]
Lis 07 19:44:22 jaffa kernel:  ttm_dma_unpopulate+0xcc/0x3d0 [ttm]
Lis 07 19:44:22 jaffa kernel:  amdgpu_ttm_tt_unpopulate+0x7a/0x80 [amdgpu]
Lis 07 19:44:22 jaffa kernel:  ttm_tt_unpopulate.part.6+0x48/0x50 [ttm]
Lis 07 19:44:22 jaffa kernel:  ttm_tt_destroy.part.7+0x49/0x50 [ttm]
Lis 07 19:44:22 jaffa kernel:  ttm_tt_destroy+0x13/0x20 [ttm]
Lis 07 19:44:22 jaffa kernel:  ttm_bo_cleanup_memtype_use+0x30/0x70 [ttm]
Lis 07 19:44:22 jaffa kernel:  ttm_bo_unref+0x318/0x350 [ttm]
Lis 07 19:44:22 jaffa kernel:  amdgpu_bo_unref+0x39/0x70 [amdgpu]
Lis 07 19:44:22 jaffa kernel:  amdgpu_gem_object_free+0x57/0x70 [amdgpu]
Lis 07 19:44:22 jaffa kernel:  drm_gem_object_free+0x1f/0x40 [drm]
Lis 07 19:44:22 jaffa kernel:  drm_gem_object_put_unlocked+0x3a/0x70 [drm]
Lis 07 19:44:22 jaffa kernel:  drm_gem_object_handle_put_unlocked+0x6a/0xb0 [drm]
Lis 07 19:44:22 jaffa kernel:  drm_gem_object_release_handle+0x53/0x90 [drm]
Lis 07 19:44:22 jaffa kernel:  drm_gem_handle_delete+0x58/0x80 [drm]
Lis 07 19:44:22 jaffa kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
Lis 07 19:44:22 jaffa kernel:  drm_gem_close_ioctl+0x20/0x30 [drm]
Lis 07 19:44:22 jaffa kernel:  drm_ioctl_kernel+0x5d/0xb0 [drm]
Lis 07 19:44:22 jaffa kernel:  drm_ioctl+0x31b/0x3d0 [drm]
Lis 07 19:44:22 jaffa kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
Lis 07 19:44:22 jaffa kernel:  ? unmap_region+0xf7/0x130
Lis 07 19:44:22 jaffa kernel:  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
Lis 07 19:44:22 jaffa kernel:  do_vfs_ioctl+0xa5/0x600
Lis 07 19:44:22 jaffa kernel:  SyS_ioctl+0x79/0x90
Lis 07 19:44:22 jaffa kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa5
Lis 07 19:44:22 jaffa kernel: RIP: 0033:0x7f99749b00d7
Lis 07 19:44:22 jaffa kernel: RSP: 002b:00007ffc28602968 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Lis 07 19:44:22 jaffa kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f99749b00d7
Lis 07 19:44:22 jaffa kernel: RDX: 00007ffc286029a8 RSI: 0000000040086409 RDI: 000000000000000c
Lis 07 19:44:22 jaffa kernel: RBP: 00007ffc28603d90 R08: 0000000000000000 R09: 000000000000000e
Lis 07 19:44:22 jaffa kernel: R10: 0000000000000053 R11: 0000000000000246 R12: 00007ffc28603e20
Lis 07 19:44:22 jaffa kernel: R13: 00007ffc28603e18 R14: 000000007fffffff R15: 00007f996eab47d0

This is repeated many times and fills up dmesg. The same does not happen with ea0eda9a882b5df33808dbd85bd64376ed187618.

I have not tried bisecting since, IIRC, some revisions between those two hang my machine. I can try if that would be useful.
Comment 1 Vedran Miletić 2017-10-07 18:34:19 UTC
Forgot to note this is on Fedora 27 pre-release, LLVM and Mesa git running on:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] [1002:687f] (rev c1) (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:6b76]
	Flags: bus master, fast devsel, latency 0, IRQ 123
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at e000 [size=256]
	Memory at dfc00000 (32-bit, non-prefetchable) [size=512K]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
Comment 2 Christian König 2017-10-07 19:07:09 UTC
No need for the bisect, there is only one patch which might be the source of the problem.

Going to take a closer look on Monday.
Comment 3 Vedran Miletić 2017-10-07 19:45:55 UTC
(In reply to Christian König from comment #2)
> No need for the bisect, there is only one patch which might be the source of
> the problem.
> 
> Going to take a closer look on Monday.

It's very nice to hear that, and it's particularly nice just 35 minutes after my report on Saturday. Keep up the good work.
Comment 4 Christian König 2017-10-09 12:35:29 UTC
Created attachment 134764 [details] [review]
Possible fix

No problem, does the attached patch help?
Comment 5 Vedran Miletić 2017-10-10 21:18:57 UTC
I can confirm both that the issue is still present in 2c7cb03ed2bb119f146ad0b9d9ab0a9ebb04b5a1 (current tip of amd-staging-drm-next) and that the patch fixes it.
Comment 6 Christian König 2017-10-12 08:25:39 UTC
Thanks, fix was pushed to Alex internal branch a minute ago. Should appear on the public mirror today.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.