Bug 98638 - Panic on shutdown with AMDGPU and Ubuntu Plymouth
Summary: Panic on shutdown with AMDGPU and Ubuntu Plymouth
Status: RESOLVED DUPLICATE of bug 97980
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-08 10:46 UTC by Ernst Sjöstrand
Modified: 2016-12-07 20:28 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
possible fix (2.53 KB, patch)
2016-12-06 15:46 UTC, Alex Deucher
no flags Details | Splinter Review

Description Ernst Sjöstrand 2016-11-08 10:46:49 UTC
Hi,

I think this is another exciting shutdown panic.
I happens with 4.9-rc4 and drm-next-4.10-wip from right now.

It only happens when I have the Ubuntu bootsplash enabled,
the screen stays frozen.

I managed to capture a few lines with netconsole:

[  352.263493] systemd-shutdow: 31 output lines suppressed due to ratelimiting
[  352.533484] sd 4:0:0:0: [sdd] Synchronizing SCSI cache
[  352.733464] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  352.791916] [drm] amdgpu: finishing device.
 mac_hid psmouse sysfillrect sysimgblt fb_sys_fops e1000e drm fjes ptp wmi pps_core[  353.908045] CPU: 1 PID: 2072 Comm: plymouthd Not tainted 4.9.0-rc4+ #63
[  353.908045] Hardware name: System manufacturer System Product Name/P8P67 PRO REV 3.1, BIOS 1704 06/08/2011
[  353.908046]  ffffa86f010f7538
[  353.908047]  ffffffff81615882 ffffa86f010f7588 0000000000000000 ffffa86f010f7578
[  353.908049]  ffffffff812834eb 0000016f00000000 ffff97bc34cce400 ffff97bc3001df68
[  353.908051]  ffff97bc2d670e00[  353.908126]  [<ffffffff81230d10>] __die+0xa0/0xe0
[  353.908295]  [<ffffffffc03c90e0>] ? drm_mode_getcrtc+0x140/0x140 [drm]
8b 

That's all I got.
I have 2 4K monitors connected to a Fiji Fury card.
Comment 1 Michel Dänzer 2016-11-09 01:05:49 UTC
It looks like netconsole dropped a lot of information there, and what's left isn't very helpful to figure out what's going on I'm afraid. Please keep trying to get more information about it.
Comment 2 Ernst Sjöstrand 2016-11-09 08:42:07 UTC
The bootsplash hides any console prints and if I disable the bootsplash it doesn't happen.
I guess it happens after filesystems go read-only so nothing is written to logs, even if I do alt-sysrq-S.

The computer responds to sysrq but V and G have no effect.
Comment 3 Michel Dänzer 2016-11-09 08:51:16 UTC
The bootsplash doesn't (directly) affect netconsole though, does it? Can you try getting more complete output from netconsole?
Comment 4 Ernst Sjöstrand 2016-11-09 08:54:18 UTC
Perhaps the netconsole runs into problems because of the panic?
I'll try to turn on some more debug options in my kernel perhaps...
And maybe just try a few times and see if I'm lucky.
Comment 5 Alex Deucher 2016-11-09 14:51:52 UTC
Shutdown and remove use same code in the driver.  Can you reproduce the issue by unloading the amdgpu module?  E.g., as root:

echo 0 > /sys/class/vtconsole/vtcon1/bind
modprobe -r amdgpu
Comment 6 Ernst Sjöstrand 2016-11-14 09:05:05 UTC
Here's a new "backtrace"...
null pointer dereference in amdgpu_fence_wait_empty can't be that many things...
Is it the rcu_dereference?

https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c?h=drm-next-4.10-wip#n261

[   85.216191] [drm] amdgpu: finishing device.
[   86.336739] Console: switching to colour VGA+ 80x25
[   86.345463] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   86.345477] IP: [<ffffffffc03882ba>] amdgpu_fence_wait_empty+0x2a/0xd0 [amdgpu]
[   86.345508] PGD 0 [   86.345510] 
[   86.345516] Oops: 0000 [#1] SMP
[   86.345519] Modules linked in: netconsole configfs binfmt_misc eeepc_wmi asus_wmi sparse_keymap video intel_rapl x86_pkg_temp_thermal kvm_intel kvm irqbypass input_leds btusb btrtl btbcm crct10dif_pclmul crc32_pclmul btintel ghash_clmulni_intel bluetooth aesni_intel aes_x86_64 snd_hda_codec_realtek lrw glue_helper ablk_helper snd_hda_codec_generic cryptd snd_hda_codec_hdmi intel_cstate snd_hda_intel intel_rapl_perf snd_hda_codec snd_hda_core serio_raw snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd mei_me soundcore mei lpc_ich shpchp mac_hid sbs sbshc max6650 coretemp parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs xor raid6_pq hid_generic usbhid hid amdkfd amd_iommu_v2 mxm_wmi amdgpu i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect ahci sysimgblt fb_sys_fops libahci e1000e83 ec 08 23 87 88 00 00 00 48 8b 97 90 00 00 00 48 8d 04 c2 18 48 85 db 74 6f 8b 0b 85 c9 74 69 8d 51 01 89 c8 f0 0f 
[   86.346049] RIP  [<ffffffffc03882ba>] amdgpu_fence_wait_empty+0x2a/0xd0 [amdgpu]
[   86.346071]  RSP <ffffa46a017f3af0>
[   86.346073] CR2: 0000000000000000

I only see the backtrace when amdgpu is unloaded before my ethernet module, which seems to be rare.
Would be nice to force the ethernet driver to be unloaded last.
Comment 7 Christian König 2016-11-14 11:28:43 UTC
Try loading the amdgpu.ko module in gdb like this (Obviously please use the correct one for your kernel version):

gdb /lib/modules/4.7.0+/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko

And then run

l *(amdgpu_fence_wait_empty+0x2a)

This should give you a line number when debug symbols are available.
Comment 8 Ernst Sjöstrand 2016-11-26 14:28:11 UTC
Now I got this interesting blob:

[ 1457.810773] systemd-shutdown[1]: All swaps deactivated.
[ 1457.810777] systemd-shutdown[1]: Detaching loop devices.
[ 1457.814471] kvm: exiting hardware virtualization
[ 1458.021577] sd 4:0:0:0: [sdd] Synchronizing SCSI cache
[ 1458.022423] sd 4:0:0:0: [sdd] Stopping disk
[ 1458.058826] sd 3:0:0:0: [sdc] Synchronizing SCSI cache
[ 1458.059003] sd 3:0:0:0: [sdc] Stopping disk
[ 1458.059179] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
[ 1458.059390] sd 2:0:0:0: [sdb] Stopping disk
[ 1458.248733] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 1458.248840] sd 0:0:0:0: [sda] Stopping disk
[ 1458.305064] [drm] amdgpu: finishing device.
[ 1459.409471] Console: switching to colour dummy device 80x25
[ 1459.417354] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 1459.417367] IP: [<ffffffffc01f92aa>] amdgpu_fence_wait_empty+0x2a/0xd0 [amdgpu]
[ 1459.417398] PGD 0 [ 1459.417401] 
[ 1459.417405] Oops: 0000 [#1] SMP
[ 1459.417409] Modules linked in: netconsole configfs binfmt_misc eeepc_wmi asus_wmi mxm_wmi video sparse_keymap intel_rapl x86_pkg_temp_thermal kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core irqbypass snd_pcm input_leds crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep snd_seq_midi aesni_intel snd_seq_midi_event aes_x86_64 snd_rawmidi glue_helper lrw ablk_helper cryptd snd_seq intel_cstate snd_timer intel_rapl_perf snd_seq_device serio_raw snd mei_me mei soundcore lpc_ich mac_hid shpchp wmi sbs sbshc max6650 coretemp parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs xor raid6_pq hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm psmouse e1000e ahci ptp libahci pps_core fjes
[ 1459.417534] CPU: 1 PID: 2847 Comm: plymouthd Not tainted 4.9.0-rc4+ #78
[ 1459.417537] Hardware name: System manufacturer System Product Name/P8P67 PRO REV 3.1, BIOS 1704 06/08/2011
[ 1459.417540] task: ffff92d031034380 task.stack: ffffb33482184000
[ 1459.417543] RIP: 0010:[<ffffffffc01f92aa>]  [<ffffffffc01f92aa>] amdgpu_fence_wait_empty+0x2a/0xd0 [amdgpu]
[ 1459.417565] RSP: 0018:ffffb33482187af0  EFLAGS: 00010202
[ 1459.417568] RAX: 0000000000000010 RBX: ffff92d02f28a938 RCX: 0000000000000000
[ 1459.417571] RDX: 0000000000000000 RSI: 0000000000004532 RDI: ffff92d02f28c570
[ 1459.417573] RBP: ffffb33482187b00 R08: 0000000000000004 R09: 0000000000030003
[ 1459.417576] R10: 0000000000030303 R11: 0000000000004000 R12: ffff92d02f288000
[ 1459.417578] R13: ffff92d02f28a9b0 R14: ffff92d02fee4800 R15: 0000000000000000
[ 1459.417582] FS:  00007f86ba048b80(0000) GS:ffff92d03f480000(0000) knlGS:0000000000000000
[ 1459.417585] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1459.417587] CR2: 0000000000000010 CR3: 00000002301b6000 CR4: 00000000000406e0
[ 1459.417590] Stack:
[ 1459.417592]  ffff92d02fee4800 ffff92d02f28a938 ffffb33482187b40 ffffffffc02096d2
[ 1459.417600]  ffffb33482187b40 0000000028d54302 ffff92d02f514000 ffff92d02f288000
[ 1459.417608]  0000000000000000 0000000000000003 ffffb33482187b68 ffffffffc0227945
[ 1459.417615] Call Trace:
[ 1459.417640]  [<ffffffffc02096d2>] amdgpu_pm_compute_clocks+0x92/0x560 [amdgpu]
[ 1459.417666]  [<ffffffffc0227945>] dce_v10_0_crtc_dpms+0xd5/0x110 [amdgpu]
[ 1459.417676]  [<ffffffffc01ae4e2>] drm_helper_connector_dpms+0x82/0x100 [drm_kms_helper]
[ 1459.417700]  [<ffffffffc0227870>] ? dce_v10_0_bandwidth_update+0x260/0x260 [amdgpu]
[ 1459.417709]  [<ffffffffc01af91f>] drm_crtc_helper_set_config+0xabf/0xaf0 [drm_kms_helper]
[ 1459.417731]  [<ffffffffc01ff148>] amdgpu_crtc_set_config+0x48/0x120 [amdgpu]
[ 1459.417749]  [<ffffffffc0118cd5>] drm_mode_set_config_internal+0x65/0x110 [drm]
[ 1459.417765]  [<ffffffffc011a4bd>] drm_mode_setcrtc+0x3fd/0x4f0 [drm]
[ 1459.417779]  [<ffffffffc0110f8b>] drm_ioctl+0x21b/0x4c0 [drm]
[ 1459.417794]  [<ffffffffc011a0c0>] ? drm_mode_getcrtc+0x140/0x140 [drm]
[ 1459.417812]  [<ffffffffc01e804f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[ 1459.417819]  [<ffffffffb7440c03>] do_vfs_ioctl+0xa3/0x600
[ 1459.417823]  [<ffffffffb74411d9>] SyS_ioctl+0x79/0x90
[ 1459.417829]  [<ffffffffb7a60b3b>] entry_SYSCALL_64_fastpath+0x1e/0xad
[ 1459.417832] Code: 90 66 66 66 66 90 8b 47 20 85 c0 0f 84 9a 00 00 00 55 48 89 e5 53 48 83 ec 08 23 87 88 00 00 00 48 8b 97 90 00 00 00 48 8d 04 c2 <48> 8b 18 48 85 db 74 6f 8b 0b 85 c9 74 69 8d 51 01 89 c8 f0 0f 
[ 1459.417923] RIP  [<ffffffffc01f92aa>] amdgpu_fence_wait_empty+0x2a/0xd0 [amdgpu]
[ 1459.417945]  RSP <ffffb33482187af0>
[ 1459.417947] CR2: 0000000000000010
[ 1459.417951] ---[ end trace 71c209200bdc61d0 ]---
Comment 9 Ernst Sjöstrand 2016-11-26 14:54:52 UTC
(gdb) l *(amdgpu_fence_wait_empty+0x29)
0x112d9 is in amdgpu_fence_wait_empty (drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:270).
270		ptr = &ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask];
(gdb) l *(amdgpu_fence_wait_empty+0x2a)
0x112da is in amdgpu_fence_wait_empty (./include/linux/compiler.h:243).
243		__READ_ONCE_SIZE;
(gdb) l *(amdgpu_fence_wait_empty+0x2b)
0x112db is in amdgpu_fence_wait_empty (./include/linux/compiler.h:243).
243		__READ_ONCE_SIZE;
(gdb) l *(amdgpu_fence_wait_empty+0x2c)
0x112dc is in amdgpu_fence_wait_empty (./include/linux/compiler.h:243).
243		__READ_ONCE_SIZE;
(gdb) l *(amdgpu_fence_wait_empty+0x2d)
0x112dd is in amdgpu_fence_wait_empty (drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:273).
273		if (!fence || !dma_fence_get_rcu(fence)) {

It's not line 270 and not line 273 so I guess it's line 271 or 272:
	rcu_read_lock();
	fence = rcu_dereference(*ptr);
Comment 10 Alex Deucher 2016-12-06 15:46:28 UTC
Created attachment 128356 [details] [review]
possible fix

Does this patch help?
Comment 11 Ernst Sjöstrand 2016-12-06 19:43:36 UTC
That seems to work great! I can try a few more reboots if you want. Won't make it into 4.9 in time anyway right?
Comment 12 Alex Deucher 2016-12-07 20:28:03 UTC

*** This bug has been marked as a duplicate of bug 97980 ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.