Bug 104299

Summary: Crash on amdgpu_sync_get_fence
Product: DRI Reporter: higuita
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: andrey.grodzovsky, ckoenig.leichtzumerken
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg without the crash
none
syslog capture for the oops
none
dmesg oops with kasan
none
dmesg oops with kasan 2 none

Description higuita 2017-12-17 03:20:21 UTC
During the past week i got amdgpu 2 crashes, both with this stack:

Dec 17 02:54:42 Couracado kernel: [69955.112339] Oops: 0000 [#1] SMP
Dec 17 02:54:42 Couracado kernel: [69955.138598] Modules linked in: uinput snd_usb_audio snd_usbmidi_lib snd_rawmidi f71882fg ipt_ECN snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_DSCP nf_nat_irc nf_nat nf_conntrack_irc nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_mangle iptable_filter ip_tables x_tables bridge stp llc ipv6 nls_iso8859_1 nls_cp437 vfat fat reiserfs sch_fq_codel pcspkr fuse joydev hid_generic snd_hda_codec_hdmi usbhid hid eeepc_wmi tuner_simple tuner_types tea5767 tuner tda7432 snd_hda_codec_realtek tvaudio snd_hda_codec_generic msp3400 snd_hda_intel snd_hda_codec
Dec 17 02:54:42 Couracado kernel: [69955.735663]  asus_wmi snd_hwdep sparse_keymap bttv tea575x snd_hda_core i2c_dev rfkill wmi_bmof tveeprom crct10dif_pclmul snd_pcm videobuf_dma_sg videobuf_core amdkfd crc32_pclmul rc_core evdev efi_pstore crc32c_intel r8169 v4l2_common ghash_clmulni_intel amd_iommu_v2 serio_raw efivars fam15h_power k10temp snd_timer mii ohci_pci videodev i2c_piix4 snd amdgpu ehci_pci soundcore ohci_hcd ehci_hcd mfd_core parport_pc hwmon xhci_pci ttm parport wmi xhci_hcd video shpchp button acpi_cpufreq loop
Dec 17 02:54:42 Couracado kernel: [69956.099719] CPU: 1 PID: 814 Comm: gfx Not tainted 4.14.6-slack #6
Dec 17 02:54:42 Couracado kernel: [69956.150725] Hardware name: System manufacturer System Product Name/A88X-PLUS, BIOS 3003 03/10/2016
Dec 17 02:54:42 Couracado kernel: [69956.225762] task: ffff884c3d508100 task.stack: ffffb665439b0000
Dec 17 02:54:42 Couracado kernel: [69956.275368] RIP: 0010:amdgpu_sync_get_fence+0x91/0xe0 [amdgpu]
Dec 17 02:54:42 Couracado kernel: [69956.324197] RSP: 0018:ffffb665439b3e20 EFLAGS: 00010246
Dec 17 02:54:42 Couracado kernel: [69956.367931] RAX: 00000000002ae450 RBX: ffff884ab449db60 RCX: 0000000000000000
Dec 17 02:54:42 Couracado kernel: [69956.427677] RDX: 0000000000000064 RSI: ffff884b534e8540 RDI: ffff884c46000e00
Dec 17 02:54:42 Couracado kernel: [69956.487426] RBP: ffffb665439b3e40 R08: 0000000000000008 R09: 0000000000000010
Dec 17 02:54:42 Couracado kernel: [69956.547172] R10: 0000000000000255 R11: 000000000000019f R12: 0000000000000000
Dec 17 02:54:42 Couracado kernel: [69956.606922] R13: ffff884767dbc900 R14: ffff884767dbc968 R15: ffff8848d44b8bd8
Dec 17 02:54:42 Couracado kernel: [69956.666669] FS:  0000000000000000(0000) GS:ffff884c5ec80000(0000) knlGS:0000000000000000
Dec 17 02:54:42 Couracado kernel: [69956.734426] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 17 02:54:42 Couracado kernel: [69956.782525] CR2: 00000000002ae468 CR3: 000000011da6a000 CR4: 00000000000406e0
Dec 17 02:54:42 Couracado kernel: [69956.842274] Call Trace:
Dec 17 02:54:42 Couracado kernel: [69956.862764]  amdgpu_job_dependency+0x93/0x100 [amdgpu]
Dec 17 02:54:42 Couracado kernel: [69956.905816]  amd_sched_main+0xb5/0x450 [amdgpu]
Dec 17 02:54:42 Couracado kernel: [69956.943730]  ? wait_woken+0x80/0x80
Dec 17 02:54:42 Couracado kernel: [69956.972902]  kthread+0x125/0x140
Dec 17 02:54:42 Couracado kernel: [69956.999935]  ? amd_sched_process_job+0xc0/0xc0 [amdgpu]
Dec 17 02:54:42 Couracado kernel: [69957.043674]  ? kthread_create_on_node+0x70/0x70
Dec 17 02:54:42 Couracado kernel: [69957.081583]  ret_from_fork+0x22/0x30
Dec 17 02:54:42 Couracado kernel: [69957.111479] Code: 89 44 24 08 48 c7 06 00 00 00 00 48 c7 46 08 00 00 00 00 48 8b 3d d8 47 15 00 e8 ab 94 d3 da 48 8b 43 48 a8 01 75 9b 48 8b 43 08 <48> 8b 40 18 48 85 c0 74 09 48 89 df ff d0 84 c0 75 0c 48 89 d8 
Dec 17 02:54:42 Couracado kernel: [69957.330761] CR2: 00000000002ae468
Dec 17 02:54:42 Couracado kernel: [69957.358479] ---[ end trace da8374d3133f4c24 ]---
Dec 17 02:54:42 Couracado kernel: [69957.397138] sched: RT throttling activated

It is rare, so hard to reproduce, but as amdgpu have been stable for me in the last 6 months, i would say it's something with the latest kernel or mesa code.
i'm using kernel 4.14.6, drm 2.4.88, mesa 17.3.0, llvm 5.0.0

thanks
Comment 1 Christian König 2017-12-18 09:37:39 UTC
Please add the full dmesg output as attachment.
Comment 2 Andrey Grodzovsky 2017-12-18 15:21:41 UTC
Hi, have you noticed any specific scenario under which those crashes happened to you ?

Thanks,
Andrey
Comment 3 higuita 2017-12-19 03:13:18 UTC
Well, both times it happen while playing rimworld but i didn't notice any special action how to trigger this.

my hardware is a A10-7850k and a RX480, slackware64-current, dual head 1920x1080, steam+rimworld running in one head, tvtime running in the other.

I do usually suspend my machine, do not know it this also help trigger this
Comment 4 higuita 2017-12-19 03:19:52 UTC
Created attachment 136266 [details]
dmesg without the crash

This is my current dmesg
Comment 5 higuita 2017-12-19 03:31:28 UTC
Created attachment 136267 [details]
syslog capture for the oops

I didn't saved the dmesg directly but i could salvage this from the syslog
Comment 6 Andrey Grodzovsky 2017-12-24 04:41:13 UTC
Can you try reproduce it wit KASAN enabled ?

Thanks,
Andrey
Comment 7 higuita 2017-12-26 22:39:20 UTC
Created attachment 136398 [details]
dmesg oops with kasan

Sure, there is the dmesg after a crash with kasan, this time over warthunder
Comment 8 Andrey Grodzovsky 2017-12-27 04:29:07 UTC
(In reply to higuita from comment #7)
> Created attachment 136398 [details]
> dmesg oops with kasan
> 
> Sure, there is the dmesg after a crash with kasan, this time over warthunder

Thanks, this seems like trying to access a fence which already was released, but i can't pinpoint the faulting line in the code both for amdgpu_sync_get_fence and for amdgpu_sync_resv, I am using addr2line for this but the offset into the function shown in the backtrace doesn't make sense. Maybe because our builds differ, can you try it and see if you get the exact offending lines in both functions ?

Thanks,
Andrey
Comment 9 higuita 2017-12-27 05:33:53 UTC
Created attachment 136400 [details]
dmesg oops with kasan 2

Another crash, this time in RUST, just to see if it helps in any way

i know how to build stuff, but i have no idea how to debug the kernel :)

can you please give me some pointers how to find and give you the needed info?
Comment 10 Andrey Grodzovsky 2017-12-27 13:14:12 UTC
(In reply to higuita from comment #9)
> Created attachment 136400 [details]
> dmesg oops with kasan 2
> 
> Another crash, this time in RUST, just to see if it helps in any way
> 
> i know how to build stuff, but i have no idea how to debug the kernel :)
> 
> can you please give me some pointers how to find and give you the needed
> info?

NP, check answer here https://stackoverflow.com/questions/13468286/how-to-read-understand-analyze-and-debug-a-linux-kernel-panicand 

to obtain the function address within your amdgpu.ko just do 

nm -C drivers/gpu/drm/amd/amdgpu/amdgpu.ko | grep amdgpu_sync_get_fence
nm -C drivers/gpu/drm/amd/amdgpu/amdgpu.ko | grep amdgpu_sync_resv

The offset into the function you can see from the dmesg dump 
amdgpu_sync_get_fence+0x91/0xe0 so 91 is the offset

Thanks,
Andrey

(In reply to higuita from comment #9)
> Created attachment 136400 [details]
> dmesg oops with kasan 2
> 
> Another crash, this time in RUST, just to see if it helps in any way
> 
> i know how to build stuff, but i have no idea how to debug the kernel :)
> 
> can you please give me some pointers how to find and give you the needed
> info?
Comment 11 Martin Peres 2019-11-19 08:27:37 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/276.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.