During the past week i got amdgpu 2 crashes, both with this stack: Dec 17 02:54:42 Couracado kernel: [69955.112339] Oops: 0000 [#1] SMP Dec 17 02:54:42 Couracado kernel: [69955.138598] Modules linked in: uinput snd_usb_audio snd_usbmidi_lib snd_rawmidi f71882fg ipt_ECN snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_DSCP nf_nat_irc nf_nat nf_conntrack_irc nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_mangle iptable_filter ip_tables x_tables bridge stp llc ipv6 nls_iso8859_1 nls_cp437 vfat fat reiserfs sch_fq_codel pcspkr fuse joydev hid_generic snd_hda_codec_hdmi usbhid hid eeepc_wmi tuner_simple tuner_types tea5767 tuner tda7432 snd_hda_codec_realtek tvaudio snd_hda_codec_generic msp3400 snd_hda_intel snd_hda_codec Dec 17 02:54:42 Couracado kernel: [69955.735663] asus_wmi snd_hwdep sparse_keymap bttv tea575x snd_hda_core i2c_dev rfkill wmi_bmof tveeprom crct10dif_pclmul snd_pcm videobuf_dma_sg videobuf_core amdkfd crc32_pclmul rc_core evdev efi_pstore crc32c_intel r8169 v4l2_common ghash_clmulni_intel amd_iommu_v2 serio_raw efivars fam15h_power k10temp snd_timer mii ohci_pci videodev i2c_piix4 snd amdgpu ehci_pci soundcore ohci_hcd ehci_hcd mfd_core parport_pc hwmon xhci_pci ttm parport wmi xhci_hcd video shpchp button acpi_cpufreq loop Dec 17 02:54:42 Couracado kernel: [69956.099719] CPU: 1 PID: 814 Comm: gfx Not tainted 4.14.6-slack #6 Dec 17 02:54:42 Couracado kernel: [69956.150725] Hardware name: System manufacturer System Product Name/A88X-PLUS, BIOS 3003 03/10/2016 Dec 17 02:54:42 Couracado kernel: [69956.225762] task: ffff884c3d508100 task.stack: ffffb665439b0000 Dec 17 02:54:42 Couracado kernel: [69956.275368] RIP: 0010:amdgpu_sync_get_fence+0x91/0xe0 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69956.324197] RSP: 0018:ffffb665439b3e20 EFLAGS: 00010246 Dec 17 02:54:42 Couracado kernel: [69956.367931] RAX: 00000000002ae450 RBX: ffff884ab449db60 RCX: 0000000000000000 Dec 17 02:54:42 Couracado kernel: [69956.427677] RDX: 0000000000000064 RSI: ffff884b534e8540 RDI: ffff884c46000e00 Dec 17 02:54:42 Couracado kernel: [69956.487426] RBP: ffffb665439b3e40 R08: 0000000000000008 R09: 0000000000000010 Dec 17 02:54:42 Couracado kernel: [69956.547172] R10: 0000000000000255 R11: 000000000000019f R12: 0000000000000000 Dec 17 02:54:42 Couracado kernel: [69956.606922] R13: ffff884767dbc900 R14: ffff884767dbc968 R15: ffff8848d44b8bd8 Dec 17 02:54:42 Couracado kernel: [69956.666669] FS: 0000000000000000(0000) GS:ffff884c5ec80000(0000) knlGS:0000000000000000 Dec 17 02:54:42 Couracado kernel: [69956.734426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 17 02:54:42 Couracado kernel: [69956.782525] CR2: 00000000002ae468 CR3: 000000011da6a000 CR4: 00000000000406e0 Dec 17 02:54:42 Couracado kernel: [69956.842274] Call Trace: Dec 17 02:54:42 Couracado kernel: [69956.862764] amdgpu_job_dependency+0x93/0x100 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69956.905816] amd_sched_main+0xb5/0x450 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69956.943730] ? wait_woken+0x80/0x80 Dec 17 02:54:42 Couracado kernel: [69956.972902] kthread+0x125/0x140 Dec 17 02:54:42 Couracado kernel: [69956.999935] ? amd_sched_process_job+0xc0/0xc0 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69957.043674] ? kthread_create_on_node+0x70/0x70 Dec 17 02:54:42 Couracado kernel: [69957.081583] ret_from_fork+0x22/0x30 Dec 17 02:54:42 Couracado kernel: [69957.111479] Code: 89 44 24 08 48 c7 06 00 00 00 00 48 c7 46 08 00 00 00 00 48 8b 3d d8 47 15 00 e8 ab 94 d3 da 48 8b 43 48 a8 01 75 9b 48 8b 43 08 <48> 8b 40 18 48 85 c0 74 09 48 89 df ff d0 84 c0 75 0c 48 89 d8 Dec 17 02:54:42 Couracado kernel: [69957.330761] CR2: 00000000002ae468 Dec 17 02:54:42 Couracado kernel: [69957.358479] ---[ end trace da8374d3133f4c24 ]--- Dec 17 02:54:42 Couracado kernel: [69957.397138] sched: RT throttling activated It is rare, so hard to reproduce, but as amdgpu have been stable for me in the last 6 months, i would say it's something with the latest kernel or mesa code. i'm using kernel 4.14.6, drm 2.4.88, mesa 17.3.0, llvm 5.0.0 thanks
Please add the full dmesg output as attachment.
Hi, have you noticed any specific scenario under which those crashes happened to you ? Thanks, Andrey
Well, both times it happen while playing rimworld but i didn't notice any special action how to trigger this. my hardware is a A10-7850k and a RX480, slackware64-current, dual head 1920x1080, steam+rimworld running in one head, tvtime running in the other. I do usually suspend my machine, do not know it this also help trigger this
Created attachment 136266 [details] dmesg without the crash This is my current dmesg
Created attachment 136267 [details] syslog capture for the oops I didn't saved the dmesg directly but i could salvage this from the syslog
Can you try reproduce it wit KASAN enabled ? Thanks, Andrey
Created attachment 136398 [details] dmesg oops with kasan Sure, there is the dmesg after a crash with kasan, this time over warthunder
(In reply to higuita from comment #7) > Created attachment 136398 [details] > dmesg oops with kasan > > Sure, there is the dmesg after a crash with kasan, this time over warthunder Thanks, this seems like trying to access a fence which already was released, but i can't pinpoint the faulting line in the code both for amdgpu_sync_get_fence and for amdgpu_sync_resv, I am using addr2line for this but the offset into the function shown in the backtrace doesn't make sense. Maybe because our builds differ, can you try it and see if you get the exact offending lines in both functions ? Thanks, Andrey
Created attachment 136400 [details] dmesg oops with kasan 2 Another crash, this time in RUST, just to see if it helps in any way i know how to build stuff, but i have no idea how to debug the kernel :) can you please give me some pointers how to find and give you the needed info?
(In reply to higuita from comment #9) > Created attachment 136400 [details] > dmesg oops with kasan 2 > > Another crash, this time in RUST, just to see if it helps in any way > > i know how to build stuff, but i have no idea how to debug the kernel :) > > can you please give me some pointers how to find and give you the needed > info? NP, check answer here https://stackoverflow.com/questions/13468286/how-to-read-understand-analyze-and-debug-a-linux-kernel-panicand to obtain the function address within your amdgpu.ko just do nm -C drivers/gpu/drm/amd/amdgpu/amdgpu.ko | grep amdgpu_sync_get_fence nm -C drivers/gpu/drm/amd/amdgpu/amdgpu.ko | grep amdgpu_sync_resv The offset into the function you can see from the dmesg dump amdgpu_sync_get_fence+0x91/0xe0 so 91 is the offset Thanks, Andrey (In reply to higuita from comment #9) > Created attachment 136400 [details] > dmesg oops with kasan 2 > > Another crash, this time in RUST, just to see if it helps in any way > > i know how to build stuff, but i have no idea how to debug the kernel :) > > can you please give me some pointers how to find and give you the needed > info?
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/276.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.