Bug 93341

Summary: Semi-random GPU lockups on radeonsi with a RadeonHD 7770 (when playing videos, running OpenGL games, WebGL apps, or after extended periods of time)
Product: Mesa Reporter: Jean-François Fortin Tam <nekohayo>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact: Default DRI bug account <dri-devel>
Severity: major    
Priority: medium CC: austinenglish, ckoenig.leichtzumerken, julien.isorce, kilgus
Version: 11.0   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1335360
Whiteboard:
i915 platform: i915 features:
Attachments: Excerpt /var/log/messages GPU crash
journalctl output at the time of a deadlock on F25
journalctl output at the time of a deadlock on F25 - X GDM session output only
journalctl output at the time of a deadlock on F25 - take 2
Xorg log

Description Jean-François Fortin Tam 2015-12-11 05:36:44 UTC
Fedora 23, xorg-x11-drv-ati, on a Dell Precision T3500 (latest BIOS, A17) with a RadeonHD 7770 GPU. Running the latest up-to-date stock packages from Fedora.

If I start a game like Xonotic (from the Fedora repos) or Unvanquished (latest alpha binary build downloaded from their github repo), after a minute or two of just looking around as a spectator player, I'll eventually see my computer's monitor turn off all of a sudden. Sound will continue to play for a while, then it might stop/loop. After a few seconds, the kernel will be locked up with the CapsLock LED no longer working.

This also happened to me once simply by watching a video fullscreen in Totem (I'm running GNOME Shell, FWIW), but this is a much rarer occurrence.


Unfortunately I don't have knowledge of debugging such things, and ABRT somehow thinks my kernel is tainted with the "I" status (meaning it's "working around a severe firmware bug"), which I suppose might be the radeon microcode, so I can't get ABRT to create a nice automated retrace/full debug thing for me. But at least it still has stuff stored on disk, if there's anything in there you'd need:

# ls -lh /var/spool/abrt/oops-2015-12-10-21:50:22-777-1/
-rw-r----- 1 root abrt    5 10 déc 21:50 abrt_version
-rw-r----- 1 root abrt    9 10 déc 21:50 analyzer
-rw-r----- 1 root abrt    6 10 déc 21:50 architecture
-rw-r----- 1 root abrt 3,7K 10 déc 21:50 backtrace
-rw-r----- 1 root abrt  124 10 déc 21:50 cmdline
-rw-r----- 1 root abrt   16 10 déc 21:50 component
-rw-r----- 1 root abrt    1 10 déc 21:50 count
-rw-r----- 1 root abrt  71K 10 déc 21:50 dmesg
-rw-r----- 1 root abrt   40 10 déc 21:50 duphash
-rw-r----- 1 root abrt   23 10 déc 21:50 extra-cc
-rw-r----- 1 root abrt    8 10 déc 21:50 hostname
-rw-r----- 1 root abrt   21 10 déc 21:50 kernel
-rw-r----- 1 root abrt   25 10 déc 21:50 kernel_tainted_long
-rw-r----- 1 root abrt    3 10 déc 21:50 kernel_tainted_short
-rw-r----- 1 root abrt   10 10 déc 21:50 last_occurrence
-rw-r----- 1 root abrt  173 10 déc 21:50 not-reportable
-rw-r----- 1 root abrt  518 10 déc 21:50 os_info
-rw-r----- 1 root abrt   32 10 déc 21:50 os_release
-rw-r----- 1 root abrt    6 10 déc 21:50 package
-rw-r----- 1 root abrt    7 10 déc 21:50 pkg_arch
-rw-r----- 1 root abrt    2 10 déc 21:50 pkg_epoch
-rw-r----- 1 root abrt   12 10 déc 21:50 pkg_name
-rw-r----- 1 root abrt    9 10 déc 21:50 pkg_release
-rw-r----- 1 root abrt    6 10 déc 21:50 pkg_version
-rw-r----- 1 root abrt 4,4K 10 déc 21:50 proc_modules
-rw-r----- 1 root abrt   37 10 déc 21:50 reason
-rw-r----- 1 root abrt    8 10 déc 21:50 runlevel
-rw-r----- 1 root abrt  269 10 déc 21:50 suspend_stats
-rw-r----- 1 root abrt   10 10 déc 21:50 time
-rw-r----- 1 root abrt   10 10 déc 21:50 type
-rw-r----- 1 root abrt   40 10 déc 21:50 uuid


This is what I get in journalctl/dmesg:


-- Logs begin at lun 2015-11-30 21:48:19 EST, end at jeu 2015-12-10 23:48:33 EST. --
déc 10 21:49:00 the_PC kernel: radeon 0000:02:00.0: ring 3 stalled for more than 10115msec
déc 10 21:49:00 the_PC kernel: radeon 0000:02:00.0: GPU lockup (current fence id 0x000000000000a5fe last fence id 0x000000000000a600 on ring 3)
déc 10 21:49:01 the_PC kernel: BUG: unable to handle kernel paging request at ffffc90404239ffc
déc 10 21:49:01 the_PC kernel: IP: [<ffffffffa00f850a>] radeon_ring_backup+0xda/0x190 [radeon]
déc 10 21:49:01 the_PC kernel: PGD 6068a8067 PUD 0 
déc 10 21:49:01 the_PC kernel: Oops: 0000 [#1] SMP 
déc 10 21:49:01 the_PC kernel: Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable
déc 10 21:49:01 the_PC kernel:  radeon i2c_algo_bit drm_kms_helper ttm drm serio_raw
déc 10 21:49:01 the_PC kernel: CPU: 3 PID: 153 Comm: kworker/u64:7 Tainted: G          I     4.2.6-301.fc23.x86_64 #1
déc 10 21:49:01 the_PC kernel: Hardware name: Dell Inc. Precision WorkStation T3500  /0K095G, BIOS A17 05/28/2013
déc 10 21:49:01 the_PC kernel: Workqueue: radeon-crtc radeon_flip_work_func [radeon]
déc 10 21:49:01 the_PC kernel: task: ffff88060299b880 ti: ffff8805ff5c0000 task.ti: ffff8805ff5c0000
déc 10 21:49:01 the_PC kernel: RIP: 0010:[<ffffffffa00f850a>]  [<ffffffffa00f850a>] radeon_ring_backup+0xda/0x190 [radeon]
déc 10 21:49:01 the_PC kernel: RSP: 0018:ffff8805ff5c3c98  EFLAGS: 00010206
déc 10 21:49:01 the_PC kernel: RAX: ffffc9000fe50000 RBX: 00000000ffffffff RCX: 0000000000000000
déc 10 21:49:01 the_PC kernel: RDX: 0000000000000000 RSI: ffffc90404239ffc RDI: 0000000000080500
déc 10 21:49:01 the_PC kernel: RBP: ffff8805ff5c3cd8 R08: ffff8805771f8cc0 R09: 0000000000082000
déc 10 21:49:01 the_PC kernel: R10: 8000000000000163 R11: ffffffff81a609e9 R12: ffff880036a654d8
déc 10 21:49:01 the_PC kernel: R13: ffff880036a654b0 R14: 0000000000020141 R15: ffff8805ff5c3d30
déc 10 21:49:01 the_PC kernel: FS:  0000000000000000(0000) GS:ffff880606ec0000(0000) knlGS:0000000000000000
déc 10 21:49:01 the_PC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
déc 10 21:49:01 the_PC kernel: CR2: ffffc90404239ffc CR3: 0000000001c0b000 CR4: 00000000000006e0
déc 10 21:49:01 the_PC kernel: Stack:
déc 10 21:49:01 the_PC kernel:  ffff8805ff5c3cc8 ffffffffa00f9413 ffff880036a64000 ffff880036a64000
déc 10 21:49:01 the_PC kernel:  ffff880036a654d8 ffff8805ff5c3d30 ffff880036a654d8 0000000000000000
déc 10 21:49:01 the_PC kernel:  ffff8805ff5c3da8 ffffffffa00c6c80 ffffffff810df990 ffff880036a64738
déc 10 21:49:01 the_PC kernel: Call Trace:
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00f9413>] ? radeon_irq_kms_disable_hpd+0x73/0x80 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00c6c80>] radeon_gpu_reset+0xd0/0x330 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffff810df990>] ? wake_atomic_t_function+0x70/0x70
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00e058f>] ? radeon_fence_wait+0x9f/0xe0 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00ed960>] radeon_flip_work_func+0x130/0x170 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b650e>] process_one_work+0x19e/0x3f0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b67ae>] worker_thread+0x4e/0x450
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810bc8b8>] kthread+0xd8/0xf0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160
déc 10 21:49:01 the_PC kernel:  [<ffffffff817797df>] ret_from_fork+0x3f/0x70
déc 10 21:49:01 the_PC kernel:  [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160
déc 10 21:49:01 the_PC kernel: Code: 10 e1 48 85 c0 49 89 07 74 6c 41 8d 7e ff 31 d2 48 c1 e7 02 eb 07 49 8b 07 48 83 c2 04 49 8b 74 24 08 8d 4b 01 89 db 48 8d 34 9e <8b> 36 89 34 10 41 23 4c 24 54 48 39 d7 89 cb 75 da 4c 89 ef e8 
déc 10 21:49:01 the_PC kernel: RIP  [<ffffffffa00f850a>] radeon_ring_backup+0xda/0x190 [radeon]
déc 10 21:49:01 the_PC kernel:  RSP <ffff8805ff5c3c98>
déc 10 21:49:01 the_PC kernel: CR2: ffffc90404239ffc
déc 10 21:49:01 the_PC kernel: ---[ end trace 37e2470f6b251992 ]---
déc 10 21:49:01 the_PC kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8
déc 10 21:49:01 the_PC kernel: IP: [<ffffffff810bcd40>] kthread_data+0x10/0x20
déc 10 21:49:01 the_PC kernel: PGD 1c0e067 PUD 1c10067 PMD 0 
déc 10 21:49:01 the_PC kernel: Oops: 0000 [#2] SMP 
déc 10 21:49:01 the_PC kernel: Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable
déc 10 21:49:01 the_PC kernel:  radeon i2c_algo_bit drm_kms_helper ttm drm serio_raw
déc 10 21:49:01 the_PC kernel: CPU: 3 PID: 153 Comm: kworker/u64:7 Tainted: G      D   I     4.2.6-301.fc23.x86_64 #1
déc 10 21:49:01 the_PC kernel: Hardware name: Dell Inc. Precision WorkStation T3500  /0K095G, BIOS A17 05/28/2013
déc 10 21:49:01 the_PC kernel: task: ffff88060299b880 ti: ffff8805ff5c0000 task.ti: ffff8805ff5c0000
déc 10 21:49:01 the_PC kernel: RIP: 0010:[<ffffffff810bcd40>]  [<ffffffff810bcd40>] kthread_data+0x10/0x20
déc 10 21:49:01 the_PC kernel: RSP: 0018:ffff8805ff5c3918  EFLAGS: 00010096
déc 10 21:49:01 the_PC kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000005
déc 10 21:49:01 the_PC kernel: RDX: 0000000000000005 RSI: 0000000000000003 RDI: ffff88060299b880
déc 10 21:49:01 the_PC kernel: RBP: ffff8805ff5c3918 R08: ffff88060299b910 R09: 0000000000000000
déc 10 21:49:01 the_PC kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000167c0
déc 10 21:49:01 the_PC kernel: R13: ffff88060299b880 R14: ffff880606ed67c0 R15: 0000000000000003
déc 10 21:49:01 the_PC kernel: FS:  0000000000000000(0000) GS:ffff880606ec0000(0000) knlGS:0000000000000000
déc 10 21:49:01 the_PC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
déc 10 21:49:01 the_PC kernel: CR2: 0000000000000028 CR3: 0000000001c0b000 CR4: 00000000000006e0
déc 10 21:49:01 the_PC kernel: Stack:
déc 10 21:49:01 the_PC kernel:  ffff8805ff5c3938 ffffffff810b7385 ffff8805ff5c3938 ffff880606ed67c0
déc 10 21:49:01 the_PC kernel:  ffff8805ff5c3988 ffffffff81774fc0 ffff880500000000 ffff88060299b880
déc 10 21:49:01 the_PC kernel:  ffff8805ff5c3988 ffff8805ff5c4000 ffff8805ff5c39f0 ffff8805ff5c39f0
déc 10 21:49:01 the_PC kernel: Call Trace:
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b7385>] wq_worker_sleeping+0x15/0xa0
déc 10 21:49:01 the_PC kernel:  [<ffffffff81774fc0>] __schedule+0x620/0x950
déc 10 21:49:01 the_PC kernel:  [<ffffffff81775327>] schedule+0x37/0x80
déc 10 21:49:01 the_PC kernel:  [<ffffffff810a103a>] do_exit+0x80a/0xae0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810180fe>] oops_end+0x9e/0xd0
déc 10 21:49:01 the_PC kernel:  [<ffffffff81064c25>] no_context+0x135/0x380
déc 10 21:49:01 the_PC kernel:  [<ffffffff81064ef0>] __bad_area_nosemaphore+0x80/0x1f0
déc 10 21:49:01 the_PC kernel:  [<ffffffff81065073>] bad_area_nosemaphore+0x13/0x20
déc 10 21:49:01 the_PC kernel:  [<ffffffff81065357>] __do_page_fault+0xb7/0x400
déc 10 21:49:01 the_PC kernel:  [<ffffffff810656cf>] do_page_fault+0x2f/0x80
déc 10 21:49:01 the_PC kernel:  [<ffffffff8177b378>] page_fault+0x28/0x30
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00f850a>] ? radeon_ring_backup+0xda/0x190 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00f85b0>] ? radeon_ring_backup+0x180/0x190 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00f9413>] ? radeon_irq_kms_disable_hpd+0x73/0x80 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00c6c80>] radeon_gpu_reset+0xd0/0x330 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffff810df990>] ? wake_atomic_t_function+0x70/0x70
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00e058f>] ? radeon_fence_wait+0x9f/0xe0 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffffa00ed960>] radeon_flip_work_func+0x130/0x170 [radeon]
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b650e>] process_one_work+0x19e/0x3f0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b67ae>] worker_thread+0x4e/0x450
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810bc8b8>] kthread+0xd8/0xf0
déc 10 21:49:01 the_PC kernel:  [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160
déc 10 21:49:01 the_PC kernel:  [<ffffffff817797df>] ret_from_fork+0x3f/0x70
déc 10 21:49:01 the_PC kernel:  [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160
déc 10 21:49:01 the_PC kernel: Code: c4 08 44 89 e8 5b 41 5c 41 5d 5d c3 4c 89 e7 e8 e7 eb fd ff eb 88 0f 1f 44 00 00 66 66 66 66 90 48 8b 87 90 05 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 
déc 10 21:49:01 the_PC kernel: RIP  [<ffffffff810bcd40>] kthread_data+0x10/0x20
déc 10 21:49:01 the_PC kernel:  RSP <ffff8805ff5c3918>
déc 10 21:49:01 the_PC kernel: CR2: ffffffffffffffd8
déc 10 21:49:01 the_PC kernel: ---[ end trace 37e2470f6b251993 ]---
déc 10 21:49:01 the_PC kernel: Fixing recursive fault but reboot is needed!
-- Reboot --
Comment 1 Jean-François Fortin Tam 2016-01-22 16:57:15 UTC
I also get it to (rarely) lockup when not doing anything in particular. I could be just sitting and staring at my desktop when suddenly the monitor turns off and I get this in dmesg:



[67967.108746] radeon 0000:02:00.0: ring 0 stalled for more than 10252msec
[67967.108750] radeon 0000:02:00.0: GPU lockup (current fence id 0x00000000006c9132 last fence id 0x00000000006c928b on ring 0)
[67967.108772] radeon 0000:02:00.0: failed to get a new IB (-35)
[67967.108805] [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib !
[67967.977163] BUG: unable to handle kernel paging request at ffffc90404239ffc
[67967.977200] IP: [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon]
[67967.977246] PGD 6068a8067 PUD 0 
[67967.977271] Oops: 0000 [#1] SMP 
[67967.977293] Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_analog snd_hda_codec_generic dell_wmi iTCO_wdt sparse_keymap gpio_ich iTCO_vendor_support video ppdev coretemp kvm_intel dcdbas snd_hda_codec_hdmi dell_smm_hwmon kvm snd_hda_intel snd_hda_codec snd_usb_audio snd_hda_core crc32c_intel snd_usbmidi_lib snd_hwdep snd_seq snd_rawmidi snd_seq_device joydev snd_pcm snd_timer snd tpm_tis lpc_ich parport_pc i2c_i801 soundcore tpm parport wmi i7core_edac shpchp edac_core acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc hid_logitech_hidpp hid_logitech_dj wacom amdkfd amd_iommu_v2
[67967.977806]  radeon i2c_algo_bit drm_kms_helper ttm tg3 serio_raw drm ptp pps_core
[67967.977875] CPU: 5 PID: 5985 Comm: Xorg Tainted: G          I     4.3.3-301.fc23.x86_64 #1
[67967.977906] Hardware name: Dell Inc. Precision WorkStation T3500  /0K095G, BIOS A17 05/28/2013
[67967.977937] task: ffff8805e5a11cc0 ti: ffff8805e8038000 task.ti: ffff8805e8038000
[67967.977965] RIP: 0010:[<ffffffffa013736a>]  [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon]
[67967.978013] RSP: 0018:ffff8805e803ba28  EFLAGS: 00010206
[67967.978033] RAX: ffffc9000c001000 RBX: 00000000ffffffff RCX: 0000000000000000
[67967.978059] RDX: 0000000000000000 RSI: ffffc90404239ffc RDI: 00000000000b0bc0
[67967.978086] RBP: ffff8805e803ba58 R08: ffff8803c68b3880 R09: 00000000000b2000
[67967.978112] R10: 8000000000000163 R11: ffffffff81a68139 R12: ffff8805ff2a54d8
[67967.978138] R13: ffff8805ff2a54b0 R14: 000000000002c2f1 R15: ffff8805e803baa0
[67967.978164] FS:  00007f5fb263f700(0000) GS:ffff880606f40000(0000) knlGS:0000000000000000
[67967.978194] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[67967.978215] CR2: ffffc90404239ffc CR3: 00000005e5b48000 CR4: 00000000000006e0
[67967.978241] Stack:
[67967.978250]  ffff8805ff2a4000 ffff8805ff2a4000 ffff8805ff2a54d8 ffff8805e803baa0
[67967.978287]  ffff8805ff2a54d8 0000000000000000 ffff8805e803bb10 ffffffffa0105c8d
[67967.978322]  ffff8805ff2a4738 00ffffff00000001 ffff8805ff2a4018 0000000000000000
[67967.978359] Call Trace:
[67967.978377]  [<ffffffffa0105c8d>] radeon_gpu_reset+0xcd/0x330 [radeon]
[67967.978415]  [<ffffffffa01dec7f>] ? radeon_sync_free+0x2f/0x40 [radeon]
[67967.978452]  [<ffffffffa01de547>] ? radeon_ib_free+0x37/0x40 [radeon]
[67967.978488]  [<ffffffffa0138df4>] radeon_cs_ioctl+0x64/0x780 [radeon]
[67967.978520]  [<ffffffffa0019408>] drm_ioctl+0x138/0x500 [drm]
[67967.978552]  [<ffffffffa0138d90>] ? radeon_cs_parser_init+0x490/0x490 [radeon]
[67967.978586]  [<ffffffff8178108e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[67967.978618]  [<ffffffffa010304c>] radeon_drm_ioctl+0x4c/0x80 [radeon]
[67967.978647]  [<ffffffff81236bd5>] do_vfs_ioctl+0x295/0x470
[67967.978671]  [<ffffffff8111e941>] ? SyS_futex+0x81/0x180
[67967.978692]  [<ffffffff81236e29>] SyS_ioctl+0x79/0x90
[67967.978712]  [<ffffffff817815ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[67967.978735] Code: 0c e1 48 85 c0 49 89 07 74 6c 41 8d 7e ff 31 d2 48 c1 e7 02 eb 07 49 8b 07 48 83 c2 04 49 8b 74 24 08 8d 4b 01 89 db 48 8d 34 9e <8b> 36 89 34 10 41 23 4c 24 54 48 39 d7 89 cb 75 da 4c 89 ef e8 
[67967.979054] RIP  [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon]
[67967.979094]  RSP <ffff8805e803ba28>
[67967.979108] CR2: ffffc90404239ffc
[67967.988284] ---[ end trace f6fe8c1dbb2ed43c ]---
[68043.714679] Chrome_ChildThr[29558]: segfault at 0 ip 0000557f813adea4 sp 00007fa9867fe3e0 error 6 in plugin-container[557f813a5000+3d000]
Comment 2 Andreas Kilgus 2016-02-18 09:31:53 UTC
Created attachment 121831 [details]
Excerpt /var/log/messages GPU crash
Comment 3 Andreas Kilgus 2016-02-18 09:50:06 UTC
Happens at low system/graphical load, maybe related to chromium (IIRC, the last two times it occured I was actively using chromium).

Radeon R7 260X

Mesa 11.1.2
xorg-x11-server 7.6 1.18.1
kernel 4.4.1
Comment 4 Jean-François Fortin Tam 2016-03-05 22:25:02 UTC
Happened to me again today after 1 day and 22 hours of uptime, with the computer just sitting around, idle, with the screen turned of. It can sometimes happen after 6 days, sometimes 1-2 days... doesn't matter what you're doing or not.

At least this time I've been able to eliminate "suspend/resume" from the list of potential causes, as the computer was set to never sleep.
Comment 5 Jean-François Fortin Tam 2016-03-05 22:27:18 UTC
And it's not triggered by Chromium/Epiphany/Firefox, it happens with just a GNOME desktop sitting around in my case. Clearly, something is just FUBAR in the radeon driver or recent Linux kernels...
Comment 6 Jean-François Fortin Tam 2016-05-12 04:33:01 UTC
For what it's like, compared to my previous comment #5, tonight (same machine, same distro/stack) I was able to trigger the bug pretty frequently by using the Epiphany browser with a particular website—twice within the span of fifteen minutes or so.

So while there is a simple time component (ex: crashes while the computer isn't doing anything in particular), it can also sometimes be triggered by stressing the graphic card a little with some operations (such as can be seen on some browsers).
Comment 7 Jean-François Fortin Tam 2016-07-01 03:55:14 UTC
Um hello, any developers around?

As previously mentioned, although it happens even when idle, it's quite easy to trigger and reproduce by using 3D/openGL content. And it's extremely easy to trigger with http://demo.f4map.com/#lat=45.4946369&lon=-73.5661827&zoom=19 ; just have to sit around that page for a minute or two, maybe pan around the map, and your driver (and kernel) will crash with the screen turning off.
Comment 8 Nicolai Hähnle 2016-07-01 13:51:05 UTC
Sorry for your troubles. Non-deterministic lockups are just very hard to debug, and silence mostly means that nobody has an idea.

For future record, which browser reproduces the lockup for you on that website?
Comment 9 Jean-François Fortin Tam 2016-07-02 00:30:37 UTC
Hi Nicolai, it's more the lack of response that bothered me after half a year, I was really looking forward to providing any information that might be needed to investigate this bug, but trying to work for six months with a workstation that can hardlock at any time is really painful :)

I can see now that it is a somewhat non-deterministic bug indeed. I have been using the latest version of Firefox (v47+) on Fedora 23 and 24 today to trigger the bug easily (usually within 3-10 minutes) by having these pages open all at the same time (what better torture test than a bunch of WebGL demos!):

- appear.in/fdo93341
- demo.f4map.com
- bongiovi.tw/projects/particlesValley/
- jayweeks.com/medusae/

...with a RadeonHD 2600 (instead of the 7770) the bug does not occur so far, but that's a completely different series (r600 instead of radeonsi) so I'm not surprised.

FWIW, this Dell workstation-class computer has a pretty powerful PSU (525w) compared to the one of the previous computer I was on with the Radeon 7770 (which had a 350w PSU). I measured the GPU's temperatures at all times (nothing unusual going on), tried different PCI-E slots (since my workstation has two), no luck...
Comment 10 Nicolai Hähnle 2016-07-04 16:15:35 UTC
I've been running the last three in Firefox on a Tonga system that was simultaneously used for other tests for 45 minutes now, without a hang.

@Christian: It's a long shot, but by the rough shape of GPU lockup reports over the last few months I have the impression that the radeon module still has a lockup bug under pressure (especially with multiple apps running simultaneously, but that might just be X/the compositor) which was fixed in amdgpu. Any idea what that might have been?
Comment 11 Jean-François Fortin Tam 2016-07-04 19:55:25 UTC
You are right Nicolai, the stressor to trigger the bug is more subtle than I thought after all... while I was able to trigger this within minutes a few days ago, now my machine has been running with those 3-4 webGL benchmarks for the entire day today without issues.

Just to make sure it's really not a hardware issue, I tried with different power supplies, I measured the consumption (the machine eats between 150 and 220 watts at the very maximum, whereas the PSU can easily supply 500 watts), and tested the "Other OS", which doesn't exhibit the issue... so it does still look like a software bug, at least. I'd be happy to provide any other info you may need.
Comment 12 Arek Ruśniak 2016-07-13 10:07:31 UTC
Hi, 
I have HD 7770 too and your problem sounds familiar. I use gnome3 as well. I use mesa/llvm from git master tree all the time. 
Sometimes clicking at "activites" was enough to gpu went "bunga bunga" but sometimes it was stable as hell.  It was extremly random and no trigger for that I found but some times ago I don't remember exactly (half year or so) problem disappeared. 

If you are still using mesa from fedora (I can't see what version is) maybe it's time to consider changes. 
There is repo with mesa-git for fedora (against llvm 3.8). It could be good start.
Comment 13 Jean-François Fortin Tam 2016-10-09 19:49:57 UTC
Just to be 110% sure: I put in a completely new, top-quality 650w power supply into the machine, and the problem persists with the F4 map webgl demo.
Comment 14 Jean-François Fortin Tam 2016-11-23 03:14:00 UTC
As an update/additional info: the problem persists on Fedora 25 running a Wayland-based GNOME. I don't know how to determine the driver's version number but I presume it to be the latest released at this time.
Comment 15 Vedran Miletić 2016-11-25 00:25:16 UTC
(In reply to Jean-François Fortin Tam from comment #14)
> As an update/additional info: the problem persists on Fedora 25 running a
> Wayland-based GNOME. I don't know how to determine the driver's version
> number but I presume it to be the latest released at this time.

Do you get the same dmesg errors? I have Wayland locking up randomly, but dmesg stays clean and I can ssh into the machine and reboot.
Comment 16 Jean-François Fortin Tam 2016-11-29 22:33:08 UTC
Created attachment 128278 [details]
journalctl output at the time of a deadlock on F25

> Do you get the same dmesg errors?
> I have Wayland locking up randomly,
> but dmesg stays clean and I can ssh into the machine and reboot.

Pretty much yeah. Attached is the crash I have experienced just now, and the computer wasn't doing anything other than sitting around on the desktop and playing music from Rhythmbox... and you can see the usual:


/usr/libexec/gdm-x-session[18145]: radeon: Failed to deallocate virtual address for buffer:
/usr/libexec/gdm-x-session[18145]: radeon:    size      : 20480 bytes
kernel: radeon 0000:02:00.0: ring 3 stalled for more than 10083msec
kernel: radeon 0000:02:00.0: GPU lockup (current fence id 0x00000000002d46ee last fence id 0x00000000002d4710 on ring 3)
kernel: radeon 0000:02:00.0: failed to get a new IB (-35)
kernel: [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
kernel: radeon 0000:02:00.0: failed to get a new IB (-35)
kernel: radeon 0000:02:00.0: failed to get a new IB (-35)
kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib !
kernel: [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
/usr/libexec/gdm-x-session[18145]: radeon:    va        : 0x1f836000
/usr/libexec/gdm-x-session[18145]: radeon: Failed to deallocate virtual address for buffer:
/usr/libexec/gdm-x-session[18145]: radeon:    size      : 45056 bytes
/usr/libexec/gdm-x-session[18145]: radeon:    va        : 0x1f4b0000
Comment 17 Jean-François Fortin Tam 2017-02-02 06:09:03 UTC
I'm unfortunately still seeing this on an up-to-date Fedora 25 with kernel 4.9.6, DRM 2.48.0, LLVM 3.8.1, mesa 13.0.3, xorg-x11-drv-ati 7.7.1 (2016-09-28 git 3fc839ff) etc.

Nicolai, would it help at all to know that I don't recall ever encountering the issue while playing non-fullscreened HTML5 youtube videos in Firefox, but that I can easily encounter it if playing fullscreen or if playing fullscreen videos in Totem (under GNOME Shell, whether Xorg or Wayland session)?

This really doesn't seem related to system load, I was looking at "radeontop" just now while playing a fullscreen video (which made it deadlock within a few minutes) and the graphics pipe was barely 20-30% used, and VRAM about 80-90% used but never 100%.
Comment 18 Michel Dänzer 2017-02-02 06:17:17 UTC
Please attach the current Xorg log file.
Comment 19 Jean-François Fortin Tam 2017-02-02 14:15:35 UTC
What would be the equivalent in the systemd/journalctl world? Apparently Fedora 25 doesn't generate Xorg.log files anymore, the last modification timestamp on that one file is october 10th 2016...
Comment 20 Alex Deucher 2017-02-02 16:31:02 UTC
(In reply to Jean-François Fortin Tam from comment #19)
> What would be the equivalent in the systemd/journalctl world? Apparently
> Fedora 25 doesn't generate Xorg.log files anymore, the last modification
> timestamp on that one file is october 10th 2016...

See this page for how to access the xorg log output on various versions of fedora:
https://fedoraproject.org/wiki/How_to_debug_Xorg_problems
Comment 21 Jean-François Fortin Tam 2017-02-02 19:59:19 UTC
Created attachment 129304 [details]
journalctl output at the time of a deadlock on F25 - X GDM session output only

Hi Alex and thanks for the pointer, here's the output as per those instructions... but the result seems quite useless compared to the full journalctl output (which I'll be attaching as well).
Comment 22 Jean-François Fortin Tam 2017-02-02 20:00:46 UTC
Created attachment 129305 [details]
journalctl output at the time of a deadlock on F25 - take 2

Full journal output at the time of the crash. Exactly the same as before as far as I can tell. If there's any other information I can provide, please tell.
Comment 23 Jean-François Fortin Tam 2017-02-02 20:07:18 UTC
Created attachment 129306 [details]
Xorg log

Xorg.0.log file found in ~/.local/share/xorg as "Xorg.0.log.old"
As you can see it says nothing about the crash. It seems only the global journalctl output caught something.
Comment 24 Julien Isorce 2017-04-10 19:11:42 UTC
Does the test
wget http://www.phoronix-test-suite.com/benchmark-files/GpuTest_Linux_x64_0.7.0.zip
DISPLAY=:0 ./GpuTest /test=fur /fullscreen

reproduce the problem ?
Comment 25 Jean-François Fortin Tam 2017-04-11 12:19:07 UTC
Hi Julien, unfortunately with that benchmark I was not able to reproduce it so far (I've had it running for about 9 hours). This might be just "luck" though, as I've sometimes had the issue refuse to reproduce for hours and days, and sometimes the issue would happen right away. As I'm suspecting it's a race condition, I'm thinking it might also be sensitive to the system's software collection at various times of the year (i.e. maybe with one kernel the problem resurfaces more frequently, then another point kernel releases changes the a bit the stack's timings and the race disappears, rinse & repeat?)

I might as well leave the benchmark running in the coming days, but at least you know that it's (probably) not directly due to the system load or the GPU load... as I mentioned in earlier comments, it seems to be quite random. For some reason, I haven't encountered a random lockup in a month, although I've grown to use my computer in light ways (too scared to play fullscreen videos or use 3D, except composited window managers)
Comment 26 Jean-François Fortin Tam 2017-04-11 22:20:05 UTC
OK, I've got good news... Julien, thanks to the crazy furry donut "torture test" you suggested, I was able to finally pinpoint the real trigger for this bug.

My understanding is that on Radeons (well, at least the Radeon HD 7770), there is an emergency mechanism in the hardware (or firmware/microcode maybe) that activates self-throttling of performances when the GPU reaches a critical temperature. Normally, the video driver is supposed to handle this state change gracefully, however the radeonsi/radeon/amdgpu driver on Linux does not, so the kernel panics because the driver went belly up.

During additional testing today, where I forced my GPU to overheat, I was able to determine that the critical point is the same as on Windows: 113 degrees Celsius. As soon as you go over 112... boom, dead radeonsi driver + kernel oops (with the same error messages as my previous logs above). Additionally, lm_sensors thinks the temperature has instantly jumped to 511 degrees Celsius (!), and the readings stay stuck at 511 Celsius.

"Duh! Just get better cooling!" might sound like a workaround (just like keeping the case open), but nope, technically, it's still a software/driver issue: the Linux driver should handle such scenarios gracefully just as well as the Windows driver. In Windows, breaching the 110-113 degrees Celsius limit results in the video driver simply dropping frames massively, continuing to function at reduced performance (ie: going from 40-60 fps to 10-15 fps on one of my benchmarks). The system never crashes.

So the bug here, as I understand it, is that the radeonsi driver on Linux does not handle the event where the hardware force-throttles itself.

---------
Contextual notes:
The reason why I only started experiencing this issue in December 2015 (as I've had the GPU since 2012) was that I changed my PC case then, which means a different airflow and cooling behavior... And the reason why it was so hard to get consistent crashes here was that when I was trying to troubleshoot it, I was sometimes doing it with the case closed, sometimes with the case open (when trying with a different power supply unit using a "siamese transplant" across another computer, for example). If I keep my case open, the card will never reach the critical temperature and so the issue will not happen. I might get a system "freeze" (possibly saying "*ERROR* si_restrict_performance_levels_before_switch failed") after many hours of torture testing, but the symptoms are different (the screen does not turn off, image stays on with everything frozen, and nothing else in the logs) and so I presume that to be a different issue.
Comment 27 Julien Isorce 2017-04-20 15:36:07 UTC
About your comment #26, do you get similar logs than those attached ? i.e. ring N stalled then gpu softreset then freeze which requires reboot ?

Can you try https://bugs.freedesktop.org/show_bug.cgi?id=100712#c6 ?
Comment 28 Jean-François Fortin Tam 2017-07-09 14:25:06 UTC
Hi Julien, sorry I missed the mail notification in the pile. To answer your question:

> About your comment #26, do you get similar logs than those attached ?
> i.e. ring N stalled then gpu softreset then freeze which requires reboot ?

Yeah I was getting the exact same output as usual (forgot to mention that).


> Can you try https://bugs.freedesktop.org/show_bug.cgi?id=100712#c6 ?

Not easily as I'd have to wait for that to trickle down into whatever kernel Fedora is packaging and compare versions, and would need to be able to make my GPU overheat which is no longer easy since I completely changed the thermal design and ventilation of my case (even under 100% GPU load it stays under 60-70 Celsius now).

Though maybe Andreas or Arek could also try this, if they have a similar issue with an "open air" GPU fan design that exhausts into a not-so-well-ventilated case (instead of a "blower" GPU cooler that directly extracts the hot air)...
Comment 29 GitLab Migration User 2019-09-25 17:53:50 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1226.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.