Summary: | Semi-random GPU lockups on radeonsi with a RadeonHD 7770 (when playing videos, running OpenGL games, WebGL apps, or after extended periods of time) | ||
---|---|---|---|
Product: | Mesa | Reporter: | Jean-François Fortin Tam <nekohayo> |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | major | ||
Priority: | medium | CC: | austinenglish, ckoenig.leichtzumerken, julien.isorce, kilgus |
Version: | 11.0 | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
URL: | https://bugzilla.redhat.com/show_bug.cgi?id=1335360 | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Excerpt /var/log/messages GPU crash
journalctl output at the time of a deadlock on F25 journalctl output at the time of a deadlock on F25 - X GDM session output only journalctl output at the time of a deadlock on F25 - take 2 Xorg log |
Description
Jean-François Fortin Tam
2015-12-11 05:36:44 UTC
I also get it to (rarely) lockup when not doing anything in particular. I could be just sitting and staring at my desktop when suddenly the monitor turns off and I get this in dmesg: [67967.108746] radeon 0000:02:00.0: ring 0 stalled for more than 10252msec [67967.108750] radeon 0000:02:00.0: GPU lockup (current fence id 0x00000000006c9132 last fence id 0x00000000006c928b on ring 0) [67967.108772] radeon 0000:02:00.0: failed to get a new IB (-35) [67967.108805] [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib ! [67967.977163] BUG: unable to handle kernel paging request at ffffc90404239ffc [67967.977200] IP: [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon] [67967.977246] PGD 6068a8067 PUD 0 [67967.977271] Oops: 0000 [#1] SMP [67967.977293] Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_analog snd_hda_codec_generic dell_wmi iTCO_wdt sparse_keymap gpio_ich iTCO_vendor_support video ppdev coretemp kvm_intel dcdbas snd_hda_codec_hdmi dell_smm_hwmon kvm snd_hda_intel snd_hda_codec snd_usb_audio snd_hda_core crc32c_intel snd_usbmidi_lib snd_hwdep snd_seq snd_rawmidi snd_seq_device joydev snd_pcm snd_timer snd tpm_tis lpc_ich parport_pc i2c_i801 soundcore tpm parport wmi i7core_edac shpchp edac_core acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc hid_logitech_hidpp hid_logitech_dj wacom amdkfd amd_iommu_v2 [67967.977806] radeon i2c_algo_bit drm_kms_helper ttm tg3 serio_raw drm ptp pps_core [67967.977875] CPU: 5 PID: 5985 Comm: Xorg Tainted: G I 4.3.3-301.fc23.x86_64 #1 [67967.977906] Hardware name: Dell Inc. Precision WorkStation T3500 /0K095G, BIOS A17 05/28/2013 [67967.977937] task: ffff8805e5a11cc0 ti: ffff8805e8038000 task.ti: ffff8805e8038000 [67967.977965] RIP: 0010:[<ffffffffa013736a>] [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon] [67967.978013] RSP: 0018:ffff8805e803ba28 EFLAGS: 00010206 [67967.978033] RAX: ffffc9000c001000 RBX: 00000000ffffffff RCX: 0000000000000000 [67967.978059] RDX: 0000000000000000 RSI: ffffc90404239ffc RDI: 00000000000b0bc0 [67967.978086] RBP: ffff8805e803ba58 R08: ffff8803c68b3880 R09: 00000000000b2000 [67967.978112] R10: 8000000000000163 R11: ffffffff81a68139 R12: ffff8805ff2a54d8 [67967.978138] R13: ffff8805ff2a54b0 R14: 000000000002c2f1 R15: ffff8805e803baa0 [67967.978164] FS: 00007f5fb263f700(0000) GS:ffff880606f40000(0000) knlGS:0000000000000000 [67967.978194] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [67967.978215] CR2: ffffc90404239ffc CR3: 00000005e5b48000 CR4: 00000000000006e0 [67967.978241] Stack: [67967.978250] ffff8805ff2a4000 ffff8805ff2a4000 ffff8805ff2a54d8 ffff8805e803baa0 [67967.978287] ffff8805ff2a54d8 0000000000000000 ffff8805e803bb10 ffffffffa0105c8d [67967.978322] ffff8805ff2a4738 00ffffff00000001 ffff8805ff2a4018 0000000000000000 [67967.978359] Call Trace: [67967.978377] [<ffffffffa0105c8d>] radeon_gpu_reset+0xcd/0x330 [radeon] [67967.978415] [<ffffffffa01dec7f>] ? radeon_sync_free+0x2f/0x40 [radeon] [67967.978452] [<ffffffffa01de547>] ? radeon_ib_free+0x37/0x40 [radeon] [67967.978488] [<ffffffffa0138df4>] radeon_cs_ioctl+0x64/0x780 [radeon] [67967.978520] [<ffffffffa0019408>] drm_ioctl+0x138/0x500 [drm] [67967.978552] [<ffffffffa0138d90>] ? radeon_cs_parser_init+0x490/0x490 [radeon] [67967.978586] [<ffffffff8178108e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 [67967.978618] [<ffffffffa010304c>] radeon_drm_ioctl+0x4c/0x80 [radeon] [67967.978647] [<ffffffff81236bd5>] do_vfs_ioctl+0x295/0x470 [67967.978671] [<ffffffff8111e941>] ? SyS_futex+0x81/0x180 [67967.978692] [<ffffffff81236e29>] SyS_ioctl+0x79/0x90 [67967.978712] [<ffffffff817815ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [67967.978735] Code: 0c e1 48 85 c0 49 89 07 74 6c 41 8d 7e ff 31 d2 48 c1 e7 02 eb 07 49 8b 07 48 83 c2 04 49 8b 74 24 08 8d 4b 01 89 db 48 8d 34 9e <8b> 36 89 34 10 41 23 4c 24 54 48 39 d7 89 cb 75 da 4c 89 ef e8 [67967.979054] RIP [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon] [67967.979094] RSP <ffff8805e803ba28> [67967.979108] CR2: ffffc90404239ffc [67967.988284] ---[ end trace f6fe8c1dbb2ed43c ]--- [68043.714679] Chrome_ChildThr[29558]: segfault at 0 ip 0000557f813adea4 sp 00007fa9867fe3e0 error 6 in plugin-container[557f813a5000+3d000] Created attachment 121831 [details]
Excerpt /var/log/messages GPU crash
Happens at low system/graphical load, maybe related to chromium (IIRC, the last two times it occured I was actively using chromium). Radeon R7 260X Mesa 11.1.2 xorg-x11-server 7.6 1.18.1 kernel 4.4.1 Happened to me again today after 1 day and 22 hours of uptime, with the computer just sitting around, idle, with the screen turned of. It can sometimes happen after 6 days, sometimes 1-2 days... doesn't matter what you're doing or not. At least this time I've been able to eliminate "suspend/resume" from the list of potential causes, as the computer was set to never sleep. And it's not triggered by Chromium/Epiphany/Firefox, it happens with just a GNOME desktop sitting around in my case. Clearly, something is just FUBAR in the radeon driver or recent Linux kernels... For what it's like, compared to my previous comment #5, tonight (same machine, same distro/stack) I was able to trigger the bug pretty frequently by using the Epiphany browser with a particular website—twice within the span of fifteen minutes or so. So while there is a simple time component (ex: crashes while the computer isn't doing anything in particular), it can also sometimes be triggered by stressing the graphic card a little with some operations (such as can be seen on some browsers). Um hello, any developers around? As previously mentioned, although it happens even when idle, it's quite easy to trigger and reproduce by using 3D/openGL content. And it's extremely easy to trigger with http://demo.f4map.com/#lat=45.4946369&lon=-73.5661827&zoom=19 ; just have to sit around that page for a minute or two, maybe pan around the map, and your driver (and kernel) will crash with the screen turning off. Sorry for your troubles. Non-deterministic lockups are just very hard to debug, and silence mostly means that nobody has an idea. For future record, which browser reproduces the lockup for you on that website? Hi Nicolai, it's more the lack of response that bothered me after half a year, I was really looking forward to providing any information that might be needed to investigate this bug, but trying to work for six months with a workstation that can hardlock at any time is really painful :) I can see now that it is a somewhat non-deterministic bug indeed. I have been using the latest version of Firefox (v47+) on Fedora 23 and 24 today to trigger the bug easily (usually within 3-10 minutes) by having these pages open all at the same time (what better torture test than a bunch of WebGL demos!): - appear.in/fdo93341 - demo.f4map.com - bongiovi.tw/projects/particlesValley/ - jayweeks.com/medusae/ ...with a RadeonHD 2600 (instead of the 7770) the bug does not occur so far, but that's a completely different series (r600 instead of radeonsi) so I'm not surprised. FWIW, this Dell workstation-class computer has a pretty powerful PSU (525w) compared to the one of the previous computer I was on with the Radeon 7770 (which had a 350w PSU). I measured the GPU's temperatures at all times (nothing unusual going on), tried different PCI-E slots (since my workstation has two), no luck... I've been running the last three in Firefox on a Tonga system that was simultaneously used for other tests for 45 minutes now, without a hang. @Christian: It's a long shot, but by the rough shape of GPU lockup reports over the last few months I have the impression that the radeon module still has a lockup bug under pressure (especially with multiple apps running simultaneously, but that might just be X/the compositor) which was fixed in amdgpu. Any idea what that might have been? You are right Nicolai, the stressor to trigger the bug is more subtle than I thought after all... while I was able to trigger this within minutes a few days ago, now my machine has been running with those 3-4 webGL benchmarks for the entire day today without issues. Just to make sure it's really not a hardware issue, I tried with different power supplies, I measured the consumption (the machine eats between 150 and 220 watts at the very maximum, whereas the PSU can easily supply 500 watts), and tested the "Other OS", which doesn't exhibit the issue... so it does still look like a software bug, at least. I'd be happy to provide any other info you may need. Hi, I have HD 7770 too and your problem sounds familiar. I use gnome3 as well. I use mesa/llvm from git master tree all the time. Sometimes clicking at "activites" was enough to gpu went "bunga bunga" but sometimes it was stable as hell. It was extremly random and no trigger for that I found but some times ago I don't remember exactly (half year or so) problem disappeared. If you are still using mesa from fedora (I can't see what version is) maybe it's time to consider changes. There is repo with mesa-git for fedora (against llvm 3.8). It could be good start. Just to be 110% sure: I put in a completely new, top-quality 650w power supply into the machine, and the problem persists with the F4 map webgl demo. As an update/additional info: the problem persists on Fedora 25 running a Wayland-based GNOME. I don't know how to determine the driver's version number but I presume it to be the latest released at this time. (In reply to Jean-François Fortin Tam from comment #14) > As an update/additional info: the problem persists on Fedora 25 running a > Wayland-based GNOME. I don't know how to determine the driver's version > number but I presume it to be the latest released at this time. Do you get the same dmesg errors? I have Wayland locking up randomly, but dmesg stays clean and I can ssh into the machine and reboot. Created attachment 128278 [details] journalctl output at the time of a deadlock on F25 > Do you get the same dmesg errors? > I have Wayland locking up randomly, > but dmesg stays clean and I can ssh into the machine and reboot. Pretty much yeah. Attached is the crash I have experienced just now, and the computer wasn't doing anything other than sitting around on the desktop and playing music from Rhythmbox... and you can see the usual: /usr/libexec/gdm-x-session[18145]: radeon: Failed to deallocate virtual address for buffer: /usr/libexec/gdm-x-session[18145]: radeon: size : 20480 bytes kernel: radeon 0000:02:00.0: ring 3 stalled for more than 10083msec kernel: radeon 0000:02:00.0: GPU lockup (current fence id 0x00000000002d46ee last fence id 0x00000000002d4710 on ring 3) kernel: radeon 0000:02:00.0: failed to get a new IB (-35) kernel: [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35) kernel: radeon 0000:02:00.0: failed to get a new IB (-35) kernel: radeon 0000:02:00.0: failed to get a new IB (-35) kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib ! kernel: [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35) /usr/libexec/gdm-x-session[18145]: radeon: va : 0x1f836000 /usr/libexec/gdm-x-session[18145]: radeon: Failed to deallocate virtual address for buffer: /usr/libexec/gdm-x-session[18145]: radeon: size : 45056 bytes /usr/libexec/gdm-x-session[18145]: radeon: va : 0x1f4b0000 I'm unfortunately still seeing this on an up-to-date Fedora 25 with kernel 4.9.6, DRM 2.48.0, LLVM 3.8.1, mesa 13.0.3, xorg-x11-drv-ati 7.7.1 (2016-09-28 git 3fc839ff) etc. Nicolai, would it help at all to know that I don't recall ever encountering the issue while playing non-fullscreened HTML5 youtube videos in Firefox, but that I can easily encounter it if playing fullscreen or if playing fullscreen videos in Totem (under GNOME Shell, whether Xorg or Wayland session)? This really doesn't seem related to system load, I was looking at "radeontop" just now while playing a fullscreen video (which made it deadlock within a few minutes) and the graphics pipe was barely 20-30% used, and VRAM about 80-90% used but never 100%. Please attach the current Xorg log file. What would be the equivalent in the systemd/journalctl world? Apparently Fedora 25 doesn't generate Xorg.log files anymore, the last modification timestamp on that one file is october 10th 2016... (In reply to Jean-François Fortin Tam from comment #19) > What would be the equivalent in the systemd/journalctl world? Apparently > Fedora 25 doesn't generate Xorg.log files anymore, the last modification > timestamp on that one file is october 10th 2016... See this page for how to access the xorg log output on various versions of fedora: https://fedoraproject.org/wiki/How_to_debug_Xorg_problems Created attachment 129304 [details]
journalctl output at the time of a deadlock on F25 - X GDM session output only
Hi Alex and thanks for the pointer, here's the output as per those instructions... but the result seems quite useless compared to the full journalctl output (which I'll be attaching as well).
Created attachment 129305 [details]
journalctl output at the time of a deadlock on F25 - take 2
Full journal output at the time of the crash. Exactly the same as before as far as I can tell. If there's any other information I can provide, please tell.
Created attachment 129306 [details]
Xorg log
Xorg.0.log file found in ~/.local/share/xorg as "Xorg.0.log.old"
As you can see it says nothing about the crash. It seems only the global journalctl output caught something.
Does the test wget http://www.phoronix-test-suite.com/benchmark-files/GpuTest_Linux_x64_0.7.0.zip DISPLAY=:0 ./GpuTest /test=fur /fullscreen reproduce the problem ? Hi Julien, unfortunately with that benchmark I was not able to reproduce it so far (I've had it running for about 9 hours). This might be just "luck" though, as I've sometimes had the issue refuse to reproduce for hours and days, and sometimes the issue would happen right away. As I'm suspecting it's a race condition, I'm thinking it might also be sensitive to the system's software collection at various times of the year (i.e. maybe with one kernel the problem resurfaces more frequently, then another point kernel releases changes the a bit the stack's timings and the race disappears, rinse & repeat?) I might as well leave the benchmark running in the coming days, but at least you know that it's (probably) not directly due to the system load or the GPU load... as I mentioned in earlier comments, it seems to be quite random. For some reason, I haven't encountered a random lockup in a month, although I've grown to use my computer in light ways (too scared to play fullscreen videos or use 3D, except composited window managers) OK, I've got good news... Julien, thanks to the crazy furry donut "torture test" you suggested, I was able to finally pinpoint the real trigger for this bug. My understanding is that on Radeons (well, at least the Radeon HD 7770), there is an emergency mechanism in the hardware (or firmware/microcode maybe) that activates self-throttling of performances when the GPU reaches a critical temperature. Normally, the video driver is supposed to handle this state change gracefully, however the radeonsi/radeon/amdgpu driver on Linux does not, so the kernel panics because the driver went belly up. During additional testing today, where I forced my GPU to overheat, I was able to determine that the critical point is the same as on Windows: 113 degrees Celsius. As soon as you go over 112... boom, dead radeonsi driver + kernel oops (with the same error messages as my previous logs above). Additionally, lm_sensors thinks the temperature has instantly jumped to 511 degrees Celsius (!), and the readings stay stuck at 511 Celsius. "Duh! Just get better cooling!" might sound like a workaround (just like keeping the case open), but nope, technically, it's still a software/driver issue: the Linux driver should handle such scenarios gracefully just as well as the Windows driver. In Windows, breaching the 110-113 degrees Celsius limit results in the video driver simply dropping frames massively, continuing to function at reduced performance (ie: going from 40-60 fps to 10-15 fps on one of my benchmarks). The system never crashes. So the bug here, as I understand it, is that the radeonsi driver on Linux does not handle the event where the hardware force-throttles itself. --------- Contextual notes: The reason why I only started experiencing this issue in December 2015 (as I've had the GPU since 2012) was that I changed my PC case then, which means a different airflow and cooling behavior... And the reason why it was so hard to get consistent crashes here was that when I was trying to troubleshoot it, I was sometimes doing it with the case closed, sometimes with the case open (when trying with a different power supply unit using a "siamese transplant" across another computer, for example). If I keep my case open, the card will never reach the critical temperature and so the issue will not happen. I might get a system "freeze" (possibly saying "*ERROR* si_restrict_performance_levels_before_switch failed") after many hours of torture testing, but the symptoms are different (the screen does not turn off, image stays on with everything frozen, and nothing else in the logs) and so I presume that to be a different issue. About your comment #26, do you get similar logs than those attached ? i.e. ring N stalled then gpu softreset then freeze which requires reboot ? Can you try https://bugs.freedesktop.org/show_bug.cgi?id=100712#c6 ? Hi Julien, sorry I missed the mail notification in the pile. To answer your question: > About your comment #26, do you get similar logs than those attached ? > i.e. ring N stalled then gpu softreset then freeze which requires reboot ? Yeah I was getting the exact same output as usual (forgot to mention that). > Can you try https://bugs.freedesktop.org/show_bug.cgi?id=100712#c6 ? Not easily as I'd have to wait for that to trickle down into whatever kernel Fedora is packaging and compare versions, and would need to be able to make my GPU overheat which is no longer easy since I completely changed the thermal design and ventilation of my case (even under 100% GPU load it stays under 60-70 Celsius now). Though maybe Andreas or Arek could also try this, if they have a similar issue with an "open air" GPU fan design that exhausts into a not-so-well-ventilated case (instead of a "blower" GPU cooler that directly extracts the hot air)... -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1226. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.