Created attachment 105841 [details] GPU crash dump from /sys/class/drm/card0/error After upgrading to Ubuntu 14.4 (with Linux 3.13 kernel), I am experiencing frequent graphics failures. They come in several manifestations, I do not know if they all have a common cause. The problems observed: - kernel warning on boot and sometimes in other situations (e.g. screen shutoff by power manager): WARNING: CPU: 2 PID: 334 at /build/buildd/linux-3.13.0/drivers/gpu/drm/i915/intel_display.c:851 intel_wait_for_pipe_off+0x19b/0x1b0 [i915] - (intermittent, infrequent) display does not wake up after going blank - complete GUI rendering freeze on window move or resize. This one is easy to reproduce and appears to trigger the '[drm] stuck on render ring' for which I am opening this bug (as a new bug, as requested in the printk output). Snippet from syslog: Sep 6 19:15:04 sb kernel: [ 131.782509] [drm] stuck on render ring Sep 6 19:15:04 sb kernel: [ 131.782520] [drm] GPU crash dump saved to /sys/class/drm/card0/error Sep 6 19:15:04 sb kernel: [ 131.782524] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Sep 6 19:15:04 sb kernel: [ 131.782527] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Sep 6 19:15:04 sb kernel: [ 131.782529] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Sep 6 19:15:04 sb kernel: [ 131.782531] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Sep 6 19:15:04 sb kernel: [ 131.786068] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x6ed1000 ctx 0) at 0x6ed1a44 Sep 6 19:15:10 sb kernel: [ 137.776878] [drm] stuck on render ring Sep 6 19:15:16 sb kernel: [ 143.779239] [drm] stuck on render ring Sep 6 19:15:16 sb kernel: [ 143.779299] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x6ed1000 ctx 0) at 0x6ed1a44 Sep 6 19:15:16 sb kernel: [ 143.779304] [drm:i915_context_is_banned] *ERROR* context hanging too fast, declaring banned! The contents of /sys/class/drm/card0/error are attached. My hardware: Intel Core i3 board using the CPU's integrated graphics controller. The monitor is Dell 2560x1440 px, connected via DisplayPort (with the Intel onboard graphics, only DP can drive this resolution at 60Hz). 00:02.0 VGA compatible controller [0300]: Intel Corporation Core Processor Integrated Graphics Controller [8086:0042] (rev 12) processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 37 model name : Intel(R) Core(TM) i3 CPU 530 @ 2.93GHz stepping : 2 microcode : 0x9 cpu MHz : 1197.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 2 apicid : 5 initial apicid : 5 fdiv_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm arat dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 5866.38 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual $ uname -a Linux sb 3.13.0-35-lowlatency #62-Ubuntu SMP PREEMPT Fri Aug 15 02:28:37 UTC 2014 i686 i686 i686 GNU/Linux
Hmm, userspace is randomly scribbling over its own memory causing the GPU to hang. Due to the nature of the scribble, it is not clear what the source is (but it is pixel data being copied).
FYI: I am willing to contribute time to get to the bottom of this. I am not an expert in graphics, but I know my C and x86 assembly and I have reasonable experience with the Linux kernel. The machine is set up with ssh access, so I can get in and debug it even if the graphics go south. Let me know where to start.
Cool, recompile xf86-video-intel with --enable-debug and attach gdb to Xorg over ssh. If it is a bug in the ddx, that will likely capture it in an assert.
OK, the sources were actually in xserver-xorg-video-intel (not xf86-). Downloaded the sources with apt-get source ..., installed the whole bunch of -dev packages needed. Configured with xvmc, because that is how the package is installed on Ubuntu Trusty: ./configure --enable-debug --enable-xvmc --prefix=/usr Then: make install DESTDIR=_dist Then: copy _dist/usr/lib files onto /usr/lib (saving the old ones first) sudo reboot From another machine: gdb Xorg $xorg_pid (gdb) c Continuing. Now, enable full window resize and drag (I had those disabled to have a somewhat useable desktop). Open a window, grab it and wiggle it around - freeze! Results: - No output on the GDB console. - Same or similar error messages in /var/log/syslog (I saved syslog and the new crash dump, I can post them here if needed). - after 30-40 seconds I get "*ERROR* context hanging too fast, declaring banned!" in the system log and the GUI seems to come back to life, but not really working. Anyway, at least I have the sources compiling and I can take a stab at getting familiar with them. I will also get the Linux kernel source and see if I can add some printk()s to see on which exact commands from userland trigger the errors. If you have any better ideas, let me know.
(In reply to comment #4) > OK, the sources were actually in xserver-xorg-video-intel (not xf86-). As you doing so well, you could actually use the upstream code: http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/ (more bug fixes, more debugging code) > Results: > - No output on the GDB console. > - Same or similar error messages in /var/log/syslog (I saved syslog and the > new crash dump, I can post them here if needed). > - after 30-40 seconds I get "*ERROR* context hanging too fast, declaring > banned!" in the system log and the GUI seems to come back to life, but not > really working. > > Anyway, at least I have the sources compiling and I can take a stab at > getting familiar with them. I will also get the Linux kernel source and see > if I can add some printk()s to see on which exact commands from userland > trigger the errors. If you have any better ideas, let me know. Don't worry about instrumenting the kernel, it is already providing you with enough information to incriminate userspace...
> Don't worry about instrumenting the kernel, it is already providing you with > enough information to incriminate userspace... OK, though I'm still worried about the kernel warning (which happens before Xorg loads, so it is not caused by usermode). Anyway, off to living dangerously - got the very latest sources (at least I think I did: git clone http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/). They compiled OK (with plenty of warnings, mostly about overriding 'const'). Copied the new binaries to /usr/lib (carelessly, while X was running - and crashed Xorg, :P ). Same result - nothing on gdb, display freezes quickly when dragging around a large window. Same 'stuck on render ring' messages (once before the GPU crash message and then repeated many times after that). This time, it did not recover even partially and remained frozen solid (or I did not wait long enough - I rebooted after 10 minutes or so). I also got the kernel warning in drivers/gpu/drm/i915/intel_display.c (some 7 minutes after the 'freeze' - could have been because video went to sleep though, not necessarily related to the 'stuck ring' problem). Saved new drm/card0/error file (I will attach it, bzip2-ed this time).
Created attachment 106012 [details] new GPU crash dump, video-intel driver from git source
Long time no updates, is this still an issue with current kernel and drivers?
Just updated my Ubuntu with the latest kernel, then pulled the latest Xorg sources from git and built them. Result is exactly the same - quick and easy freeze if you open Blender and try to resize its window (other apps cause the same effect, but Blender does it most quickly and reliably). Switching back to UXA :( I am still willing to invest some time to get to the bottom of this. Let me know what I can do to debug it.
If you really think it is the DDX, compile git://anongit.freedesktop.org/xorg/driver/xf86-video-intel with ./configure --enable-debug and tell me if you hit an assert.
(In reply to Chris Wilson from comment #10) > If you really think it is the DDX, compile > git://anongit.freedesktop.org/xorg/driver/xf86-video-intel with ./configure > --enable-debug and tell me if you hit an assert. I will give this a try - though I did this on the previous round (when I ran it under GDB, too) and there were no asserts. NOTE: clarification - 'latest kernel' meant what you get with 'apt-get upgrade' on Ubuntu Studio 14.4 (which is 3.13.0-45-lowlatency as of today). I am not playing with mainline kernels for now, on the assumtion that Chris was right in his comments about this problem being in userland. If you think I should get a kernel other than the ones that come with Ubuntu, let me know.
It's still likely to be userland since we require overwriting of a pending batchbuffer, which is only likely by a stray write from userland. Since all the writes inside the DDX should be asserted that they are within the target bo and pixmap, I really don't expect this to be from the DDX.
On the other hand, 3.14 may not be ideal for there is one report that kernel has broken relocations...
Just came up with a new experiment (how did I not think of it earlier??): check if it is related to resolution or to the fact that I use DisplayPort: test1: reduce the resolution to full HD (1920x1080 instead of 2560x1440), try both with DP and HDMI cable. Result: rock-solid, no failures, no freezes. test2: test with HDMI cable, resolution set to 1920x1080 and to 2560x1440 (the latter with 30Hz refresh, because the Intel chip cannot clock this resolution at 60Hz on HDMI). Result: on HD it works fine, on the high res it freezes just as with DP. Conclusion: this is clearly related to the larger display area - could be integer overflow of some kind, which just does not happen on the regular HD resolution. No asserts were observed on GDB (I have recent sources compiled with these config options: '--enable-xvmc' '--enable-debug' '--prefix=/usr').
To narrow this down a bit further: I got the opportunity to test with a different Intel CPU: Core i3. lspci says: 00:02.0 VGA compatible controller [0300]: Intel Corporation 3rd Gen Core processor Graphics Controller [8086:0166] (rev 09) On this hardware, the large screen size (2560x1440, connected via HDMI at 30Hz refresh rate) works without problems.
*** Bug 89701 has been marked as a duplicate of this bug. ***
I was the reporter of duplicate 89701. I have upgraded to Fedora 22 and I am no longer able to reproduce the problem I had (which previously occurred reliably every time I tried to resize xfce4-terminal.) The hardware is all the same but I guess the software changed a lot, although I am still using xfce. Here is my info now: Linux xxx 4.0.6-300.fc22.x86_64 #1 SMP Tue Jun 23 13:58:53 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux [ 1.676555] [drm] Initialized drm 1.1.0 20060810 [ 1.837255] [drm] Memory usable by graphics device = 2048M [ 1.837257] [drm] Replacing VGA console driver [ 1.838469] [drm] ACPI BIOS requests an excessive sleep of 467631231 ms, using 1500 ms instead [ 1.838545] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 1.838546] [drm] Driver supports precise vblank timestamp query. [ 1.841135] [drm] Initialized i915 1.6.0 20150130 for 0000:00:02.0 on minor 0 [ 1.865832] fbcon: inteldrmfb (fb0) is primary device [ 1.909942] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
Is this still an issue? Can it be closed?
Did 'git pull' (commit e4ef6e9e5b2c8b637356621c60b28d064d40d29c) and tested again, running with stock Ubuntu 14.4 kernel 3.13.0-85-lowlatency. So far working fine with SNA acceleration mode. The newly built driver passes the 'blender resize' test (running blender and quickly resizing its window repeatedly). FYI: my previous test (when it was failing was with commit bc8c161 - quite a while ago). I still have the "pipe_off wait timed out" kernel warning during startup, though this was said not to be related to the present bug. NOTE the Ubuntu package was also updated in the meantime, but I have not yet tested whether it works in SNA mode. I will test it and post a separate update here. In any case, I will leave my workstation in SNA acceleration mode and with the manually-built driver from GIT (https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/) to see how it behaves - I put plenty of rendering load on it, so it should flush out any problems quickly.
(continued from comment #19) Re-tested also with Ubuntu's own xserver-xorg-video-intel package (version 2:2.99.910-0 i386), seems to work in SNA mode, too. (I will leave my desktop running with the new version from git, so I can test it) Let me know if you want me to do any additional checks (e.g., combinations with older kernels, previous git commits, etc.), to find what was the change that fixed the problem.
After less than a day, it failed. Unfortunately I forgot that the crash dump needs to be saved from /sys/class/drm/card0/error and rebooted the machine. I will try to catch it one more time and save a dump. Here's the kernel log: May 2 15:21:48 sb kernel: [63357.870540] [drm] stuck on render ring May 2 15:21:48 sb kernel: [63357.870548] [drm] GPU crash dump saved to /sys/class/drm/card0/error May 2 15:21:48 sb kernel: [63357.870552] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. May 2 15:21:48 sb kernel: [63357.870554] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel May 2 15:21:48 sb kernel: [63357.870556] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. May 2 15:21:48 sb kernel: [63357.870559] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. May 2 15:21:48 sb kernel: [63357.873992] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0xcdca000 ctx 0) at 0xcdca584 May 2 15:21:54 sb kernel: [63363.860931] [drm] stuck on render ring May 2 15:22:00 sb kernel: [63369.871256] [drm] stuck on render ring May 2 15:22:06 sb kernel: [63375.869617] [drm] stuck on render ring May 2 15:22:06 sb kernel: [63376.180611] ------------[ cut here ]------------ May 2 15:22:06 sb kernel: [63376.180669] WARNING: CPU: 1 PID: 1838 at /build/linux-zdaXYD/linux-3.13.0/drivers/gpu/drm/i915/intel_display.c:851 intel_wait_for_pipe_off+0x19b/0x1b0 [i915]() May 2 15:22:06 sb kernel: [63376.180672] pipe_off wait timed out May 2 15:22:06 sb kernel: [63376.180675] Modules linked in: dccp_diag dccp tcp_diag udp_diag inet_diag cuse xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables rfcomm bnep usblp snd_hda_codec_hdmi snd_hda_codec_realtek uvcvideo videobuf2_vmalloc videobuf2_memops snd_usb_audio videobuf2_core videodev snd_usbmidi_lib gpio_ich snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event intel_powerclamp snd_rawmidi coretemp kvm_intel kvm i915 snd_seq snd_seq_device btusb snd_timer serio_raw bluetooth snd mei_me video lpc_ich drm_kms_helper drm soundcore mei parport_pc i2c_algo_bit mac_hid ppdev lp parport hid_a4tech usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 ahci raid0 e1000 libahci multipath linear May 2 15:22:06 sb kernel: [63376.180758] CPU: 1 PID: 1838 Comm: Xorg Tainted: G W 3.13.0-85-lowlatency #129-Ubuntu May 2 15:22:06 sb kernel: [63376.180761] Hardware name: Gigabyte Technology Co., Ltd. H55M-UD2H/H55M-UD2H, BIOS F8 02/12/2010 May 2 15:22:06 sb kernel: [63376.180764] 00203286 00203286 edb1fcec c1661655 edb1fd2c f8f26e94 edb1fd1c c105947e May 2 15:22:06 sb kernel: [63376.180774] f8f31008 edb1fd48 0000072e f8f26e94 00000353 f8ed42ab f8ed42ab eda7c000 May 2 15:22:06 sb kernel: [63376.180783] 00070008 03c2ba42 edb1fd34 c10594d3 00000009 edb1fd2c f8f31008 edb1fd48 May 2 15:22:06 sb kernel: [63376.180792] Call Trace: May 2 15:22:06 sb kernel: [63376.180802] [<c1661655>] dump_stack+0x58/0x72 May 2 15:22:06 sb kernel: [63376.180812] [<c105947e>] warn_slowpath_common+0x7e/0xa0 May 2 15:22:06 sb kernel: [63376.180850] [<f8ed42ab>] ? intel_wait_for_pipe_off+0x19b/0x1b0 [i915] May 2 15:22:06 sb kernel: [63376.180886] [<f8ed42ab>] ? intel_wait_for_pipe_off+0x19b/0x1b0 [i915] May 2 15:22:06 sb kernel: [63376.180892] [<c10594d3>] warn_slowpath_fmt+0x33/0x40 May 2 15:22:06 sb kernel: [63376.180929] [<f8ed42ab>] intel_wait_for_pipe_off+0x19b/0x1b0 [i915] May 2 15:22:06 sb kernel: [63376.180966] [<f8ed434e>] intel_disable_pipe+0x8e/0xa0 [i915] May 2 15:22:06 sb kernel: [63376.181002] [<f8ed5a29>] ironlake_crtc_disable+0xb9/0x850 [i915] May 2 15:22:06 sb kernel: [63376.181010] [<c166664c>] ? __mutex_lock_slowpath+0x16c/0x1a9 May 2 15:22:06 sb kernel: [63376.181046] [<f8edb03e>] intel_crtc_update_dpms+0x5e/0x90 [i915] May 2 15:22:06 sb kernel: [63376.181083] [<f8edf021>] intel_connector_dpms+0x51/0x60 [i915] May 2 15:22:06 sb kernel: [63376.181114] [<f89c884a>] drm_mode_obj_set_property_ioctl+0x39a/0x3c0 [drm] May 2 15:22:06 sb kernel: [63376.181143] [<f89c8870>] ? drm_mode_obj_set_property_ioctl+0x3c0/0x3c0 [drm] May 2 15:22:06 sb kernel: [63376.181171] [<f89c88a3>] drm_mode_connector_property_set_ioctl+0x33/0x40 [drm] May 2 15:22:06 sb kernel: [63376.181193] [<f89b9792>] drm_ioctl+0x472/0x500 [drm] May 2 15:22:06 sb kernel: [63376.181225] [<f89c8870>] ? drm_mode_obj_set_property_ioctl+0x3c0/0x3c0 [drm] May 2 15:22:06 sb kernel: [63376.181233] [<c1308e10>] ? lockref_put_or_lock+0x20/0x30 May 2 15:22:06 sb kernel: [63376.181240] [<c119b616>] ? mntput_no_expire+0x26/0x140 May 2 15:22:06 sb kernel: [63376.181245] [<c11b9fa3>] ? fsnotify_put_event+0x53/0x90 May 2 15:22:06 sb kernel: [63376.181249] [<c11b9fa3>] ? fsnotify_put_event+0x53/0x90 May 2 15:22:06 sb kernel: [63376.181254] [<c11b9c48>] ? fsnotify+0x208/0x2d0 May 2 15:22:06 sb kernel: [63376.181276] [<f89b9320>] ? drm_free_buffer+0x30/0x30 [drm] May 2 15:22:06 sb kernel: [63376.181281] [<c11903f2>] do_vfs_ioctl+0x2e2/0x4d0 May 2 15:22:06 sb kernel: [63376.181287] [<c1181481>] ? __sb_end_write+0x31/0x70 May 2 15:22:06 sb kernel: [63376.181292] [<c117f9d5>] ? vfs_write+0x165/0x1b0 May 2 15:22:06 sb kernel: [63376.181297] [<c119921f>] ? fget_light+0x6f/0xc0 May 2 15:22:06 sb kernel: [63376.181302] [<c1190640>] SyS_ioctl+0x60/0x80 May 2 15:22:06 sb kernel: [63376.181308] [<c166e64d>] sysenter_do_call+0x12/0x12 May 2 15:22:06 sb kernel: [63376.181312] ---[ end trace 1f5f8283d766d1aa ]--- May 2 15:22:15 sb kernel: [63384.863231] [drm] stuck on render ring May 2 15:22:17 sb dbus[752]: [system] Activating service name='org.freedesktop.ConsoleKit' (using servicehelper) May 2 15:22:17 sb dbus[752]: [system] Successfully activated service 'org.freedesktop.ConsoleKit' May 2 15:22:21 sb kernel: [63390.861643] [drm] stuck on render ring May 2 15:22:27 sb kernel: [63396.852028] [drm] stuck on render ring
Created attachment 124047 [details] Another GPU crash, with recent driver build Matching system log snippet in following attachment.
Created attachment 124050 [details] The system log matching attachment#124047 [details] The system log matching attachment#124047 [details] In case this helps, the "unusual" things about the system (already noted in previous posts, just a summary) that might be relevant, because the bug looks like an integer overflow problem: - running on 32-bit build (both kernel and userland) - the display size is 2560x1440 (the problem DOES NOT occur when running on the same exact hardware and drivers, but with standard full-HD size 1920x1080)
there seems to be a point where the was fix working and validated by submitter and also by comment17.. I will set this bug as resolved, however please if you encounter the problem again with recent kernel upload new logs
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.