I've opened bugzilla for a problem discussed on intel-gfx list. My hardware T61, GMA965, I'm running SNA intel driver (current git hash xf86-video-intel: 8ea4ba081de0206351394481f54dcbe6922a085b) I've figured very easy way to trigger the problem - run Xorg in valgrind and just run xterm & firefox inside such slowed Xorg server - after few mouse moves on touchpad I usually get a quick GPU hang report. I'll attach more logs from system if that will help. It's probably worth to note here that I've tried remote drm branch: [remote "drm-intel-next-queued"] url = git://people.freedesktop.org/~danvet/drm-intel fetch = +refs/heads/*:refs/remotes/drm-intel-next-queued/* git hash: 98b6bd998ae057611d2bc040ace1fc847f575b84 and with this kernel 3.4.0-rc6-00441-g98b6bd9 I've NOT been able to reproduce this issue so easily anymore - so it seems like this drm branch has some fix inside. This is report hang with vanilla 3.4 kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [drm:init_ring_common] *ERROR* render ring initialization failed ctl 00000000 head 00000000 tail 00000000 start 00000000 [drm] Changing LVDS panel from (+hsync, +vsync) to (-hsync, -vsync) ------------[ cut here ]------------ WARNING: at drivers/gpu/drm/i915/intel_display.c:807 intel_enable_pipe+0x12c/0x150 [i915]() Hardware name: 6464CTO PLL state assertion failure (expected on, current off) Modules linked in: i915 i2c_algo_bit drm_kms_helper drm ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc ipv6 snd_hda_codec_analog arc4 iwl3945 iwlegacy mac80211 microcode psmouse serio_raw cfg80211 snd_hda_intel i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support snd_hda_codec snd_pcm snd_page_alloc snd_timer e1000e wmi thinkpad_acpi nvram snd soundcore binfmt_misc evdev loop vhost_net tun kvm_intel kvm nfsd exportfs nfs_acl auth_rpcgss lockd sunrpc autofs4 pcmcia sr_mod cdrom sdhci_pci sdhci mmc_core yenta_socket ehci_hcd uhci_hcd usbcore usb_common video backlight Pid: 2281, comm: memcheck-amd64- Not tainted 3.4.0-00006-gf4267bb #68 Call Trace: warn_slowpath_common+0x7f/0xc0 warn_slowpath_fmt+0x46/0x50 intel_enable_pipe+0x12c/0x150 [i915] i9xx_crtc_mode_set+0x8b5/0xe70 [i915] ? _raw_spin_unlock_irqrestore+0x42/0x80 intel_crtc_mode_set+0x71/0xb0 [i915] drm_crtc_helper_set_mode+0x2cd/0x4b0 [drm_kms_helper] drm_crtc_helper_set_config+0xa32/0xb50 [drm_kms_helper] ? get_parent_ip+0x11/0x50 drm_fb_helper_restore_fbdev_mode+0x3d/0x60 [drm_kms_helper] intel_fb_restore_mode+0x39/0xb0 [i915] i915_driver_lastclose+0x30/0x60 [i915] drm_lastclose+0x4a/0x320 [drm] drm_release+0x582/0x790 [drm] fput+0x12c/0x270 filp_close+0x69/0xa0 put_files_struct+0x125/0x410 ? put_files_struct+0x38/0x410 exit_files+0x52/0x60 do_exit+0x178/0xa90 do_group_exit+0x4c/0xc0 get_signal_to_deliver+0x2be/0x860 ? sub_preempt_count+0x79/0xd0 do_signal+0x68/0x950 ? sysret_check+0x1b/0x56 ? sys_kill+0x83/0x300 ? might_fault+0x53/0xb0 ? sysret_signal+0x5/0x47 do_notify_resume+0x65/0x80 ? trace_hardirqs_on_thunk+0x3a/0x3f int_signal+0x12/0x17
Created attachment 62139 [details] grep -r /sys/kernel/debug/dri/0 I've just removed repeated zeros to make it shorter
Created attachment 62140 [details] intel_gpu_dump
Created attachment 62141 [details] intel_reg_dumper
Next time please just attach an unmollified i915_error_state.
So which kernel is this?
Created attachment 62143 [details] Unmodified error state
(In reply to comment #5) > So which kernel is this? GPU hang happens with upstream kernels (report is from 3.4 with my 6 unrelated minor patches)
And I also need to know if you are using any GL applications and if so which version of mesa?
(In reply to comment #8) > And I also need to know if you are using any GL applications and if so which > version of mesa? The only running application has been xterm and firefox started from xterm. (.xinitrc has only exec xterm) No other application has been running (i.e. no window manager) I'm not aware if firefox is using mesa for it's windows html rendering, but it seemed to me the problem is more related to the mouse movement. It's always triggered when I'm moving mouse via touchpad in the firefox window. The rendered page in firefox was just some basic welcome text. Here is the list mesa libs on my system: mesa-dri-drivers-8.1-0.5.fc18.x86_64 mesa-libGLU-8.1-0.5.fc18.i686 mesa-debuginfo-8.1-0.5.fc18.x86_64 mesa-libGL-8.1-0.5.fc18.i686 mesa-libglapi-8.1-0.5.fc18.i686 mesa-libGLU-8.1-0.5.fc18.x86_64 mesa-libGLU-devel-8.1-0.5.fc18.x86_64 mesa-libEGL-8.1-0.5.fc18.i686 mesa-libEGL-8.1-0.5.fc18.x86_64 mesa-libGL-devel-8.1-0.5.fc18.x86_64 mesa-libEGL-devel-8.1-0.5.fc18.x86_64 mesa-libGLES-8.1-0.5.fc18.i686 mesa-dri-filesystem-8.1-0.5.fc18.x86_64 mesa-libgbm-8.1-0.5.fc18.i686 mesa-libGL-8.1-0.5.fc18.x86_64 mesa-libGLES-8.1-0.5.fc18.x86_64 mesa-libgbm-8.1-0.5.fc18.x86_64 mesa-dri-filesystem-8.1-0.5.fc18.i686 mesa-dri-drivers-8.1-0.5.fc18.i686 mesa-libglapi-8.1-0.5.fc18.x86_64
Just going through the known bugs that trigger similar error states...
Could be completely unrelated - but it seem there could be correlation between hang, and my mouse movement - since it seems that this is one of moments, where valgrind gives last message if it doesn't directly coredump (as posted in the list). Syscall param ioctl(generic) points to uninitialised byte(s) at 0x6B06317: ioctl (syscall-template.S:81) by 0x8775807: drmIoctl (xf86drm.c:167) by 0x8779138: drmModeSetCursor (xf86drmMode.c:57) by 0x4B6A1C: xf86_set_cursor_position (xf86Cursors.c:308) by 0x4C0CBF: xf86SetCursor (xf86HWCurs.c:155) by 0x4BFA23: xf86CursorSetCursor (xf86Cursor.c:365) by 0x55AB39: miPointerUpdateSprite (mipointer.c:450) by 0x55ADEC: miPointerDisplayCursor (mipointer.c:201) by 0x4CD922: CursorDisplayCursor (cursor.c:166) by 0x4FD796: AnimCurDisplayCursor (animcur.c:226) by 0x43BBB8: ChangeToCursor (events.c:930) by 0x43F002: CheckMotion (events.c:2968) by 0x43F129: WindowsRestructured (events.c:2985) by 0x45DF2D: MapWindow (window.c:2610) by 0x42EC4F: ProcMapWindow (dispatch.c:793) by 0x4343B9: Dispatch (dispatch.c:425) by 0x423404: main (main.c:288) Location 0x7fefffdc8 is 0 bytes inside arg.x, declared at xf86drmMode.c:389, in frame #2 of thread 1 Uninitialised value was created by a stack allocation at 0x8BEE150: sna_crtc_show_cursor (sna_display.c:794) I've no idea how does works setting mouse position on hardware level, but couldn't there be some hidden async bug?
Valgrind isn't telling you anything useful here except that it doesn't understand the layout of the ioctl and so warning that two unused members of that struct are left uninitialised.
(In reply to comment #12) > Valgrind isn't telling you anything useful here except that it doesn't > understand the layout of the ioctl and so warning that two unused members of > that struct are left uninitialized. Yes - that's clear, I've not meant those uninitialized members - but rather the fact, that it seemed that log got frozen in the moment it's been manipulating with mouse. However I've also tried to play with this more and noticed that even the usage of valgrind is not a reliable way to make it freeze quickly :( I've got an idea to check whether there is some patch between drm-next and vanilla which makes it working - and noticed I couldn't say when it's OK, since even with vanilla I get sometimes quite long run without GPU hang :(. (So it's actually quite well possible, that even drm-next will hang, I've just not played long enough to trigger this problem.)
I've noticed you made a new upstream commit to help with valgrind usage. Anyway - here are few interesting bits - which might be actually problems of Xorg server and not really intel sna driver (so may deserve a new separate bugzilla) When I'll have time - I'll look on them more closely - meanwhile - someone else might get more clue about them. What is interesting about them is the 'Crtc' part of the function name - which is also a part of my kernel GPU hung bug description. So maybe there could be something related together? Source code in use (to match source lines) is: sna driver: 55e6f5f220401318529e81f7c96fe0af3b893a0c xorg-x11-server-Xorg-1.12.1-2.fc18.x86_64 They appear in log when I switch between console and X server. Invalid read of size 1 at 0x4C2C162: __GI_strlen (mc_replace_strmem.c:391) by 0x6AA23A5: strdup (strdup.c:41) by 0x46C26D: XNFstrdup (utils.c:1096) by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209) by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267) by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF3B89: sna_enter_vt (sna_driver.c:985) by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Address 0x775d530 is 0 bytes inside a block of size 10 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004) by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538) by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517) by 0x4E8051: RRGetInfo (rrinfo.c:195) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Invalid read of size 1 at 0x4C2C174: __GI_strlen (mc_replace_strmem.c:391) by 0x6AA23A5: strdup (strdup.c:41) by 0x46C26D: XNFstrdup (utils.c:1096) by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209) by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267) by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF3B89: sna_enter_vt (sna_driver.c:985) by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Address 0x775d531 is 1 bytes inside a block of size 10 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004) by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538) by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517) by 0x4E8051: RRGetInfo (rrinfo.c:195) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Invalid read of size 1 at 0x6AAAEAE: __GI_memcpy (memcpy.S:71) by 0x46C26D: XNFstrdup (utils.c:1096) by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209) by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267) by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF3B89: sna_enter_vt (sna_driver.c:985) by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Address 0x775d530 is 0 bytes inside a block of size 10 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004) by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538) by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517) by 0x4E8051: RRGetInfo (rrinfo.c:195) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Invalid read of size 2 at 0x6AAAEBE: __GI_memcpy (memcpy.S:83) by 0x46C26D: XNFstrdup (utils.c:1096) by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209) by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267) by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF3B89: sna_enter_vt (sna_driver.c:985) by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Address 0x775d531 is 1 bytes inside a block of size 10 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004) by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538) by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517) by 0x4E8051: RRGetInfo (rrinfo.c:195) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Invalid read of size 8 at 0x6AAAEE6: __GI_memcpy (memcpy.S:107) by 0x46C26D: XNFstrdup (utils.c:1096) by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209) by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267) by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF3B89: sna_enter_vt (sna_driver.c:985) by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Address 0x775d533 is 3 bytes inside a block of size 10 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004) by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538) by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517) by 0x4E8051: RRGetInfo (rrinfo.c:195) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Invalid read of size 1 at 0x4C2C331: strncpy (mc_replace_strmem.c:463) by 0x8BF3621: sna_crtc_set_mode_major (sna_display.c:393) by 0x4B32BC: xf86CrtcSetModeTransform (xf86Crtc.c:294) by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF3B89: sna_enter_vt (sna_driver.c:985) by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Address 0x775d530 is 0 bytes inside a block of size 10 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004) by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538) by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517) by 0x4E8051: RRGetInfo (rrinfo.c:195) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Invalid read of size 1 at 0x4C2C350: strncpy (mc_replace_strmem.c:463) by 0x8BF3621: sna_crtc_set_mode_major (sna_display.c:393) by 0x4B32BC: xf86CrtcSetModeTransform (xf86Crtc.c:294) by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF3B89: sna_enter_vt (sna_driver.c:985) by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Address 0x775d531 is 1 bytes inside a block of size 10 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004) by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538) by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517) by 0x4E8051: RRGetInfo (rrinfo.c:195) by 0x494A20: xf86XVEnterVT (xf86xv.c:1311) by 0x813B44B: glxDRIEnterVT (glxdri2.c:601) by 0x480CEF: xf86Wakeup (xf86Events.c:527) by 0x43836A: WakeupHandler (dixutils.c:421) by 0x462413: WaitForSomething (WaitFor.c:224) by 0x4340F0: Dispatch (dispatch.c:357) by 0x423494: main (main.c:288) Invalid read of size 4 at 0x428CAD: FreeCursor (cursor.c:117) by 0x44B89B: FreeGrab (grabs.c:253) by 0x4299DD: CloseDevice (devices.c:932) by 0x42A012: CloseDownDevices (devices.c:996) by 0x4234FB: main (main.c:321) Address 0x76df5a4 is 20 bytes inside a block of size 112 free'd at 0x4C2A9AE: free (vg_replace_malloc.c:427) by 0x428D0E: FreeCursor (cursor.c:126) by 0x455151: doFreeResource (resource.c:549) by 0x455F54: FreeClientResources (resource.c:806) by 0x433869: CloseDownClient (dispatch.c:3379) by 0x434445: Dispatch (dispatch.c:437) by 0x423494: main (main.c:288)
We've chased those valgrind warnings in the past: bug 36108.
(In reply to comment #15) > We've chased those valgrind warnings in the past: bug 36108. Ohh great even my own report ;)
Ok - I could confirm that with this kernel: 3.4.0-rc6-00441-g98b6bd9 - which is supposedly this commit: commit 98b6bd998ae057611d2bc040ace1fc847f575b84 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Sun May 20 20:00:25 2012 +0200 I'm also getting the GPU hang on my machine. So this issue is still present in drm-intel-next-queued for my machine. The only thing I've been doing is - I've been moving firefox window above another firefox window.
Can you attach a recent error-state, useful to confirm it is the same pattern? Also if you have a simple test case, then --enable-debug=full would be useful. Perhaps even #define DEBUG_FLUSH_SYNC 1.
(In reply to comment #18) > Can you attach a recent error-state, useful to confirm it is the same pattern? > Also if you have a simple test case, then --enable-debug=full would be useful. > Perhaps even #define DEBUG_FLUSH_SYNC 1. So first I've just tried to get the full error state log with this drm-next kernel - and it's been apparently quite simple - I've just want to create new bz with some display artifacts - so I wanted to use some minimal environment, and it has pretty quickly locked - so to check for GPU hang simple - then I want to use you build advice However as soon as I try to run such sna driver within my valgrind setup - I get instant coredump with the first mouse move. Process terminating with default action of signal 11 (SIGSEGV): dumping core General Protection Fault at 0x6B25165: __sprintf_chk (sprintf_chk.c:27) by 0x468E7C: LogVWrite (stdio2.h:33) by 0x468BAB: ErrorF (log.c:641) by 0x8C15CAC: sna_crtc_set_cursor_position (sna_display.c:815) by 0x4B6B43: xf86_set_cursor_position (xf86Cursors.c:391) by 0x55A3C2: miPointerMoveNoEvent (mipointer.c:544) by 0x55B3E7: miPointerSetPosition (mipointer.c:625) by 0x448486: positionSprite.part.6 (getevents.c:913) by 0x44895C: fill_pointer_events (getevents.c:1331) by 0x449CD2: GetPointerEvents (getevents.c:1567) by 0x44A19C: QueuePointerEvents (getevents.c:1215) by 0xAEF30E3: EvdevReadInput (evdev.c:1012) by 0x480316: xf86SigioReadInput (xf86Events.c:298) by 0x4A44EF: xf86SIGIO (sigio.c:108) by 0x582254F: ??? (in /usr/lib64/libpthread-2.15.90.so) by 0x6B027DD: __write_nocancel (syscall-template.S:81) by 0x6A925C2: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1286) by 0x6A924A1: new_do_write (fileops.c:540) by 0x6A9307D: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1369) by 0x6A88ABC: fwrite (iofwrite.c:44) by 0x468ED8: LogVWrite (log.c:280) by 0x468BAB: ErrorF (log.c:641) by 0x8C15CAC: sna_crtc_set_cursor_position (sna_display.c:815) by 0x4B6B43: xf86_set_cursor_position (xf86Cursors.c:391) by 0x550E25: mieqProcessInputEvents (mieq.c:628) by 0x480398: ProcessInputEvents (xf86Events.c:164) by 0x4340E1: Dispatch (dispatch.c:353) by 0x423494: main (main.c:288) What looks weird is the rentrancy - it looks like things are being handled via signal in parallel with normal ProcessInputEvents - should there be some protection for debugging? And in fact isn't this issue related to my original problem - since some parts of the code could be executed easily in parallel. Thus drmIoctl(DRM_IOCTL_MODE_CURSOR) might be called by normal task as well as via signal handler ?
Interesting, I've been having a discussion with Dave Airlie about the native of SIGIO handling of cursors and to have an explicit example of badness (albeit due to extra debugging code) is fascinating. In short, the SIGIO has been known to exercise kernel bugs in the past, but this callstack does not directly relate to the GPU hang.
(In reply to comment #20) > Interesting, I've been having a discussion with Dave Airlie about the native of > SIGIO handling of cursors and to have an explicit example of badness (albeit > due to extra debugging code) is fascinating. In short, the SIGIO has been known > to exercise kernel bugs in the past, but this callstack does not directly > relate to the GPU hang. Yes - I'll probably need to modify debugging somehow for this case. But I'm just wondering - it the Signal handler is operating on kgem objects - how is the locking working then ? How do you ensure kgem object used during such handling remains valid ? What should be the result of such operation ? Is the cursor position influencing GPU processing - i.e. you expect to render cursor somewhere - and signal handler modifies the position again - is this safe operation ? Should there be 'kernel queue' for mouse positions.
Posting valgrind options for bug 50477 MALLOC="--malloc-fill=aa" FREE="--free-fill=21" TRACKFD="--track-fds=yes" STACK="--max-stackframe=300000" EXTRAS="--read-var-info=yes" LOGFILE="--log-file=/tmp/valglog.$SUFFIX" TIMESTAMP="--time-stamp=yes" LEAKCHECK="--leak-check=yes" LEAKCHECK="--leak-check=full --track-origins=yes" CALLERS="--num-callers=40" actually reminded me - that '21' is my clear pattern - which could be seen in GPU error status on several places - so it seems like such free memory gets in this GPU buffer. I'll try to change hex number to confirm this...
(In reply to comment #22) > Posting valgrind options for bug 50477 > > MALLOC="--malloc-fill=aa" > FREE="--free-fill=21" > TRACKFD="--track-fds=yes" > STACK="--max-stackframe=300000" > EXTRAS="--read-var-info=yes" > LOGFILE="--log-file=/tmp/valglog.$SUFFIX" > TIMESTAMP="--time-stamp=yes" > LEAKCHECK="--leak-check=yes" > LEAKCHECK="--leak-check=full --track-origins=yes" > CALLERS="--num-callers=40" > > > actually reminded me - that '21' is my clear pattern - which could be seen in > GPU error status on several places - so it seems like such free memory gets in > this GPU buffer. > > I'll try to change hex number to confirm this... Ok, that is worrisome. :)
Created attachment 62268 [details] i915_error_state with changed vg pattern (21 -> EE) As one may see quite a few 'EE' sections could be found there. So it seems that released buffers are GPU queue (which could be fine, if GPU has already executed them - but since they are on multiple places - it looks suspicious.
Created attachment 62270 [details] Coredump of Xserver I'm now getting almost instant GPU hang and Xorg crash when I use Xorg with valgrind - sna driver 740368c4c6eb547adad247ff529d16e594d6459b + Flush patch from Bug 50477 and start xterm + twm + thunderbird. After TB starts it wants to open two windows and I want to position them with twm - and with few mouse moves I got GPU hung - I'm leaving some timeinfo from kernel in trace here - so it could be seen it's quite soon after boot - it doesn't matter if 3.4 or 3.4-rc6-drm-next is in use. 64.668973] drm: registered panic notifier 64.678853] acpi device:02: registered as cooling_device2 64.684729] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/LNXVIDEO:00/input/input9 64.696343] ACPI: Video Device [VID] (multi-head: yes rom: no post: no) 64.703397] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0 187.640080] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung 187.648131] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state 187.658803] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 00000000 head 00000000 tail 00000000 start 00000000 187.780290] [drm] Changing LVDS panel from (+hsync, +vsync) to (-hsync, -vsync) 187.787833] ------------[ cut here ]------------ 187.792480] WARNING: at drivers/gpu/drm/i915/intel_display.c:807 intel_enable_pipe+0x12c/0x150 [i915]() 187.802121] Hardware name: 6464CTO 187.805627] PLL state assertion failure (expected on, current off) 187.811928] Modules linked in: i915 i2c_algo_bit drm_kms_helper drm ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc ipv6 snd_hda_codec_analog usbhid hid arc4 microcode psmouse iwl3945 serio_raw iwlegacy i2c_i801 i2c_core mac80211 snd_hda_intel snd_hda_codec iTCO_wdt snd_pcm iTCO_vendor_support cfg80211 snd_page_alloc snd_timer e1000e wmi thinkpad_acpi nvram snd soundcore binfmt_misc loop vhost_net evdev tun kvm_intel kvm nfsd exportfs nfs_acl auth_rpcgss lockd sunrpc autofs4 pcmcia sr_mod cdrom sdhci_pci sdhci mmc_core yenta_socket ehci_hcd uhci_hcd usbcore usb_common video backlight 187.879457] Pid: 1139, comm: memcheck-amd64- Not tainted 3.4.0-00006-gf4267bb #68 187.886961] Call Trace: 187.889416] [<ffffffff8103b50f>] warn_slowpath_common+0x7f/0xc0 187.895443] [<ffffffff8103b606>] warn_slowpath_fmt+0x46/0x50 187.901222] [<ffffffffa056025c>] intel_enable_pipe+0x12c/0x150 [i915] 187.907781] [<ffffffffa0563c85>] i9xx_crtc_mode_set+0x8b5/0xe70 [i915] 187.914416] [<ffffffff815159e2>] ? _raw_spin_unlock_irqrestore+0x42/0x80 187.921233] [<ffffffffa055d761>] intel_crtc_mode_set+0x71/0xb0 [i915] 187.927787] [<ffffffffa029a01d>] drm_crtc_helper_set_mode+0x2cd/0x4b0 [drm_kms_helper] 187.935806] [<ffffffffa029b262>] drm_crtc_helper_set_config+0xa32/0xb50 [drm_kms_helper] 187.944010] [<ffffffff81078611>] ? get_parent_ip+0x11/0x50 187.949609] [<ffffffffa0297c6d>] drm_fb_helper_restore_fbdev_mode+0x3d/0x60 [drm_kms_helper] 187.958167] [<ffffffffa0577d69>] intel_fb_restore_mode+0x39/0xb0 [i915] 187.964898] [<ffffffffa053e6a0>] i915_driver_lastclose+0x30/0x60 [i915] 187.971633] [<ffffffffa050079a>] drm_lastclose+0x4a/0x320 [drm] 187.977666] [<ffffffffa05010f2>] drm_release+0x582/0x790 [drm] 187.983607] [<ffffffff81177dcc>] fput+0x12c/0x270 187.988422] [<ffffffff81172d29>] filp_close+0x69/0xa0 187.993592] [<ffffffff81040135>] put_files_struct+0x125/0x410 187.999445] [<ffffffff81040048>] ? put_files_struct+0x38/0x410 188.005385] [<ffffffff810404e2>] exit_files+0x52/0x60 188.010551] [<ffffffff81040668>] do_exit+0x178/0xa90 188.015627] [<ffffffff810412dc>] do_group_exit+0x4c/0xc0 188.021053] [<ffffffff81053c1e>] get_signal_to_deliver+0x2be/0x860 188.027340] [<ffffffff815192c9>] ? sub_preempt_count+0x79/0xd0 188.033287] [<ffffffff81002298>] do_signal+0x68/0x950 188.038444] [<ffffffff8151d2fb>] ? sysret_check+0x1b/0x56 188.043948] [<ffffffff81054b53>] ? sys_kill+0x83/0x300 188.049189] [<ffffffff81141b33>] ? might_fault+0x53/0xb0 188.054613] [<ffffffff8151d350>] ? sysret_signal+0x5/0x47 188.060125] [<ffffffff81002c05>] do_notify_resume+0x65/0x80 188.065807] [<ffffffff812f4cfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f 188.072270] [<ffffffff8151d5e2>] int_signal+0x12/0x17 188.077447] ---[ end trace 134029284d004294 ]---
Ah. That looks to be an issue with valgrind writing its free-pattern over a released CPU mapping that is still on the GPU. If you comment out the two VALGRIND_FREELIKE_BLOCK() in src/sna/kgem.c, does the problem go away?
This should do the trick and keep valgrind happy with the unknown GPU mappings. commit 47e2db0ba8ae3fbcdf58cba26c58ec0932b4c90b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed May 30 10:37:22 2012 +0100 sna: Replace FREELIKE_BLOCK with MAKE_MEM_NOACCESS And similarly for MALLOCLIKE_BLOCK. The issue being that valgrind may overwrite such blocks with alloc-fill and free-fill values, but when in fact they are defined and still in use by the GPU. This can lead to the GPU processing garbage, and GPU hangs. Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50393 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Ok, with valgrind modification - you have basically eliminated my easy way to trigger GPU hang ;) - but obviously since my report has been originally started without using --enable-debug compiled sna driver it might not be definite fix. Let's see if the combination with other fixes (flush) will help. So far - I've not been able to easily hang GPU like I used to be. Btw now one instance of Xorg made this report: Memcheck: mc_leakcheck.c:1112 (vgMemCheck_detect_memory_leaks): the 'impossible' happened. ==00:00:07:51.006 8850== Block 0x7f9edb9b6000..0x7f9edb9f5fff overlaps with block 0x7f9edb9b7000..0x7f9edb9f6fffThis is usually caused by using VALGRIND_MALLOCLIKE_BLOCKin an inappropriate way. at 0x3804C60F: report_and_quit (m_libcassert.c:210)
/me mutters something about the summary being "hang with valgrind" :-p Both the headers and code claim not to impose any restrictions on those MAKE_MEM requests. Not sure, try with valgrind -v as well.
(In reply to comment #29) > /me mutters something about the summary being "hang with valgrind" :-p > It's been just really easy to reproduce with valgrind :) But the original issue was definitely there without valgrind. Let's just see how often I'll hit it with current sna version.
Looking back at some collected coredumps - here is one which might be worth to check by you. #0 __kgem_flush (kgem=kgem@entry=0x76ed638, bo=0x0) at kgem.c:2953 #1 0x0000000008be715d in sna_accel_do_flush (sna=0x76ed430) at sna_accel.c:11928 #2 sna_accel_block_handler (sna=0x76ed430, tv=0x7fefffed8) at sna_accel.c:12328 #3 0x00000000004381ba in BlockHandler (pTimeout=pTimeout@entry=0x7fefffed8, pReadmask=pReadmask@entry=0x7de540) at dixutils.c:387 #4 0x0000000000462384 in WaitForSomething (pClientsReady=pClientsReady@entry=0xae44580) at WaitFor.c:210 #5 0x00000000004340f1 in Dispatch () at dispatch.c:357 #6 0x0000000000423495 in main (argc=4, argv=0x7ff0002e8, envp=<optimized out>) at main.c:288 I'm not able to get this one anymore - it's been with your artifacts patch in. For some unknown reason bo has been pointing to 0 in !__kgem_flush(&sna->kgem, priv->gpu_bo). Crash happened directly during initialization so I've got only black screen. So is it always safe to assume priv->gpu_bo is not NULL? (since you only check for priv->cpu_damage == NULL)
Once upon a time it checked for priv->gpu_bo precisely because it was possible for the block_handler to fire before the screen was constructed... commit cf5b3e2ebf4ee0330f5421b9377bb512a94ec284 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed May 30 14:44:59 2012 +0100 sna: Don't attempt to flush scanout if unattached This was lost in the midst of the OSTimer overhaul. Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=50393 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Do you have any insight into the valgrind assertion failure? Does it always fail in the same way? Can you attach debug=full log for such a failure?
Just to be clear, in this bug report I'm focusing on making sure we can run under valgrind. (Which version of valgrind do you have btw?) Can you please file the instability issue separately so that it can be tracked without me further confusing it with the valgrind bug.
I'll try later to play again with valgrind - but so far it looks pretty good - while using still the same vanilla 3.4 kernel - I'm not getting GPU hangs now (at least haven't seen one for couple days now I think) (though now it's replaced with coredumps - but they could be solved more easily :)) My rawhide currently runs: valgrind-3.7.0-4.fc18.x86_64
Created attachment 62798 [details] debug dri log dir
Created attachment 62799 [details] error state captured
Created attachment 62800 [details] dmesg from hang
Created attachment 62801 [details] intel_reg_dumper output
Ok - after some time I've got GPU hang again. It's with current git:adc872a9654dc18d778323ca0721704878ad3851 (So I'm not sure how 'unique' is this - I've been having just very few windows on desktop and typing something in gnome-terminal while screen went black) Is there anything readable from those GPU hangs ? This time no valgrind involved - just relatively freshly booted system.
The batch buffer got clobbered; overwritten by a clear/yellow fill to a tiled surface. The batch itself looks perfectly innocuous. Afaict that would only be possible through a wild BLT command (or through a strange GTT write). Did you have assertions enabled at the time? And this was a plain uncomposited Xsession?
Two things that spring to mind that may help. --enable-debug should spot any out-of-bound writes, and #define DEBUG_FLUSH_SYNC 1 would restrict it to only one batch buffer in flight at a time. That would help define the problem.
(In reply to comment #41) > The batch buffer got clobbered; overwritten by a clear/yellow fill to a tiled > surface. The batch itself looks perfectly innocuous. Afaict that would only be > possible through a wild BLT command (or through a strange GTT write). Did you > have assertions enabled at the time? And this was a plain uncomposited > Xsession? Compiled with this options (and using SNA) --enable-sna --enable-uxa --enable-debug Using openbox uncomposited WM. BTW my primary suspect is still the SIGIO handling. The lock always happened to me when I type on keyboard and move mouse. So could not setting the mouse position in the signal interrupt do something unexpected ? I'll try to run for a while now with --enable-debug=full - but it generates tons of logs and makes the machine not really usable. (It would be probably nice to have some circular log for at most 50MB)
(In reply to comment #43) > (In reply to comment #41) > > The batch buffer got clobbered; overwritten by a clear/yellow fill to a tiled > > surface. The batch itself looks perfectly innocuous. Afaict that would only be > > possible through a wild BLT command (or through a strange GTT write). Did you > > have assertions enabled at the time? And this was a plain uncomposited > > Xsession? > > Compiled with this options (and using SNA) > --enable-sna --enable-uxa --enable-debug Using the Option "AccelMethod" "sna", right? > Using openbox uncomposited WM. Just checking, it didn't look like the mesa bug but... > BTW my primary suspect is still the SIGIO handling. > The lock always happened to me when I type on keyboard and move mouse. Strong correlation, but it could just be the rendering of the text... > So could not setting the mouse position in the signal interrupt do something > unexpected ? No idea, never tried it. I suspect the only reason for the signal handler is fake interactivity on an otherwise hung server. > I'll try to run for a while now with --enable-debug=full - but it generates > tons > of logs and makes the machine not really usable. > (It would be probably nice to have some circular log for at most 50MB) Yeah, I've been meaning to think of a more useful means of capturing live debug logs -- unless you have a good idea of what you are looking for, there is simply too much information.
So I've played for about 30 minutes with full=debug - it has generate 2GB logfile and it has got very slow over the time - is the full debug compilation doing some large list collections that slows things down ? Especially in the firefox it has got so slow with reaction on mouse after 30 minutes that I had to stop this experiment. I've not get any GPU hung nor coredump for flash playback - thus it's really not easy to catch in debug mode - probably the condition to trigger hang could be even harder to reach in this slow mode.
For testing purposed, --enable-debug [i.e. not --enable-debug=full] is fine as that turns on the assertions without the logging. Yes it does perform additional walks over lists to ensure consistency, but the absolute performance killer is actuallly the batch buffer dumping.
Created attachment 63021 [details] i915_error_state Just another hang report - again mouse was involved - I've been just trying to move window holding Alt key.
Continuing the trend that something (a stray GPU write) clobbered that batch before it was executed. Was this with --enable-debug?
(In reply to comment #48) > Continuing the trend that something (a stray GPU write) clobbered that batch > before it was executed. Was this with --enable-debug? --enable-debug=memory - but I guess this will not help much cut&paste just before hang: [ 28209.006] Allocated bo: 158, 44642304 bytes [ 28219.047] Allocated shadow pixels: 74, 40953072 bytes, as CPU bo: 0, 0 bytes [ 28219.047] Allocated bo: 159, 44752896 bytes [ 28227.114] [mi] EQ overflowing. Additional events will be discarded until existing events are processed. [ 28227.114] [ 28227.114] Backtrace: [ 28227.115] 0: /usr/bin/X (xorg_backtrace+0x36) [0x465106] [ 28227.115] 1: /usr/bin/X (mieqEnqueue+0x26b) [0x55112b] [ 28227.116] 2: /usr/bin/X (0x400000+0x480c2) [0x4480c2] [ 28227.116] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7f17108d5000+0x60e4) [0x7f17108db0e4] [ 28227.116] 4: /usr/bin/X (0x400000+0x805b7) [0x4805b7] [ 28227.116] 5: /usr/bin/X (0x400000+0xa4790) [0x4a4790] [ 28227.116] 6: /lib64/libpthread.so.0 (0x7f1714f4f000+0xf5
(In reply to comment #49) > (In reply to comment #48) > > Continuing the trend that something (a stray GPU write) clobbered that batch > > before it was executed. Was this with --enable-debug? > > --enable-debug=memory - but I guess this will not help much It helps, that means that we had assertions enabled and so we should have validated that all writes through the BLT (which are not clipped by the hardware) were within the bounds of the target surface. Which unfortunately also means that we need to look for a more complex failure mode such as re-using active surfaces either in the kernel or in the ddx, or some GPU state mismanagement.
Hmm I've got my 2nd hang today :( and again with just plain window move with mouse. Anyway - tried again with valgrind after reboot - haven't reached the hang.- but got at least this one: Source and destination overlap in memcpy(0xd3fe6a0, 0xd3fe6c0, 88) at 0x4C2D080: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:837) by 0x8BB9F0D: memcpy_blt (blt.c:211) by 0x8BD2C7B: sna_self_copy_boxes (sna_accel.c:3501) by 0x54F324: miCopyRegion (micopy.c:121) by 0x8BD2986: sna_copy_window (sna_accel.c:11991) by 0x50246D: damageCopyWindow (damage.c:1581) by 0x560321: miSpriteCopyWindow (misprite.c:518) by 0x4CA8D6: compCopyWindow (compwindow.c:522) by 0x5685BF: miMoveWindow (miwindow.c:290) by 0x4CB688: compMoveWindow (compwindow.c:375) by 0x45D560: ConfigureWindow (window.c:2309) by 0x42EEC3: ProcConfigureWindow (dispatch.c:867) by 0x4344A9: Dispatch (dispatch.c:428) by 0x423504: main (main.c:288) If the region may overlap - should there be memmove ? And isn't that some bug which should be asserted ?
It's not a bug I'd been worrying about since it only happens rarely with fallback surfaces. The fix is trivial (use memmove along the self-copy path), but it is definitely not related to the hangs.
commit b415ca05c2e1c4f09f85d8769d39e5369661ed3a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jun 14 17:06:19 2012 +0100 sna: Reset damage extents after reduction goes to zero If the reduction of the damage clears all of the boxes, we need to reset the -infinite extents so that we continue to accumulate further damage. Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=50744 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Onwards!
Oops, that was for the other bug, this one merits: commit 0df7c488640d3590d2a88dc353b72167b6644eaa Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jun 14 16:09:38 2012 +0100 sna: Supporting overlapping copies for fallback blits Reported-by: Zdenek Kabelac <zkabelac@redhat.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=50393 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Created attachment 63380 [details] Crashed Xorg when GPU hang So after quite some time I've got another GPU hang - at least it is not so frequent now - but still not in good frequency. Using 565297e6bd3457a150036af9c62fe0dc67b794ac Again mouse was in use.
Created attachment 63381 [details] dmesg from GPU hang It interesting the GPU could not be reinitialized - since I remember I'be using some shell script to reset GPU and restore it into functional VGA state - so it's interesting kernel cannot do that these days.
Created attachment 63382 [details] error state from GPU Just in case some 'pattern' of failure could be found.
Looks pretty similar to Bug 39387 So it's most probably duplication.
And Bug 48826 looks close as well - even same HW T61 in this case.
(In reply to comment #58) > Looks pretty similar to Bug 39387 (In reply to comment #59) > And Bug 48826 looks close as well - even same HW T61 in this case. Both of these bugs match the pattern for a known bug in mesa. This does not, as the batch is overwritten with varying data.
(In reply to comment #60) > Both of these bugs match the pattern for a known bug in mesa. This does not, as > the batch is overwritten with varying data. Hmmm, while I do not generally use Mesa - i.e. typical apps - editors/terminal/browser/mailer/chat I'm using i.e. xlock screen blanker which it using some 3D modes. Also playing movies with 'xv' surface happens from time to time. While during hang those 3d apps and movies were not involved - they were run usually run during session. So is there a chance, that 3D pipe would break something in a way - I'd get GPU hang after 30 minutes or so with 2D desktop? Also how the better trace could be made ? How big history is need ? Since the lock happens 1-2 times per week - it's not really feasible to run debug=full and keep TB of logs on my system. Some ring buffer which would be printed before assert, that would keep history of submitted commands ? Is the reason of hang known ? Access of unmapped page ? Hitting some internal gm965 hw fault (since it cannot be reset) ? Is there any 'disassembler' for instruction in the ring buffer ?
(In reply to comment #61) > (In reply to comment #60) > > Both of these bugs match the pattern for a known bug in mesa. This does not, as > > the batch is overwritten with varying data. > > Hmmm, while I do not generally use Mesa - i.e. typical apps - > editors/terminal/browser/mailer/chat I'm using i.e. xlock screen blanker which > it using some 3D modes. Also playing movies with 'xv' surface happens from time > to time. > > While during hang those 3d apps and movies were not involved - they were run > usually run during session. So is there a chance, that 3D pipe would break > something in a way - I'd get GPU hang after 30 minutes or so with 2D desktop? No, this is extremely timing sensitive and would require that Mesa overwrote a batch submitted concurrently by the ddx (i.e between the time the contents where written by CPU and then executed by the GPU). The mesa bug in question was introduce in 8.0 and fixed by 8.0.2; the glClear operation was using uninitialised coordinates and so was writing into random blocks of memory. > Also how the better trace could be made ? > How big history is need ? Indeed, just the keeping the log from when the GPU was last idle should be enough to detect this style of bug. Hmm, something to think about. In the meantime if you verify that the hang still occurs with --enable-debug (i.e. assertions only), I'm at a loss to explain where in the ddx this bug is going to come from. (The only operations not clipped to the surface are generated as BLT commands, which I think I've covered by asserting that they are all within bounds. So the remaining possibility is use-after-free of a GPU bo, which should be impossible...) > Since the lock happens 1-2 times per week - it's not really feasible to run > debug=full and keep TB of logs on my system. > > Some ring buffer which would be printed before assert, that would keep history > of submitted commands ? > > Is the reason of hang known ? Access of unmapped page ? Hitting some internal > gm965 hw fault (since it cannot be reset) ? > Is there any 'disassembler' for instruction in the ring buffer ? Look at intel-gpu-tools and intel_error_decode. If you look at the batch buffer currently being executed you will see a large blob of pixel data rather than commands.
Use after free would be ruled out as well since the act of writing the batch is serialising (userspace asserts that the bo is not busy on the GPU and the kernel itself flushes any activity on the bo before writing to it). So that leaves a wild write that nevertheless was valid enough to pass all the sanity checks. The other possibility is that this is a kernel bug. It wouldn't be the first, but bugs of this nature are getting harder to find...
Created attachment 63413 [details] valgrind run So I've once again tried to play a bit with valgrind - and here is result of few things that might need some fixes - though most probably they are not sna issues. And also they do not look like 'big issue' to cause troubles. They were generate by switching between terminal and Xorg. It looks like something is being freed when leaving VT and it's tried to be reused on EnterVT. i.e.: Invalid read of size 8 at 0x6AA86C3: __GI_memcpy (memcpy.S:215) by 0x46C43D: XNFstrdup (utils.c:1096) by 0x4B9C28: xf86DuplicateMode (xf86Modes.c:209) by 0x4B347E: xf86CrtcSetModeTransform (xf86Crtc.c:267) by 0x4B3D93: xf86SetDesiredModes (xf86Crtc.c:2693) by 0x8BF5714: sna_enter_vt (sna_driver.c:1019) by 0x4BD1AA: xf86RandR12EnterVT (xf86RandR12.c:1737) by 0x494CC0: xf86XVEnterVT (xf86xv.c:1311) by 0x813944B: glxDRIEnterVT (glxdri2.c:601) by 0x480F8F: xf86Wakeup (xf86Events.c:527) by 0x4383EA: WakeupHandler (dixutils.c:421) by 0x462693: WaitForSomething (WaitFor.c:224) by 0x434170: Dispatch (dispatch.c:357) by 0x423504: main (main.c:288) Address 0x775ab90 is 96 bytes inside a block of size 128 free'd at 0x4C2A9CE: free (vg_replace_malloc.c:427) by 0x5A76231: pixman_op (pixman-region.c:1007) by 0x5A777EB: pixman_region_subtract (pixman-region.c:2010) by 0x562583: miValidateTree (regionstr.h:265) by 0x56859C: miMoveWindow (miwindow.c:289) by 0x4CB688: compMoveWindow (compwindow.c:375) by 0x45D560: ConfigureWindow (window.c:2309) by 0x42EEC3: ProcConfigureWindow (dispatch.c:867) by 0x4344A9: Dispatch (dispatch.c:428) by 0x423504: main (main.c:288) This seems like might be eventually touching some date ?? But all issues looks like 'read' operation, so it should not cause probably GPU problems ? Also I've not been switching terminals when hang happened. I could look later for reasons - but you might have better experience with this code (and probably even commit right to submit Xorg fixes). xorg-x11-server-Xorg-1.12.2-3.fc18.x86_64 sna git: fa10005ce31483827547b7f71eae066899f0026c
Seems to be same issue as in comment 15
Also this reminded me one thing - when this crash happened - and Xorg has core dump on assert - I've seen switch to console screen - so while in dmesg kernel is writing this: [drm:intel_lvds_enable] *ERROR* timed out waiting for panel to power on I've definitely seen an enabled panel. (And probably properly restored after Xorg has aborted.) Also why is the 'gem_read()' using assert(0) - if the wedged GPU is 'expectable', it should be able to handle this case ? Or the system is not really able to continue in this moment?
(In reply to comment #66) > Also this reminded me one thing - when this crash happened - and Xorg has core > dump on assert - I've seen switch to console screen - so while in dmesg > kernel is writing this: > > [drm:intel_lvds_enable] *ERROR* timed out waiting for panel to power on Shrug. That wait is superfluous anyway... But surprising that it should ever time out. > I've definitely seen an enabled panel. > (And probably properly restored after Xorg has aborted.) > > > Also why is the 'gem_read()' using assert(0) - if the wedged GPU is > 'expectable', > it should be able to handle this case ? The code should handle the case, but really to have hit that condition in the first place means that we have hit a garbage code path (that read is only used for GPU readback of !wedged data, so if we find ourselves wedged at that point terror ensues). The idea behind the assert was to make it easier to debug. (In reply to comment #65) > Seems to be same issue as in comment 15 Agreed. 'Tis the very same set of dangling pointer.
(In reply to comment #67) > > > I've definitely seen an enabled panel. > > (And probably properly restored after Xorg has aborted.) > > > > > > Also why is the 'gem_read()' using assert(0) - if the wedged GPU is > > 'expectable', > > it should be able to handle this case ? > > The code should handle the case, but really to have hit that condition in the > first place means that we have hit a garbage code path (that read is only used > for GPU readback of !wedged data, so if we find ourselves wedged at that point > terror ensues). The idea behind the assert was to make it easier to debug. > I guess no one would complain if he would get temporarily damaged screen which may even need full 'redraw' in software mode - it would be still 1000x better then current 'dead' end. (Thought possibly if I'd use non debug compilation - maybe I'd have got exactly this ?? So is there anything to fix to make assert(0) go away? (except for commenting it out from sources in my local git)
It's only a dead end in an almost impossible condition with --enable-debug. But sure, assert(0) can die. I'd vote for something like replacing it with an #ifndef NDEBUG #define AWOOGA() awooga(__FUNCTION__, __LINE__) void awooga(const char *function, int line) { ErrorF("impossible code-path at %s:%d, expect corruption\n", function, line); } #else #define AWOOGA() #endif which makes you happy by removing the assert, me happy as I have a breakpoint I can place for any such silly paths, and the user happy as it just works.
I'm still mulling over how to enhance debugging for production systems. In the meantime, am I right in considering the underlying bugs here begone?
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.