Bug 50393

Summary:

Assertion failure which valgrinding

Product:

xorg

Reporter:

Zdenek Kabelac <zdenek.kabelac>

Component:

Driver/intel

Assignee:

Chris Wilson <chris>

Status:

RESOLVED FIXED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

medium

CC:

ben, chris, daniel, jbarnes

Version:

git

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
grep -r /sys/kernel/debug/dri/0	none
intel_gpu_dump	none
intel_reg_dumper	none
Unmodified error state	none
i915_error_state with changed vg pattern (21 -> EE)	none
Coredump of Xserver	none
debug dri log dir	none
error state captured	none
dmesg from hang	none
intel_reg_dumper output	none
i915_error_state	none
Crashed Xorg when GPU hang	none
dmesg from GPU hang	none
error state from GPU	none
valgrind run	none

Description Zdenek Kabelac 2012-05-27 08:15:16 UTC

I've opened bugzilla for a problem discussed on intel-gfx list.

My hardware T61, GMA965, I'm running  SNA intel driver 
(current git hash xf86-video-intel: 8ea4ba081de0206351394481f54dcbe6922a085b)

I've figured very easy way to trigger the problem - run Xorg in valgrind and just run xterm & firefox inside such slowed Xorg server - after few mouse moves on touchpad I usually get a quick GPU hang report.

I'll attach more logs from system if that will help.

It's probably worth to note here that I've tried remote drm branch:
[remote "drm-intel-next-queued"]
        url = git://people.freedesktop.org/~danvet/drm-intel
        fetch = +refs/heads/*:refs/remotes/drm-intel-next-queued/*

git hash: 98b6bd998ae057611d2bc040ace1fc847f575b84
and with this kernel 3.4.0-rc6-00441-g98b6bd9 I've NOT been able to reproduce this issue so easily anymore - so it seems like this drm branch has some fix inside.


This is report hang with vanilla 3.4 kernel:

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[drm:init_ring_common] *ERROR* render ring initialization failed ctl 00000000 head 00000000 tail 00000000 start 00000000
[drm] Changing LVDS panel from (+hsync, +vsync) to (-hsync, -vsync)
------------[ cut here ]------------
WARNING: at drivers/gpu/drm/i915/intel_display.c:807 intel_enable_pipe+0x12c/0x150 [i915]()
Hardware name: 6464CTO
PLL state assertion failure (expected on, current off)
Modules linked in: i915 i2c_algo_bit drm_kms_helper drm ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc ipv6 snd_hda_codec_analog arc4 iwl3945 iwlegacy mac80211 microcode psmouse serio_raw cfg80211 snd_hda_intel i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support snd_hda_codec snd_pcm snd_page_alloc snd_timer e1000e wmi thinkpad_acpi nvram snd soundcore binfmt_misc evdev loop vhost_net tun kvm_intel kvm nfsd exportfs nfs_acl auth_rpcgss lockd sunrpc autofs4 pcmcia sr_mod cdrom sdhci_pci sdhci mmc_core yenta_socket ehci_hcd uhci_hcd usbcore usb_common video backlight
Pid: 2281, comm: memcheck-amd64- Not tainted 3.4.0-00006-gf4267bb #68
Call Trace:
warn_slowpath_common+0x7f/0xc0
warn_slowpath_fmt+0x46/0x50
intel_enable_pipe+0x12c/0x150 [i915]
i9xx_crtc_mode_set+0x8b5/0xe70 [i915]
? _raw_spin_unlock_irqrestore+0x42/0x80
intel_crtc_mode_set+0x71/0xb0 [i915]
drm_crtc_helper_set_mode+0x2cd/0x4b0 [drm_kms_helper]
drm_crtc_helper_set_config+0xa32/0xb50 [drm_kms_helper]
? get_parent_ip+0x11/0x50
drm_fb_helper_restore_fbdev_mode+0x3d/0x60 [drm_kms_helper]
intel_fb_restore_mode+0x39/0xb0 [i915]
i915_driver_lastclose+0x30/0x60 [i915]
drm_lastclose+0x4a/0x320 [drm]
drm_release+0x582/0x790 [drm]
fput+0x12c/0x270
filp_close+0x69/0xa0
put_files_struct+0x125/0x410
? put_files_struct+0x38/0x410
exit_files+0x52/0x60
do_exit+0x178/0xa90
do_group_exit+0x4c/0xc0
get_signal_to_deliver+0x2be/0x860
? sub_preempt_count+0x79/0xd0
do_signal+0x68/0x950
? sysret_check+0x1b/0x56
? sys_kill+0x83/0x300
? might_fault+0x53/0xb0
? sysret_signal+0x5/0x47
do_notify_resume+0x65/0x80
? trace_hardirqs_on_thunk+0x3a/0x3f
int_signal+0x12/0x17

Comment 1 Zdenek Kabelac 2012-05-27 08:17:12 UTC

Created attachment 62139 [details]
grep -r /sys/kernel/debug/dri/0

I've just removed repeated zeros to make it shorter

Comment 2 Zdenek Kabelac 2012-05-27 08:18:12 UTC

Created attachment 62140 [details]
intel_gpu_dump

Comment 3 Zdenek Kabelac 2012-05-27 08:19:00 UTC

Created attachment 62141 [details]
intel_reg_dumper

Comment 4 Chris Wilson 2012-05-27 08:38:08 UTC

Next time please just attach an unmollified i915_error_state.

Comment 5 Chris Wilson 2012-05-27 08:41:03 UTC

So which kernel is this?

Comment 6 Zdenek Kabelac 2012-05-27 09:19:17 UTC

Created attachment 62143 [details]
Unmodified error state

Comment 7 Zdenek Kabelac 2012-05-27 09:20:29 UTC

(In reply to comment #5)
> So which kernel is this?

GPU hang happens with upstream kernels (report is from 3.4 with my 6 unrelated minor patches)

Comment 8 Chris Wilson 2012-05-27 09:22:57 UTC

And I also need to know if you are using any GL applications and if so which version of mesa?

Comment 9 Zdenek Kabelac 2012-05-27 09:31:44 UTC

(In reply to comment #8)
> And I also need to know if you are using any GL applications and if so which
> version of mesa?

The only running application has been  

xterm  and  firefox started from xterm.
(.xinitrc  has only   exec xterm)

No other application has been running (i.e. no window manager)

I'm not aware if firefox is using mesa for it's windows html rendering,
but it seemed to me the problem is more related to the mouse movement.
It's always triggered when I'm moving mouse via touchpad in the firefox window.
The rendered page in firefox was just some basic welcome text.

Here is the list mesa libs on my system:

mesa-dri-drivers-8.1-0.5.fc18.x86_64
mesa-libGLU-8.1-0.5.fc18.i686
mesa-debuginfo-8.1-0.5.fc18.x86_64
mesa-libGL-8.1-0.5.fc18.i686
mesa-libglapi-8.1-0.5.fc18.i686
mesa-libGLU-8.1-0.5.fc18.x86_64
mesa-libGLU-devel-8.1-0.5.fc18.x86_64
mesa-libEGL-8.1-0.5.fc18.i686
mesa-libEGL-8.1-0.5.fc18.x86_64
mesa-libGL-devel-8.1-0.5.fc18.x86_64
mesa-libEGL-devel-8.1-0.5.fc18.x86_64
mesa-libGLES-8.1-0.5.fc18.i686
mesa-dri-filesystem-8.1-0.5.fc18.x86_64
mesa-libgbm-8.1-0.5.fc18.i686
mesa-libGL-8.1-0.5.fc18.x86_64
mesa-libGLES-8.1-0.5.fc18.x86_64
mesa-libgbm-8.1-0.5.fc18.x86_64
mesa-dri-filesystem-8.1-0.5.fc18.i686
mesa-dri-drivers-8.1-0.5.fc18.i686
mesa-libglapi-8.1-0.5.fc18.x86_64

Comment 10 Chris Wilson 2012-05-27 09:36:27 UTC

Just going through the known bugs that trigger similar error states...

Comment 11 Zdenek Kabelac 2012-05-27 10:01:02 UTC

Could be completely unrelated - but it seem there could be correlation between hang, and my mouse movement - since it seems that this is one of moments,
where valgrind gives last message if it doesn't directly coredump (as posted in the list).

Syscall param ioctl(generic) points to uninitialised byte(s)
   at 0x6B06317: ioctl (syscall-template.S:81)
   by 0x8775807: drmIoctl (xf86drm.c:167)
   by 0x8779138: drmModeSetCursor (xf86drmMode.c:57)
   by 0x4B6A1C: xf86_set_cursor_position (xf86Cursors.c:308)
   by 0x4C0CBF: xf86SetCursor (xf86HWCurs.c:155)
   by 0x4BFA23: xf86CursorSetCursor (xf86Cursor.c:365)
   by 0x55AB39: miPointerUpdateSprite (mipointer.c:450)
   by 0x55ADEC: miPointerDisplayCursor (mipointer.c:201)
   by 0x4CD922: CursorDisplayCursor (cursor.c:166)
   by 0x4FD796: AnimCurDisplayCursor (animcur.c:226)
   by 0x43BBB8: ChangeToCursor (events.c:930)
   by 0x43F002: CheckMotion (events.c:2968)
   by 0x43F129: WindowsRestructured (events.c:2985)
   by 0x45DF2D: MapWindow (window.c:2610)
   by 0x42EC4F: ProcMapWindow (dispatch.c:793)
   by 0x4343B9: Dispatch (dispatch.c:425)
   by 0x423404: main (main.c:288)
 Location 0x7fefffdc8 is 0 bytes inside arg.x,
 declared at xf86drmMode.c:389, in frame #2 of thread 1
 Uninitialised value was created by a stack allocation
   at 0x8BEE150: sna_crtc_show_cursor (sna_display.c:794)


I've no idea how does works setting mouse position on hardware level, but couldn't there be some hidden async bug?

Comment 12 Chris Wilson 2012-05-28 04:37:59 UTC

Valgrind isn't telling you anything useful here except that it doesn't understand the layout of the ioctl and so warning that two unused members of that struct are left uninitialised.

Comment 13 Zdenek Kabelac 2012-05-28 04:49:27 UTC

(In reply to comment #12)
> Valgrind isn't telling you anything useful here except that it doesn't
> understand the layout of the ioctl and so warning that two unused members of
> that struct are left uninitialized.

Yes - that's clear, I've not meant those uninitialized members - but rather the fact, that it seemed that log got frozen in the moment it's been manipulating with mouse. 

However I've also tried to play with this more and noticed that even the usage of valgrind is not a reliable way to make it freeze quickly :(

I've got an idea to check whether there is some patch between drm-next and vanilla which makes it working - and noticed I couldn't say when it's OK,
since even with vanilla I get sometimes quite long run without GPU hang :(.
(So it's actually quite well possible, that even drm-next will hang, I've just not played long enough to trigger this problem.)

Comment 14 Zdenek Kabelac 2012-05-29 02:12:01 UTC

I've noticed you made a new upstream commit to help with valgrind usage.

Anyway - here are few interesting bits - which might be actually problems of Xorg server and not really intel sna driver (so may deserve a new separate bugzilla)

When I'll have time - I'll look on them more closely - meanwhile - someone else might get more clue about them.

What is interesting about them is the 'Crtc' part of the function name - which is also a part of my kernel GPU hung bug description.
So maybe there could be something related together?

Source code in use (to match source lines) is:
sna driver: 55e6f5f220401318529e81f7c96fe0af3b893a0c
xorg-x11-server-Xorg-1.12.1-2.fc18.x86_64

They appear in log when I switch between console and X server.


Invalid read of size 1
   at 0x4C2C162: __GI_strlen (mc_replace_strmem.c:391)
   by 0x6AA23A5: strdup (strdup.c:41)
   by 0x46C26D: XNFstrdup (utils.c:1096)
   by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209)
   by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267)
   by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF3B89: sna_enter_vt (sna_driver.c:985)
   by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)
 Address 0x775d530 is 0 bytes inside a block of size 10 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004)
   by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538)
   by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517)
   by 0x4E8051: RRGetInfo (rrinfo.c:195)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)

Invalid read of size 1
   at 0x4C2C174: __GI_strlen (mc_replace_strmem.c:391)
   by 0x6AA23A5: strdup (strdup.c:41)
   by 0x46C26D: XNFstrdup (utils.c:1096)
   by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209)
   by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267)
   by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF3B89: sna_enter_vt (sna_driver.c:985)
   by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)
 Address 0x775d531 is 1 bytes inside a block of size 10 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004)
   by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538)
   by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517)
   by 0x4E8051: RRGetInfo (rrinfo.c:195)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)

Invalid read of size 1
   at 0x6AAAEAE: __GI_memcpy (memcpy.S:71)
   by 0x46C26D: XNFstrdup (utils.c:1096)
   by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209)
   by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267)
   by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF3B89: sna_enter_vt (sna_driver.c:985)
   by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)
 Address 0x775d530 is 0 bytes inside a block of size 10 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004)
   by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538)
   by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517)
   by 0x4E8051: RRGetInfo (rrinfo.c:195)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)

Invalid read of size 2
   at 0x6AAAEBE: __GI_memcpy (memcpy.S:83)
   by 0x46C26D: XNFstrdup (utils.c:1096)
   by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209)
   by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267)
   by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF3B89: sna_enter_vt (sna_driver.c:985)
   by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)
 Address 0x775d531 is 1 bytes inside a block of size 10 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004)
   by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538)
   by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517)
   by 0x4E8051: RRGetInfo (rrinfo.c:195)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)

Invalid read of size 8
   at 0x6AAAEE6: __GI_memcpy (memcpy.S:107)
   by 0x46C26D: XNFstrdup (utils.c:1096)
   by 0x4B9988: xf86DuplicateMode (xf86Modes.c:209)
   by 0x4B31DE: xf86CrtcSetModeTransform (xf86Crtc.c:267)
   by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF3B89: sna_enter_vt (sna_driver.c:985)
   by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)
 Address 0x775d533 is 3 bytes inside a block of size 10 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004)
   by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538)
   by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517)
   by 0x4E8051: RRGetInfo (rrinfo.c:195)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)

Invalid read of size 1
   at 0x4C2C331: strncpy (mc_replace_strmem.c:463)
   by 0x8BF3621: sna_crtc_set_mode_major (sna_display.c:393)
   by 0x4B32BC: xf86CrtcSetModeTransform (xf86Crtc.c:294)
   by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF3B89: sna_enter_vt (sna_driver.c:985)
   by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)
 Address 0x775d530 is 0 bytes inside a block of size 10 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004)
   by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538)
   by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517)
   by 0x4E8051: RRGetInfo (rrinfo.c:195)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)

Invalid read of size 1
   at 0x4C2C350: strncpy (mc_replace_strmem.c:463)
   by 0x8BF3621: sna_crtc_set_mode_major (sna_display.c:393)
   by 0x4B32BC: xf86CrtcSetModeTransform (xf86Crtc.c:294)
   by 0x4B3AF3: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF3B89: sna_enter_vt (sna_driver.c:985)
   by 0x4BCF0A: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)
 Address 0x775d531 is 1 bytes inside a block of size 10 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x4921C4: xf86DeleteMode (xf86Mode.c:2004)
   by 0x4B2237: xf86ProbeOutputModes (xf86Crtc.c:1538)
   by 0x4BBC1B: xf86RandR12GetInfo12 (xf86RandR12.c:1517)
   by 0x4E8051: RRGetInfo (rrinfo.c:195)
   by 0x494A20: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813B44B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480CEF: xf86Wakeup (xf86Events.c:527)
   by 0x43836A: WakeupHandler (dixutils.c:421)
   by 0x462413: WaitForSomething (WaitFor.c:224)
   by 0x4340F0: Dispatch (dispatch.c:357)
   by 0x423494: main (main.c:288)

Invalid read of size 4
   at 0x428CAD: FreeCursor (cursor.c:117)
   by 0x44B89B: FreeGrab (grabs.c:253)
   by 0x4299DD: CloseDevice (devices.c:932)
   by 0x42A012: CloseDownDevices (devices.c:996)
   by 0x4234FB: main (main.c:321)
 Address 0x76df5a4 is 20 bytes inside a block of size 112 free'd
   at 0x4C2A9AE: free (vg_replace_malloc.c:427)
   by 0x428D0E: FreeCursor (cursor.c:126)
   by 0x455151: doFreeResource (resource.c:549)
   by 0x455F54: FreeClientResources (resource.c:806)
   by 0x433869: CloseDownClient (dispatch.c:3379)
   by 0x434445: Dispatch (dispatch.c:437)
   by 0x423494: main (main.c:288)

Comment 15 Chris Wilson 2012-05-29 03:57:41 UTC

We've chased those valgrind warnings in the past: bug 36108.

Comment 16 Zdenek Kabelac 2012-05-29 04:06:18 UTC

(In reply to comment #15)
> We've chased those valgrind warnings in the past: bug 36108.

Ohh great even my own report ;)

Comment 17 Zdenek Kabelac 2012-05-29 07:29:17 UTC

Ok - I could confirm that with this kernel:

3.4.0-rc6-00441-g98b6bd9  - which is supposedly this commit:

commit 98b6bd998ae057611d2bc040ace1fc847f575b84
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sun May 20 20:00:25 2012 +0200

I'm also getting the GPU hang on my machine.

So this issue is still present in drm-intel-next-queued for my machine.

The only thing I've been doing is - I've been moving firefox window above another firefox window.

Comment 18 Chris Wilson 2012-05-29 07:33:36 UTC

Can you attach a recent error-state, useful to confirm it is the same pattern? Also if you have a simple test case, then --enable-debug=full would be useful. Perhaps even #define DEBUG_FLUSH_SYNC 1.

Comment 19 Zdenek Kabelac 2012-05-29 13:47:07 UTC

(In reply to comment #18)
> Can you attach a recent error-state, useful to confirm it is the same pattern?
> Also if you have a simple test case, then --enable-debug=full would be useful.
> Perhaps even #define DEBUG_FLUSH_SYNC 1.

So first I've just tried to get the full error state log with this drm-next kernel - and it's been apparently quite simple - I've just want to create new bz with some display artifacts - so I wanted to use some minimal environment, and it has pretty quickly locked - so to check for GPU hang simple - then I want to use you build advice 

However as soon as I try to run such sna driver within my valgrind setup - I get instant coredump with the first mouse move.

Process terminating with default action of signal 11 (SIGSEGV): dumping core
 General Protection Fault
   at 0x6B25165: __sprintf_chk (sprintf_chk.c:27)
   by 0x468E7C: LogVWrite (stdio2.h:33)
   by 0x468BAB: ErrorF (log.c:641)
   by 0x8C15CAC: sna_crtc_set_cursor_position (sna_display.c:815)
   by 0x4B6B43: xf86_set_cursor_position (xf86Cursors.c:391)
   by 0x55A3C2: miPointerMoveNoEvent (mipointer.c:544)
   by 0x55B3E7: miPointerSetPosition (mipointer.c:625)
   by 0x448486: positionSprite.part.6 (getevents.c:913)
   by 0x44895C: fill_pointer_events (getevents.c:1331)
   by 0x449CD2: GetPointerEvents (getevents.c:1567)
   by 0x44A19C: QueuePointerEvents (getevents.c:1215)
   by 0xAEF30E3: EvdevReadInput (evdev.c:1012)
   by 0x480316: xf86SigioReadInput (xf86Events.c:298)
   by 0x4A44EF: xf86SIGIO (sigio.c:108)
   by 0x582254F: ??? (in /usr/lib64/libpthread-2.15.90.so)
   by 0x6B027DD: __write_nocancel (syscall-template.S:81)
   by 0x6A925C2: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1286)
   by 0x6A924A1: new_do_write (fileops.c:540)
   by 0x6A9307D: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1369)
   by 0x6A88ABC: fwrite (iofwrite.c:44)
   by 0x468ED8: LogVWrite (log.c:280)
   by 0x468BAB: ErrorF (log.c:641)
   by 0x8C15CAC: sna_crtc_set_cursor_position (sna_display.c:815)
   by 0x4B6B43: xf86_set_cursor_position (xf86Cursors.c:391)
   by 0x550E25: mieqProcessInputEvents (mieq.c:628)
   by 0x480398: ProcessInputEvents (xf86Events.c:164)
   by 0x4340E1: Dispatch (dispatch.c:353)
   by 0x423494: main (main.c:288)
 

What looks weird is the rentrancy  - it looks like things are being handled via signal in parallel with normal ProcessInputEvents - should there be some protection for debugging?

And in fact isn't this issue related to my original problem  - since
some parts of the code could be executed easily in parallel.
Thus drmIoctl(DRM_IOCTL_MODE_CURSOR) might be called by normal task as well as via signal handler ?

Comment 20 Chris Wilson 2012-05-29 14:00:09 UTC

Interesting, I've been having a discussion with Dave Airlie about the native of SIGIO handling of cursors and to have an explicit example of badness (albeit due to extra debugging code) is fascinating. In short, the SIGIO has been known to exercise kernel bugs in the past, but this callstack does not directly relate to the GPU hang.

Comment 21 Zdenek Kabelac 2012-05-29 14:18:46 UTC

(In reply to comment #20)
> Interesting, I've been having a discussion with Dave Airlie about the native of
> SIGIO handling of cursors and to have an explicit example of badness (albeit
> due to extra debugging code) is fascinating. In short, the SIGIO has been known
> to exercise kernel bugs in the past, but this callstack does not directly
> relate to the GPU hang.

Yes - I'll probably need to modify debugging somehow for this case.

But I'm just wondering - it the Signal handler is operating on kgem objects - how is the locking working then ?  How do you ensure  kgem object used during such handling remains valid ?  What should be the result of such operation ?

Is the cursor position influencing GPU processing  - i.e. you expect to render cursor somewhere - and signal handler modifies the position again - is this safe operation ?  Should there be 'kernel queue' for mouse positions.

Comment 22 Zdenek Kabelac 2012-05-30 00:57:18 UTC

Posting valgrind options for bug 50477  

MALLOC="--malloc-fill=aa"
FREE="--free-fill=21"
TRACKFD="--track-fds=yes"
STACK="--max-stackframe=300000"
EXTRAS="--read-var-info=yes"
LOGFILE="--log-file=/tmp/valglog.$SUFFIX"
TIMESTAMP="--time-stamp=yes"
LEAKCHECK="--leak-check=yes"
LEAKCHECK="--leak-check=full --track-origins=yes"
CALLERS="--num-callers=40"


actually reminded me - that '21' is my clear pattern - which could be seen in GPU error status on several places - so it seems like such free memory gets in this GPU buffer.

I'll try to change hex number to confirm this...

Comment 23 Chris Wilson 2012-05-30 01:01:28 UTC

(In reply to comment #22)
> Posting valgrind options for bug 50477  
> 
> MALLOC="--malloc-fill=aa"
> FREE="--free-fill=21"
> TRACKFD="--track-fds=yes"
> STACK="--max-stackframe=300000"
> EXTRAS="--read-var-info=yes"
> LOGFILE="--log-file=/tmp/valglog.$SUFFIX"
> TIMESTAMP="--time-stamp=yes"
> LEAKCHECK="--leak-check=yes"
> LEAKCHECK="--leak-check=full --track-origins=yes"
> CALLERS="--num-callers=40"
> 
> 
> actually reminded me - that '21' is my clear pattern - which could be seen in
> GPU error status on several places - so it seems like such free memory gets in
> this GPU buffer.
> 
> I'll try to change hex number to confirm this...

Ok, that is worrisome. :)

Comment 24 Zdenek Kabelac 2012-05-30 01:36:53 UTC

Created attachment 62268 [details]
i915_error_state with changed vg pattern (21 -> EE)

As one may see quite a few 'EE' sections could be found there. So it seems that released buffers are GPU queue (which could be fine, if GPU has already executed them - but since they are on multiple places - it looks suspicious.

Comment 25 Zdenek Kabelac 2012-05-30 01:45:03 UTC

Created attachment 62270 [details]
Coredump of Xserver


I'm now getting almost instant GPU hang and Xorg crash when I use Xorg with valgrind - sna driver 740368c4c6eb547adad247ff529d16e594d6459b + Flush patch from Bug 50477 and start  xterm + twm + thunderbird.

After TB starts it wants to open two windows and I want to position them with twm - and with few mouse moves I got GPU hung - I'm leaving some timeinfo from kernel in trace here - so it could be seen it's quite soon after boot - it doesn't matter if 3.4 or 3.4-rc6-drm-next is in use.


 64.668973] drm: registered panic notifier
 64.678853] acpi device:02: registered as cooling_device2
 64.684729] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/LNXVIDEO:00/input/input9
 64.696343] ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
 64.703397] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
187.640080] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
187.648131] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
187.658803] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 00000000 head 00000000 tail 00000000 start 00000000
187.780290] [drm] Changing LVDS panel from (+hsync, +vsync) to (-hsync, -vsync)
187.787833] ------------[ cut here ]------------
187.792480] WARNING: at drivers/gpu/drm/i915/intel_display.c:807 intel_enable_pipe+0x12c/0x150 [i915]()
187.802121] Hardware name: 6464CTO
187.805627] PLL state assertion failure (expected on, current off)
187.811928] Modules linked in: i915 i2c_algo_bit drm_kms_helper drm ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc ipv6 snd_hda_codec_analog usbhid hid arc4 microcode psmouse iwl3945 serio_raw iwlegacy i2c_i801 i2c_core mac80211 snd_hda_intel snd_hda_codec iTCO_wdt snd_pcm iTCO_vendor_support cfg80211 snd_page_alloc snd_timer e1000e wmi thinkpad_acpi nvram snd soundcore binfmt_misc loop vhost_net evdev tun kvm_intel kvm nfsd exportfs nfs_acl auth_rpcgss lockd sunrpc autofs4 pcmcia sr_mod cdrom sdhci_pci sdhci mmc_core yenta_socket ehci_hcd uhci_hcd usbcore usb_common video backlight
187.879457] Pid: 1139, comm: memcheck-amd64- Not tainted 3.4.0-00006-gf4267bb #68
187.886961] Call Trace:
187.889416]  [<ffffffff8103b50f>] warn_slowpath_common+0x7f/0xc0
187.895443]  [<ffffffff8103b606>] warn_slowpath_fmt+0x46/0x50
187.901222]  [<ffffffffa056025c>] intel_enable_pipe+0x12c/0x150 [i915]
187.907781]  [<ffffffffa0563c85>] i9xx_crtc_mode_set+0x8b5/0xe70 [i915]
187.914416]  [<ffffffff815159e2>] ? _raw_spin_unlock_irqrestore+0x42/0x80
187.921233]  [<ffffffffa055d761>] intel_crtc_mode_set+0x71/0xb0 [i915]
187.927787]  [<ffffffffa029a01d>] drm_crtc_helper_set_mode+0x2cd/0x4b0 [drm_kms_helper]
187.935806]  [<ffffffffa029b262>] drm_crtc_helper_set_config+0xa32/0xb50 [drm_kms_helper]
187.944010]  [<ffffffff81078611>] ? get_parent_ip+0x11/0x50
187.949609]  [<ffffffffa0297c6d>] drm_fb_helper_restore_fbdev_mode+0x3d/0x60 [drm_kms_helper]
187.958167]  [<ffffffffa0577d69>] intel_fb_restore_mode+0x39/0xb0 [i915]
187.964898]  [<ffffffffa053e6a0>] i915_driver_lastclose+0x30/0x60 [i915]
187.971633]  [<ffffffffa050079a>] drm_lastclose+0x4a/0x320 [drm]
187.977666]  [<ffffffffa05010f2>] drm_release+0x582/0x790 [drm]
187.983607]  [<ffffffff81177dcc>] fput+0x12c/0x270
187.988422]  [<ffffffff81172d29>] filp_close+0x69/0xa0
187.993592]  [<ffffffff81040135>] put_files_struct+0x125/0x410
187.999445]  [<ffffffff81040048>] ? put_files_struct+0x38/0x410
188.005385]  [<ffffffff810404e2>] exit_files+0x52/0x60
188.010551]  [<ffffffff81040668>] do_exit+0x178/0xa90
188.015627]  [<ffffffff810412dc>] do_group_exit+0x4c/0xc0
188.021053]  [<ffffffff81053c1e>] get_signal_to_deliver+0x2be/0x860
188.027340]  [<ffffffff815192c9>] ? sub_preempt_count+0x79/0xd0
188.033287]  [<ffffffff81002298>] do_signal+0x68/0x950
188.038444]  [<ffffffff8151d2fb>] ? sysret_check+0x1b/0x56
188.043948]  [<ffffffff81054b53>] ? sys_kill+0x83/0x300
188.049189]  [<ffffffff81141b33>] ? might_fault+0x53/0xb0
188.054613]  [<ffffffff8151d350>] ? sysret_signal+0x5/0x47
188.060125]  [<ffffffff81002c05>] do_notify_resume+0x65/0x80
188.065807]  [<ffffffff812f4cfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
188.072270]  [<ffffffff8151d5e2>] int_signal+0x12/0x17
188.077447] ---[ end trace 134029284d004294 ]---

Comment 26 Chris Wilson 2012-05-30 01:50:17 UTC

Ah. That looks to be an issue with valgrind writing its free-pattern over a released CPU mapping that is still on the GPU.

If you comment out the two VALGRIND_FREELIKE_BLOCK() in src/sna/kgem.c, does the problem go away?

Comment 27 Chris Wilson 2012-05-30 02:45:58 UTC

This should do the trick and keep valgrind happy with the unknown GPU mappings.

commit 47e2db0ba8ae3fbcdf58cba26c58ec0932b4c90b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 30 10:37:22 2012 +0100

    sna: Replace FREELIKE_BLOCK with MAKE_MEM_NOACCESS
    
    And similarly for MALLOCLIKE_BLOCK. The issue being that valgrind may
    overwrite such blocks with alloc-fill and free-fill values, but when in
    fact they are defined and still in use by the GPU. This can lead to the
    GPU processing garbage, and GPU hangs.
    
    Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50393
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 28 Zdenek Kabelac 2012-05-30 05:05:09 UTC

Ok,  with valgrind modification - you have basically eliminated my easy way to trigger GPU hang ;)  - but obviously since my report has been originally started without using --enable-debug compiled sna driver  it might not be definite fix.

Let's see if the combination with other fixes (flush) will help.
So far - I've not been able to easily hang GPU like I used to be.

Btw now one instance of Xorg made this report:

Memcheck: mc_leakcheck.c:1112 (vgMemCheck_detect_memory_leaks): the 'impossible' happened.
==00:00:07:51.006 8850== Block 0x7f9edb9b6000..0x7f9edb9f5fff overlaps with block 0x7f9edb9b7000..0x7f9edb9f6fffThis is usually caused by using VALGRIND_MALLOCLIKE_BLOCKin an inappropriate way.   at 0x3804C60F: report_and_quit (m_libcassert.c:210)

Comment 29 Chris Wilson 2012-05-30 05:31:13 UTC

/me mutters something about the summary being "hang with valgrind" :-p

Both the headers and code claim not to impose any restrictions on those MAKE_MEM requests. Not sure, try with valgrind -v as well.

Comment 30 Zdenek Kabelac 2012-05-30 06:00:45 UTC

(In reply to comment #29)
> /me mutters something about the summary being "hang with valgrind" :-p
> 

It's been just really easy to reproduce with valgrind :)
But the original issue was definitely there without valgrind.
Let's just see  how often I'll hit it with current sna version.

Comment 31 Zdenek Kabelac 2012-05-30 06:14:24 UTC

Looking back at some collected coredumps - here is one which might be worth to check by you.

#0  __kgem_flush (kgem=kgem@entry=0x76ed638, bo=0x0) at kgem.c:2953
#1  0x0000000008be715d in sna_accel_do_flush (sna=0x76ed430) at sna_accel.c:11928
#2  sna_accel_block_handler (sna=0x76ed430, tv=0x7fefffed8) at sna_accel.c:12328
#3  0x00000000004381ba in BlockHandler (pTimeout=pTimeout@entry=0x7fefffed8, pReadmask=pReadmask@entry=0x7de540) at dixutils.c:387
#4  0x0000000000462384 in WaitForSomething (pClientsReady=pClientsReady@entry=0xae44580) at WaitFor.c:210
#5  0x00000000004340f1 in Dispatch () at dispatch.c:357
#6  0x0000000000423495 in main (argc=4, argv=0x7ff0002e8, envp=<optimized out>) at main.c:288


I'm not able to get this one anymore - it's been with your artifacts patch in.
For some unknown reason bo  has been pointing to 0 in  !__kgem_flush(&sna->kgem, priv->gpu_bo).

Crash happened directly during initialization so I've got only black screen.

So is it always safe to assume priv->gpu_bo is not NULL?  (since you only check for  priv->cpu_damage == NULL)

Comment 32 Chris Wilson 2012-05-30 06:49:00 UTC

Once upon a time it checked for priv->gpu_bo precisely because it was possible for the block_handler to fire before the screen was constructed...

commit cf5b3e2ebf4ee0330f5421b9377bb512a94ec284
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 30 14:44:59 2012 +0100

    sna: Don't attempt to flush scanout if unattached
    
    This was lost in the midst of the OSTimer overhaul.
    
    Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=50393
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 33 Chris Wilson 2012-06-04 03:19:44 UTC

Do you have any insight into the valgrind assertion failure? Does it always fail in the same way? Can you attach debug=full log for such a failure?

Comment 34 Chris Wilson 2012-06-04 03:21:55 UTC

Just to be clear, in this bug report I'm focusing on making sure we can run under valgrind. (Which version of valgrind do you have btw?) Can you please file the instability issue separately so that it can be tracked without me further confusing it with the valgrind bug.

Comment 35 Zdenek Kabelac 2012-06-06 01:14:54 UTC

I'll try later to play again with valgrind - but so far it looks pretty good - while using  still the same vanilla 3.4 kernel - I'm not getting GPU hangs now (at least haven't seen one for couple days now I think) (though now it's replaced with coredumps - but they could be solved more easily :))

My rawhide currently runs:

valgrind-3.7.0-4.fc18.x86_64

Comment 36 Zdenek Kabelac 2012-06-08 07:43:52 UTC

Created attachment 62798 [details]
debug dri log dir

Comment 37 Zdenek Kabelac 2012-06-08 07:48:14 UTC

Created attachment 62799 [details]
error state captured

Comment 38 Zdenek Kabelac 2012-06-08 07:49:06 UTC

Created attachment 62800 [details]
dmesg from hang

Comment 39 Zdenek Kabelac 2012-06-08 07:49:44 UTC

Created attachment 62801 [details]
intel_reg_dumper output

Comment 40 Zdenek Kabelac 2012-06-08 07:52:42 UTC

Ok - after some time I've got GPU hang again.

It's with current git:adc872a9654dc18d778323ca0721704878ad3851
(So I'm not sure how 'unique' is this - I've been having just very few windows on desktop and typing something in gnome-terminal while screen went black)

Is there anything readable from those GPU hangs ?

This time no valgrind involved - just relatively freshly booted system.

Comment 41 Chris Wilson 2012-06-08 08:22:37 UTC

The batch buffer got clobbered; overwritten by a clear/yellow fill to a tiled surface. The batch itself looks perfectly innocuous. Afaict that would only be possible through a wild BLT command (or through a strange GTT write). Did you have assertions enabled at the time? And this was a plain uncomposited Xsession?

Comment 42 Chris Wilson 2012-06-08 09:53:33 UTC

Two things that spring to mind that may help. --enable-debug should spot any out-of-bound writes, and #define DEBUG_FLUSH_SYNC 1 would restrict it to only one batch buffer in flight at a time. That would help define the problem.

Comment 43 Zdenek Kabelac 2012-06-08 11:17:15 UTC

(In reply to comment #41)
> The batch buffer got clobbered; overwritten by a clear/yellow fill to a tiled
> surface. The batch itself looks perfectly innocuous. Afaict that would only be
> possible through a wild BLT command (or through a strange GTT write). Did you
> have assertions enabled at the time? And this was a plain uncomposited
> Xsession?

Compiled with this options (and using SNA)
 --enable-sna --enable-uxa --enable-debug

Using openbox uncomposited WM.

BTW my primary suspect is still the SIGIO handling.
The lock always happened to me when I type on keyboard and move mouse.

So could not setting the mouse position in the signal interrupt do something unexpected ?

I'll try to run for a while now with --enable-debug=full - but it generates tons
of logs and makes the machine not really usable.
(It would be probably nice to have some circular log for at most 50MB)

Comment 44 Chris Wilson 2012-06-08 11:28:04 UTC

(In reply to comment #43)
> (In reply to comment #41)
> > The batch buffer got clobbered; overwritten by a clear/yellow fill to a tiled
> > surface. The batch itself looks perfectly innocuous. Afaict that would only be
> > possible through a wild BLT command (or through a strange GTT write). Did you
> > have assertions enabled at the time? And this was a plain uncomposited
> > Xsession?
> 
> Compiled with this options (and using SNA)
>  --enable-sna --enable-uxa --enable-debug

Using the Option "AccelMethod" "sna", right?


> Using openbox uncomposited WM.

Just checking, it didn't look like the mesa bug but...
 
> BTW my primary suspect is still the SIGIO handling.
> The lock always happened to me when I type on keyboard and move mouse.

Strong correlation, but it could just be the rendering of the text...

> So could not setting the mouse position in the signal interrupt do something
> unexpected ?

No idea, never tried it. I suspect the only reason for the signal handler is fake interactivity on an otherwise hung server.

> I'll try to run for a while now with --enable-debug=full - but it generates
> tons
> of logs and makes the machine not really usable.
> (It would be probably nice to have some circular log for at most 50MB)

Yeah, I've been meaning to think of a more useful means of capturing live debug logs -- unless you have a good idea of what you are looking for, there is simply too much information.

Comment 45 Zdenek Kabelac 2012-06-08 12:43:31 UTC

So I've played for about 30 minutes with full=debug - it has generate 2GB logfile and it has got very slow over the time - is the full debug compilation doing some large list collections that slows things down ?  Especially in the firefox it has got so slow with reaction on mouse after 30 minutes that I had to stop this experiment.

I've not get any GPU hung nor coredump for flash playback - thus it's really not easy to catch in debug mode - probably the condition to trigger hang could be even harder to reach in this slow mode.

Comment 46 Chris Wilson 2012-06-08 13:01:40 UTC

For testing purposed, --enable-debug [i.e. not --enable-debug=full] is fine as that turns on the assertions without the logging. Yes it does perform additional walks over lists to ensure consistency, but the absolute performance killer is actuallly the batch buffer dumping.

Comment 47 Zdenek Kabelac 2012-06-14 05:28:49 UTC

Created attachment 63021 [details]
i915_error_state

Just another hang report - again mouse was involved - I've been just trying to move window holding Alt key.

Comment 48 Chris Wilson 2012-06-14 05:55:11 UTC

Continuing the trend that something (a stray GPU write) clobbered that batch before it was executed. Was this with --enable-debug?

Comment 49 Zdenek Kabelac 2012-06-14 05:58:40 UTC

(In reply to comment #48)
> Continuing the trend that something (a stray GPU write) clobbered that batch
> before it was executed. Was this with --enable-debug?

--enable-debug=memory - but I guess this will not help much

cut&paste just before hang:


[ 28209.006] Allocated bo: 158, 44642304 bytes
[ 28219.047] Allocated shadow pixels: 74, 40953072 bytes, as CPU bo: 0, 0 bytes
[ 28219.047] Allocated bo: 159, 44752896 bytes
[ 28227.114] [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
[ 28227.114] 
[ 28227.114] Backtrace:
[ 28227.115] 0: /usr/bin/X (xorg_backtrace+0x36) [0x465106]
[ 28227.115] 1: /usr/bin/X (mieqEnqueue+0x26b) [0x55112b]
[ 28227.116] 2: /usr/bin/X (0x400000+0x480c2) [0x4480c2]
[ 28227.116] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7f17108d5000+0x60e4) [0x7f17108db0e4]
[ 28227.116] 4: /usr/bin/X (0x400000+0x805b7) [0x4805b7]
[ 28227.116] 5: /usr/bin/X (0x400000+0xa4790) [0x4a4790]
[ 28227.116] 6: /lib64/libpthread.so.0 (0x7f1714f4f000+0xf5

Comment 50 Chris Wilson 2012-06-14 06:03:31 UTC

(In reply to comment #49)
> (In reply to comment #48)
> > Continuing the trend that something (a stray GPU write) clobbered that batch
> > before it was executed. Was this with --enable-debug?
> 
> --enable-debug=memory - but I guess this will not help much

It helps, that means that we had assertions enabled and so we should have validated that all writes through the BLT (which are not clipped by the hardware) were within the bounds of the target surface.

Which unfortunately also means that we need to look for a more complex failure mode such as re-using active surfaces either in the kernel or in the ddx, or some GPU state mismanagement.

Comment 51 Zdenek Kabelac 2012-06-14 07:52:17 UTC

Hmm I've got my 2nd hang today :( and again with just plain window move with mouse.

Anyway - tried again with valgrind after reboot - haven't reached the hang.-
but got at least this one:

Source and destination overlap in memcpy(0xd3fe6a0, 0xd3fe6c0, 88)
   at 0x4C2D080: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:837)
   by 0x8BB9F0D: memcpy_blt (blt.c:211)
   by 0x8BD2C7B: sna_self_copy_boxes (sna_accel.c:3501)
   by 0x54F324: miCopyRegion (micopy.c:121)
   by 0x8BD2986: sna_copy_window (sna_accel.c:11991)
   by 0x50246D: damageCopyWindow (damage.c:1581)
   by 0x560321: miSpriteCopyWindow (misprite.c:518)
   by 0x4CA8D6: compCopyWindow (compwindow.c:522)
   by 0x5685BF: miMoveWindow (miwindow.c:290)
   by 0x4CB688: compMoveWindow (compwindow.c:375)
   by 0x45D560: ConfigureWindow (window.c:2309)
   by 0x42EEC3: ProcConfigureWindow (dispatch.c:867)
   by 0x4344A9: Dispatch (dispatch.c:428)
   by 0x423504: main (main.c:288)


If the region may overlap - should there be memmove ?
And isn't that some bug which should be asserted ?

Comment 52 Chris Wilson 2012-06-14 08:15:15 UTC

It's not a bug I'd been worrying about since it only happens rarely with fallback surfaces. The fix is trivial (use memmove along the self-copy path), but it is definitely not related to the hangs.

Comment 53 Chris Wilson 2012-06-14 09:12:11 UTC

commit b415ca05c2e1c4f09f85d8769d39e5369661ed3a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 14 17:06:19 2012 +0100

    sna: Reset damage extents after reduction goes to zero
    
    If the reduction of the damage clears all of the boxes, we need to reset
    the -infinite extents so that we continue to accumulate further damage.
    
    Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=50744
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Onwards!

Comment 54 Chris Wilson 2012-06-14 09:12:45 UTC

Oops, that was for the other bug, this one merits:

commit 0df7c488640d3590d2a88dc353b72167b6644eaa
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 14 16:09:38 2012 +0100

    sna: Supporting overlapping copies for fallback blits
    
    Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=50393
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 55 Zdenek Kabelac 2012-06-23 13:48:56 UTC

Created attachment 63380 [details]
Crashed Xorg when GPU hang

So after quite some time I've got another GPU hang - at least it is not so frequent now - but still not in good frequency.

Using 565297e6bd3457a150036af9c62fe0dc67b794ac

Again mouse was in use.

Comment 56 Zdenek Kabelac 2012-06-23 13:51:47 UTC

Created attachment 63381 [details]
dmesg from GPU hang

It interesting the GPU could not be reinitialized - since I remember I'be using some shell script to reset GPU and restore it into functional VGA state - so it's interesting kernel cannot do that these days.

Comment 57 Zdenek Kabelac 2012-06-23 13:53:58 UTC

Created attachment 63382 [details]
error state from GPU

Just in case some 'pattern' of failure could be found.

Comment 58 Zdenek Kabelac 2012-06-23 13:59:43 UTC

Looks pretty similar to  Bug 39387
So it's most probably duplication.

Comment 59 Zdenek Kabelac 2012-06-23 14:02:44 UTC

And Bug 48826  looks close as well - even same HW T61 in this case.

Comment 60 Chris Wilson 2012-06-24 01:59:31 UTC

(In reply to comment #58)
> Looks pretty similar to  Bug 39387

(In reply to comment #59)
> And Bug 48826  looks close as well - even same HW T61 in this case.

Both of these bugs match the pattern for a known bug in mesa. This does not, as the batch is overwritten with varying data.

Comment 61 Zdenek Kabelac 2012-06-24 03:03:15 UTC

(In reply to comment #60)
> Both of these bugs match the pattern for a known bug in mesa. This does not, as
> the batch is overwritten with varying data.

Hmmm, while I do not generally use  Mesa - i.e. typical apps - editors/terminal/browser/mailer/chat  I'm using i.e. xlock screen blanker which it using some 3D modes. Also playing movies with 'xv' surface happens from time to time.

While during hang those 3d apps and movies were not involved - they were run usually run during session.  So is there a chance, that 3D pipe would break something in a way - I'd get GPU hang after 30 minutes or so with 2D desktop?

Also how the better trace could be made ?
How big history is need ?

Since the lock happens 1-2 times per week - it's not really feasible to run debug=full and keep TB of logs on my system.

Some ring buffer which would be printed before assert, that would keep history of submitted commands ?

Is the reason of hang known ? Access of unmapped page ?  Hitting some internal gm965 hw fault (since it cannot be reset) ?
Is there any 'disassembler' for instruction in the ring buffer ?

Comment 62 Chris Wilson 2012-06-24 04:03:55 UTC

(In reply to comment #61)
> (In reply to comment #60)
> > Both of these bugs match the pattern for a known bug in mesa. This does not, as
> > the batch is overwritten with varying data.
> 
> Hmmm, while I do not generally use  Mesa - i.e. typical apps -
> editors/terminal/browser/mailer/chat  I'm using i.e. xlock screen blanker which
> it using some 3D modes. Also playing movies with 'xv' surface happens from time
> to time.
> 
> While during hang those 3d apps and movies were not involved - they were run
> usually run during session.  So is there a chance, that 3D pipe would break
> something in a way - I'd get GPU hang after 30 minutes or so with 2D desktop?

No, this is extremely timing sensitive and would require that Mesa overwrote a batch submitted concurrently by the ddx (i.e between the time the contents where written by CPU and then executed by the GPU).

The mesa bug in question was introduce in 8.0 and fixed by 8.0.2; the glClear operation was using uninitialised coordinates and so was writing into random blocks of memory.
 
> Also how the better trace could be made ?
> How big history is need ?

Indeed, just the keeping the log from when the GPU was last idle should be enough to detect this style of bug. Hmm, something to think about. In the meantime if you verify that the hang still occurs with --enable-debug (i.e. assertions only), I'm at a loss to explain where in the ddx this bug is going to come from. (The only operations not clipped to the surface are generated as BLT commands, which I think I've covered by asserting that they are all within bounds. So the remaining possibility is use-after-free of a GPU bo, which should be impossible...)
 
> Since the lock happens 1-2 times per week - it's not really feasible to run
> debug=full and keep TB of logs on my system.
> 
> Some ring buffer which would be printed before assert, that would keep history
> of submitted commands ?
> 
> Is the reason of hang known ? Access of unmapped page ?  Hitting some internal
> gm965 hw fault (since it cannot be reset) ?
> Is there any 'disassembler' for instruction in the ring buffer ?

Look at intel-gpu-tools and intel_error_decode. If you look at the batch buffer currently being executed you will see a large blob of pixel data rather than commands.

Comment 63 Chris Wilson 2012-06-24 07:28:19 UTC

Use after free would be ruled out as well since the act of writing the batch is serialising (userspace asserts that the bo is not busy on the GPU and the kernel itself flushes any activity on the bo before writing to it). So that leaves a wild write that nevertheless was valid enough to pass all the sanity checks.

The other possibility is that this is a kernel bug. It wouldn't be the first, but bugs of this nature are getting harder to find...

Comment 64 Zdenek Kabelac 2012-06-24 13:12:24 UTC

Created attachment 63413 [details]
valgrind run

So I've once again tried to play a bit with valgrind - and here is result of few things that might need some fixes - though most probably they are not sna issues.

And also they do not look like 'big issue' to cause troubles.
They were generate by switching between terminal and Xorg.
It looks like something is being freed when leaving VT and it's tried to be reused on EnterVT.

i.e.:

Invalid read of size 8
   at 0x6AA86C3: __GI_memcpy (memcpy.S:215)
   by 0x46C43D: XNFstrdup (utils.c:1096)
   by 0x4B9C28: xf86DuplicateMode (xf86Modes.c:209)
   by 0x4B347E: xf86CrtcSetModeTransform (xf86Crtc.c:267)
   by 0x4B3D93: xf86SetDesiredModes (xf86Crtc.c:2693)
   by 0x8BF5714: sna_enter_vt (sna_driver.c:1019)
   by 0x4BD1AA: xf86RandR12EnterVT (xf86RandR12.c:1737)
   by 0x494CC0: xf86XVEnterVT (xf86xv.c:1311)
   by 0x813944B: glxDRIEnterVT (glxdri2.c:601)
   by 0x480F8F: xf86Wakeup (xf86Events.c:527)
   by 0x4383EA: WakeupHandler (dixutils.c:421)
   by 0x462693: WaitForSomething (WaitFor.c:224)
   by 0x434170: Dispatch (dispatch.c:357)
   by 0x423504: main (main.c:288)
 Address 0x775ab90 is 96 bytes inside a block of size 128 free'd
   at 0x4C2A9CE: free (vg_replace_malloc.c:427)
   by 0x5A76231: pixman_op (pixman-region.c:1007)
   by 0x5A777EB: pixman_region_subtract (pixman-region.c:2010)
   by 0x562583: miValidateTree (regionstr.h:265)
   by 0x56859C: miMoveWindow (miwindow.c:289)
   by 0x4CB688: compMoveWindow (compwindow.c:375)
   by 0x45D560: ConfigureWindow (window.c:2309)
   by 0x42EEC3: ProcConfigureWindow (dispatch.c:867)
   by 0x4344A9: Dispatch (dispatch.c:428)
   by 0x423504: main (main.c:288)


This seems like might be eventually touching some date ??
But all issues looks like 'read' operation, so it should not cause
probably GPU problems ?

Also I've not been switching terminals when hang happened.

I could look later for reasons - but you might have better experience with this code (and probably even commit right to submit Xorg fixes).

xorg-x11-server-Xorg-1.12.2-3.fc18.x86_64
sna git: fa10005ce31483827547b7f71eae066899f0026c

Comment 65 Zdenek Kabelac 2012-06-24 13:14:25 UTC

Seems to be same issue as in comment 15

Comment 66 Zdenek Kabelac 2012-06-25 01:48:41 UTC

Also this reminded me one thing -  when this crash happened - and Xorg has core dump on assert -  I've seen switch to console screen -  so while in dmesg kernel is writing this:

[drm:intel_lvds_enable] *ERROR* timed out waiting for panel to power on

I've definitely seen an enabled panel.
(And probably properly restored after Xorg has aborted.)


Also why is the 'gem_read()' using assert(0) -  if the wedged GPU is 'expectable',
it should be able to handle this case ?

Or the system is not really able to continue in this moment?

Comment 67 Chris Wilson 2012-06-25 02:28:46 UTC

(In reply to comment #66)
> Also this reminded me one thing -  when this crash happened - and Xorg has core
> dump on assert -  I've seen switch to console screen -  so while in dmesg
> kernel is writing this:
> 
> [drm:intel_lvds_enable] *ERROR* timed out waiting for panel to power on

Shrug. That wait is superfluous anyway... But surprising that it should ever time out.
 
> I've definitely seen an enabled panel.
> (And probably properly restored after Xorg has aborted.)
> 
> 
> Also why is the 'gem_read()' using assert(0) -  if the wedged GPU is
> 'expectable',
> it should be able to handle this case ?

The code should handle the case, but really to have hit that condition in the first place means that we have hit a garbage code path (that read is only used for GPU readback of !wedged data, so if we find ourselves wedged at that point terror ensues). The idea behind the assert was to make it easier to debug.

(In reply to comment #65)
> Seems to be same issue as in comment 15

Agreed. 'Tis the very same set of dangling pointer.

Comment 68 Zdenek Kabelac 2012-06-25 02:53:39 UTC

(In reply to comment #67)

> 
> > I've definitely seen an enabled panel.
> > (And probably properly restored after Xorg has aborted.)
> > 
> > 
> > Also why is the 'gem_read()' using assert(0) -  if the wedged GPU is
> > 'expectable',
> > it should be able to handle this case ?
> 
> The code should handle the case, but really to have hit that condition in the
> first place means that we have hit a garbage code path (that read is only used
> for GPU readback of !wedged data, so if we find ourselves wedged at that point
> terror ensues). The idea behind the assert was to make it easier to debug.
> 

I guess no one would complain if he would get temporarily damaged screen which may even need full 'redraw' in software mode - it would be still 1000x better then current 'dead' end.

(Thought possibly if I'd use  non debug compilation - maybe I'd have got exactly this ??

So is there anything to fix to make assert(0) go away?
(except for commenting it out from sources in my local git)

Comment 69 Chris Wilson 2012-06-25 03:27:36 UTC

It's only a dead end in an almost impossible condition with --enable-debug. But sure, assert(0) can die. I'd vote for something like replacing it with an


#ifndef NDEBUG
#define AWOOGA() awooga(__FUNCTION__, __LINE__)

void awooga(const char *function, int line)
{
  ErrorF("impossible code-path at %s:%d, expect corruption\n", function, line);
}
#else
#define AWOOGA()
#endif

which makes you happy by removing the assert, me happy as I have a breakpoint I can place for any such silly paths, and the user happy as it just works.

Comment 70 Chris Wilson 2012-07-14 09:22:20 UTC

I'm still mulling over how to enhance debugging for production systems. In the meantime, am I right in considering the underlying bugs here begone?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.