Created attachment 47773 [details] (debian) reportbug log containing Xorg.log, dmesg, .. Bug description: Running a specific scenario (loading/rendering a "Robot Model" in rviz - http://www.ros.org/wiki/rviz ) leads reproducibly to a full X freeze. Bugreport was created logging in via ssh (while frozen). See the attachment for logs of dmesg and Xorg.log. System environment: -- chipset: 965GM -- system architecture: x86_64 -- xf86-video-intel: git 021209d5d3add8b28143611cfad4c5481a2945a3 (merged master into debian-unstable) -- libc6 2.13-4 -- libdrm-intel1 2.4.25-2 -- libdrm2 2.4.25-2 -- libpciaccess0 0.12.1-1 -- libudev0 167-3 -- xserver-xorg-core [xorg-video 2:1.10.1-2 -- kernel: 2.6.32-5-686 -- Linux distribution: debian unstable -- Machine or mobo model: Dell Inspiron 1420 -- Display connector: VGA Additional info: Running 'intel_gpu_dump' hangs after printing: ACTHD: 0xffffffff EIR: 0x00000000 EMR: 0xffffffcd ESR: 0x00000001 PGTBL_ER: 0x00000000 IPEHR: 0x01000000 IPEIR: 0x00000000 INSTDONE: 0xffe5fafe INSTDONE1: 0x000fffff busy: Projection and LOD busy: Bypass FIFO busy: Color calculator oddly i915_error_state (both in /sys/kernel/debug/dri/0 and /sys/kernel/debug/dri/64) say "no error state collected".
Oh no, I promised Daniel we would *never* do anything this silly.
Now that Daniel has had a good laugh, can you please attach gdb to rvis during the hang and attach the backtrace. Thanks.
Glad you guys are having fun. Attaching with gdb does not work (hangs saying "Attaching to ...") This (from kernel log, as attached) does not help? [ 720.544650] Call Trace: [ 720.544659] [<ffffffff813262f2>] ? __mutex_lock_common+0x127/0x193 [ 720.544669] [<ffffffff810cfdef>] ? vm_insert_pfn+0xff/0x13b [ 720.544677] [<ffffffff8132642a>] ? mutex_lock+0x1a/0x33 [ 720.544699] [<ffffffffa07a1cce>] ? i915_gem_fault+0x4a/0x1bc [i915] [ 720.544708] [<ffffffff810d0cb1>] ? __do_fault+0x52/0x3ed [ 720.544716] [<ffffffff810d1e62>] ? handle_pte_fault+0x322/0x8f2 [ 720.544726] [<ffffffff8132a147>] ? do_page_fault+0x32a/0x34c [ 720.544737] [<ffffffff8104d3ad>] ? walk_system_ram_range+0x82/0x101 [ 720.544746] [<ffffffff8104d443>] ? page_is_ram+0x17/0x21 [ 720.544754] [<ffffffff8102fb9f>] ? pat_pagerange_is_ram+0x30/0x5e [ 720.544762] [<ffffffff813274d5>] ? page_fault+0x25/0x30 [ 720.544771] [<ffffffff811a3a7f>] ? __copy_user_nocache+0x2f/0xb8 [ 720.544793] [<ffffffffa07a1649>] ? i915_gem_pwrite_ioctl+0x237/0x872 [i915] [ 720.544815] [<ffffffffa079f42a>] ? i915_gem_object_set_to_gtt_domain+0x8b/0x92 [i915] [ 720.544834] [<ffffffffa0746fb4>] ? drm_gem_object_lookup+0x37/0x47 [drm] [ 720.544849] [<ffffffffa07457c0>] ? drm_ioctl+0x24e/0x321 [drm] [ 720.544871] [<ffffffffa07a1412>] ? i915_gem_pwrite_ioctl+0x0/0x872 [i915] [ 720.544880] [<ffffffff8132a131>] ? do_page_fault+0x314/0x34c [ 720.544891] [<ffffffff81104b0b>] ? do_vfs_ioctl+0x467/0x4b4 [ 720.544898] [<ffffffff811084d2>] ? dput+0x29/0xe9 [ 720.544907] [<ffffffff810f8d08>] ? fput+0x1c5/0x1ec [ 720.544915] [<ffffffff81104ba3>] ? sys_ioctl+0x4b/0x70 [ 720.544923] [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b Any other idea on how I can get valuable information on what's going on? Cheers,
Ok, if you enable lockdep (CONFIG_PROVE_LOCKING and friends under Kernel Hacking) that should detect the deadlock as it occurs and allow us to get the stacktrace from userspace. Hmm, we may need a few more tricks as well. The problem is that we are doing a pwrite from inside one mmap'ed bo to another bo. We hit a pagefault during that operation, and re-enter our code and find ourselves in a deadlock. Hmm, this can happen if the application does glBufferData(dst, glBufferMap(src)); Woe, woe, and thrice woe.
I got the following backtrace (after which X freezes anyhow...): ============================================= [ INFO: possible recursive locking detected ] 2.6.38.20110610.1 #1 --------------------------------------------- rviz/2939 is trying to acquire lock: (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa05c98ee>] i915_gem_fault+0x50/0x1bf [i915] but task is already holding lock: (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa05c6edf>] i915_mutex_lock_interruptible+0x28/0x4a [i915] other info that might help us debug this: 2 locks held by rviz/2939: #0: (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa05c6edf>] i915_mutex_lock_interruptible+0x28/0x4a [i915] #1: (&mm->mmap_sem){++++++}, at: [<ffffffff81356aca>] do_page_fault+0x163/0x34c stack backtrace: Pid: 2939, comm: rviz Not tainted 2.6.38.20110610.1 #1 Call Trace: [<ffffffff810764df>] ? __lock_acquire+0xd38/0xdc0 [<ffffffff810e0cad>] ? __get_locked_pte+0x8e/0x9e [<ffffffff81076a7e>] ? lock_acquire+0xef/0xfe [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915] [<ffffffff81076a63>] ? lock_acquire+0xd4/0xfe [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915] [<ffffffff81352093>] ? __mutex_lock_common+0x47/0x355 [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915] [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915] [<ffffffff810749df>] ? trace_hardirqs_on_caller+0x13f/0x172 [<ffffffff81352427>] ? mutex_lock_nested+0x28/0x2f [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915] [<ffffffff8107464b>] ? mark_lock+0x2d/0x210 [<ffffffff810dd5ec>] ? __do_fault+0x50/0x32c [<ffffffff810dfaac>] ? handle_pte_fault+0x240/0x676 [<ffffffff81356aca>] ? do_page_fault+0x163/0x34c [<ffffffff81076a7e>] ? lock_acquire+0xef/0xfe [<ffffffff810dcf31>] ? pmd_offset+0x14/0x3b [<ffffffff810e0207>] ? handle_mm_fault+0x181/0x194 [<ffffffff81356c91>] ? do_page_fault+0x32a/0x34c [<ffffffff8107487e>] ? mark_held_locks+0x50/0x72 [<ffffffff8135237f>] ? __mutex_lock_common+0x333/0x355 [<ffffffff81072d26>] ? trace_hardirqs_off_caller+0xa3/0x10b [<ffffffff81353099>] ? trace_hardirqs_off_thunk+0x3a/0x3c [<ffffffff81353e55>] ? page_fault+0x25/0x30 [<ffffffff811be7cf>] ? __copy_user_nocache+0x2f/0xb8 [<ffffffffa05ca046>] ? i915_gem_pwrite_ioctl+0x244/0x840 [i915] [<ffffffffa05618a9>] ? drm_ioctl+0x2a9/0x38b [drm] [<ffffffffa05c9e02>] ? i915_gem_pwrite_ioctl+0x0/0x840 [i915] [<ffffffff810684e6>] ? up_read+0x26/0x2c [<ffffffff81108340>] ? rcu_read_lock+0x0/0x30 [<ffffffff81115a32>] ? do_vfs_ioctl+0x492/0x4d3 [<ffffffff81076983>] ? lock_release+0x1cd/0x1d9 [<ffffffff8110844a>] ? rcu_read_unlock+0x1c/0x1e [<ffffffff81115ac4>] ? sys_ioctl+0x51/0x75 [<ffffffff81009992>] ? system_call_fastpath+0x16/0x1b Does this help?
Created attachment 47828 [details] [review] Disable page faults for the fast paths It confirms that we indeed seeing a reentrant deadlock... Hmm, actually doesn't look that difficult to fix in this case. Please can you try the attached.
Thanks, the attached patch worked for me - no freeze anymore! However I got the following kernel failure message: BUG: scheduling while atomic: rviz/2857/0x10000001 1 lock held by rviz/2857: #0: (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa05bfef0>] i915_mutex_lock_interruptible+0x28/0x4a [i915] Modules linked in: hidp i915 drm_kms_helper drm i2c_algo_bit uinput binfmt_misc parport_pc ppdev lp parport sco rfcomm bnep l2cap crc16 acpi_cpufreq mperf cpufreq_stats cpufreq_powersave cpufreq_conservative cpufreq_userspace nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc cryptd aes_x86_64 aes_generic ecb ecryptfs fuse firewire_sbp2 loop snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_pcm r852 snd_seq sm_common snd_timer snd_seq_device nand nand_ids btusb bluetooth snd uvcvideo videodev nand_ecc v4l2_compat_ioctl32 usbhid hid joydev tpm_tis tpm tpm_bios i2c_i801 mtd pcspkr dell_laptop rfkill evdev dcdbas video battery dell_wmi sparse_keymap psmouse soundcore snd_page_alloc i2c_core processor ac button serio_raw power_supply wmi ext3 jbd mbcache sg sr_mod sd_mod cdrom crc_t10dif ata_generic ahci libahci sdhci_pci sdhci ata_piix uhci_hcd tg3 firewire_ohci libata mmc_core ehci_hcd firewire_core scsi_mod crc_itu_t usbcore libphy thermal thermal_sys nls_base [last unloaded: scsi_wait_scan] Pid: 2857, comm: rviz Not tainted 2.6.38.20110611.1 #1 Call Trace: [<ffffffff8103db08>] ? __schedule_bug+0x70/0x75 [<ffffffff8135138f>] ? schedule+0xa3/0x5cb [<ffffffff810451d1>] ? __cond_resched+0x25/0x30 [<ffffffff81351968>] ? _cond_resched+0x16/0x1d [<ffffffff810ca462>] ? __alloc_pages_nodemask+0x92/0x730 [<ffffffff810d3b8a>] ? shmem_swp_entry+0x41/0x12b [<ffffffff810c37e6>] ? rcu_read_lock+0x0/0x30 [<ffffffff810c37e6>] ? rcu_read_lock+0x0/0x30 [<ffffffff810f3a24>] ? alloc_pages_current+0xac/0xc4 [<ffffffff810c3e5c>] ? __page_cache_alloc+0x72/0x79 [<ffffffff810c407d>] ? do_read_cache_page+0x43/0x125 [<ffffffff810d44c1>] ? shmem_readpage+0x0/0x3c [<ffffffff810c4178>] ? read_cache_page_gfp+0x19/0x23 [<ffffffffa05c339b>] ? i915_gem_pwrite_ioctl+0x57a/0x86f [i915] [<ffffffff810dd0dc>] ? might_fault+0x40/0x90 [<ffffffffa055a8a9>] ? drm_ioctl+0x2a9/0x38b [drm] [<ffffffffa05c2e21>] ? i915_gem_pwrite_ioctl+0x0/0x86f [i915] [<ffffffff810684e6>] ? up_read+0x26/0x2c [<ffffffff81108340>] ? rcu_read_lock+0x0/0x30 [<ffffffff81115a32>] ? do_vfs_ioctl+0x492/0x4d3 [<ffffffff81076983>] ? lock_release+0x1cd/0x1d9 [<ffffffff8110844a>] ? rcu_read_unlock+0x1c/0x1e [<ffffffff81115ac4>] ? sys_ioctl+0x51/0x75 [<ffffffff81009992>] ? system_call_fastpath+0x16/0x1b While this is no real problem for me (the application runs smoothly), it may indicate a general problem? I failed to mention, that rviz uses OGRE lib... maybe that's using the driver in some unorthodox way.
(In reply to comment #7) > Thanks, the attached patch worked for me - no freeze anymore! > However I got the following kernel failure message: > > BUG: scheduling while atomic: rviz/2857/0x10000001 > 1 lock held by rviz/2857: > #0: (&dev->struct_mutex){+.+.+.}, > at: [<ffffffffa05bfef0>] i915_mutex_lock_interruptible+0x28/0x4a [i915] Yeah, that will be a side-effect of those patches, they disable preemption and so the warnings about scheduling. The question is: can we make those fast path allocations atomic?
Created attachment 48075 [details] [review] Disable page faults for the fast paths Once more with setting the ATOMIC flags for the page allocations along the fast path. Some cursory performance testing suggests that we don't regress too badly by taking the slow path more often.
Created attachment 48197 [details] [review] Prefault all pages before io Does this help without the disable pagefaults patch? As I need to reduce the critical region for pagefaults_disable() to avoid holding it across the shmemfs operations...
Created attachment 48198 [details] [review] Disable page faults for the fast paths (v3) As said, the solution is simply to reduce the critical region.
A patch referencing this bug report has been merged in Linux v3.2-rc1: commit 24dd85ff723f142093f44244764b9b5c152235b8 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Sep 28 11:57:23 2011 +0200 io-mapping: ensure io_mapping_map_atomic _is_ atomic
The patch that landed prevents the freeze, but will probably instead cause GL to segfault or hang the GPU as it returns an error rather than fixup the write paths to handle the inaccessible memory. Daniel has an addition series of patches to extend pwrite to support this mode of operation.
Patches merged to drm-intel-next and is on track to get merged to 3.4
A patch referencing this bug report has been merged in Linux v3.4-rc1: commit 8461d2267726dddcb9b5d6ae6b32769b38a326cc Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Dec 14 13:57:32 2011 +0100 drm/i915: rewrite shmem_pread_slow to use copy_to_user
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.