Bug 38115 - 965GM X freezes (when rendering in rviz) aka pwrite/pread on gtt mmaps
965GM X freezes (when rendering in rviz) aka pwrite/pread on gtt mmaps
Status: RESOLVED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Intel
unspecified
x86-64 (AMD64) Linux (All)
: medium normal
Assigned To: Daniel Vetter
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-09 08:02 UTC by Chris
Modified: 2012-04-05 06:50 UTC (History)
4 users (show)

See Also:


Attachments
(debian) reportbug log containing Xorg.log, dmesg, .. (125.33 KB, text/plain)
2011-06-09 08:02 UTC, Chris
no flags Details
Disable page faults for the fast paths (2.03 KB, patch)
2011-06-10 13:10 UTC, Chris Wilson
no flags Details | Splinter Review
Disable page faults for the fast paths (3.40 KB, patch)
2011-06-17 00:01 UTC, Chris Wilson
no flags Details | Splinter Review
Prefault all pages before io (2.51 KB, patch)
2011-06-20 08:42 UTC, Chris Wilson
no flags Details | Splinter Review
Disable page faults for the fast paths (v3) (2.79 KB, patch)
2011-06-20 08:53 UTC, Chris Wilson
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Chris 2011-06-09 08:02:03 UTC
Created attachment 47773 [details]
(debian) reportbug log containing Xorg.log, dmesg, ..

Bug description:

Running a specific scenario (loading/rendering a "Robot Model" in rviz -
http://www.ros.org/wiki/rviz ) leads reproducibly to a full X freeze.
Bugreport was created logging in via ssh (while frozen).

See the attachment for logs of dmesg and Xorg.log.

System environment: 
-- chipset: 965GM 
-- system architecture: x86_64
-- xf86-video-intel: git 021209d5d3add8b28143611cfad4c5481a2945a3 (merged master into debian-unstable)
-- libc6 2.13-4
-- libdrm-intel1 2.4.25-2
-- libdrm2 2.4.25-2 
-- libpciaccess0 0.12.1-1                                                                                                                                                           
-- libudev0 167-3
-- xserver-xorg-core [xorg-video 2:1.10.1-2
-- kernel: 2.6.32-5-686
-- Linux distribution: debian unstable
-- Machine or mobo model: Dell Inspiron 1420
-- Display connector: VGA

Additional info:

Running 'intel_gpu_dump' hangs after printing:

  ACTHD: 0xffffffff
  EIR: 0x00000000
  EMR: 0xffffffcd
  ESR: 0x00000001
  PGTBL_ER: 0x00000000
  IPEHR: 0x01000000
  IPEIR: 0x00000000
  INSTDONE: 0xffe5fafe
  INSTDONE1: 0x000fffff
    busy: Projection and LOD
    busy: Bypass FIFO
    busy: Color calculator

oddly i915_error_state (both in /sys/kernel/debug/dri/0 and
/sys/kernel/debug/dri/64) say "no error state collected".
Comment 1 Chris Wilson 2011-06-09 08:44:04 UTC
Oh no, I promised Daniel we would *never* do anything this silly.
Comment 2 Chris Wilson 2011-06-09 11:11:14 UTC
Now that Daniel has had a good laugh, can you please attach gdb to rvis during the hang and attach the backtrace. Thanks.
Comment 3 Chris 2011-06-09 12:30:01 UTC
Glad you guys are having fun.

Attaching with gdb does not work (hangs saying "Attaching to ...")
This (from kernel log, as attached) does not help?

[  720.544650] Call Trace:
[  720.544659]  [<ffffffff813262f2>] ? __mutex_lock_common+0x127/0x193
[  720.544669]  [<ffffffff810cfdef>] ? vm_insert_pfn+0xff/0x13b
[  720.544677]  [<ffffffff8132642a>] ? mutex_lock+0x1a/0x33
[  720.544699]  [<ffffffffa07a1cce>] ? i915_gem_fault+0x4a/0x1bc [i915]
[  720.544708]  [<ffffffff810d0cb1>] ? __do_fault+0x52/0x3ed
[  720.544716]  [<ffffffff810d1e62>] ? handle_pte_fault+0x322/0x8f2
[  720.544726]  [<ffffffff8132a147>] ? do_page_fault+0x32a/0x34c
[  720.544737]  [<ffffffff8104d3ad>] ? walk_system_ram_range+0x82/0x101
[  720.544746]  [<ffffffff8104d443>] ? page_is_ram+0x17/0x21
[  720.544754]  [<ffffffff8102fb9f>] ? pat_pagerange_is_ram+0x30/0x5e
[  720.544762]  [<ffffffff813274d5>] ? page_fault+0x25/0x30
[  720.544771]  [<ffffffff811a3a7f>] ? __copy_user_nocache+0x2f/0xb8
[  720.544793]  [<ffffffffa07a1649>] ? i915_gem_pwrite_ioctl+0x237/0x872 [i915]
[  720.544815]  [<ffffffffa079f42a>] ? i915_gem_object_set_to_gtt_domain+0x8b/0x92 [i915]
[  720.544834]  [<ffffffffa0746fb4>] ? drm_gem_object_lookup+0x37/0x47 [drm]
[  720.544849]  [<ffffffffa07457c0>] ? drm_ioctl+0x24e/0x321 [drm]
[  720.544871]  [<ffffffffa07a1412>] ? i915_gem_pwrite_ioctl+0x0/0x872 [i915]
[  720.544880]  [<ffffffff8132a131>] ? do_page_fault+0x314/0x34c
[  720.544891]  [<ffffffff81104b0b>] ? do_vfs_ioctl+0x467/0x4b4
[  720.544898]  [<ffffffff811084d2>] ? dput+0x29/0xe9
[  720.544907]  [<ffffffff810f8d08>] ? fput+0x1c5/0x1ec
[  720.544915]  [<ffffffff81104ba3>] ? sys_ioctl+0x4b/0x70
[  720.544923]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b

Any other idea on how I can get valuable information on what's going on?

Cheers,
Comment 4 Chris Wilson 2011-06-09 15:32:37 UTC
Ok, if you enable lockdep (CONFIG_PROVE_LOCKING and friends under Kernel Hacking) that should detect the deadlock as it occurs and allow us to get the stacktrace from userspace. Hmm, we may need a few more tricks as well.

The problem is that we are doing a pwrite from inside one mmap'ed bo to another bo. We hit a pagefault during that operation, and re-enter our code and find ourselves in a deadlock.

Hmm, this can happen if the application does glBufferData(dst, glBufferMap(src));

Woe, woe, and thrice woe.
Comment 5 Chris 2011-06-10 12:37:10 UTC
I got the following backtrace (after which X freezes anyhow...):

=============================================
[ INFO: possible recursive locking detected ]
2.6.38.20110610.1 #1
---------------------------------------------
rviz/2939 is trying to acquire lock:
 (&dev->struct_mutex){+.+.+.}, 
  at: [<ffffffffa05c98ee>] i915_gem_fault+0x50/0x1bf [i915]

but task is already holding lock:
 (&dev->struct_mutex){+.+.+.}, 
  at: [<ffffffffa05c6edf>] i915_mutex_lock_interruptible+0x28/0x4a [i915]

other info that might help us debug this:
2 locks held by rviz/2939:
 #0:  (&dev->struct_mutex){+.+.+.}, 
  at: [<ffffffffa05c6edf>] i915_mutex_lock_interruptible+0x28/0x4a [i915]
 #1:  (&mm->mmap_sem){++++++}, 
  at: [<ffffffff81356aca>] do_page_fault+0x163/0x34c

stack backtrace:
Pid: 2939, comm: rviz Not tainted 2.6.38.20110610.1 #1
Call Trace:
 [<ffffffff810764df>] ? __lock_acquire+0xd38/0xdc0
 [<ffffffff810e0cad>] ? __get_locked_pte+0x8e/0x9e
 [<ffffffff81076a7e>] ? lock_acquire+0xef/0xfe
 [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915]
 [<ffffffff81076a63>] ? lock_acquire+0xd4/0xfe
 [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915]
 [<ffffffff81352093>] ? __mutex_lock_common+0x47/0x355
 [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915]
 [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915]
 [<ffffffff810749df>] ? trace_hardirqs_on_caller+0x13f/0x172
 [<ffffffff81352427>] ? mutex_lock_nested+0x28/0x2f
 [<ffffffffa05c98ee>] ? i915_gem_fault+0x50/0x1bf [i915]
 [<ffffffff8107464b>] ? mark_lock+0x2d/0x210
 [<ffffffff810dd5ec>] ? __do_fault+0x50/0x32c
 [<ffffffff810dfaac>] ? handle_pte_fault+0x240/0x676
 [<ffffffff81356aca>] ? do_page_fault+0x163/0x34c
 [<ffffffff81076a7e>] ? lock_acquire+0xef/0xfe
 [<ffffffff810dcf31>] ? pmd_offset+0x14/0x3b
 [<ffffffff810e0207>] ? handle_mm_fault+0x181/0x194
 [<ffffffff81356c91>] ? do_page_fault+0x32a/0x34c
 [<ffffffff8107487e>] ? mark_held_locks+0x50/0x72
 [<ffffffff8135237f>] ? __mutex_lock_common+0x333/0x355
 [<ffffffff81072d26>] ? trace_hardirqs_off_caller+0xa3/0x10b
 [<ffffffff81353099>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff81353e55>] ? page_fault+0x25/0x30
 [<ffffffff811be7cf>] ? __copy_user_nocache+0x2f/0xb8
 [<ffffffffa05ca046>] ? i915_gem_pwrite_ioctl+0x244/0x840 [i915]
 [<ffffffffa05618a9>] ? drm_ioctl+0x2a9/0x38b [drm]
 [<ffffffffa05c9e02>] ? i915_gem_pwrite_ioctl+0x0/0x840 [i915]
 [<ffffffff810684e6>] ? up_read+0x26/0x2c
 [<ffffffff81108340>] ? rcu_read_lock+0x0/0x30
 [<ffffffff81115a32>] ? do_vfs_ioctl+0x492/0x4d3
 [<ffffffff81076983>] ? lock_release+0x1cd/0x1d9
 [<ffffffff8110844a>] ? rcu_read_unlock+0x1c/0x1e
 [<ffffffff81115ac4>] ? sys_ioctl+0x51/0x75
 [<ffffffff81009992>] ? system_call_fastpath+0x16/0x1b

Does this help?
Comment 6 Chris Wilson 2011-06-10 13:10:55 UTC
Created attachment 47828 [details] [review]
Disable page faults for the fast paths

It confirms that we indeed seeing a reentrant deadlock...

Hmm, actually doesn't look that difficult to fix in this case. Please can you try the attached.
Comment 7 Chris 2011-06-11 10:28:07 UTC
Thanks, the attached patch worked for me - no freeze anymore!
However I got the following kernel failure message:

BUG: scheduling while atomic: rviz/2857/0x10000001
1 lock held by rviz/2857:
 #0:  (&dev->struct_mutex){+.+.+.}, 
   at: [<ffffffffa05bfef0>] i915_mutex_lock_interruptible+0x28/0x4a [i915]

Modules linked in: hidp i915 drm_kms_helper drm i2c_algo_bit uinput binfmt_misc parport_pc ppdev lp parport sco rfcomm bnep l2cap crc16 acpi_cpufreq mperf cpufreq_stats cpufreq_powersave cpufreq_conservative cpufreq_userspace nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc cryptd aes_x86_64 aes_generic ecb ecryptfs fuse firewire_sbp2 loop snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_pcm r852 snd_seq sm_common snd_timer snd_seq_device nand nand_ids btusb bluetooth snd uvcvideo videodev nand_ecc v4l2_compat_ioctl32 usbhid hid joydev tpm_tis tpm tpm_bios i2c_i801 mtd pcspkr dell_laptop rfkill evdev dcdbas video battery dell_wmi sparse_keymap psmouse soundcore snd_page_alloc i2c_core processor ac button serio_raw power_supply wmi ext3 jbd mbcache sg sr_mod sd_mod cdrom crc_t10dif ata_generic ahci libahci sdhci_pci sdhci ata_piix uhci_hcd tg3 firewire_ohci libata mmc_core ehci_hcd firewire_core scsi_mod crc_itu_t usbcore libphy thermal thermal_sys nls_base [last unloaded: scsi_wait_scan]

Pid: 2857, comm: rviz Not tainted 2.6.38.20110611.1 #1
Call Trace:
 [<ffffffff8103db08>] ? __schedule_bug+0x70/0x75
 [<ffffffff8135138f>] ? schedule+0xa3/0x5cb
 [<ffffffff810451d1>] ? __cond_resched+0x25/0x30
 [<ffffffff81351968>] ? _cond_resched+0x16/0x1d
 [<ffffffff810ca462>] ? __alloc_pages_nodemask+0x92/0x730
 [<ffffffff810d3b8a>] ? shmem_swp_entry+0x41/0x12b
 [<ffffffff810c37e6>] ? rcu_read_lock+0x0/0x30
 [<ffffffff810c37e6>] ? rcu_read_lock+0x0/0x30
 [<ffffffff810f3a24>] ? alloc_pages_current+0xac/0xc4
 [<ffffffff810c3e5c>] ? __page_cache_alloc+0x72/0x79
 [<ffffffff810c407d>] ? do_read_cache_page+0x43/0x125
 [<ffffffff810d44c1>] ? shmem_readpage+0x0/0x3c
 [<ffffffff810c4178>] ? read_cache_page_gfp+0x19/0x23
 [<ffffffffa05c339b>] ? i915_gem_pwrite_ioctl+0x57a/0x86f [i915]
 [<ffffffff810dd0dc>] ? might_fault+0x40/0x90
 [<ffffffffa055a8a9>] ? drm_ioctl+0x2a9/0x38b [drm]
 [<ffffffffa05c2e21>] ? i915_gem_pwrite_ioctl+0x0/0x86f [i915]
 [<ffffffff810684e6>] ? up_read+0x26/0x2c
 [<ffffffff81108340>] ? rcu_read_lock+0x0/0x30
 [<ffffffff81115a32>] ? do_vfs_ioctl+0x492/0x4d3
 [<ffffffff81076983>] ? lock_release+0x1cd/0x1d9
 [<ffffffff8110844a>] ? rcu_read_unlock+0x1c/0x1e
 [<ffffffff81115ac4>] ? sys_ioctl+0x51/0x75
 [<ffffffff81009992>] ? system_call_fastpath+0x16/0x1b

While this is no real problem for me (the application runs smoothly), it may indicate a general problem?
I failed to mention, that rviz uses OGRE lib... maybe that's using the driver in some unorthodox way.
Comment 8 Chris Wilson 2011-06-12 12:36:14 UTC
(In reply to comment #7)
> Thanks, the attached patch worked for me - no freeze anymore!
> However I got the following kernel failure message:
> 
> BUG: scheduling while atomic: rviz/2857/0x10000001
> 1 lock held by rviz/2857:
>  #0:  (&dev->struct_mutex){+.+.+.}, 
>    at: [<ffffffffa05bfef0>] i915_mutex_lock_interruptible+0x28/0x4a [i915]

Yeah, that will be a side-effect of those patches, they disable preemption and so the warnings about scheduling. The question is: can we make those fast path allocations atomic?
Comment 9 Chris Wilson 2011-06-17 00:01:06 UTC
Created attachment 48075 [details] [review]
Disable page faults for the fast paths

Once more with setting the ATOMIC flags for the page allocations along the fast path. Some cursory performance testing suggests that we don't regress too badly by taking the slow path more often.
Comment 10 Chris Wilson 2011-06-20 08:42:09 UTC
Created attachment 48197 [details] [review]
Prefault all pages before io

Does this help without the disable pagefaults patch? As I need to reduce the critical region for pagefaults_disable() to avoid holding it across the shmemfs operations...
Comment 11 Chris Wilson 2011-06-20 08:53:00 UTC
Created attachment 48198 [details] [review]
Disable page faults for the fast paths (v3)

As said, the solution is simply to reduce the critical region.
Comment 12 Florian Mickler 2012-01-21 08:44:23 UTC
A patch referencing this bug report has been merged in Linux v3.2-rc1:

commit 24dd85ff723f142093f44244764b9b5c152235b8
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Sep 28 11:57:23 2011 +0200

    io-mapping: ensure io_mapping_map_atomic _is_ atomic
Comment 13 Chris Wilson 2012-01-21 09:02:27 UTC
The patch that landed prevents the freeze, but will probably instead cause GL to segfault or hang the GPU as it returns an error rather than fixup the write paths to handle the inaccessible memory. Daniel has an addition series of patches to extend pwrite to support this mode of operation.
Comment 14 Daniel Vetter 2012-02-08 02:19:00 UTC
Patches merged to drm-intel-next and is on track to get merged to 3.4
Comment 15 Florian Mickler 2012-04-05 06:50:55 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 8461d2267726dddcb9b5d6ae6b32769b38a326cc
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Dec 14 13:57:32 2011 +0100

    drm/i915: rewrite shmem_pread_slow to use copy_to_user