Bug 25911 - 2.10.0 causes kernel oops and system hangs
Summary: 2.10.0 causes kernel oops and system hangs
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: high critical
Assignee: Carl Worth
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2010-01-06 01:57 UTC by Łukasz Maśko
Modified: 2010-02-09 00:43 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
intel_gpu_dump for driver 2.10 on 2.6.32 kernel (126.53 KB, application/x-gzip)
2010-01-13 03:14 UTC, Octavian Petre
no flags Details
Xorg log file showing the error message (8.42 KB, application/x-gzip)
2010-01-13 03:15 UTC, Octavian Petre
no flags Details

Description Łukasz Maśko 2010-01-06 01:57:56 UTC
Yesterday I've tried the latest intel driver. Unfortunately, after some time of work i got a system crash. I've given it another try, but the result was the same. In my system log I got the following entries:

Jan  5 14:15:39 laptok kernel: : [drm:i915_gem_object_pin] *ERROR* Failure to install fence: -28
Jan  5 14:15:39 laptok kernel: : ------------[ cut here ]------------
Jan  5 14:15:39 laptok kernel: : kernel BUG at drivers/gpu/drm/i915/i915_gem.c:2123!
Jan  5 14:15:39 laptok kernel: : invalid opcode: 0000 [#1] PREEMPT SMP
Jan  5 14:15:39 laptok kernel: : last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT0/charge_full
Jan  5 14:15:39 laptok kernel: : Modules linked in: binfmt_misc ipv6 rfcomm bridge stp bnep hidp l2cap crc16 nls_iso8859_2 nls_cp852 vfat fat sg sr_mod cdrom snd_pcm_oss snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_mixer_oss i8k vboxdrv fuse ircomm_tty ircomm irda crc_ccitt dm_mod cpufreq_powersave cpufreq_ondemand cpufreq_stats acpi_cpufreq freq_table hid_a4tech usbhid hid ipaq usbserial usb_storage scsi_mod usb_libusual btusb zaurus bluetooth cdc_acm cdc_ether usbnet mii cdc_wdm arc4 ecb sdhci_pci sdhci mmc_core iwl3945 snd_hda_codec_idt ohci1394 iwlcore mac80211 ieee1394 led_class iTCO_wdt snd_hda_intel uhci_hcd ehci_hcd firewire_ohci firewire_core snd_hda_codec snd_hwdep snd_pcm snd_timer crc_itu_t snd soundcore snd_page_alloc pcmcia usbcore joydev rng_core thermal cfg80211 rfkill yenta_socket rsrc_nonstatic pcmcia_core tg3 libphy evdev i2c_i801 battery processor ppdev parport_pc parport dcdbas ac wmi pcspkr psmouse serio_raw tuxonice_userui tuxonice_swap tuxonice_bio tuxonice_compress i915 drm_kms_helper drm i2c_algo_bit button video output intel_agp ext3 jbd mbcache ide_gd_mod piix ide_core
Jan  5 14:15:39 laptok kernel: :
Jan  5 14:15:39 laptok kernel: : Pid: 3556, comm: Xorg Not tainted (2.6.32.2-laptop-bfs-1 #1) Latitude D430
Jan  5 14:15:39 laptok kernel: : EIP: 0060:[<f81d03b5>] EFLAGS: 00213246 CPU: 1
Jan  5 14:15:39 laptok kernel: : EIP is at i915_gem_leavevt_ioctl+0x115/0x150 [i915]
Jan  5 14:15:39 laptok kernel: : EAX: f454c000 EBX: f69a5000 ECX: 00000000 EDX: 00003232
Jan  5 14:15:39 laptok kernel: : ESI: f6968000 EDI: f69a5e0c EBP: f69a5e20 ESP: f454dda8
Jan  5 14:15:39 laptok kernel: : DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Jan  5 14:15:39 laptok kernel: : Process Xorg (pid: 3556, ti=f454c000 task=f698e810 task.ti=f454c000)
Jan  5 14:15:39 laptok kernel: : Stack:
Jan  5 14:15:39 laptok kernel: : 00000000 00000015 f430cbc0 00000015 f101b540 f81d1a30 f454de34 00100000
Jan  5 14:15:39 laptok kernel: : <0> f454de18 f454dde0 f454dde4 c118ee03 b3d2e000 f3bbf224 f69a5000 0010ebc4
Jan  5 14:15:39 laptok kernel: : <0> f3b75aa0 f36a4000 f68050c0 f1119840 f69a5000 f430cbc0 f3b75800 f36a4000
Jan  5 14:15:39 laptok kernel: : Call Trace:
Jan  5 14:15:39 laptok kernel: : [<f81d1a30>] ? i915_gem_execbuffer+0x830/0x1310 [i915]
Jan  5 14:15:39 laptok kernel: : [<c118ee03>] ? prio_tree_next+0x93/0x220
Jan  5 14:15:39 laptok kernel: : [<f810626d>] ? drm_ioctl+0x14d/0x300 [drm]
Jan  5 14:15:39 laptok kernel: : [<f81d1200>] ? i915_gem_execbuffer+0x0/0x1310 [i915]
Jan  5 14:15:39 laptok kernel: : [<c10c866c>] ? vm_insert_pfn+0xac/0x110
Jan  5 14:15:39 laptok kernel: : [<f81d2928>] ? i915_gem_fault+0x98/0x150 [i915]
Jan  5 14:15:39 laptok kernel: : [<c10eee49>] ? vfs_ioctl+0x89/0xa0
Jan  5 14:15:39 laptok kernel: : [<c10eefc9>] ? do_vfs_ioctl+0x79/0x5c0
Jan  5 14:15:39 laptok kernel: : [<c10c7498>] ? handle_mm_fault+0x1a8/0x720
Jan  5 14:15:39 laptok kernel: : [<c10ef586>] ? sys_ioctl+0x76/0x90
Jan  5 14:15:39 laptok kernel: : [<c1002e90>] ? sysenter_do_call+0x12/0x22
Jan  5 14:15:39 laptok kernel: : Code: c0 89 c1 75 a6 89 f0 e8 ca fa ff ff 85 c0 89 c1 75 99 89 f8 89 0c 24 e8 2a 1e 11 c9 3b ab 20 0e 00 00 74 0b 89 f8 e8 bb 20 11 c9 <0f> 0b eb fe 8d 83 18 0e 00 00 39 83 18 0e 00 00 75 e7 8d 83 10
Jan  5 14:15:39 laptok kernel: : EIP: [<f81d03b5>] i915_gem_leavevt_ioctl+0x115/0x150 [i915] SS:ESP 0068:f454dda8
Jan  5 14:15:39 laptok kernel: : ---[ end trace b8ce43fc5aef224b ]---

I'm using almost plain 2.6.32.2 kernel.
Comment 1 Gordon Jin 2010-01-06 18:20:32 UTC
what's the previous working version?
Comment 2 Łukasz Maśko 2010-01-07 01:26:20 UTC
I can use without problems intel driver 2.9.1 with Mesa 7.7, libdrm 2.4.17 and kernel 2.6.32.x. The problems appear when I change the intel driver.
Comment 3 Octavian Petre 2010-01-13 03:13:35 UTC
My system:
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)

also become unstable with 2.10.0. No kernel crash but display is frozen

I had 
xf86-video-intel-2.10.0
libdrm-2.4.17
mesa-7.7
xorg-server-1.7.4
kernel-2.6.33_rc3

after few minutes of X using I got in Xorg.0.log:
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

could not use the display anymore, I had to reboot.




Then I have tried:
xf86-video-intel-2.10.0
libdrm-2.4.17
mesa-7.7
xorg-server-1.7.4
kernel-2.6.32

again issues so I have dumped the registers with intel_gpu_dump which is attached.
Also the Xorg.0.log is attached

The working solution right now is using 
xf86-video-intel-2.9.1
libdrm-2.4.17
mesa-7.7
xorg-server-1.7.4
kernel-2.6.32
Comment 4 Octavian Petre 2010-01-13 03:14:42 UTC
Created attachment 32610 [details]
intel_gpu_dump for driver 2.10 on 2.6.32 kernel
Comment 5 Octavian Petre 2010-01-13 03:15:09 UTC
Created attachment 32611 [details]
Xorg log file showing the error message
Comment 6 Carl Worth 2010-02-08 18:12:30 UTC
(In reply to comment #2)
> I can use without problems intel driver 2.9.1 with Mesa 7.7, libdrm 2.4.17 and
> kernel 2.6.32.x. The problems appear when I change the intel driver.

Hi Łukasz,

Is this failure easy for you to replicate?

I haven't seen a trend of similar kernel crashes from other users with version 2.9.1, so there might be something unique about your system.

If so, could you perform a git-bisect between version 2.9.1 and 2.10 of the driver to identify what the commit is that introduced the problem?

-Carl
Comment 7 Carl Worth 2010-02-08 18:14:20 UTC
(In reply to comment #3)
> My system:
> 00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML
> Express Integrated Graphics Controller (rev 03)
> 
> also become unstable with 2.10.0. No kernel crash but display is frozen

Hi Octavian,

I appreciate you sharing your report.

However, with graphics driver bugs we really want one bug report per user symptom. Certainly one bug that results in a kernel crash is distinct from a bug that does not. Could you please open a second bug report for your issue so that we can track it separately?

Thanks,

-Carl
Comment 8 Łukasz Maśko 2010-02-08 23:24:26 UTC
(In reply to comment #6)
[...]
> Is this failure easy for you to replicate?
> 
> I haven't seen a trend of similar kernel crashes from other users with version
> 2.9.1, so there might be something unique about your system.

Yes, it is (or at leas was) easy to replicate. I've given a chance to 2.10.0 and tried it several times, for I know, that sometimes crashes may be caused by other system elements - but every time, sooner or later, the result was the same (system hanged). Since then I'm using only 2.9.1, which is rock-steady (if I don't try to use a 2-screen configuration, which gives me a workspace wider then 2048).

> If so, could you perform a git-bisect between version 2.9.1 and 2.10 of the
> driver to identify what the commit is that introduced the problem?

I'm affraid, I'll have two problems with it: first, right now I have no time to do it, to much work and not enough time. Second, I've never done such bisection, so I need some instructions.

Lukasz
Comment 9 Łukasz Maśko 2010-02-09 00:15:44 UTC
Just to confirm - I've just tried again, this time on 2.6.32.7. Half an our later: crash. Went back to 2.9.1.
Comment 10 Chris Wilson 2010-02-09 00:43:11 UTC
Eric realised that we were not accounting for pinned buffers when working out of the number of fences required for the batch. As we don't actually know how many of these fences are lost due to pinned buffers, we have to make a conservative guess of 2 instead.

The reason this starting having a pronounced effect with 2.10.0 is the put_image acceleration introduced with that release will used a tiled blit (requiring a fence on i915 and prior) and hence causing fence starvation (as previously we never even attempted to use fences).

I've pushed this patch to drm, that should work-around this issue:

commit fdcde592c2c48e143251672cf2e82debb07606bd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 9 08:32:54 2010 +0000

    intel: Account for potential pinned buffers hogging fences
    
    As the kernel reports the total number of fences, we must guess how many
    fences are likely to be pinned. In the typical system these will be only
    used by the scanout buffers, of which there may be one per pipe, and any
    number of manually pinned fenced buffers. So take a conservative guess
    and reserve two fences for use by the system.
    
    Note this reduces the number of fences to 3 for i915 and prior.
    
    Reference:
      http://bugs.freedesktop.org/show_bug.cgi?id=25911
      The latest intel driver 2.10.0 causes kernel oops and system hangs
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.