Bug 23919

Summary: [855GM KMS] allocate memory fail
Product: xorg Reporter: Tony White <tonywhite100>
Component: Driver/intelAssignee: Wang Zhenyu <zhenyu.z.wang>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: cworth
Version: 7.4 (2008.09)Keywords: NEEDINFO
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log
none
lspci -vv
none
dmesg
none
xorg crashed but logged no errors
none
intel-drm kernel dmesg
none
intel-drm kernel messages
none
intel-drm syslog
none
xsession-errors
none
intel-drm Xorg.0.log none

Description Tony White 2009-09-14 03:30:11 UTC
I've been having intermittent problems with org since the kms work was merged into the kernel, and that means the xserver has been freezing, crashing and not responding to input. To add to that, nothing in the logs, a complete mystery.
Then finally today, I found something in Xorg.0.log.old because I was randomly kicked out of kde straight into init 3.
I've attached the log and maybe someone here will know if it's a bug because there doesn't appear I can back trace it.
When it happens, kde freezes and nothing responds to input. Ctrl + Alt + Backspace nor Ctrl + Alt + Delete do anything and the xserver doesn't automatically restart after the crash. The desktop completely locks up and I can only hold down the power button to shutdown and power up again.
Comment 1 Tony White 2009-09-14 03:35:08 UTC
Created attachment 29509 [details]
Xorg.0.log

The error is at the end of the log.
Comment 2 Tony White 2009-09-14 03:39:49 UTC
Created attachment 29510 [details]
lspci -vv

If any other data or info is required, please just ask and I'll try my best to add what I know.
Comment 3 Julien Cristau 2009-09-14 04:19:42 UTC
> --- Comment #1 from Tony White <tonywhite100@googlemail.com>  2009-09-14 03:35:08 PST ---
> Created an attachment (id=29509)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=29509)
> Xorg.0.log
> 
> The error is at the end of the log.
> 
That error looks like bug #20516.
Comment 4 Tony White 2009-09-14 13:54:27 UTC
Nope, doesn't look like it. I can log in and out just fine.
Also, I'm using debian sid right now and this exact same issue has also occured on two other linux installs from two other different Linux vendors but this is the first time I've been able to find a record of anything going wrong in the logs.
Comment 5 Gordon Jin 2009-09-14 20:07:04 UTC
the log says:

(EE) intel(0): Failed to initialize kernel memory manager
(==) intel(0): VideoRam: 131072 KB
(II) intel(0): Attempting memory allocation with tiled buffers.
(WW) intel(0): xf86AllocateGARTMemory: allocation of 1536 pages failed
	(Cannot allocate memory)
Comment 6 Wang Zhenyu 2009-09-15 01:52:36 UTC
Could you paste dmesg? 

Eric has done a fix for 8xx which is on his drm-intel-next branch, could you test it? (passing i915.powersave=0 if that's not relevate to you).

git clone git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git


Comment 7 Tony White 2009-09-16 08:54:45 UTC
I pulled the latest snapshot from here :
http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=summary
the last commit was :
4 days ago	Chris Wilson	drm/i915: Only destroy a constructed mmap offset drm-intel-next.

It's building now and I'll test it after it's built. I'll post back either way with a yes it's fixed here or not in about a week (Hopefully.)

I don't use :
i915.powersave=0 or anything like that on this machine, when booting Linux.

If this fix works, is there anyway to know if/when it will get merged into mainline?
Sort of like next bug fix release for the 2.6.31 kernel or 2.6.32?
I'm guessing it's intel-next tag means 2.6.32 possibly? Or is that a silly question?
Comment 8 Tony White 2009-09-16 08:56:13 UTC
Created attachment 29596 [details]
dmesg
Comment 9 Tony White 2009-09-16 14:53:32 UTC
I don't know how but that snapshot made things a lot worse. There were a few small horizontal black lines on a qt 4 application (Opera) That were not supposed to be there (I'd call them artifacts, about 2 cm long.) It took this kernel a grand total of 2 minutes and thirty seconds before it crashed as described. Complete lock up and nothing in the logs. I've attached the Xorg.0.log but it has logged no error.

Can't the intel xorg driver just be regressed right back to the 2.6.27.x version because this is a major problem here and has been since that kernel, getting worse with every new version released.
I know it seems like I'm moaning but I simply cannot use Linux like this.
10 year old Windows XP is actually more reliable right now.

Please fix this bug. I'm willing to test any patches you guys can post but I can only put up with this for another three months. I will just buy/build another (Non Intel machine) To solve it if you can't.

Thanks for the patch suggestion but it very much did not work.
Comment 10 Tony White 2009-09-16 14:56:27 UTC
Created attachment 29607 [details]
xorg crashed but logged no errors
Comment 11 Wang Zhenyu 2009-09-16 18:10:42 UTC
From your kernel dmesg (seems not drm-intel-next kernel), you don't load i915 module at all. And you have vesafb loaded, don't do that. And X log showed that you're using UMS, instead of KMS.

Please check your kernel config, make sure you have enable following configs:
CONFIG_AGP=y                                
CONFIG_AGP_INTEL=y                                              
CONFIG_DRM=y                                        
CONFIG_DRM_I915=y                                              
CONFIG_DRM_I915_KMS=y                                         
CONFIG_FRAMEBUFFER_CONSOLE=y                                                                                                                
                                     
Comment 12 Tony White 2009-09-17 01:31:46 UTC
Firstly, are you saying that I should use intelfb on the kernel command line and in fact the kernel (Or udev) Is incorrectly loading the fallback vesa framebuffer device driver instead of intelfb?

Secondly, debian isn't configured to my knowledge to use kms and I don't expect it will for at least a year. Doesn't there need to be something that sends a signal to the kernel to stop kms when xorg is run from userspace if CONFIG_DRM_I915_KMS=y is set because that means kernel mode settings turned on by default on the intelfb?
kms has never worked on this machine when the intelfb driver is set to kms by default. It gets to x and then x won't load.
Never the less, I'll try it again.

@ seems not drm-intel-next kernel - Sorry, yes that is the wrong dmesg. I will post the right one.
Comment 13 Julien Cristau 2009-09-17 01:46:23 UTC
> --- Comment #12 from Tony White <tonywhite100@googlemail.com>  2009-09-17 01:31:46 PST ---
> Firstly, are you saying that I should use intelfb on the kernel command line
> and in fact the kernel (Or udev) Is incorrectly loading the fallback vesa
> framebuffer device driver instead of intelfb?

No, using intelfb leads to lots of pain.

> Secondly, debian isn't configured to my knowledge to use kms and I don't expect
> it will for at least a year. Doesn't there need to be something that sends a
> signal to the kernel to stop kms when xorg is run from userspace if
> CONFIG_DRM_I915_KMS=y is set because that means kernel mode settings turned on
> by default on the intelfb?

Forget about intelfb.  For kms, adding 'options i915 modeset=1' to
/etc/modprobe.d/kms.conf (and regenerating your initramfs) should do the
trick.  Or simply adding i915.modeset=1 to the kernel command line
should work too.
Comment 14 Tony White 2009-09-17 10:34:00 UTC
OK. So I did as asked, pulled the snapshot again, used :
CONFIG_AGP=y                                
CONFIG_AGP_INTEL=y                                              
CONFIG_DRM=y                                        
CONFIG_DRM_I915=y                                              
CONFIG_DRM_I915_KMS=y                                         
CONFIG_FRAMEBUFFER_CONSOLE=y

booted and the result is the same. The xserver won't stay up without crashing and freezing for any longer than two minutes. I've booted into it about eight times now. It freezes every go.

In the logs :
/var/log/messages :

Sep 17 13:35:05 pentium-m kernel: i915 0000:00:02.0: VGA-1: EDID invalid.
Sep 17 13:35:05 pentium-m kernel:
Sep 17 13:35:05 pentium-m kernel: i915 0000:00:02.0: VGA-1: EDID invalid.
Sep 17 13:35:05 pentium-m kernel: [drm] DAC-6: set mode 1024x768 1c
Sep 17 13:35:23 pentium-m kernel:
Sep 17 13:35:23 pentium-m kernel: i915 0000:00:02.0: VGA-1: EDID invalid.
Sep 17 13:35:23 pentium-m kernel:
Sep 17 13:35:23 pentium-m kernel: i915 0000:00:02.0: VGA-1: EDID invalid.

But I guess that means wrong display resolution, however that's wrong and it says that the resolution is supported in .xsession-errors.

/var/log/syslog :

Sep 17 14:14:29 pentium-m kernel: [drm:edid_is_valid] *ERROR* Raw EDID:
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel: <3>00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Sep 17 14:14:29 pentium-m kernel:
Sep 17 14:14:29 pentium-m kernel: i915 0000:00:02.0: VGA-1: EDID invalid.
Sep 17 14:14:29 pentium-m kernel: fb: conflicting fb hw usage inteldrmfb vs VESA VGA - removing generic driver
Sep 17 14:14:29 pentium-m kernel: Console: switching to colour dummy device 80x25
Sep 17 14:14:29 pentium-m kernel: fbcon: inteldrmfb (fb0) is primary device
Sep 17 14:14:29 pentium-m kernel: render error detected, EIR: 0x00000010
Sep 17 14:14:29 pentium-m kernel: [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
Sep 17 14:14:29 pentium-m kernel: render error detected, EIR: 0x00000010

~.xsession-errors :

X Error: XSyncBadAlarm 152
  Extension:    143 (Uknown extension)
  Minor opcode: 11 (Unknown request)
  Resource id:  0x0
kdeinit4: preparing to launch /usr/lib/libkdeinit4_kcminit_startup.so
X Error: XSyncBadAlarm 152
  Extension:    143 (Uknown extension)
  Minor opcode: 11 (Unknown request)
  Resource id:  0x0

So further instructions please, it's definitely worse. If this gets merged as it is, it doesn't look like I'll be using Linux on this machine.
I've attached the logs if they are any use.
Comment 15 Tony White 2009-09-17 10:34:54 UTC
Created attachment 29642 [details]
intel-drm kernel dmesg
Comment 16 Tony White 2009-09-17 10:36:08 UTC
Created attachment 29643 [details]
intel-drm kernel messages
Comment 17 Tony White 2009-09-17 10:37:28 UTC
Created attachment 29644 [details]
intel-drm syslog
Comment 18 Tony White 2009-09-17 10:38:03 UTC
Created attachment 29645 [details]
xsession-errors
Comment 19 Tony White 2009-09-17 10:38:30 UTC
Created attachment 29646 [details]
intel-drm Xorg.0.log
Comment 20 Tony White 2009-09-18 07:04:33 UTC
I also see this as the first message when booting the intel drm kernel :

[drm : edid_is_valid] *ERROR* Raw EDID
[drm : i915_handle_error] *ERROR* EIR stuck : 0x00000010, Masking
render error detected, EIR 0x00000010
Comment 21 Tony White 2009-09-18 13:54:26 UTC
I'm just posting again to say how angry I am about this bug and that I bought intel hardware because I require reliability. I've used Linux for ten years and this is the worst problem I have ever seen. It is getting worse and worse and worse with every update you guys push.
I've just received an xserver pre release copy along with what I guess is the intel driver pre release and it is even worse than when I first posted this report. The xserver used to crash and freeze randomly once or twice a week but now it actually does it does it every two hours!

I need to know if you guys know what the problem is and whether you guys are committed to solving it because I cannot work like this.
Three months of this random freezing here and others also reporting it happening with this driver and still no fix.
What is going on???
Comment 22 Carl Worth 2009-09-25 14:19:30 UTC
(In reply to comment #21)
> I'm just posting again to say how angry I am about this bug and that I bought
> intel hardware because I require reliability.

Hi Tony,

I know it's frustrating to encounter bugs like this, and I'm sorry you haven't
seen any improvements yet.

> I need to know if you guys know what the problem is and whether you guys are
> committed to solving it because I cannot work like this.
> Three months of this random freezing here and others also reporting it
> happening with this driver and still no fix.
> What is going on???

We've definitely seen that lots of people with 855 and 865 hardware were having
lots of pain. And yes, we've been working hard to fix these issues.

We very recently made a couple of important breakthroughs that fix things for
many users. The first is a commit by Eric Anholt to the kernel:

commit e517a5e97080bbe52857bd0d7df9b66602d53c4d
Author: Eric Anholt <eric@anholt.net>
Date:   Thu Sep 10 17:48:48 2009 -0700

    agp/intel: Fix the pre-9xx chipset flush.
    
    Ever since we enabled GEM, the pre-9xx chipsets (particularly 865) have had
    serious stability issues.  Back in May a wbinvd was added to the DRM to
    work around much of the problem.  Some failure remained -- easily visible
    by dragging a window around on an X -retro desktop, or by looking at bugzill
    
    The chipset flush was on the right track -- hitting the right amount of
    memory, and it appears to be the only way to flush on these chipsets, but th
    flush page was mapped uncached.  As a result, the writes trying to clear the
    writeback cache ended up bypassing the cache, and not flushing anything!  Th
    wbinvd would flush out other writeback data and often cause the data we want
    to get flushed, but not always.  By removing the setting of the page to UC
    and instead just clflushing the data we write to try to flush it, we get the
    desired behavior with no wbinvd.
    
    This exports clflush_cache_range(), which was laying around and happened to
    basically match the code I was otherwise going to copy from the DRM.
    
    Signed-off-by: Eric Anholt <eric@anholt.net>
    Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
    Cc: stable@kernel.org

If you can verify that your kernel includes that, (or update if it doesn't),
and report back whether that helps, that would be very useful.

We also recently fixed some issues with the xf86-video-intel driver in the
2.8.99.902 release, (which is the 2nd release candidate for 2.9.0). That
release includes this fix:

commit 2cc1f3cb6034dddd65b3781b0cde7dff4ac1e803
Author: Keith Packard <keithp@keithp.com>
Date:   Sat Sep 19 17:30:57 2009 -0700

    i8xx: Format projective texture coordinates correctly.
    
    Projective texture coordinates must be delivered as TEXCOORDFMT_3D
    using TEXCOORDTYPE_HOMOGENOUS. This meant selecting the correct type
    in i830_texture_setup, the correct format in i830_emit_composite_state
    and sending only 3 coordinates in i830_emit_composite_primitive.
    
    Signed-off-by: Keith Packard <keithp@keithp.com>
    [ickle: tweaked to fix up a couple of use-before-initialised]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Please let us know if these commits don't fix your issues, (and please
remove the NEEDINFO keyword when you reply).

Thanks,

-Carl
Comment 23 Tony White 2009-09-27 15:57:43 UTC
Thanks Carl,
I have realised what's given me a bee in my bonnet. It's that I've reported [855GM KMS] allocate memory fail as the problem but with the latest stuff, [855GM KMS] allocate memory fail is fixed but the problem that I thought [855GM KMS] allocate memory fail was causing is worse with the new stuff; if that makes any sense. So it's just the confusion I was under, not realising that the problem is not [855GM KMS] allocate memory fail, which is now fixed looking at the logs.

As far as where I'm at with this :
I'm using the VESA driver instead with xorg for the time being until I can verify the problem is gone; so at least my worst fear (No Linux for me) Is unfounded. The VESA driver actually works very well and I've not had one single crash or freeze whilst using it.
However I do want to use this intel driver instead for dual display.
So I need to debug the xserver over ssh to catch the freeze and get the crash data to the developers. I'm just waiting for a part for a second machine I have here and then I can debug the crash over ssh.

If I succeed, I'll create a new report.
Comment 24 Wang Zhenyu 2009-10-19 23:40:56 UTC
Please, open a new track for your new problem.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.