Bug 17993

Summary: [GEM] kernel with PAE panics when starting X
Product: xorg Reporter: Pierre Willenbrock <pierre>
Component: Driver/intelAssignee: Eric Anholt <eric>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: d13f00l, konstantin.sobolev, mh, peter.ganzhorn, remi, sven.koehler, wenrui
Version: git   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
Attachments:
Description Flags
Dmesg snippet with added backtrace
none
kernel .config
none
Xorg.log with nopat
none
Guard against highmem pages before putting them into the agp subsystem
none
Guard against highmem pages fetched from shmem file
none
Make the gem shmem file only allocate in GFP_DMA32
none
Modify agp subsystem to handle dma_addr_t physical addresses
none
Modify agp subsystem to handle dma_addr_t physical addresses
none
Use offset of agp area for gtt variant of i915_gem_gtt_pwrite
none
Modify agp subsystem to handle dma_addr_t physical addresses
none
Use set_memory_{uc,wb} instead of set_memory_array_{uc,wb} for CONFIG_HIGHMEM64G none

Description Pierre Willenbrock 2008-10-09 08:56:29 UTC
Created attachment 19535 [details]
Dmesg snippet with added backtrace

Starting X on a kernel with CONFIG_HIGHMEM64G leads to panics, oopses and bugs with backtraces pointing to different unrelated places. This does not happen with either CONFIG_NOHIGHMEM or CONFIG_HIGHMEM4G. 

I put a breakpoint on drmIoctl and found that drmIoctl(fd=16, request=25689, arg=0x0) leads to the above mentioned behaviour. Full backtrace:
#0  drmIoctl (fd=16, request=25689, arg=0x0) at xf86drm.c:183
#1  0xb7abf4da in drmCommandNone (fd=16, drmCommandIndex=25) at xf86drm.c:2285
#2  0xb79ecf70 in I830EnterVT (scrnIndex=0, flags=0) at i830_driver.c:3644
#3  0xb79f1e68 in I830ScreenInit (scrnIndex=0, pScreen=0x8219638, argc=1,
    argv=0xbfd17d04) at i830_driver.c:3412
#4  0x08070cfc in AddScreen (pfnInit=0xb79f0970 <I830ScreenInit>, argc=1,
    argv=0xbfd17d04) at main.c:690
#5  0x080ae954 in InitOutput (pScreenInfo=0x81f2360, argc=1, argv=0xbfd17d04)
    at xf86Init.c:1089
#6  0x08071409 in main (argc=1, argv=0xbfd17d04, envp=0xbfd17d0c) at main.c:310

I am using:
drm-intel-next branch of Erics linux2.6:
b34c87315b1a2822111fc8ef744ef504f9be2f85

master of xf86-video-intel (with patch increasing initial allocation of offscreen memory, i have not figured out why i need it, yet):
74571363539426abeb0a1af11f3bb545d91ed6c2

master of xserver(plus setxkbmap workaround):
1feb69eb63e6739ff5db255ad529e84adf941a10

master of drm:
728d8e226f1bc12f50f710cc96bbb2a25f72ada3

This is using exa acceleration and dri, no dri2.

Attached is part of the output of dmesg, with the backtrace added at the correct place. This is actually the result of three runs, in the first run i got the long dmesg snippet, in the second run i found exactly which drmIoctl call leads to panics, and the third run provided the actual backtrace.
Comment 1 Pierre Willenbrock 2008-10-13 10:59:19 UTC
Created attachment 19626 [details]
kernel .config

Using nopat on the kernel commandline causes the Xserver to "only" hang with movable cursor instead of kernel panics. For non-CONFIG_HIGHMEM64G kernels, the Xserver works with and without nopat. This is a Lenovo T61 with 965GM integrated graphics. Adding kernel .config and Xorg.log from the nopat case.
Comment 2 Pierre Willenbrock 2008-10-13 11:00:15 UTC
Created attachment 19627 [details]
Xorg.log with nopat
Comment 3 Pierre Willenbrock 2008-10-17 16:36:34 UTC
The actual problem is this: the i915 drm-driver feeds pages allocated using kcalloc into the agp subsystem. On systems having more than 3G of RAM, these pages may have physical addresses beyond the 4GB boundary, thus being unreachable for the (current?) agp implementation. On its way into the agp subsystem, the extra bits are chopped off, and if the GPU writes anything in that space, it is probably overwriting kernel memory. If i find out how to allocate memory in the low 4GB, a patch will follow.
Comment 4 Pierre Willenbrock 2008-10-18 08:01:25 UTC
Sorry, the above was incorrect. The bad physical addresses are from the mapping of some inode in i915_gem.c:1087.
Comment 5 Pierre Willenbrock 2008-10-18 11:29:35 UTC
Created attachment 19734 [details] [review]
Guard against highmem pages before putting them into the agp subsystem

This attachment adds a simple check against highmem pages, making memory corruption on this way impossible. This may be a problem on 64bit kernels, if the "unsigned long" used as address data type in the agp subsystem is 64bit wide. On the other hand i suspect that the agp subsystem doesn't handle addresses above 4G well.
Comment 6 Pierre Willenbrock 2008-10-18 11:31:49 UTC
Created attachment 19735 [details] [review]
Guard against highmem pages fetched from shmem file

The attached patch is purely diagnostic, to bail out early when we get highmem pages.
Comment 7 Pierre Willenbrock 2008-10-18 11:33:54 UTC
Created attachment 19736 [details] [review]
Make the gem shmem file only allocate in GFP_DMA32

Attached patch finally fixes the problem in this bug, making the shmem subsystem return only pages from GFP_DMA32, instead of GFP_HIGHMEM.
Comment 8 Eric Anholt 2008-10-19 17:16:42 UTC
AGP really just need to get fixed for >32-bit addresses.  It shouldn't be too hard of a job.
Comment 9 Pierre Willenbrock 2008-10-21 07:34:50 UTC
Created attachment 19786 [details] [review]
Modify agp subsystem to handle dma_addr_t physical addresses

This one lets my session start up until the opengl compositing manager starts. I suspect DRM_IOCTL_AGP_ALLOC needs to be extended to pass physical addresses using 64bit to userspace. I can imagine that the __u32 is too small for 64bit kernels, too(Given a sufficient amount of memory available). Another place i left of was passing physical addresses to intelfb, which then passes them to mtrr, which again is using unsigned longs for physical addresses.
Comment 10 Pierre Willenbrock 2008-10-22 12:19:10 UTC
Created attachment 19821 [details] [review]
Modify agp subsystem to handle dma_addr_t physical addresses

Remove accidental drm api change. physical pages are not used in userspace for i9xx after all.
Comment 11 Pierre Willenbrock 2008-10-22 13:23:10 UTC
Created attachment 19824 [details] [review]
Use offset of agp area for gtt variant of i915_gem_gtt_pwrite

This one corrects 264c96fe844237c3a5af92a7ee1f2bea4836ad4d.

Using this patch and the AGP changes in attachment 19821 [details] [review] i can get
my kde4 session to start up completely. I tested on top of the old for-review-branch, forcing the slow path in i915_gem_gtt_pwrite, so i ran into a few issues ;-).
Comment 12 Pierre Willenbrock 2008-10-23 10:04:15 UTC
Created attachment 19827 [details] [review]
Modify agp subsystem to handle dma_addr_t physical addresses

This one should be the last revision of this patch. Removed the remaining api change in agpgart.h, included changes to the other agp backends, changing the prototype of the mask_memory functions. This one is tested and works on top of 57742578dc476ef5d1a06b08f61da0aae32185f4.
Comment 13 Eric Anholt 2008-11-03 23:12:45 UTC
Applied 19827 to for-airlied and drm-intel-next.  Was reviewed by Arjan as well.
Comment 14 Pierre Willenbrock 2008-12-11 16:10:12 UTC
The patch does not work correctly, quoting Dave Airlie:

> So we have calls to set_memory_array_uc that used to take unsigned
> long *, they now take dma_addr_t *... this would be an issue.

This will break all users of agp_generic_alloc_pages and agp_generic_destroy_pages on CONFIG_HIGHMEM64G systems. 

Soo.. For me, the obvious solution would be to iterate over the array, calling set_memory_uc. Since the code obviously does not check for errors from set_memory_array_uc, this should work the same. Similar for set_memory_array_wb. Patch will follow.
Comment 15 Pierre Willenbrock 2008-12-11 16:40:11 UTC
Created attachment 21075 [details] [review]
Use set_memory_{uc,wb} instead of set_memory_array_{uc,wb} for CONFIG_HIGHMEM64G

This one also fixes two unrelated warnings. I should have looked for Warnings in an earlier iteration..
Comment 16 Michael Fu 2008-12-12 23:57:05 UTC
*** Bug 19003 has been marked as a duplicate of this bug. ***
Comment 17 Peter Ganzhorn 2008-12-24 17:19:59 UTC
Is there a patch that is safe to use yet?
Kernel 2.6.28 is released and I can't get my X4500HD (G45) to work with it - X just does not want to start and leaves me with the following:

(EE) intel(0): Failed to pin front buffer: Cannot allocate memory

Fatal server error:
Couldn't bind memory for BO front buffer

I had a look at my kernel config but can't find CONFIG_HIGHMEM* anywhere, I recall something like this config option isn't available on x86-64 kernels - right?

If you need someone to test some patches for the kernel - I volunteer, you just have to tell me what patches I have to apply and in what order.
The testing system I have is the following:
Asus P5Q-EM Board, G45 chipset
8 GB of RAM (it's obvious I'd like to have 64GB highmem support)

Ubuntu 8.10 for 64-Bit CPUs running a self-compiled kernel (2.6.27.10 for now)
xf86-video-intel 2.5.1
libdrm 2.4.1

I tried to run kernel 2.6.28 with xf86-video-intel 2.5.99.1 and libdrm 2.4.3 ending up with the mentioned allocation error.
So if you have a patch I can test with 2.6.28, please tell me how to use it and I'll gladly tell you if it works for me ;)
Comment 18 Gordon Jin 2008-12-24 18:29:45 UTC
(In reply to comment #17)
> Is there a patch that is safe to use yet?
> Kernel 2.6.28 is released and I can't get my X4500HD (G45) to work with it - X
> just does not want to start and leaves me with the following:
> 
> (EE) intel(0): Failed to pin front buffer: Cannot allocate memory

This seems a separate issue, i.e. bug#19179. Are you using big "Virtual" value in xorg.conf?
Comment 19 Peter Ganzhorn 2008-12-25 04:40:59 UTC
I am not using the "virtual" value at all - my screen size is 1920x1200 if that matters.

The problem does only occur with 2.6.28, until 2.6.28-rc4 X simply freezes on startup. Any kernel >= 2.6.28-rc4 gives me the mentioned allocation error.

Since there was a patch "drm/ i915: GEM on PAE has problems - disable it for now." in -rc4, I think it somehow is related to it (because I didn't have the freezes after it, but X still did not start)
Are you sure my problem is unrelated to this bug?
Comment 20 Gordon Jin 2008-12-25 16:59:48 UTC
(In reply to comment #19)
> Are you sure my problem is unrelated to this bug?
Unsure. I take back my comment#18, as you're not using "Virtual".
I'll let Eric to comment.
Eric, is this bug dup with bug#18082?
Comment 21 Peter Ganzhorn 2008-12-28 12:24:31 UTC
Here's a bit more information, gathered with 2.6.28 (Vanilla), libdrm 2.4.1 and xf86-video-intel 2.5.1:

cat /var/log/Xorg.0.log | grep -e '(WW)' -e '(EE)'
(WW) intel(0): libpciaccess reported 0 rom size, guessing 64kB
(WW) intel(0): Allocation error, framebuffer compression disabled
(EE) intel(0): Failed to pin front buffer: Cannot allocate memory

And here's some (I guess serious) errors in dmesg of 2.6.28, produced by the attempt to start X:

mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
resource map sanity check conflict: 0xd0000000 0xdfffffff 0xd0000000 0xd7feffff vesafb
------------[ cut here ]------------
WARNING: at arch/x86/mm/ioremap.c:226 __ioremap_caller+0x339/0x380()
Modules linked in: w83627ehf hwmon_vid
Pid: 3057, comm: Xorg Not tainted 2.6.28-pgzh #1
Call Trace:
 [<ffffffff8025b294>] warn_on_slowpath+0x64/0xa0
 [<ffffffff80261228>] iomem_map_sanity_check+0x98/0xc0
 [<ffffffff80247ad9>] __ioremap_caller+0x339/0x380
 [<ffffffff80491bdf>] i915_gem_entervt_ioctl+0x2cf/0x5a0
 [<ffffffff80491bdf>] i915_gem_entervt_ioctl+0x2cf/0x5a0
 [<ffffffff80491910>] i915_gem_entervt_ioctl+0x0/0x5a0
 [<ffffffff80480bf2>] drm_ioctl+0x112/0x340
 [<ffffffff802cc335>] vfs_ioctl+0x85/0xb0
 [<ffffffff802cc3dc>] do_vfs_ioctl+0x7c/0x460
 [<ffffffff802cc809>] sys_ioctl+0x49/0x80
 [<ffffffff8022a36b>] system_call_fastpath+0x16/0x1b
---[ end trace 7b9ce6d857e6ff4d ]---
[drm:i915_gem_object_bind_to_gtt] *ERROR* GTT full, but LRU list empty
[drm:i915_gem_object_pin] *ERROR* Failure to bind: -12<4>Clocksource tsc unstable (delta = -143772081 ns)
iounmap: bad address ffffc20011780000
Pid: 3057, comm: Xorg Tainted: G        W  2.6.28-pgzh #1
Call Trace:
 [<ffffffff804911d9>] i915_gem_leavevt_ioctl+0x39/0x50
 [<ffffffff80480bf2>] drm_ioctl+0x112/0x340
 [<ffffffff802cc335>] vfs_ioctl+0x85/0xb0
 [<ffffffff802cc3dc>] do_vfs_ioctl+0x7c/0x460
 [<ffffffff802748d0>] hrtimer_wakeup+0x0/0x30
 [<ffffffff80673f9e>] do_nanosleep+0x7e/0xd0
 [<ffffffff802cc809>] sys_ioctl+0x49/0x80
 [<ffffffff80275487>] sys_nanosleep+0x77/0x80
 [<ffffffff8022a36b>] system_call_fastpath+0x16/0x1b
Xorg[3057]: segfault at 0 ip 00007f8c04da265f sp 00007fff10a12f10 error 6 in intel_drv.so[7f8c04d5a000+69000]

Without vesafb (just tried it because of the vesafb-related error):
mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
[drm:i915_gem_object_bind_to_gtt] *ERROR* GTT full, but LRU list empty
[drm:i915_gem_object_pin] *ERROR* Failure to bind: -12<3>iounmap: bad address ffffc20011280000
Pid: 3037, comm: Xorg Not tainted 2.6.28-pgzh #3
Call Trace:
 [<ffffffff8048d4b9>] i915_gem_leavevt_ioctl+0x39/0x50
 [<ffffffff8047ced2>] drm_ioctl+0x112/0x340
 [<ffffffff802545ea>] set_next_entity+0x3a/0x80
 [<ffffffff802cc335>] vfs_ioctl+0x85/0xb0
 [<ffffffff802cc3dc>] do_vfs_ioctl+0x7c/0x460
 [<ffffffff802cc809>] sys_ioctl+0x49/0x80
 [<ffffffff8022a36b>] system_call_fastpath+0x16/0x1b
Xorg[3037]: segfault at 0 ip 00007f49f10e765f sp 00007ffffcd57f50 error 6 in intel_drv.so[7f49f109f000+69000]

Please tell me if this is some different bug and if I should file a new bug report in that case.
Comment 22 Eric Anholt 2008-12-29 14:16:43 UTC
Peter: If you ever have doubt about whether you've got the same bug, just open a new report.  Your information doesn't seem to be related to this bug at all.
Comment 23 Eric Anholt 2008-12-29 14:17:10 UTC
*** Bug 18082 has been marked as a duplicate of this bug. ***
Comment 24 Eric Anholt 2008-12-29 14:25:23 UTC
GEM should now be disabled with PAE, which at least fixes the corruption.  We need to get a version of these patches that airlied will accept.
Comment 25 Matthias Heinz 2009-01-11 05:33:17 UTC
I just updated to 2.6.29-rc1 without applying any additional patches and CONFIG_HIGHMEM64G=y and it didn't crash. But i took a deeper look and saw that those patches are not part of -rc1.

Is this bug fixed otherwise?
Comment 26 Li Peng 2009-01-12 22:29:37 UTC
please look at another bug http://bugs.freedesktop.org/show_bug.cgi?id=19415, 
I got totally different result from this one.

Comment 27 Eric Anholt 2009-02-19 11:18:46 UTC
*** Bug 19739 has been marked as a duplicate of this bug. ***
Comment 28 Sven 2009-02-19 12:55:36 UTC
(In reply to comment #24)
> GEM should now be disabled with PAE, which at least fixes the corruption.  We
> need to get a version of these patches that airlied will accept.

You should at least print some warning or information to the logs, that PAE is not supported.


Will GEM work with PAE at some future point? No NX bit protection without PAE.
Comment 29 Jesse Barnes 2009-05-11 11:21:31 UTC
Adjusting severity: crashes & hangs should be marked critical.
Comment 30 Pierre Willenbrock 2009-06-20 13:40:22 UTC
Fixed by commits
07613ba2f464f59949266f4337b75b91eb610795: agp: switch AGP to use page array instead of unsigned long array
95934f939c46ea2b37f3c91a4f8c82e003727761: drm/i915: enable GEM on PAE.
0b7af262aba912f52bc6ef76f1bc0960b01b8502: agp/intel: Make intel_i965_mask_memory use dma_addr_t for physical addresses