Bug 17085 - [GEM] PAT breaks GEM with "*ERROR* Failed to map ringbuffer"
Summary: [GEM] PAT breaks GEM with "*ERROR* Failed to map ringbuffer"
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/other (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Eric Anholt
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-11 14:17 UTC by Steven Newbury
Modified: 2008-09-23 14:58 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Kernel log of failure to map ringbuffer (44.04 KB, text/plain)
2008-08-11 14:17 UTC, Steven Newbury
no flags Details
Xorg.0.log of initialisation failure (12.50 KB, text/plain)
2008-08-11 14:18 UTC, Steven Newbury
no flags Details
dmesg output with debugging (36.32 KB, text/plain)
2008-08-11 16:49 UTC, Steven Newbury
no flags Details
dmesg output with debugging enabled (37.88 KB, text/plain)
2008-08-15 14:45 UTC, Tomas Carnecky
no flags Details
use ioremap_wc() in place of ioremap() (629 bytes, patch)
2008-08-19 16:47 UTC, Steven Newbury
no flags Details | Splinter Review
Xorg log with GEM enabled after ioremap_wc() change (24.78 KB, text/plain)
2008-08-19 16:48 UTC, Steven Newbury
no flags Details

Description Steven Newbury 2008-08-11 14:17:17 UTC
Created attachment 18224 [details]
Kernel log of failure to map ringbuffer

I reported in https://bugs.freedesktop.org/show_bug.cgi?id=16474 my inability to bring up GEM DRM on my system.  Since GEM has now hit master and I'm still unable to bring it up, even using Eric Anholt's drm-gem-merge tree, I thought I should open a specific bug.

I'm attaching my full kernel log and Xorg.0.log to this bug in the hope that it helps.

I tried using a different gcc version in case of a compiler bug, but no avail.

Hardware is a Dell Latitude D830 (i965GM)
Comment 1 Steven Newbury 2008-08-11 14:18:26 UTC
Created attachment 18225 [details]
Xorg.0.log of initialisation failure
Comment 2 Eric Anholt 2008-08-11 15:01:53 UTC
Weird.  What does the dmesg look like with git://people.freedesktop.org/~anholt/linux-2.6#drm-gem-fd-17085 ?
Comment 3 Steven Newbury 2008-08-11 16:19:15 UTC
(In reply to comment #2)
> Weird.  What does the dmesg look like with
> git://people.freedesktop.org/~anholt/linux-2.6#drm-gem-fd-17085 ?
> 

I'll test that now.. give me a few minutes...
Comment 4 Steven Newbury 2008-08-11 16:49:57 UTC
Created attachment 18226 [details]
dmesg output with debugging
Comment 5 Steven Newbury 2008-08-12 14:47:37 UTC
Is there any more information or testing I can do to help resolve this?
Comment 6 Steven Newbury 2008-08-12 19:06:44 UTC
(In reply to comment #5)
> Is there any more information or testing I can do to help resolve this?
> 

Has anyone got this working on x86-64 yet?  Maybe there is a int where there should be an unsigned long somewhere?
Comment 7 Steven Newbury 2008-08-12 19:12:51 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Is there any more information or testing I can do to help resolve this?
> > 
> 
> Has anyone got this working on x86-64 yet?  Maybe there is a int where there
> should be an unsigned long somewhere?
> 

That wasn't quite what I meant to say, but my meaning should be clear enough..
Comment 8 Kristian Høgsberg 2008-08-13 08:13:44 UTC
Try the latest git version of libdrm, specifically this commit

http://cgit.freedesktop.org/mesa/drm/commit/?id=b0e68829462aad00ce68be998da6313bca754e9a

fixes a bug where libdrm overwrites the gem handles of some buffers.
Comment 9 Steven Newbury 2008-08-13 09:32:12 UTC
(In reply to comment #8)
> Try the latest git version of libdrm, specifically this commit
> 
> http://cgit.freedesktop.org/mesa/drm/commit/?id=b0e68829462aad00ce68be998da6313bca754e9a
> 
> fixes a bug where libdrm overwrites the gem handles of some buffers.
> 

No change.
Comment 10 Steven Newbury 2008-08-13 09:53:04 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > Is there any more information or testing I can do to help resolve this?
> > > 
> > 
> > Has anyone got this working on x86-64 yet?  Maybe there is a int where there
> > should be an unsigned long somewhere?
> > 
> 
> That wasn't quite what I meant to say, but my meaning should be clear enough..
> 

What I was trying to say; the debugging output you put in is expecting the addresses to be in 64bits, which would make sense, but the actual addresses used seem to be only 32bits.
Comment 11 Tomas Carnecky 2008-08-15 14:45:39 UTC
Created attachment 18304 [details]
dmesg output with debugging enabled

Linux fu 2.6.27-rc3-00227-g54b288f #22 SMP PREEMPT Fri Aug 15 23:30:39 CEST 2008 x86_64 Intel(R) Core(TM)2 Duo CPU L7500 @ 1.60GHz GenuineIntel GNU/Linux

That is Linus' tree as of Fri Aug 15 plus drm-gem-merge and drm-gem-merge-fd-17085 pulled in.
Comment 12 Tomas Carnecky 2008-08-15 15:33:07 UTC
It looks like the mtrr mapping isn't being set up correctly. mtrr_add() returns 0 and the mapping doesn't show up in /proc/mtrr. Will try to investigate further but any help is appreciated.
Comment 13 Tomas Carnecky 2008-08-15 17:43:16 UTC
The drm_core_ioremap() at [1] eventually calls ioremap() in include/asm-x86/io_64.h which calls ioremap_nocache() in arch/x86/mm/ioremap.c which calls __ioremap_caller() in the same file which fails in this block:

        if (prot_val != new_prot_val) {
                /*
                 * Do not fallback to certain memory types with certain
                 * requested type:
                 * - request is uc-, return cannot be write-back
                 * - request is uc-, return cannot be write-combine
                 * - request is write-combine, return cannot be write-back
                 */
                if ((prot_val == _PAGE_CACHE_UC_MINUS &&
                     (new_prot_val == _PAGE_CACHE_WB ||
                      new_prot_val == _PAGE_CACHE_WC)) ||
                    (prot_val == _PAGE_CACHE_WC &&
                     new_prot_val == _PAGE_CACHE_WB)) {
                        printk(
                "ioremap error for 0x%llx-0x%llx, requested 0x%lx, got 0x%lx\n",
                                (unsigned long long)phys_addr,
                                (unsigned long long)(phys_addr + size),
                                prot_val, new_prot_val);
                        free_memtype(phys_addr, phys_addr + size);
                        return NULL;
                }
                prot_val = new_prot_val;
        }

with the following values:

ioremap error for 0xe077f000-0xe079f000, requested 0x10, got 0x8

The types are:
 - requested: _PAGE_CACHE_UC_MINUS (this one comes directly from ioremap_nocache())
 - got: _PAGE_CACHE_WC

Interesting is also the comment inside ioremap_nocache() (mentions the X drivers and ioremap_wc()).

new_prot_val comes from arch/x86/mm/pat.c:reserve_memtype() which calls chk_conflict() because of this conflict:

Xorg:3880 memory types e077f000-e079f000 uncached-minus <-> e0000000-f0000000 write-combining

So someone must be setting up e0000000-f0000000 as WC and drm can't map that as uncached. Maybe AGP?
agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xe0000000


tom

[1] http://gitweb.freedesktop.org/?p=users/anholt/anholt/linux-2.6.git;a=blob;h=2394f4568005fb0eda07742a343437df943ec67f;hb=93e91293c9d7a3fcdd3a7874cd43295786e9a614;f=drivers/gpu/drm/i915/i915_gem.c#l2401
Comment 14 Steven Newbury 2008-08-16 05:42:25 UTC
During startup of the working TTM DRM/Mesa/Xorg (based off git master with TTM patched back in) I get these messages in dmesg:

X:4367 conflicting memory types e0000000-f0000000 write-combining<->uncached-minus
reserve_memtype failed 0xe0000000-0xf0000000, track write-combining, req write-combining
X:4367 conflicting memory types e0000000-f0000000 write-combining<->uncached-minus
reserve_memtype failed 0xe0000000-0xf0000000, track write-combining, req write-combining
X:4378 freeing invalid memtype e0000000-f0000000
X:4367 conflicting memory types e0000000-f0000000 write-combining<->uncached-minus
reserve_memtype failed 0xe0000000-0xf0000000, track write-combining, req write-combining
X:4379 freeing invalid memtype e0000000-f0000000
X:4367 conflicting memory types e0000000-f0000000 write-combining<->uncached-minus
reserve_memtype failed 0xe0000000-0xf0000000, track write-combining, req write-combining
X:4380 freeing invalid memtype e0000000-f0000000
X:4367 conflicting memory types e0000000-f0000000 write-combining<->uncached-minus
reserve_memtype failed 0xe0000000-0xf0000000, track write-combining, req write-combining
X:4381 freeing invalid memtype e0000000-f0000000

Seems it may be relevant?
Comment 15 Steven Newbury 2008-08-18 15:57:10 UTC
I was reading through the intel-agp driver and I was wondering since it uses the mappings as set up during POST, could it be on some systems the BIOS is setting the aperture mapping as WC by default?  This is the only logical conclusion I have been able to come to after following on from Tomas Carnecky's tracing of the code paths.
Comment 16 Thomas Hellström 2008-08-19 00:09:23 UTC
Isn't the root cause of this that the intel module is using
drm_core_ioremap() when it should in fact be using
ioremap_wc() with newer kernels.
That would make the conflict go away.

/Thomas
Comment 17 Steven Newbury 2008-08-19 08:32:36 UTC
(In reply to comment #16)
> Isn't the root cause of this that the intel module is using
> drm_core_ioremap() when it should in fact be using
> ioremap_wc() with newer kernels.
> That would make the conflict go away.
> 
> /Thomas
> 

Yes it would, if it is indeed acceptable to have the ringbuffer as WC.  I'll make this change and see how it goes...
Comment 18 Thomas Hellström 2008-08-19 08:43:28 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > Isn't the root cause of this that the intel module is using
> > drm_core_ioremap() when it should in fact be using
> > ioremap_wc() with newer kernels.
> > That would make the conflict go away.
> > 
> > /Thomas
> > 
> 
> Yes it would, if it is indeed acceptable to have the ringbuffer as WC.  I'll
> make this change and see how it goes...
> 

Yes, the ringbuffer _should_ be WC. 
Before ioremap_wc() was introduced, drm_core_ioremap() combined with a WC MTRR meant a WC mapping. 

/Thomas

 
Comment 19 Tomas Carnecky 2008-08-19 11:51:11 UTC
After I located the conflict with the WC mapping, I changed the code to use ioremap_wc() just to see if that improves anything. It didn't. When I started up Xorg, it locked up the computer and I had to do a hard reboot.

Maybe I did something wrong, so if you give me a patch or an git url I could pull the needed changes from, I will try again (patch preferably against anholt's linux-2.6/drm-gem-merge tree).
Comment 20 Steven Newbury 2008-08-19 16:47:25 UTC
Created attachment 18396 [details] [review]
use ioremap_wc() in place of ioremap()

Okay, this works as far as initialisation.  Ringbuffer is allocated and enabled, though there a some errors during server startup.

dmesg contains:
[drm] Set up GEM GTT of 144896kb
[drm] Creating ringbuffer
[drm:i915_gem_object_pin] *ERROR* pinning object ffff88007bf53180 size 32768: 1

Also the next attached Xorg log, notice "(WW) intel(0): Existing errors found in hardware state." and "(EE) intel(0): underrun on pipe B!"

Furthermore, attempting to run a glxgears results in the machine hard locking.
Comment 21 Steven Newbury 2008-08-19 16:48:25 UTC
Created attachment 18397 [details]
Xorg log with GEM enabled after ioremap_wc() change
Comment 22 Steven Newbury 2008-08-19 16:49:39 UTC
I'm going to be away now for the next week, so maybe it'll all be working when I get back! :)
Comment 23 Brian Rogers 2008-08-23 03:48:18 UTC
I was experiencing this issue. It went away when I removed CONFIG_X86_PAT from my kernel config. I suspect that option is the only requirement to trigger this bug.

The option can be found in the kernel configuration menu under 'Processor type and features.'
Comment 24 Eric Anholt 2008-08-25 15:08:14 UTC
I'm uncomfortable changing the drm core stuff without reviewing what all the other drivers intended with it.  I'll probably just change the intel driver to do what we meant using linux calls directly now that ioremap()'s semantics changed under us.  But, looking at it we've been doing things wrong anyway.  We want cached access to the status page -- that's kind of the point of having a status page.  The newer hardware actually expects you to access it cached and that may be required for coherency.
Comment 25 Eric Anholt 2008-09-23 14:58:33 UTC
Fixed in drm-gem-merge branch of my tree.

commit f2b39cd19b63ec8b82c198004d0e45044abce103
Author: Eric Anholt <eric@anholt.net>
Date:   Tue Sep 23 14:50:57 2008 -0700

    drm: Use ioremap_wc in i915_driver instead of ioremap, since we always want WC.
    
    Fixes failure to map the ringbuffer when PAT tells us we don't get to do
    uncached on something that's already mapped WC, or something along those lines.



Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.