Bug 15807

Summary:

[865G rev 02] Xserver crashes during VT switch

Product:

xorg

Reporter:

Hugo Mildenberger <Hugo.Mildenberger>

Component:

Driver/intel

Assignee:

Eric Anholt <eric>

Status:

RESOLVED FIXED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

medium

CC:

eric, remi, steeve

Version:

git

Hardware:

x86 (IA32)

OS:

Linux (All)

URL:

http://bugs.gentoo.org/show_bug.cgi?id=219834

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
output of lshw	none
output of lspci -v	none
output of lsmod	none
symbolic stack trace	none
full driver trace around the moment x dies.	none
Problem of first session, tries to write a message, but segv's	none
commented debug log, with structure dumps	none
gdb steps through a memory write which he can't read	none

Description Hugo Mildenberger 2008-05-02 13:11:39 UTC

I already described the problem on http://bugs.gentoo.org/show_bug.cgi?id=219834.

In a nutshell: switching back from another session crashes the server always since "xf86-video-intel-2.2.99.902", whereas revision bearing the tag "xf86-video-intel-2.2.99.901" is the last working.

Doing a git bisection between "origin/xf86-video-intel-2.2-branch"
and "origin/xf86-video-intel-2.3-branch" turned out commit

'[0af692e9ee5857e41ffdbaf760752a37737b21b7] Revert "Use mprotect on unbound AGP memory to attempt to catch use while unbound.'

as the prime suspect. This commit actually reverted commit 'a1612b7728d4153499fe86b6713a13c8702cc7d9' for fear of the mprotect calls being unstable under Linux.

Further investigation using the current master branch showed a ring buffer stall in I830WaitLpRing(). A detailed log showed that the GPU did go out for lunch (at least I interpret it that way).

The most attractive idea (which obviously was reverting the reversion) gave mixed results: VT switch then does not work any more consistently (kills the active session without opening an new one, only root may create a working session afterwards, but no X crash any more.)

Setting Option "Tiling" "false" and Option "AccelMethod" "EXA" does not change anything. The segmentation fault appears to be only a consequence of a somewhat airy implementation of CloseDownDevices(), calling CloseDevice() directly whereas it shouldn't do that, thus preventing X from printing a diagnostic message naming the problem and exiting gracefully. After what I have seen here searching for comparable bugs, this unfriendly behaviour persists since long time.

For convenience, I only used the git master branch for xf86-video-intel, because it's tedious to set-up the git sources in such a manner that they do not taint the file structure given by Gentoo.

If anybody should need more info, just drop me a line -- but do it soon, because I will give away the machine the next days.

Comment 1 Hugo Mildenberger 2008-05-02 13:12:52 UTC

Created attachment 16311 [details]
output of lshw

Comment 2 Hugo Mildenberger 2008-05-02 13:13:35 UTC

Created attachment 16312 [details]
output of lspci -v

Comment 3 Hugo Mildenberger 2008-05-02 13:16:59 UTC

Created attachment 16313 [details]
output of lsmod

Kernel version is
Linux 2.6.25-gentoo-r1 #27 PREEMPT Thu May 1 04:16:45 CEST 2008 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux

Comment 4 Hugo Mildenberger 2008-05-02 13:45:52 UTC

Created attachment 16314 [details]
symbolic stack trace

Comment 5 Hugo Mildenberger 2008-05-02 13:47:32 UTC

Created attachment 16315 [details]
full driver trace around the moment x dies. 

Complete trace (1.5 Megabyte) will be sent on request

Comment 6 Gordon Jin 2008-05-03 03:04:07 UTC

Reassigning to Eric as he's the committer.

Comment 7 Hugo Mildenberger 2008-05-04 07:52:20 UTC

I narrowed it further down. Remember, the problem was:

1.) login (via kde); 2.) login to another user; 3.) switch back to first login

If I do that with current git master versions for xf86-video-intel, xserver, dri, mesa, updated drm + updated kernel part (i915 1.13.0 20080312) and updated keyboard, mouse + libpciaccess, switch back seems to succeed (no segv), but now the first session dies, as usual due to ring buffer stall. 

If I revert the reversion of the mprotect patch, I get a different behavior: VT number two now dies due to a segfault, located at i830_accel.c:221. 

This routine is short enough to print it here:

210 void
211 I830EmitFlush(ScrnInfoPtr pScrn)
212 {
213    I830Ptr pI830 = I830PTR(pScrn);
214    int flags = MI_WRITE_DIRTY_STATE | MI_INVALIDATE_MAP_CACHE;
215
216    if (IS_I965G(pI830))
217       flags = 0;
218
219    {
220        BEGIN_BATCH(2);
221        OUT_BATCH(MI_FLUSH | flags);
222        OUT_BATCH(MI_NOOP);              /* pad to quadword */
223        ADVANCE_BATCH();
224    }
225 }

Because I wanted to get it more precise, I preprocessed i830_accel.c to i830_accel.i (by passing -save-temps to CFLAGS of original make), piped the result through indent and sed (removing #line statetments), and remade the driver with the processed file instead of i830_accel.c -- in order to get these ugly macros expanded and line by line traceable, and have compiler options remaining the same. 

The exact statement where the process dies (the line numbers are those of preprocessed file) is the following:

29235 *(volatile uint32_t *) (pI830->LpRing->virtual_start +
29236                         pI830->ring_next) = (0x04 << 23) | flags;

(gdb) print *pI830->LpRing
$33 = {tail_mask = 131071, mem = 0x8239a38, virtual_start = 0xafa1e000 <Address 0xafa1e000 out of bounds>, head = 119616,
  tail = 119616, space = 131064}

(gdb) print pI830->LpRing->virtual_start + pI830->ring_next
$3 = (unsigned char *) 0xaf9429e8 <Address 0xaf9429e8 out of bounds>
(gdb) print pI830->ring_next
$4 = 88552

I think the problem lies within pI830->virtual_start beeing invalid. If this really is the sole problem, than pI830->head == pI830->tail indicates a currently empty ring buffer.

Because I wanted to see a normal situation, I set a break point to I830Emitflush shortly after creation of the VT2 serving process. But that routine got called often, and gdb constantly indicated pI830->virtual_start beeing invalid. I have not set a breakpoint to I830_accel.c:221, possibly this statement will not be effective until switching back to original session?

Will test it later on. 

By the way: As a late consequence of beeing unable to reinitialize DRM, the first process also dies with a segv, keeping him from exiting gracefully. Excerpt from X server log (this was produced using the unmodified master branch of xf86-video-intel)

Backtrace:
0: /usr/bin/X(xf86SigHandler+0x79) [0x80c5b79]
1: [0xb7f13400]
2: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb79985b6]
3: /usr/bin/X [0x8164183]
4: /usr/bin/X [0x813ccb2]
5: /usr/bin/X(BlockHandler+0x58) [0x808e598]
6: /usr/bin/X(WaitForSomething+0x12a) [0x812b2fa]
7: /usr/bin/X(Dispatch+0x7d) [0x808a9bd]
8: /usr/bin/X(main+0x46a) [0x80716ea]
9: /lib/libc.so.6(__libc_start_main+0xe0) [0xb7ac9400]
10: /usr/bin/X(FontFileCompleteXLFD+0x20d) [0x8070b31]

Fatal server error:
Caught signal 11.  Server aborting

Which translates to:
#0  DRILock (pScreen=0x8238418, flags=0) at dri.c:2181
#1  0xb7a574fc in I830LeaveVT (scrnIndex=0, flags=0) at i830_driver.c:3162
#2  0x080c9e78 in xf86XVLeaveVT (index=0, flags=0) at xf86xv.c:1268
#3  0xb7b0a809 in glxDRILeaveVT (index=0, flags=0) at glxdri.c:889
#4  0x080a7328 in AbortDDX () at xf86Init.c:1300
#5  0x08137df8 in AbortServer () at log.c:406
#6  0x08138366 in FatalError (f=0xb7a7fc9c "lockup\n") at log.c:552
#7  0xb7a4a8cb in I830WaitLpRing (pScrn=0x820c608, n=131064, timeout_millis=0) at i830_accel.c:150
#8  0xb7a4accc in I830Sync (pScrn=0x820c608) at i830_accel.c:204
#9  0xb7a572c0 in i830_stop_ring (pScrn=0x820c608, flush=<value optimized out>) at i830_driver.c:1814
#10 0xb7a57e7b in I830EnterVT (scrnIndex=0, flags=0) at i830_driver.c:3282
#11 0x080cc412 in CMapEnterVT (index=0, flags=0) at xf86cmap.c:454
#12 0x080c9f05 in xf86XVEnterVT (index=0, flags=0) at xf86xv.c:1227
#13 0xb7b0a874 in glxDRIEnterVT (index=0, flags=0) at glxdri.c:871
#14 0x080c6856 in xf86Wakeup (blockData=0x0, err=-1, pReadmask=0x81f3360) at xf86Events.c:963
#15 0x0808e6c9 in WakeupHandler (result=-1, pReadmask=0x81f3360) at dixutils.c:418
#16 0x0812b396 in WaitForSomething (pClientsReady=0x85831e8) at WaitFor.c:239
#17 0x0808a9bd in Dispatch () at dispatch.c:372
#18 0x080716ea in main (argc=8, argv=0xbfdd2a34, envp=0x831b938) at main.c:433


The reason pDRIPriv beeing non zero, but invalid (0x104 or something), which 
seems to be far consequence of a previous server state, when message "DRM disabled" was printed. Exact entry was:
(EE) [drm] Could not set DRM device bus ID.
(EE) intel(0): [dri] DRIScreenInit failed. Disabling DRI.

The point of that failure is line 2181, which I just have not available the moment of writing this comment.
void
2175    DRILock(ScreenPtr pScreen, int flags)
2176    {
2177        DRIScreenPrivPtr pDRIPriv = DRI_SCREEN_PRIV(pScreen);
2178
2179        if(!pDRIPriv || !pDRIPriv->pLockRefCount) return;

Have to quit this now, will attach some debug logs later.

Comment 8 Hugo Mildenberger 2008-05-04 07:53:55 UTC

Created attachment 16346 [details]
Problem of first session, tries to write a message, but segv's

Comment 9 Hugo Mildenberger 2008-05-04 13:14:17 UTC

Created attachment 16348 [details]
commented debug log, with structure dumps 

This gdb log was obtained directly after switching from second login back to first, using the current git master version after having locally reverted that commit which removed the mprotect calls. 

Notice that at least two pointer are marked as "out of bounds": pI830->MMIOBase and pI830->LpRing->virtual_start. 

A delighting marginal bug is, that there are two booleans of opposite semantics: pI830->directRenderingEnabled (==0) and pI830->direcRenderingDisabled, which is also = 0. 

Effectively, only about five effective statements will be executed in this routine before the crash:

29204           pI830->ring_emitting = (2) * 4;
29207           if (pI830->LpRing->space < pI830->ring_emitting)
29208             I830WaitLpRing (pScrn, pI830->ring_emitting, 0);
29209           pI830->ring_next = pI830->LpRing->tail;
29212           pI830->ring_used = 0;

All other branches are unreachable.

Comment 10 Hugo Mildenberger 2008-05-04 14:01:53 UTC

Created attachment 16350 [details]
gdb steps through a memory write which he can't read

For the sake of completness, I just stopped the current session (which I now continue to use for writing this message) in I830EmitFlush(), and stepped through it. gdb continues to mark pI830->LpRing->virtual_start as invalid. However, that very same gdb steps through the following without any problem:


29235 *(volatile uint32_t *) (pI830->LpRing->virtual_start +
29236                         pI830->ring_next) = (0x04 << 23) | flags;


Program is now halted at line no 2937, just after the statement above. 
If I now say:

(gdb) print *(uint32_t*)(pI830->LpRing->virtual_start+pI830->ring_next)


gdb says:

Cannot access memory at address 0xaf948980


So, is this a bug in gdb or is here somthing really wired happening? I never saw something like this before. And while I continue to write this, the statement above had probably been executed ten million times. 
Or was this chunk of memory just protected write-only, i.e. without read access?

Log is attached.

Comment 11 Hugo Mildenberger 2008-05-05 03:47:50 UTC

All the hassle seems to be a late consequence of session no. 2 not beeing able to (re-)initialize DRI. This was somewhat hard to see immediately, also because X logging facility is a mess when trying to track multiple sessions. In addition to various  Xorg logs we have kdm.log, logging something similar, but not identical. And in addition to Xorg.0.log Xorg.1.log, ... we have also have Xorg.1.log.old Xorg.1.log.old ..). If a reborn Herakles please could cleanup also that pigsty? Even Microsoft has a consistent system logging facility.

I finally found a solution. User switching now works fine, at least with the current master branch, and with the mprotect patch reactivated (I did not try the official, unreverted version yet). 

It seems as if X really has a very big problem to act consistently upon presence, absence or usabilty of one of its features.

When looking at Xorg.1.log (which is the log output of the newly started session,  created when trying to switch to another user), there is an entry which looked tolerable, but really wasn‘t: 

(II) [drm] DRM interface version 1.0
(EE) [drm] Could not set DRM device bus ID.
(EE) intel(0): [dri] DRIScreenInit failed. Disabling DRI.

Extending the debug output resulted in:

(II) [drm] DRM interface version 1.0
(EE) [drm] Could not set DRM device bus ID:-13:'Permission denied', uid:gid = 0:0, euid:egid = 0:0
(EE) intel(0): [dri] DRIScreenInit failed. Disabling DRI. (I830DRIScreenInit: i830_dri.c: 633)


I wrote previously, that a marginal bug seems to be pI830->directRenderingEnabled == pI830->directRenderingDisabled == 0. This was not fully correct. Looking closely at the sources, „directRenderingDisabled“ has the semantic meaning „Don‘t even try to enable direct rendering“  -- and should therefore be renamed as such, for clarity.

I repeated the experiment I mentioned in comment #10, this time writing via gdb to the address in question. 

(gdb) x *(uint32_t*)(pI830->LpRing->virtual_start+pI830->ring_next) = (0x04 << 23) | flags

I got the very same error message as for reading:

(gdb) Cannot access memory at address 0xaf94898


Beeing at loss here, I walked the stack up to see if I could recognize something obviously wrong. I stopped within I830BlockHandler() (in i830_driver.c):

2425     if (!pI830->noAccel && !pI830->directRenderingEnabled)
2426         I830EmitFlush(pScrn);

Seeing this I thought that probably the above mentioned Enabled/Disabled inconsistency could be the real problem. And commenting out these two lines made user switching work again!  At least with status „works for me“.

Because this is only a workaround, not a general solution. There are more problems.

Looking at Xorg.0.log again: 

(WW) intel(0): ESR is 0x00000010, page table error
(WW) intel(0): PGTBL_ER is 0x00000029
(WW) intel(0): Existing errors found in hardware state.

And remember that strange notice, that even root could not execute drmSetBusID():

(EE) [drm] Could not set DRM device bus ID:-13:'Permission denied', uid:gid = 0:0, euid:egid = 0:0


This was triggered from git/drm/xf86drm.c returning -13:

 780 /**
 781  * Set the bus ID of the device.
 782  *
 783  * \param fd file descriptor.
 784  * \param busid bus ID string.
 785  *
 786  * \return zero on success, negative on failure.
 787  *
 788  * \internal
 789  * This function is a wrapper around the DRM_IOCTL_SET_UNIQUE ioctl, passing
 790  * the arguments in a drm_unique structure.
 791  */
 792 int drmSetBusid(int fd, const char *busid)
 793 {
 794     drm_unique_t u;
 795
 796     u.unique     = (char *)busid;
 797     u.unique_len = strlen(busid);
 798
 799     if (ioctl(fd, DRM_IOCTL_SET_UNIQUE, &u)) {
 800         return -errno;
 801     }
 802     return 0;
 803 }


To draw the line: all this eventually points to an underlying resource management problem. Possibly drmSetBusid can not legally called twice for the same device. Possibly this only is a respectless, naughty behaviour (next to an error) anywhere in a kernel driver, simply beeing a bit too strict, a bit too harsh with poor old X.


Access is probably denied by drm_unlocked_ioctl() in git/drm/linux-core/drm_drv.c:

647         if (!func) {
648                 DRM_DEBUG("no function\n");
649                 retcode = -EINVAL;
650         } else if (((ioctl->flags & DRM_ROOT_ONLY) && !capable(CAP_SYS_ADMIN)) ||
651                    ((ioctl->flags & DRM_AUTH) && !file_priv->authenticated) ||
652                    ((ioctl->flags & DRM_MASTER) && !file_priv->master)) {
653                 retcode = -EACCES;
654         } else {
655                 retcode = func(dev, kdata, file_priv);
656         }


Looking closer at this particular piece of drm kernel driver code, I guess its the DRM_MASTER branch witch makes the function return -EACCES(=-13), which would strongly support my resource quench hypothesis: drmSetBusid can't legally be called twice.

Comment 12 Hugo Mildenberger 2008-05-06 02:58:40 UTC

As a result of my discussion with on Rémi Cardona at http://bugs.gentoo.org/show_bug.cgi?id=219834 I hereby reopen this report, because the workaround I mentioned would else be considered by everyonle as solving the whole issue. This is certainly not the case, and that consequence was likewise not intended when I changed status from "NEW" to "WORKSFORME". The workaround cures only makes the system working again.

To root of the hole problem is, that the initialization of DRI fails for second session only, whereas it succeeds for the first. The symptoms are that second session and / or also first sesion dies (depending on mprotect calls beeing in effect or not), after switching back from the second session to the first. 

I already pointed out a possible reason for this behaviour. Grepping through DRM sources, looking at these drm_magic functions, I now think it may be also a DRM_AUTH issue.

Comment 13 Eric Anholt 2008-05-06 18:53:36 UTC

Thanks for the detailed debugging.  The issue was doing a flush emit while VT switched away -- the switched-away server would mash your ring while the current X  server was unaware.

As far as the driDisabled versus driEnabled flags -- sorry it's so confusing.  Things are kind of complicated because of the server regeneration support (when you log out of gdm and almost everything about the session is torn down and then restarted).  driDisabled is the flag for "the user requested that you please not ever try to do DRI".  driEnabled is the flag for "did we successfully turn on DRI this server generation".  If we didn't want to try enabling DRI in future generations, we could get down to just one flag, but that would be kind of counter to how server regeneration is supposed to work as I understand.  I would actually submit that server regeneration is a misfeature, but it exists.

Anyway, I think your bug should be fixed in master now.

Comment 14 Hugo Mildenberger 2008-05-07 02:53:48 UTC

Thank you very much for fixing this. Now I can give Amira a second try to run on top of the current intel driver. However, could you please also fix some of the issues I encountered within error handling, which made debugging of X so difficult?

These issues had been:

1.) The segmentation fault in DRILock (with master branch) after DRI-Reinitialization also failed for the first session (see https://bugs.freedesktop.org/show_bug.cgi?id=15807#c7).
This lead to a complete lock up of the system, as it did with the 2.2.902 codebase for another, but similar reason. Today I feel unsure, if X-Server's non-symbolic backtrace I attached in my comment is the right one, but the symbolic stack trace retrieved with gdb pointing to a segv in DRILock() is. I simply did not analyze this further, because I was after the primary cause.

2.) Could you please document the proper use of "I830_DEBUG" and that of the synonymous "INTEL_DEBUG" environment variable within driver's man page? I actually found it's usage grepping through the driver code, and recognized that in addition to exporting I830_DEBUG="all" within process environment one has to #undef I830_DEBUG to force the driver to actually make use of it.

Concerning DRI reinitalization: Having posted that question (if disabling DRI for the second X server would currently be a normal situation, on xorg@lists.freedesktop.org), I got the following answer from Dave Arlie:
 >Yes, currently DRI can only work on the first X server.
 >I've done a lot of work towards fixing this but it'll be around kernel
 >modesetting time before its truly fixed.

Finally, until today I do not understand, why gdb constantly complained about pI830->LpRing->virtual_start beeing invalid, refusing to read from this address range (and also refusing to write to), while running the related statement until it got run within wrong context, as you explained above. Is this really only a gdb problem? 

While I can't prove a cause-effect relationship for the moment, I have the impression that there actually is a positive correlation between this and the bug you now fixed, assuming you reverted "Use mprotect on unbound AGP memory to attempt to catch use while unbound" to get "Fix 965+ rendering issues with DRI disabled" running, which in turn introduced a call to I830EmitFlush() out of context. Git bisection identified the revertion of the mprotect calls as beeing the primary issue, and this was committed just before your 965+rendering issue fix:

http://gitweb.freedesktop.org/?p=xorg/driver/xf86-video-intel.git;a=shortlog;h=21783ec9dfad9aae0837fd2d8eb313a77f031046

2008-03-26  Eric Anholt  Fix 965+ rendering issues with DRI disabled.  
2008-03-25  Eric Anholt  Revert "Use mprotect on unbound AGP memory to attempt

Comment 15 Eric Anholt 2008-05-07 09:22:46 UTC

The mprotect calls were reverted because they failed on many linux systems, and they never once identified a bug.  They were *supposed* to make your server segfault when a bug such as the one you were experiencing occurred, right at the point where we did the wrong thing (poke device memory while switched away) but as you saw, it didn't.

As far as gdb goes, the linux kernel has been broken for reading device memory (such as agp, or mmio registers) for a long time.  The result would be refusing to read/write the memory.  Supposedly this is fixed upstream now or very soon.

Comment 16 Hugo Mildenberger 2008-05-07 09:43:06 UTC

(In reply to comment #15)
> The mprotect calls were reverted because they failed on many linux systems, and
> they never once identified a bug.  They were *supposed* to make your server
> segfault when a bug such as the one you were experiencing occurred, right at
> the point where we did the wrong thing (poke device memory while switched away)
> but as you saw, it didn't.

To the contrary what you wrote here, I at least saw your bugfix segfaulting  shortly after the momemt I reactivated the mprotect calls by locally reverting your commit '0af692e9ee5857e41ffdbaf760752a37737b21b7', which disabled them. So you now got bug report number one in a hopefully short row of Intel powered bugs. See https://bugs.freedesktop.org/show_bug.cgi?id=15807#c9.


> 
> As far as gdb goes, the linux kernel has been broken for reading device memory
> (such as agp, or mmio registers) for a long time.  The result would be refusing
> to read/write the memory.  Supposedly this is fixed upstream now or very soon.
> 
Aha, thank you for that information.

Comment 17 Steeve McCauley 2008-06-13 06:58:54 UTC

Can someone point me to the patch that fixes this bug?

Comment 18 Steeve McCauley 2008-06-13 08:11:32 UTC

Please excuse the previous request, this was my first time using git and was having a hard time figuring out the repository structure since there doesn't seem to be a builtin git list repository command.  I found the update for commit 36ec93300926084fb2951d69b001e4c67bc6ff79 on May 6, 2008.

$ git diff 0c00a638ef 36ec933009
diff --git a/src/i830_driver.c b/src/i830_driver.c
index b776ff6..2efad47 100644
--- a/src/i830_driver.c
+++ b/src/i830_driver.c
@@ -2411,7 +2411,7 @@ I830BlockHandler(int i,
      * after the page flipping updates, so there's no need to duplicate
      * the effort here.
      */
-    if (!pI830->noAccel && !pI830->directRenderingEnabled)
+    if (pScrn->vtSema && !pI830->noAccel && !pI830->directRenderingEnabled)
        I830EmitFlush(pScrn);
 
     I830VideoBlockHandler(i, blockData, pTimeout, pReadmask);

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.