Summary: | [865G rev 02] Xserver crashes during VT switch | ||
---|---|---|---|
Product: | xorg | Reporter: | Hugo Mildenberger <Hugo.Mildenberger> |
Component: | Driver/intel | Assignee: | Eric Anholt <eric> |
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> |
Severity: | normal | ||
Priority: | medium | CC: | eric, remi, steeve |
Version: | git | ||
Hardware: | x86 (IA32) | ||
OS: | Linux (All) | ||
URL: | http://bugs.gentoo.org/show_bug.cgi?id=219834 | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Hugo Mildenberger
2008-05-02 13:11:39 UTC
Created attachment 16311 [details]
output of lshw
Created attachment 16312 [details]
output of lspci -v
Created attachment 16313 [details]
output of lsmod
Kernel version is
Linux 2.6.25-gentoo-r1 #27 PREEMPT Thu May 1 04:16:45 CEST 2008 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux
Created attachment 16314 [details]
symbolic stack trace
Created attachment 16315 [details]
full driver trace around the moment x dies.
Complete trace (1.5 Megabyte) will be sent on request
Reassigning to Eric as he's the committer. I narrowed it further down. Remember, the problem was: 1.) login (via kde); 2.) login to another user; 3.) switch back to first login If I do that with current git master versions for xf86-video-intel, xserver, dri, mesa, updated drm + updated kernel part (i915 1.13.0 20080312) and updated keyboard, mouse + libpciaccess, switch back seems to succeed (no segv), but now the first session dies, as usual due to ring buffer stall. If I revert the reversion of the mprotect patch, I get a different behavior: VT number two now dies due to a segfault, located at i830_accel.c:221. This routine is short enough to print it here: 210 void 211 I830EmitFlush(ScrnInfoPtr pScrn) 212 { 213 I830Ptr pI830 = I830PTR(pScrn); 214 int flags = MI_WRITE_DIRTY_STATE | MI_INVALIDATE_MAP_CACHE; 215 216 if (IS_I965G(pI830)) 217 flags = 0; 218 219 { 220 BEGIN_BATCH(2); 221 OUT_BATCH(MI_FLUSH | flags); 222 OUT_BATCH(MI_NOOP); /* pad to quadword */ 223 ADVANCE_BATCH(); 224 } 225 } Because I wanted to get it more precise, I preprocessed i830_accel.c to i830_accel.i (by passing -save-temps to CFLAGS of original make), piped the result through indent and sed (removing #line statetments), and remade the driver with the processed file instead of i830_accel.c -- in order to get these ugly macros expanded and line by line traceable, and have compiler options remaining the same. The exact statement where the process dies (the line numbers are those of preprocessed file) is the following: 29235 *(volatile uint32_t *) (pI830->LpRing->virtual_start + 29236 pI830->ring_next) = (0x04 << 23) | flags; (gdb) print *pI830->LpRing $33 = {tail_mask = 131071, mem = 0x8239a38, virtual_start = 0xafa1e000 <Address 0xafa1e000 out of bounds>, head = 119616, tail = 119616, space = 131064} (gdb) print pI830->LpRing->virtual_start + pI830->ring_next $3 = (unsigned char *) 0xaf9429e8 <Address 0xaf9429e8 out of bounds> (gdb) print pI830->ring_next $4 = 88552 I think the problem lies within pI830->virtual_start beeing invalid. If this really is the sole problem, than pI830->head == pI830->tail indicates a currently empty ring buffer. Because I wanted to see a normal situation, I set a break point to I830Emitflush shortly after creation of the VT2 serving process. But that routine got called often, and gdb constantly indicated pI830->virtual_start beeing invalid. I have not set a breakpoint to I830_accel.c:221, possibly this statement will not be effective until switching back to original session? Will test it later on. By the way: As a late consequence of beeing unable to reinitialize DRM, the first process also dies with a segv, keeping him from exiting gracefully. Excerpt from X server log (this was produced using the unmodified master branch of xf86-video-intel) Backtrace: 0: /usr/bin/X(xf86SigHandler+0x79) [0x80c5b79] 1: [0xb7f13400] 2: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb79985b6] 3: /usr/bin/X [0x8164183] 4: /usr/bin/X [0x813ccb2] 5: /usr/bin/X(BlockHandler+0x58) [0x808e598] 6: /usr/bin/X(WaitForSomething+0x12a) [0x812b2fa] 7: /usr/bin/X(Dispatch+0x7d) [0x808a9bd] 8: /usr/bin/X(main+0x46a) [0x80716ea] 9: /lib/libc.so.6(__libc_start_main+0xe0) [0xb7ac9400] 10: /usr/bin/X(FontFileCompleteXLFD+0x20d) [0x8070b31] Fatal server error: Caught signal 11. Server aborting Which translates to: #0 DRILock (pScreen=0x8238418, flags=0) at dri.c:2181 #1 0xb7a574fc in I830LeaveVT (scrnIndex=0, flags=0) at i830_driver.c:3162 #2 0x080c9e78 in xf86XVLeaveVT (index=0, flags=0) at xf86xv.c:1268 #3 0xb7b0a809 in glxDRILeaveVT (index=0, flags=0) at glxdri.c:889 #4 0x080a7328 in AbortDDX () at xf86Init.c:1300 #5 0x08137df8 in AbortServer () at log.c:406 #6 0x08138366 in FatalError (f=0xb7a7fc9c "lockup\n") at log.c:552 #7 0xb7a4a8cb in I830WaitLpRing (pScrn=0x820c608, n=131064, timeout_millis=0) at i830_accel.c:150 #8 0xb7a4accc in I830Sync (pScrn=0x820c608) at i830_accel.c:204 #9 0xb7a572c0 in i830_stop_ring (pScrn=0x820c608, flush=<value optimized out>) at i830_driver.c:1814 #10 0xb7a57e7b in I830EnterVT (scrnIndex=0, flags=0) at i830_driver.c:3282 #11 0x080cc412 in CMapEnterVT (index=0, flags=0) at xf86cmap.c:454 #12 0x080c9f05 in xf86XVEnterVT (index=0, flags=0) at xf86xv.c:1227 #13 0xb7b0a874 in glxDRIEnterVT (index=0, flags=0) at glxdri.c:871 #14 0x080c6856 in xf86Wakeup (blockData=0x0, err=-1, pReadmask=0x81f3360) at xf86Events.c:963 #15 0x0808e6c9 in WakeupHandler (result=-1, pReadmask=0x81f3360) at dixutils.c:418 #16 0x0812b396 in WaitForSomething (pClientsReady=0x85831e8) at WaitFor.c:239 #17 0x0808a9bd in Dispatch () at dispatch.c:372 #18 0x080716ea in main (argc=8, argv=0xbfdd2a34, envp=0x831b938) at main.c:433 The reason pDRIPriv beeing non zero, but invalid (0x104 or something), which seems to be far consequence of a previous server state, when message "DRM disabled" was printed. Exact entry was: (EE) [drm] Could not set DRM device bus ID. (EE) intel(0): [dri] DRIScreenInit failed. Disabling DRI. The point of that failure is line 2181, which I just have not available the moment of writing this comment. void 2175 DRILock(ScreenPtr pScreen, int flags) 2176 { 2177 DRIScreenPrivPtr pDRIPriv = DRI_SCREEN_PRIV(pScreen); 2178 2179 if(!pDRIPriv || !pDRIPriv->pLockRefCount) return; Have to quit this now, will attach some debug logs later. Created attachment 16346 [details]
Problem of first session, tries to write a message, but segv's
Created attachment 16348 [details]
commented debug log, with structure dumps
This gdb log was obtained directly after switching from second login back to first, using the current git master version after having locally reverted that commit which removed the mprotect calls.
Notice that at least two pointer are marked as "out of bounds": pI830->MMIOBase and pI830->LpRing->virtual_start.
A delighting marginal bug is, that there are two booleans of opposite semantics: pI830->directRenderingEnabled (==0) and pI830->direcRenderingDisabled, which is also = 0.
Effectively, only about five effective statements will be executed in this routine before the crash:
29204 pI830->ring_emitting = (2) * 4;
29207 if (pI830->LpRing->space < pI830->ring_emitting)
29208 I830WaitLpRing (pScrn, pI830->ring_emitting, 0);
29209 pI830->ring_next = pI830->LpRing->tail;
29212 pI830->ring_used = 0;
All other branches are unreachable.
Created attachment 16350 [details]
gdb steps through a memory write which he can't read
For the sake of completness, I just stopped the current session (which I now continue to use for writing this message) in I830EmitFlush(), and stepped through it. gdb continues to mark pI830->LpRing->virtual_start as invalid. However, that very same gdb steps through the following without any problem:
29235 *(volatile uint32_t *) (pI830->LpRing->virtual_start +
29236 pI830->ring_next) = (0x04 << 23) | flags;
Program is now halted at line no 2937, just after the statement above.
If I now say:
(gdb) print *(uint32_t*)(pI830->LpRing->virtual_start+pI830->ring_next)
gdb says:
Cannot access memory at address 0xaf948980
So, is this a bug in gdb or is here somthing really wired happening? I never saw something like this before. And while I continue to write this, the statement above had probably been executed ten million times.
Or was this chunk of memory just protected write-only, i.e. without read access?
Log is attached.
All the hassle seems to be a late consequence of session no. 2 not beeing able to (re-)initialize DRI. This was somewhat hard to see immediately, also because X logging facility is a mess when trying to track multiple sessions. In addition to various Xorg logs we have kdm.log, logging something similar, but not identical. And in addition to Xorg.0.log Xorg.1.log, ... we have also have Xorg.1.log.old Xorg.1.log.old ..). If a reborn Herakles please could cleanup also that pigsty? Even Microsoft has a consistent system logging facility. I finally found a solution. User switching now works fine, at least with the current master branch, and with the mprotect patch reactivated (I did not try the official, unreverted version yet). It seems as if X really has a very big problem to act consistently upon presence, absence or usabilty of one of its features. When looking at Xorg.1.log (which is the log output of the newly started session, created when trying to switch to another user), there is an entry which looked tolerable, but really wasn‘t: (II) [drm] DRM interface version 1.0 (EE) [drm] Could not set DRM device bus ID. (EE) intel(0): [dri] DRIScreenInit failed. Disabling DRI. Extending the debug output resulted in: (II) [drm] DRM interface version 1.0 (EE) [drm] Could not set DRM device bus ID:-13:'Permission denied', uid:gid = 0:0, euid:egid = 0:0 (EE) intel(0): [dri] DRIScreenInit failed. Disabling DRI. (I830DRIScreenInit: i830_dri.c: 633) I wrote previously, that a marginal bug seems to be pI830->directRenderingEnabled == pI830->directRenderingDisabled == 0. This was not fully correct. Looking closely at the sources, „directRenderingDisabled“ has the semantic meaning „Don‘t even try to enable direct rendering“ -- and should therefore be renamed as such, for clarity. I repeated the experiment I mentioned in comment #10, this time writing via gdb to the address in question. (gdb) x *(uint32_t*)(pI830->LpRing->virtual_start+pI830->ring_next) = (0x04 << 23) | flags I got the very same error message as for reading: (gdb) Cannot access memory at address 0xaf94898 Beeing at loss here, I walked the stack up to see if I could recognize something obviously wrong. I stopped within I830BlockHandler() (in i830_driver.c): 2425 if (!pI830->noAccel && !pI830->directRenderingEnabled) 2426 I830EmitFlush(pScrn); Seeing this I thought that probably the above mentioned Enabled/Disabled inconsistency could be the real problem. And commenting out these two lines made user switching work again! At least with status „works for me“. Because this is only a workaround, not a general solution. There are more problems. Looking at Xorg.0.log again: (WW) intel(0): ESR is 0x00000010, page table error (WW) intel(0): PGTBL_ER is 0x00000029 (WW) intel(0): Existing errors found in hardware state. And remember that strange notice, that even root could not execute drmSetBusID(): (EE) [drm] Could not set DRM device bus ID:-13:'Permission denied', uid:gid = 0:0, euid:egid = 0:0 This was triggered from git/drm/xf86drm.c returning -13: 780 /** 781 * Set the bus ID of the device. 782 * 783 * \param fd file descriptor. 784 * \param busid bus ID string. 785 * 786 * \return zero on success, negative on failure. 787 * 788 * \internal 789 * This function is a wrapper around the DRM_IOCTL_SET_UNIQUE ioctl, passing 790 * the arguments in a drm_unique structure. 791 */ 792 int drmSetBusid(int fd, const char *busid) 793 { 794 drm_unique_t u; 795 796 u.unique = (char *)busid; 797 u.unique_len = strlen(busid); 798 799 if (ioctl(fd, DRM_IOCTL_SET_UNIQUE, &u)) { 800 return -errno; 801 } 802 return 0; 803 } To draw the line: all this eventually points to an underlying resource management problem. Possibly drmSetBusid can not legally called twice for the same device. Possibly this only is a respectless, naughty behaviour (next to an error) anywhere in a kernel driver, simply beeing a bit too strict, a bit too harsh with poor old X. Access is probably denied by drm_unlocked_ioctl() in git/drm/linux-core/drm_drv.c: 647 if (!func) { 648 DRM_DEBUG("no function\n"); 649 retcode = -EINVAL; 650 } else if (((ioctl->flags & DRM_ROOT_ONLY) && !capable(CAP_SYS_ADMIN)) || 651 ((ioctl->flags & DRM_AUTH) && !file_priv->authenticated) || 652 ((ioctl->flags & DRM_MASTER) && !file_priv->master)) { 653 retcode = -EACCES; 654 } else { 655 retcode = func(dev, kdata, file_priv); 656 } Looking closer at this particular piece of drm kernel driver code, I guess its the DRM_MASTER branch witch makes the function return -EACCES(=-13), which would strongly support my resource quench hypothesis: drmSetBusid can't legally be called twice. As a result of my discussion with on Rémi Cardona at http://bugs.gentoo.org/show_bug.cgi?id=219834 I hereby reopen this report, because the workaround I mentioned would else be considered by everyonle as solving the whole issue. This is certainly not the case, and that consequence was likewise not intended when I changed status from "NEW" to "WORKSFORME". The workaround cures only makes the system working again. To root of the hole problem is, that the initialization of DRI fails for second session only, whereas it succeeds for the first. The symptoms are that second session and / or also first sesion dies (depending on mprotect calls beeing in effect or not), after switching back from the second session to the first. I already pointed out a possible reason for this behaviour. Grepping through DRM sources, looking at these drm_magic functions, I now think it may be also a DRM_AUTH issue. Thanks for the detailed debugging. The issue was doing a flush emit while VT switched away -- the switched-away server would mash your ring while the current X server was unaware. As far as the driDisabled versus driEnabled flags -- sorry it's so confusing. Things are kind of complicated because of the server regeneration support (when you log out of gdm and almost everything about the session is torn down and then restarted). driDisabled is the flag for "the user requested that you please not ever try to do DRI". driEnabled is the flag for "did we successfully turn on DRI this server generation". If we didn't want to try enabling DRI in future generations, we could get down to just one flag, but that would be kind of counter to how server regeneration is supposed to work as I understand. I would actually submit that server regeneration is a misfeature, but it exists. Anyway, I think your bug should be fixed in master now. Thank you very much for fixing this. Now I can give Amira a second try to run on top of the current intel driver. However, could you please also fix some of the issues I encountered within error handling, which made debugging of X so difficult? These issues had been: 1.) The segmentation fault in DRILock (with master branch) after DRI-Reinitialization also failed for the first session (see https://bugs.freedesktop.org/show_bug.cgi?id=15807#c7). This lead to a complete lock up of the system, as it did with the 2.2.902 codebase for another, but similar reason. Today I feel unsure, if X-Server's non-symbolic backtrace I attached in my comment is the right one, but the symbolic stack trace retrieved with gdb pointing to a segv in DRILock() is. I simply did not analyze this further, because I was after the primary cause. 2.) Could you please document the proper use of "I830_DEBUG" and that of the synonymous "INTEL_DEBUG" environment variable within driver's man page? I actually found it's usage grepping through the driver code, and recognized that in addition to exporting I830_DEBUG="all" within process environment one has to #undef I830_DEBUG to force the driver to actually make use of it. Concerning DRI reinitalization: Having posted that question (if disabling DRI for the second X server would currently be a normal situation, on xorg@lists.freedesktop.org), I got the following answer from Dave Arlie: >Yes, currently DRI can only work on the first X server. >I've done a lot of work towards fixing this but it'll be around kernel >modesetting time before its truly fixed. Finally, until today I do not understand, why gdb constantly complained about pI830->LpRing->virtual_start beeing invalid, refusing to read from this address range (and also refusing to write to), while running the related statement until it got run within wrong context, as you explained above. Is this really only a gdb problem? While I can't prove a cause-effect relationship for the moment, I have the impression that there actually is a positive correlation between this and the bug you now fixed, assuming you reverted "Use mprotect on unbound AGP memory to attempt to catch use while unbound" to get "Fix 965+ rendering issues with DRI disabled" running, which in turn introduced a call to I830EmitFlush() out of context. Git bisection identified the revertion of the mprotect calls as beeing the primary issue, and this was committed just before your 965+rendering issue fix: http://gitweb.freedesktop.org/?p=xorg/driver/xf86-video-intel.git;a=shortlog;h=21783ec9dfad9aae0837fd2d8eb313a77f031046 2008-03-26 Eric Anholt Fix 965+ rendering issues with DRI disabled. 2008-03-25 Eric Anholt Revert "Use mprotect on unbound AGP memory to attempt The mprotect calls were reverted because they failed on many linux systems, and they never once identified a bug. They were *supposed* to make your server segfault when a bug such as the one you were experiencing occurred, right at the point where we did the wrong thing (poke device memory while switched away) but as you saw, it didn't. As far as gdb goes, the linux kernel has been broken for reading device memory (such as agp, or mmio registers) for a long time. The result would be refusing to read/write the memory. Supposedly this is fixed upstream now or very soon. (In reply to comment #15) > The mprotect calls were reverted because they failed on many linux systems, and > they never once identified a bug. They were *supposed* to make your server > segfault when a bug such as the one you were experiencing occurred, right at > the point where we did the wrong thing (poke device memory while switched away) > but as you saw, it didn't. To the contrary what you wrote here, I at least saw your bugfix segfaulting shortly after the momemt I reactivated the mprotect calls by locally reverting your commit '0af692e9ee5857e41ffdbaf760752a37737b21b7', which disabled them. So you now got bug report number one in a hopefully short row of Intel powered bugs. See https://bugs.freedesktop.org/show_bug.cgi?id=15807#c9. > > As far as gdb goes, the linux kernel has been broken for reading device memory > (such as agp, or mmio registers) for a long time. The result would be refusing > to read/write the memory. Supposedly this is fixed upstream now or very soon. > Aha, thank you for that information. Can someone point me to the patch that fixes this bug? Please excuse the previous request, this was my first time using git and was having a hard time figuring out the repository structure since there doesn't seem to be a builtin git list repository command. I found the update for commit 36ec93300926084fb2951d69b001e4c67bc6ff79 on May 6, 2008. $ git diff 0c00a638ef 36ec933009 diff --git a/src/i830_driver.c b/src/i830_driver.c index b776ff6..2efad47 100644 --- a/src/i830_driver.c +++ b/src/i830_driver.c @@ -2411,7 +2411,7 @@ I830BlockHandler(int i, * after the page flipping updates, so there's no need to duplicate * the effort here. */ - if (!pI830->noAccel && !pI830->directRenderingEnabled) + if (pScrn->vtSema && !pI830->noAccel && !pI830->directRenderingEnabled) I830EmitFlush(pScrn); I830VideoBlockHandler(i, blockData, pTimeout, pReadmask); |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.