21622 – Multiple Bugs Caused System hangs completely after a while if Depth 16 is used

Bug 21622 - Multiple Bugs Caused System hangs completely after a while if Depth 16 is used

Summary: Multiple Bugs Caused System hangs completely after a while if Depth 16 is used

Status:	RESOLVED MOVED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/siliconmotion (show other bugs)
Version:	git
Hardware:	Other Linux (All)

Importance:	medium critical
Assignee:	Xorg Project Team
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-05-07 13:45 UTC by Zhang Le
Modified:	2018-08-10 20:46 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
xorg.conf (1.17 KB, text/plain) 2009-05-07 13:48 UTC, Zhang Le	no flags	Details
View All

Description Zhang Le 2009-05-07 13:45:42 UTC

I haven't found a specific sequence of actions which could reproduce this for sure.

After startx, drag any window on the screen back and forth several times, the system will hang, but you don't know exactly when it will hang.

When the system hangs, I got this message on stdout:
exaCopyDirty: Pending damage region empty!

I will attach the xorg.conf I am using and the Xorg.0.log produced.

Comment 1 Zhang Le 2009-05-07 13:48:21 UTC

Created attachment 25609 [details]
xorg.conf

Comment 2 Zhang Le 2009-05-07 14:08:36 UTC

http://www.gentoo-cn.org/~zhangle/Xorg.0.log.bz2

The Xorg.0.log is too large, I bzip2'ed it.

Comment 3 Zhang Le 2009-05-07 14:11:58 UTC

BTW, please take a look at this picture:
http://www.gentoo-cn.org/~zhangle/2009-05-08-045027_1024x600_scrot.png
There are some black lines which should not exist.

Any idea?

Comment 4 Francisco Jerez 2009-05-08 02:03:23 UTC

Does it work right when using XAA instead of EXA?

You might also try:
> Option "NoAccel"

or
> Option "MCLK" "0Hz"

Comment 5 Zhang Le 2009-05-08 09:47:52 UTC

I have tried XAA, NoAccel and OHz, but no luck, :(.

Also I found sometimes, this "exaCopyDirty: Pending damage region empty!" happened before the crash.

Comment 6 Francisco Jerez 2009-05-08 10:15:53 UTC

(In reply to comment #5)
> I have tried XAA, NoAccel and OHz, but no luck, :(.
> 
> Also I found sometimes, this "exaCopyDirty: Pending damage region empty!"
> happened before the crash.
> 

It is very strange that it crashes with acceleration disabled. Could you attach the logs generated with some high verbosity level (like "-logverbose 7") and both config options? Like:

> Option "NoAccel"
> Option "MCLK" "0Hz"

Thanks.

Comment 7 Francisco Jerez 2009-05-10 11:48:40 UTC

Is there any specific application that tends to trigger it? Does it also happen when you leave the X server idling without any clients, not even a window manager?

To discard that this is a duplicate of #15898 (It doesn't sound like the same, but just to be sure), you might try something like:
$ xset dpms force off; sleep 1; xset dpms force on

Does that hang your computer?

Comment 8 Zhang Le 2009-05-11 09:51:47 UTC

(In reply to comment #7)
> To discard that this is a duplicate of #15898 (It doesn't sound like the same,
> but just to be sure), you might try something like:
> $ xset dpms force off; sleep 1; xset dpms force on
> 
> Does that hang your computer?

I have tried in Depth 24 and 16, in both situations the system won't hang. Just that, "xset dpms force on" can't light the screen, I have to touch the keyboard to light the screen.

Comment 9 Zhang Le 2009-05-14 02:41:57 UTC

(In reply to comment #7)
> Is there any specific application that tends to trigger it? Does it also happen
> when you leave the X server idling without any clients, not even a window
> manager?

I haven't found any specific application which tends to trigger it.
If I start X alone, then the system won't hang.

Also I found if I mount the partition where log resides as sync, it became harder to hang the system.

Comment 10 Francisco Jerez 2009-05-14 03:19:47 UTC

(In reply to comment #9)
> (In reply to comment #7)
> > Is there any specific application that tends to trigger it? Does it also happen
> > when you leave the X server idling without any clients, not even a window
> > manager?
> 
> I haven't found any specific application which tends to trigger it.
> If I start X alone, then the system won't hang.
> 
> Also I found if I mount the partition where log resides as sync, it became
> harder to hang the system.
> 

You could try to use something like "x11perf -repeat 1 -all" to find out if there is an specific request that tends to hang your server.

I think I would do it with acceleration disabled (Option "NoAccel" set in the "Device" section) and no wm running, to avoid other interactions.

Comment 11 Zhang Le 2009-05-14 03:40:14 UTC

(In reply to comment #10)
> I think I would do it with acceleration disabled (Option "NoAccel" set in the
> "Device" section) and no wm running, to avoid other interactions.

Yes, I have been using these two options:
> Option "NoAccel"
> Option "MCLK" "0Hz"

Comment 12 Zhang Le 2009-05-14 11:32:21 UTC

(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #7)
> > > Is there any specific application that tends to trigger it? Does it also happen
> > > when you leave the X server idling without any clients, not even a window
> > > manager?
> > 
> > I haven't found any specific application which tends to trigger it.
> > If I start X alone, then the system won't hang.
> > 
> > Also I found if I mount the partition where log resides as sync, it became
> > harder to hang the system.
> > 
> 
> You could try to use something like "x11perf -repeat 1 -all" to find out if
> there is an specific request that tends to hang your server.

X got bus error. The last three line was:

 160000 reps @   0.0497 msec ( 20100.0/sec): Char in 80-char rgb core line (Charter 10)

  22400 reps @   0.2502 msec (  4000.0/sec): Char in 30-char rgb core line (Charter 24)

 480000 reps @   0.0106 msec ( 94500.0/sec): Char in 80-char rgb core line (Courier 12)

I will try to fix it and run x11perf again.

Comment 13 Zhang Le 2009-05-18 23:20:29 UTC

I found this test tends to crash X when 16 depth is used:

x11perf -repeat 1 -scroll10

Sometimes bus error, sometimes segfault, sometimes completely hang.

It works well if 24 depth is used.

Comment 14 Francisco Jerez 2009-05-19 01:38:57 UTC

(In reply to comment #13)
> I found this test tends to crash X when 16 depth is used:
> 
> x11perf -repeat 1 -scroll10
> 
> Sometimes bus error, sometimes segfault, sometimes completely hang.
> 
> It works well if 24 depth is used.
> 

You might be able to get a backtrace if you run the server with gdb and it doesn't completely hang. That might be useful.

Comment 15 Zhang Le 2009-05-22 10:51:21 UTC

(In reply to comment #14)
> You might be able to get a backtrace if you run the server with gdb and it
> doesn't completely hang. That might be useful.

I forget to save the core file I generated previously. The current core file can't give any useful information. 
(gdb) bt
#0  0xbe754028 in ?? ()
#1  0xbe754028 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The previous core file does not make sense to me. It segfaults because it tries to load a value from 0xffffffff. The correct address in in memory as can be seen from core file, and the assembly insn sequence to load the address seems correct too. Don't know why the loaded address would become 0xffffffff.

BTW, I found this command is enough to crash X:
x11perf -scroll10

Also I have tried -sync options and found that no matter the system completely hangs or just bus error/segfault, it happens at a XCopyArea call in DoScroll function. I will trace into that function to find what's really going on.

And, with 24 Depth, x11perf -scroll10 works well.

Comment 16 rixed 2010-02-02 11:00:17 UTC

I'm experiencing the same problem. I've compiled with debug symbols an xorg-server 1.6.5 and got this, attaching X with gdb and running x1perf -scroll10 :

Program received signal SIGBUS, Bus error.
0x2b1914a4 in _fbGetWindowPixmap (pWindow=Cannot access memory at address 0x4002e02c
) at fbscreen.c:88
88	}
(gdb) bt
#0  0x2b1914a4 in _fbGetWindowPixmap (pWindow=Cannot access memory at address 0x4002e02c
) at fbscreen.c:88
Cannot access memory at address 0x4002e054
(gdb) info threads 
* 1 Thread 0x2b138000 (LWP 2421)  0x2b1914a4 in _fbGetWindowPixmap (pWindow=Cannot access memory at address 0x4002e02c
)
    at fbscreen.c:88

But I think the real problem happens earlier since, although the bus error seams to trigger instantaneously, the display is already wrong (there is a wrong pattern of approximately 8x80 pixels in the bottom left of the picture).

Also, sometimes the system just hangs completely, usually without anything on the kernel console, yet once I got this :

spurious 8259A interrupt: IRQ13.

I wouldn't be surprised if on this hardware (lemote yeeloong) irq13 were bound to the SM712 (got to check this).

Comment 17 rixed 2010-02-02 11:29:17 UTC

Another time I was more lucky :

Program received signal SIGSEGV, Segmentation fault.
0x10216c58 in getDrawableDamageRef (pDrawable=0x10378ed8) at damage.c:92
92		if (pScreen->GetWindowPixmap
(gdb) bt
#0  0x10216c58 in getDrawableDamageRef (pDrawable=0x10378ed8) at damage.c:92
#1  0x10217ac0 in damageRegionProcessPending (pDrawable=0x10378ed8)
    at damage.c:386
#2  0x1021a3fc in damageCopyArea (pSrc=0x10378ed8, pDst=0x10378ed8, 
    pGC=0x10379bd8, srcx=10, srcy=23, width=10, height=10, dstx=10, dsty=10)
    at damage.c:951
#3  0x1004bb90 in ProcCopyArea (client=0x10376980) at dispatch.c:1575
#4  0x10047a18 in Dispatch () at dispatch.c:456
#5  0x100211b4 in main (argc=4, argv=0x7fda8154, envp=0x7fda8168) at main.c:397

Also :

(gdb) l
87	    if (pDrawable->type == DRAWABLE_WINDOW)
88	    {
89		ScreenPtr   pScreen = pDrawable->pScreen;
90	
91		pPixmap = 0;
92		if (pScreen->GetWindowPixmap
93	#ifdef ROOTLESS_WORKAROUND
94		    && ((WindowPtr)pDrawable)->viewable
95	#endif
96		    )
(gdb) p *pDrawable
$1 = {type = 0 '\000', class = 1 '\001', depth = 16 '\020', 
  bitsPerPixel = 16 '\020', id = 2097153, x = 3, y = 0, width = 600, 
  height = 600, pScreen = 0x10302e28, serialNumber = 11}
(gdb) p pScreen
$2 = (ScreenPtr) 0xffffffff

This is strange since it was compiled with -O0 !?

So :

(gdb) p *pDrawable->pScreen
$4 = {myNum = 0, id = 0, width = 1024, height = 600, mmWidth = 270, 
  mmHeight = 158, numDepths = 7, rootDepth = 16 '\020', 
  allowedDepths = 0x103031d8, rootVisual = 33, defColormap = 32, 
  minInstalledCmaps = 1, maxInstalledCmaps = 1, 
  backingStoreSupport = 0 '\000', saveUnderSupport = 0 '\000', 
  whitePixel = 65535, blackPixel = 0, rgf = 0, GCperDepth = {0x1030c730, 
    0x1030c818, 0x1030c900, 0x1030ca10, 0x1030cb30, 0x1030cc50, 0x1030cd70, 
    0x1030ce90, 0x0}, PixmapPerDepth = {0x1030cfb0}, devPrivate = 0x1030c1c0, 
  numVisuals = 3, visuals = 0x1030c468, 
  CloseScreen = 0x101a2ff0 <compCloseScreen>, 
  QueryBestSize = 0x2b4a1300 <fbQueryBestSize>, 
  SaveScreen = 0x2b431030 <SMI_SaveScreen>, 
  GetImage = 0x1016992c <miSpriteGetImage>, 
  GetSpans = 0x10169c10 <miSpriteGetSpans>, 
  PointerNonInterestBox = 0x1015e888 <miPointerPointerNonInterestBox>, 
  SourceValidate = 0x10169f58 <miSpriteSourceValidate>, 
  CreateWindow = 0x101a64b0 <compCreateWindow>, 
  DestroyWindow = 0x101a66d4 <compDestroyWindow>, 
  PositionWindow = 0x101a4e44 <compPositionWindow>, 
  ChangeWindowAttributes = 0x101a3300 <compChangeWindowAttributes>, 
  RealizeWindow = 0x101a5070 <compRealizeWindow>, 
  UnrealizeWindow = 0x101a5158 <compUnrealizeWindow>, 
  ValidateTree = 0x1016d93c <miValidateTree>, PostValidateTree = 0, 
  WindowExposures = 0x1015051c <miWindowExposures>, PaintWindowBackground = 0, 
  PaintWindowBorder = 0, CopyWindow = 0x101a5e40 <compCopyWindow>, 
  ClearToBackground = 0x10175880 <miClearToBackground>, 
  ClipNotify = 0x101a5240 <compClipNotify>, RestackWindow = 0, 
  CreatePixmap = 0x2b49f348 <fbCreatePixmap>, 
  DestroyPixmap = 0x101c482c <ShmDestroyPixmap>, SaveDoomedAreas = 0, 
  RestoreAreas = 0, ExposeCopy = 0, TranslateBackingStore = 0, 
  ClearBackingStore = 0, DrawGuarantee = 0, BackingStoreFuncs = {
    SaveAreas = 0, RestoreAreas = 0, SetClipmaskRgn = 0, GetImagePixmap = 0, 
    GetSpansPixmap = 0}, RealizeFont = 0x2b4a12a8 <fbRealizeFont>, 
  UnrealizeFont = 0x2b4a12d4 <fbUnrealizeFont>, 
  ConstrainCursor = 0x1015e764 <miPointerConstrainCursor>, 
  CursorLimits = 0x102124f4 <AnimCurCursorLimits>, 
  DisplayCursor = 0x102129b0 <AnimCurDisplayCursor>, 
  RealizeCursor = 0x10212d54 <AnimCurRealizeCursor>, 
  UnrealizeCursor = 0x10212e54 <AnimCurUnrealizeCursor>, 
  RecolorCursor = 0x10212fd0 <AnimCurRecolorCursor>, 
  SetCursorPosition = 0x10212c2c <AnimCurSetCursorPosition>, 
  CreateGC = 0x10217dd4 <damageCreateGC>, 
  CreateColormap = 0x100a7710 <CMapCreateColormap>, 
  DestroyColormap = 0x100b3358 <DGADestroyColormap>, 
  InstallColormap = 0x101a31d0 <compInstallColormap>, 
  UninstallColormap = 0x100b3584 <DGAUninstallColormap>, 
  ListInstalledColormaps = 0x2b47b1a0 <fbListInstalledColormaps>, 
  StoreColors = 0x100a79cc <CMapStoreColors>, 
  ResolveColor = 0x2b47b2a4 <fbResolveColor>, 
  BitmapToRegion = 0x2b49f4e4 <fbPixmapToRegion>, 
  SendGraphicsExpose = 0x1014fe34 <miSendGraphicsExpose>, 
  BlockHandler = 0x101a3514 <compBlockHandler>, 
  WakeupHandler = 0x1005a4dc <NoopDDA>, blockData = 0x0, wakeupData = 0x0, 
  devPrivates = 0x10306110, 
  CreateScreenResources = 0x10168a24 <miCreateScreenResources>, 
  ModifyPixmapHeader = 0x10168630 <miModifyPixmapHeader>, 
  GetWindowPixmap = 0x2b4a1448 <_fbGetWindowPixmap>, 
  SetWindowPixmap = 0x1021f0e4 <damageSetWindowPixmap>, 
  GetScreenPixmap = 0x10168d2c <miGetScreenPixmap>, 
  SetScreenPixmap = 0x10168d58 <miSetScreenPixmap>, pScratchPixmap = 0x0, 
  totalPixmapSize = 48, MarkWindow = 0x10175bc0 <miMarkWindow>, 
  MarkOverlappedWindows = 0x10175c70 <miMarkOverlappedWindows>, 
  ChangeSaveUnder = 0, PostChangeSaveUnder = 0, 
  MoveWindow = 0x101a5618 <compMoveWindow>, 
  ResizeWindow = 0x101a583c <compResizeWindow>, 
  GetLayerWindow = 0x10177810 <miGetLayerWindow>, 
  HandleExposures = 0x10175fd8 <miHandleValidateExposures>, 
  ReparentWindow = 0x101a5c28 <compReparentWindow>, 
  SetShape = 0x1017783c <miSetShape>, 
  ChangeBorderWidth = 0x101a5a50 <compChangeBorderWidth>, 
  MarkUnrealizedWindow = 0x10177db8 <miMarkUnrealizedWindow>, 
  DeviceCursorInitialize = 0x1015e9dc <miPointerDeviceInitialize>, 
  DeviceCursorCleanup = 0x1015eb54 <miPointerDeviceCleanup>}

(just in case its informationnal).

The places are not always the same, the error (segfault or bus error or hang) not always the same neither, sometime the whole register set seams bogus (including stack pointer)...

Notice that here also it works well if depth is 24 or 8. I didn't found any other config parameter that seams to have any influence.

Comment 18 rixed 2010-02-02 12:18:56 UTC

If someone want to have a look I've got a core file now.
Download it here :

http://happyleptic.org/~rixed/X_siliconmotion_core.tgz

It again crashed after the ProcCopyArea()

Also, while playing with this bug I got another spurious IRQ13.
Not sure it's related.

I also changed x11perf test windows to no be clipped and moved them away from screen border (thus video memory borders) but this didn't change anything.
I also noticed that the bug seams to hangs the machine if you exec X in 16bpp after a fresh rebbot, but not if you first run X in 32 then kill it and restart at 16. You can then do many experiments before hanging the host.

I keep looking for clues...

Comment 19 Francisco Jerez 2010-02-02 12:46:06 UTC

It would be interesting to know what happens with some other DDX, e.g. xf86-video-dummy.

Comment 20 rixed 2010-02-02 13:07:27 UTC

As for the strange register that gets the value 0xffffffff from nowhere, could it be possible that the 16bpp code triggers an exception that's not handled properly by kernel or hardware, thus leaving -1 in a register ?
I can't think of any other explanation for now.

Comment 21 rixed 2010-02-02 13:08:24 UTC

(In reply to comment #19)
> It would be interesting to know what happens with some other DDX, e.g.
> xf86-video-dummy.
> 

I already tried fb driver at 16bpp and it worked allright.

Comment 22 rixed 2010-02-03 03:14:09 UTC

All bugs I was able to catch with gcc seamed to be related to the stack being corrupt assynchronously. For instance, 0xffffffff is present on the stack instead of a valid address in this kind of instruction sequence :

lw v0,4(s8) ; load the word at offset 4 from the stack frame pointed by s8 into v0

lw v0,0(v0)  ; dereference it, successfully
...do other reads...
lw v0,4(%s8) ; relead the same value that was not modified
lw v0,0(v0)  ; <- segfault here, v0=0xffffffff and you can see with gdb x cmd that $s8+4 actually holds 0xffffffff.

So my theory is that when copying an area, some pixels get written to the stack addresses (0xffffffff would be a pixel, and the stack appears to be not very far from the /dev/mem mmapping used to access video memory), but with different cache settings so that the former correct values were read from the stack up to some point when the cache gets refreshed ?

I've looked into xorg-server CopyArea code for some other kind of assynchronous memory writes but couldn't find any (no hardware ROP nor DMA transfert and no pixman).

Comment 23 rixed 2010-02-03 11:05:42 UTC

Even more simple case of magick :

Program received signal SIGSEGV, Segmentation fault.
0x2ac4a118 in pixman_region_selfcheck (reg=0x4008802c) at pixman-region.c:2415
2415	    if ((reg->extents.x1 > reg->extents.x2) ||
(gdb) disassemble pixman_region_selfcheck 
Dump of assembler code for function pixman_region_selfcheck:
0x2ac4a0f8 <pixman_region_selfcheck+0>:	addiu	sp,sp,-64
0x2ac4a0fc <pixman_region_selfcheck+4>:	sd	s8,56(sp)
0x2ac4a100 <pixman_region_selfcheck+8>:	move	s8,sp
0x2ac4a104 <pixman_region_selfcheck+12>:	lui	a1,0x6
0x2ac4a108 <pixman_region_selfcheck+16>:	addu	a1,a1,t9
0x2ac4a10c <pixman_region_selfcheck+20>:	addiu	a1,a1,30328
0x2ac4a110 <pixman_region_selfcheck+24>:	sw	a0,32(s8)
0x2ac4a114 <pixman_region_selfcheck+28>:	lw	v0,32(s8)
0x2ac4a118 <pixman_region_selfcheck+32>:	lh	v1,0(v0)

So we segfaulted here, right after storing a0 (reg address) into stack, re-read it into v0, then trying to dereference v0. Now guess what :

(gdb) info registers
                  zero               at               v0               v1
 R0   0000000000000000 ffffffffcfffffff 000000004008802c 00000000103888e8 
                    a0               a1               a2               a3
 R4   00000000103888e8 000000002acb1770 00000000103894b8 000000000000000a 

v0 is 4008802c thus the segfault, but a0 is correct (103888e8), and the stack
location holds :

(gdb) x $s8+32
0x7f978ce0:	0x4008802c

The wrong one !
(notice how those wrong values appearing in the stack are always either 0xffffffff or 0x400XXXXX).

So either "sw a0,32(s8)" is boggus, or some vicious interrupt changed the stack just after that, of we were jumping to this instruction from somewhere else.
Notice that the other values on the stack seams allright, the stack frame is OK, so I do not believe we came here after a misplaced jump.

This bug is driving me mad. I'm not familiar with mips nor X11, but this looks magick to me.

Comment 24 rixed 2010-02-03 13:48:07 UTC

If an X11 guru would like to check this code of ftBlt I would be thanksfull, since when I comment it out scroll10 test works :

    if (alu == GXcopy && pm == FB_ALLONES && !reverse &&
            !(srcX & 7) && !(dstX & 7) && !(width & 7)) {
        int i;
        CARD8 *src = (CARD8 *) srcLine;
        CARD8 *dst = (CARD8 *) dstLine;

        srcStride *= sizeof(FbBits);
        dstStride *= sizeof(FbBits);
        width >>= 3;
        src += (srcX >> 3);
        dst += (dstX >> 3);

        if (!upsidedown)
            for (i = 0; i < height; i++)
                MEMCPY_WRAPPED(dst + i * dstStride, src + i * srcStride, width);
        else
            for (i = height - 1; i >= 0; i--)
                MEMCPY_WRAPPED(dst + i * dstStride, src + i * srcStride, width);

        return;
    }

The code that follow handle correctly this case as well as the more general case anyway.

I'm currently running the whole x11perf test suite, but anyway I managed to display reddit homepage in firefox without any destroyed glyphes (wrong glyphes was the main manifestation of the bug for me), so it's clearly better without this code.

Comment 25 rixed 2010-02-03 16:42:32 UTC

OK, so now I recomplied all my xorg server with usual CFLAGS, with the above code commented out.

I keep get the wrong glyphes (black lines on top of some chars) but only with the EXA acceleration. If I choose NoAccel or XAA it works allright (so far).

With EXA some fonts are corrupt and after a while the computer freeze.

With XAA, so far so good.

Comment 26 rixed 2010-02-04 06:41:55 UTC

The cited code apparently cause problem since source and dest copied lines are allowed to overlap. Replacing memcpy in fb/fb.h by memmove solve the problem when NoAccel.

Still got to look for what's causing EXA accel mode to crash.

I noticed that the memory barrier for mips is merely a sequence of nops.
I got to check on the manual, but I guess it's inappropriate for the loongson.

Comment 27 Matt Turner 2012-05-13 09:34:57 UTC

I'm seeing this too on my Yeeloong, SM712.

Thanks to rixed for putting 'spurious 8259A interrupt' in his post, that's how I found this bug.

I've started using Depth 24 (which requires putting Virtual 1024 600 into xorg.conf so that it allocates a 1024x600 framebuffer instead of 1024x1024 which won't fit into the VRAM given the tiny amount needed by the cursor).

Comment 28 Petr Pisar 2012-08-14 18:52:20 UTC

As XAA has been removed from xorg-server-1.12.99, the EXA remains and I experience lock-ups when using any GTK2 application (Loongson MIPS, 16 bpp depth). The common visual symptom is that some widgets (icons, shaded buttons) gets corrupted. I suspected pixman, but downgrading pixman or disabling Loongson optimized paths in pixman did not help. So I think it has something to do with SMI driver.

Comment 29 Tom Li 2014-05-30 03:01:30 UTC

Same problem here.

I'll try the solution from Comment #24.

Comment 30 Tom Li 2014-05-31 15:50:37 UTC

After some tests, I realized that I have found another issue.

It can be reproduced by

    x11perf --copywinpix100

then the system will completely hang.

The last logs from X is:

> SMI_SetupForSolidFill
color=0000FFFF rop=03
DPR14 = 0000FFFF
DPR34 = FFFFFFFF
DPR38 = FFFFFFFF
< SMI_SetupForSolidFill
> SMI_SubsequentSolidFillRect
x=3 y=0 w=600 h=600
DPR04 = 00030000
DPR08 = 02580258
DPR0C = 800000F0
< SMI_SubsequentSolidFillRect
> SMI_AccelSync
< SMI_AccelSync
(completely hang)

The last SMI_AccelSync returns. Then what up? Because the system is completely hang, neither SysRq or networking is not work. So I can't know the next function's name.

But running with depth 24 doesn't have this problem.

Comment 31 Tom Li 2014-05-31 15:52:50 UTC

I'm using the old version of X with XAA. So the problem can't cause by EXA.

Comment 32 Tom Li 2014-06-01 13:27:49 UTC

When the kernel hangs, I got

    [ 1222.876000] spurious 8259A interrupt: IRQ0.

from netconsole.

Comment 33 Tom Li 2014-06-01 14:12:52 UTC

I attach gdb to a running X server, then set a breakpoint and use

    while 1
       shell sleep 0.1
       next
    end

to see which line of code crash the machine.

437     in /var/tmp/portage/x11-base/xorg-server-1.11.4-r3/work/xorg-server-1.11.4/dix/dispatch.c
438     in /var/tmp/portage/x11-base/xorg-server-1.11.4-r3/work/xorg-server-1.11.4/dix/dispatch.c
439     in /var/tmp/portage/x11-base/xorg-server-1.11.4-r3/work/xorg-server-1.11.4/dix/dispatch.c
(hang)

Code:

437  result = XaceHookDispatch(client, client->majorOp);
438  if (result == Success)
439      result = (* client->requestVector[client->majorOp])(client);

So, a memory dereference crashes the machine. It may means some areas of memory was destroyed by the buggy driver.

and luckily, this time, a kernel panic occurred instead of completely hang:

[  583.648000] spurious 8259A interrupt: IRQ0.
[  583.656000] spurious 8259A interrupt: IRQ13.
[  583.660000] spurious 8259A interrupt: IRQ6.
[  583.664000] CPU 0 Unable to handle kernel paging request at virtual address 0000000000000020, epc == 0000000000000020, ra == ffffffff80279ae4
[  583.664000] Oops[#1]:
[  583.664000] CPU: 0 PID: 235 Comm: X Not tainted 3.14.4-yeeloong-gaizi+ #10
[  583.664000] task: 98000000bfbb6db0 ti: 98000000b8370000 task.ti: 98000000b8370000
[  583.664000] $ 0   : 0000000000000000 ffffffffcfffffff 0000000000000020 0000000000000000
[  583.664000] $ 4   : 0000000000000008 98000000bf360000 0000000000000020 0000000000000002
[  583.664000] $ 8   : 0000000000000001 000000000000ffff 000000000000ffff 000000007628fafc
[  583.664000] $12   : 00000000140044e0 000000001000001f 00000000760b4dfa ffffffffffffffff
[  583.664000] $16   : 0000000000000000 0000000000000000 0000000000000008 ffffffff80b61cc0
[  583.664000] $20   : 000000000000ffff ffffffff80adef58 0000000000000008 ffffffff80ba0000
[  583.664000] $24   : 00000000000004b0 0000000000000800
[  583.664000] $28   : 98000000b8370000 98000000b8373e00 98000000bf345080 ffffffff80279ae4
[  583.664000] Hi    : 0000000000000000
[  583.664000] Lo    : 0000000000000000
[  583.664000] epc   : 0000000000000020 0x20
[  583.664000]     Not tainted
[  583.664000] ra    : ffffffff80279ae4 handle_irq_event_percpu+0x6c/0x220
[  583.664000] Status: 140044e2 KX SX UX KERNEL EXL
[  583.664000] Cause : 10008408
[  583.664000] BadVA : 0000000000000020
[  583.664000] PrId  : 00006303 (ICT Loongson-2)
[  583.664000] Modules linked in: netconsole configfs arc4 rtl8187 eeprom_93cx6 led_class mac80211 cfg80211 rfkill loongson2_cpufreq psmouse snd_cs5535audio 8139too mii snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore ipv6
[  583.664000] Process X (pid: 235, threadinfo=98000000b8370000, task=98000000bfbb6db0, tls=000000007712b4a0)
[  583.664000] Stack : ffffffff80b61cc0 000000000000ffff 000000000000ffff 000000000000ffff
          000000000000ffff ffff00000000ffff 000000000000ffff 000000001089c848
          00000000000001e0 ffffffff80279d04 000000000000ffff 000000000000ffff
          ffffffff80b61cc0 ffffffff8027d278 0000000000000000 ffffffff80279154
          ffff000000000000 ffffffff80209310 0000000000000008 ffffffff80203f38
          00000000000001e0 ffffffff80206f40 0000000000000000 ffffffffcfffffff
          000000007628f382 00000000760b4a10 000000007628f302 00000000760b4950
          00000000000000c2 0000000000000002 0000000000000001 000000000000ffff
          000000000000ffff 000000007628fafc 000000000000001d 00000000000000c8
          00000000760b4dfa ffffffffffffffff 0000000000000000 000000000000ffff
          ...
[  583.664000] Call Trace:
[  583.664000] [<ffffffff80279d04>] handle_irq_event+0x6c/0xa8
[  583.664000] [<ffffffff8027d278>] handle_level_irq+0xb0/0x170
[  583.664000] [<ffffffff80279154>] generic_handle_irq+0x5c/0x80
[  583.664000] [<ffffffff80209310>] do_IRQ+0x18/0x28
[  583.664000] [<ffffffff80203f38>] mach_irq_dispatch+0x50/0x78
[  583.664000] [<ffffffff80206f40>] ret_from_irq+0x0/0x4
[  583.664000]
[  583.664000]
Code: (Bad address in epc)
[  583.664000]
[  583.672000] ---[ end trace b4344c9ded821fc4 ]---
[  583.672000] Kernel panic - not syncing: Fatal exception in interrupt

Comment 34 Tom Li 2014-06-01 16:10:38 UTC

BTW, the IRQ interrupts before hang are almost random:

[  168.972000] spurious 8259A interrupt: IRQ3.
[  251.516000] spurious 8259A interrupt: IRQ13.
[  254.968000] spurious 8259A interrupt: IRQ3.
[  254.968000] spurious 8259A interrupt: IRQ10.
[  260.968000] spurious 8259A interrupt: IRQ6.
[   46.704000] spurious 8259A interrupt: IRQ13.
[   47.940000] spurious 8259A interrupt: IRQ10.

Comment 35 Tom Li 2014-06-01 16:17:34 UTC

It is not a problem of XAA implementation. If I use NoAccel, it also crashes.

> SMI_SaveScreen
 > SMILynx_DisplayPowerManagementSet
 < SMILynx_DisplayPowerManagementSet
< SMI_SaveScreen

dmesg:
[   65.168000] spurious 8259A interrupt: IRQ13.
[   65.172000] spurious 8259A interrupt: IRQ0.

Comment 36 Tom Li 2014-06-01 16:28:03 UTC

With EXA, I got corrupted fonts when running any GTK applications, and the kernel will recive an IRQ interrupt, then system will hang completely after a while.

But, for x11perf --copywinpix100 test, it works fine. It is very strange.

Comment 37 Tom Li 2014-06-01 16:51:17 UTC

Another kernel panic:

[ 1556.616000] spurious 8259A interrupt: IRQ0.
[ 1556.620000] CPU 0 Unable to handle kernel paging request at virtual address 000000000000a400, epc == 000000000000a400, ra == ffffffff80279ae4
[ 1556.620000] Oops[#1]:
[ 1556.620000] CPU: 0 PID: 4661 Comm: X Not tainted 3.14.4-yeeloong-gaizi+ #10
[ 1556.620000] task: 98000000b86dba80 ti: 98000000bfef8000 task.ti: 98000000bfef8000
[ 1556.620000] $ 0   : 0000000000000000 ffffffffcfffffff 000000000000a400 0000000000000000
[ 1556.620000] $ 4   : 0000000000000008 98000000bf360000 0000000000000020 0000000000000006
[ 1556.620000] $ 8   : 0000000000000001 000000000000ffff 000000000000ffff 00000000764cf0d0
[ 1556.620000] $12   : 00000000140044e0 000000001000001f 0000000076474a76 ffffffffffffffff
[ 1556.620000] $16   : 0000000000000000 0000000000000000 0000000000000008 ffffffff80b61cc0
[ 1556.620000] $20   : 000000000000ffff ffffffff80adef58 0000000000000008 ffffffff80ba0000
[ 1556.620000] $24   : 00000000000004b0 0000000000000800
[ 1556.620000] $28   : 98000000bfef8000 98000000bfefbe00 98000000bf345080 ffffffff80279ae4
[ 1556.620000] Hi    : 0000000000000000
[ 1556.620000] Lo    : 0000000000000000
[ 1556.620000] epc   : 000000000000a400 0xa400
[ 1556.620000]     Not tainted
[ 1556.620000] ra    : ffffffff80279ae4 handle_irq_event_percpu+0x6c/0x220
[ 1556.620000] Status: 140044e2 KX SX UX KERNEL EXL
[ 1556.620000] Cause : 10008408
[ 1556.620000] BadVA : 000000000000a400
[ 1556.620000] PrId  : 00006303 (ICT Loongson-2)
[ 1556.620000] Modules linked in: ctr ccm netconsole configfs arc4 rtl8187 eeprom_93cx6 led_class mac80211 cfg80211 psmouse loongson2_cpufreq rfkill 8139too mii snd_cs5535audio snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore ipv6
[ 1556.620000] Process X (pid: 4661, threadinfo=98000000bfef8000, task=98000000b86dba80, tls=00000000773e74a0)
[ 1556.620000] Stack : ffffffff80b61cc0 000000000000ffff 000000000000ffff 000000000000ffff
          000000000000ffff ffff00000000ffff 000000000000ffff 00000000102e4848
          0000000000000000 ffffffff80279d04 000000000000ffff 000000000000ffff
          ffffffff80b61cc0 ffffffff8027d278 0000000000000000 ffffffff80279154
          0000000000000000 ffffffff80209310 0000000000000008 ffffffff80203f38
          0000000000000000 ffffffff80206f40 0000000000000000 ffffffffcfffffff
          00000000764ce992 0000000076474688 00000000764ce8d2 00000000764745c8
          00000000000000c6 0000000000000006 0000000000000001 000000000000ffff
          000000000000ffff 00000000764cf0d0 000000000000002e 00000000000000c8
          0000000076474a76 ffffffffffffffff 0000000000000000 000000000000ffff
          ...
[ 1556.620000] Call Trace:
[ 1556.620000] [<ffffffff80279d04>] handle_irq_event+0x6c/0xa8
[ 1556.620000] [<ffffffff8027d278>] handle_level_irq+0xb0/0x170
[ 1556.620000] [<ffffffff80279154>] generic_handle_irq+0x5c/0x80
[ 1556.620000] [<ffffffff80209310>] do_IRQ+0x18/0x28
[ 1556.620000] [<ffffffff80203f38>] mach_irq_dispatch+0x50/0x78
[ 1556.620000] [<ffffffff80206f40>] ret_from_irq+0x0/0x4
[ 1556.620000]
[ 1556.620000]
Code: (Bad address in epc)
[ 1556.620000]
[ 1556.624000] ---[ end trace 6928418bef65e208 ]---
[ 1556.624000] Kernel panic - not syncing: Fatal exception in interrupt

Comment 38 Tom Li 2014-08-21 15:59:30 UTC

It isn't a memory corruption issue, I think. I said the system hang at:

438  if (result == Success)
439      result = (* client->requestVector[client->majorOp])(client);

in fact, it hangs at:

    upsidedown=0, bitplane=0, closure=0x0) at /var/tmp/portage/x11-base/xorg-server-1.12.4-r2/work/xorg-server-1.12.4/fb/fbcopy.c:79
79          fbGetDrawable(pDstDrawable, dst, dstStride, dstBpp, dstXoff, dstYoff);
81          while (nbox--) {
83              if (pm == FB_ALLONES && alu == GXcopy && !reverse && !upsidedown) {
86                       srcBpp, dstBpp, (pbox->x1 + dx + srcXoff),
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
87                       (pbox->y1 + dy + srcYoff), (pbox->x1 + dstXoff),
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
87                       (pbox->y1 + dy + srcYoff), (pbox->x1 + dstXoff),
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
88                       (pbox->y1 + dstYoff), (pbox->x2 - pbox->x1),
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
88                       (pbox->y1 + dstYoff), (pbox->x2 - pbox->x1),
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
88                       (pbox->y1 + dstYoff), (pbox->x2 - pbox->x1),
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
89                       (pbox->y2 - pbox->y1)))
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
89                       (pbox->y2 - pbox->y1)))
85                      ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,

Code:

        if (pm == FB_ALLONES && alu == GXcopy && !reverse && !upsidedown) {
            if (!pixman_blt
                ((uint32_t *) src, (uint32_t *) dst, srcStride, dstStride,
                 srcBpp, dstBpp, (pbox->x1 + dx + srcXoff),
                 (pbox->y1 + dy + srcYoff), (pbox->x1 + dstXoff),
                 (pbox->y1 + dstYoff), (pbox->x2 - pbox->x1),
                 (pbox->y2 - pbox->y1)))
                goto fallback;
            else
                goto next;
        }

So I think there was something wrong with pixman.

I recompiled pixman with:

    USE="-loongson2f" emerge pixman

and all the problems go away.

Comment 39 GitLab Migration User 2018-08-10 20:46:46 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-siliconmotion/issues/2.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.