Bug 94515 - xorg-server 1.18.2 freeze desktop on i915
Summary: xorg-server 1.18.2 freeze desktop on i915
Status: RESOLVED NOTOURBUG
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-12 19:28 UTC by FadeMind
Modified: 2016-09-16 22:10 UTC (History)
18 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg, journalctl, xorg.0.log (69.28 KB, application/x-compressed-tar)
2016-03-12 19:28 UTC, FadeMind
no flags Details
logs-debug_enabled.tar.gz (2.01 MB, application/x-compressed-tar)
2016-03-12 20:26 UTC, FadeMind
no flags Details
xserver 1.18.2 log (408.11 KB, text/plain)
2016-03-12 22:31 UTC, Tobias Klausmann
no flags Details
dmesg drm.debug=0x7 (500.01 KB, text/plain)
2016-03-12 22:33 UTC, Tobias Klausmann
no flags Details
xserver 1.18.2 log w 74cd4d0 (442.12 KB, text/plain)
2016-03-22 19:35 UTC, Tobias Klausmann
no flags Details

Description FadeMind 2016-03-12 19:28:36 UTC
Created attachment 122254 [details]
dmesg, journalctl, xorg.0.log

Reported here too: https://bugs.archlinux.org/task/48549

OS: Arch Linux x86_64
Kernel: 4.4.5-zen
xf86-video-intel 1:2.99.917+560+gd167280
Mesa: mesa 11.1.2-1

After installing lastest Xorg update 1.18.2 and reboot I have serious issue . After while whole GUI just freeze Mouse and Keyboard not repond at all. Only hard power off (via pushing power button 5 sec. works) shutdown laptop. 
I using linux-zen kernel 4.4.5 now.
In Xorg I have weird:

[ 232.965] (II) GetModeLine - scrn: 0 clock: 71800
[ 232.965] (II) GetModeLine - hdsp: 1366 hbeg: 1414 hend: 1446 httl: 1526

in lots of lines after launching google-chrome too.

After revert to 1.18.1 freeze issue not happened. GetModeLine msgs gone in Xorg log.

HW
inxi -G
Graphics: Card-1: Intel 3rd Gen Core processor Graphics Controller
Card-2: NVIDIA GF108M [GeForce GT 630M]
Display Server: X.Org 1.18.2 driver: intel Resolution: 1366x768@60.01hz
GLX Renderer: Mesa DRI Intel Ivybridge Mobile GLX Version: 3.0 Mesa 11.1.2

During writing this report issue not happened jet but will happened.

Regards

Tomasz

PS. I do not know if the selected category is correct.
Comment 1 FadeMind 2016-03-12 19:34:07 UTC
Comment on attachment 122254 [details]
dmesg, journalctl, xorg.0.log

Save as file: logs.tar.gz
Comment 2 Tobias Klausmann 2016-03-12 19:59:19 UTC
i can confirm this bug (sometimes right at startup), it always leads to the following backtrace (sorry for no line numbers):

#0  0x00007f1c47e8b487 in ioctl () at /lib64/libc.so.6
#1  0x00007f1c491cdbf8 in drmIoctl () at /usr/lib64/libdrm.so.2
#2  0x00007f1c43972fff in sna_present_queue_vblank () at /usr/lib64/xorg/modules/drivers/intel_drv.so
#3  0x000000000050ff4c in present_execute ()
#4  0x00007f1c43973161 in sna_present_vblank_handler () at /usr/lib64/xorg/modules/drivers/intel_drv.so
#5  0x00007f1c438e4cff in sna_mode_wakeup () at /usr/lib64/xorg/modules/drivers/intel_drv.so
#6  0x00007f1c438e7ff8 in sna_wakeup_handler () at /usr/lib64/xorg/modules/drivers/intel_drv.so
#7  0x000000000043db1a in WakeupHandler ()
#8  0x0000000000586ce7 in WaitForSomething ()
#9  0x0000000000438f5e in Dispatch ()
#10 0x000000000043cfe3 in dix_main ()
#11 0x00007f1c47dc9610 in __libc_start_main () at /lib64/libc.so.6
#12 0x0000000000428539 in _start ()

Hardware is an Notebook with intel (gen7/ivb)+ nvidia.
Comment 3 Chris Wilson 2016-03-12 20:09:26 UTC
(In reply to Tobias Klausmann from comment #2)
> i can confirm this bug (sometimes right at startup), it always leads to the
> following backtrace (sorry for no line numbers):
> 
> #0  0x00007f1c47e8b487 in ioctl () at /lib64/libc.so.6
> #1  0x00007f1c491cdbf8 in drmIoctl () at /usr/lib64/libdrm.so.2
> #2  0x00007f1c43972fff in sna_present_queue_vblank () at
> /usr/lib64/xorg/modules/drivers/intel_drv.so

That can't be the same bug as FadeMind isn't using DRI3/Present. Please file a separate bug, and here both drm.debug=0x7 dmesg and/or Xorg.0.log with xf86-video-intel compiled with --enable-debug=full will be vital.
Comment 4 Chris Wilson 2016-03-12 20:16:37 UTC
(In reply to FadeMind from comment #0)
> After installing lastest Xorg update 1.18.2 and reboot I have serious issue
> . After while whole GUI just freeze Mouse and Keyboard not repond at all.
> Only hard power off (via pushing power button 5 sec. works) shutdown laptop. 
> I using linux-zen kernel 4.4.5 now.
> In Xorg I have weird:
> 
> [ 232.965] (II) GetModeLine - scrn: 0 clock: 71800
> [ 232.965] (II) GetModeLine - hdsp: 1366 hbeg: 1414 hend: 1446 httl: 1526
> 
> in lots of lines after launching google-chrome too.

That's extremely suspect indeed. That message comes from Xext/vidmode.c which was recently rewritten. Maybe using x11trace with google-chrome would reveal the loop it gets stuck in, or gdb'ing the Xserver, or using perf. Bisecting the Xserver would also be useful, my guess would be 

commit f175cf45aebcdda53f3ae49c0eaf27da1f194e92
Author: Olivier Fourdan <ofourdan@redhat.com>
Date:   Wed Feb 10 09:34:34 2016 +0100

    vidmode: move to a separate library of its own

but hopefully it is a smaller commit.
Comment 5 FadeMind 2016-03-12 20:26:56 UTC
Created attachment 122257 [details]
logs-debug_enabled.tar.gz

Xorg file with extended msgs. 

I forgot: I using KDE Plasma 5 with enabled Rendering OpenGL 3.1 AND I tested two scenarios with DRI2 and DRI3. The same freeze happened.
Comment 6 Chris Wilson 2016-03-12 20:33:50 UTC
Ok, the GetModeLine spam is a separate issue, from the looks of the debug log it doesn't look to be related to any hw/ddx interaction.
Comment 7 Chris Wilson 2016-03-12 20:40:59 UTC
Logs look fairly normal up to the end. I presume the event has to yet to happen, could you keep capturing and attach the logs from after the freeze? Probably need to grab the logs remotely, or else disable filesystem caching (depending on hard you need to reboot).
Comment 8 FadeMind 2016-03-12 21:00:15 UTC
Mr Chris Wilson how to do this debug catch safety?
I using SSD. One partition with Ext4 standard options.

/               ext4            rw,relatime,data=ordered        0 1


Like I mentioned whole GUI and input devices are frozen when this issue happened. IMO nothing is saved except this what was saved before hangs.

I can trying boot from live usb AFTER this happened, so xorg file will be not overwritten.
Comment 9 FadeMind 2016-03-12 21:22:55 UTC
UPDATE

I have this at beginning error (loop?)

[   158.158] vblank_complete: 1 events complete
[   158.158] vblank_complete: pipe=0 tv=158.155330 msc=9449 (target=4294976744), event=10093 complete: MISS
[   158.158] sna_present_queue_vblank(pipe=0, event=10093, msc=4294976744)
[   158.158] sna_present_queue_vblank: appending event=10093 to vblank 4294976744 x 2

at the end of file I have:

174.213] sna_present_queue_vblank: appending event=10093 to vblank 4294976744 x 1399566
[   174.213] vblank_complete: pipe=0 tv=158.155330 msc=9449 (target=4294976744), event=10093 complete: MISS
[   174.213] sna_present_queue_vblank(pipe=0, event=10093, msc=4294976744)
[   174.213] sna_present_queue_vblank: appending event=10093 to vblank 4294976744 x 1399567
[   174.213

Xorg log have 470 MB (with debug flags enabled in xorg intel driver)

HERE it is: https://www.dropbox.com/s/25lzliobalk7861/error_94515.tar.xz?dl=1
SHA-1: a587d8329196eca2a2c0694176e22aa185471735
Comment 10 Tobias Klausmann 2016-03-12 21:25:43 UTC
(In reply to FadeMind from comment #9)
> UPDATE
> 
> I have this at beginning error (loop?)
> 
> [   158.158] vblank_complete: 1 events complete
> [   158.158] vblank_complete: pipe=0 tv=158.155330 msc=9449
> (target=4294976744), event=10093 complete: MISS
> [   158.158] sna_present_queue_vblank(pipe=0, event=10093, msc=4294976744)
> [   158.158] sna_present_queue_vblank: appending event=10093 to vblank
> 4294976744 x 2
> 
> at the end of file I have:
> 
> 174.213] sna_present_queue_vblank: appending event=10093 to vblank
> 4294976744 x 1399566
> [   174.213] vblank_complete: pipe=0 tv=158.155330 msc=9449
> (target=4294976744), event=10093 complete: MISS
> [   174.213] sna_present_queue_vblank(pipe=0, event=10093, msc=4294976744)
> [   174.213] sna_present_queue_vblank: appending event=10093 to vblank
> 4294976744 x 1399567
> [   174.213
> 
> Xorg log have 470 MB (with debug flags enabled in xorg intel driver)
> 
> HERE it is: https://www.dropbox.com/s/25lzliobalk7861/error_94515.tar.xz?dl=1
> SHA-1: a587d8329196eca2a2c0694176e22aa185471735

Hmm that is the same as in my backtrace. Maybe (just to make sure): Can you ssh into your maschine and debug the hanging xserver with gdb?
Comment 11 FadeMind 2016-03-12 21:27:26 UTC
Sorry, but I cannot do SSH connect here.
Comment 12 Tobias Klausmann 2016-03-12 22:31:38 UTC
Created attachment 122258 [details]
xserver 1.18.2 log
Comment 13 Tobias Klausmann 2016-03-12 22:33:51 UTC
Created attachment 122259 [details]
dmesg drm.debug=0x7

Despite your request to open another bug, following the information from the xorg log, i think this is indeed the same bug. The above posted backtrace is still valid.
Comment 14 Sait Umar 2016-03-13 00:39:11 UTC
I filed a bug with Fedora earlier on this problem.

Using Fedora 23 with xorg-x11-xserver 1.18.2, KDE plasma 5.19 on a Lenovo
Thinkpad 450s. The mouse/keyboard/display freezes for periods of 20-40 seconds
but if you wait long enough it starts until the next freeze. I get messages:
===========================
[  1120.069] (II) SYN_DROPPED event from "TPPS/2 IBM TrackPoint" - some input events have been lost.
[  11[  1120.069] (II) SYN_DROPPED event from "TPPS/2 IBM TrackPoint" - some input events have been lost.
[  1120.070] (EE) libinput bug: timer offset negative
[  1120.070] (EE) libinput bug: timer offset negative
[  1120.070] (EE) libinput bug: timer offset negative
[  1120.070] (EE) libinput bug: timer offset negative
[  1120.070] (EE) libinput bug: timer offset negative
[  1120.070] (EE) libinput bug: timer offset negative
[  1120.070] (EE) libinput bug: timer offset negative
[  1120.071] (EE) libinput bug: timer offset negative
[  1120.071] (II) SYN_DROPPED event from "SynPS/2 Synaptics TouchPad" - some input events have been lost.
============================

Yje problem I think actually exists with 1.18.1 as well but less evident. Here is
my analysis:
xorg-x11-server-1.18.1-3, EGL interface, libinput-1.2.1-3 NO PROBLEMS

xorg-x11-server-1.18.1-3, glx interface, libinput-1.2.1-3 OCCASIONAL FREEZE

xorg-x11-server-1.18.2-1, either EGL or glx interface, FREQUENT FREEZE
Comment 15 Sait Umar 2016-03-13 00:50:00 UTC
Tobias, are you using KDE plasma 5 as well?
Comment 16 Chris Wilson 2016-03-13 08:47:14 UTC
(In reply to Tobias Klausmann from comment #13)
> Created attachment 122259 [details]
> dmesg drm.debug=0x7
> 
> Despite your request to open another bug, following the information from the
> xorg log, i think this is indeed the same bug. The above posted backtrace is
> still valid.

Yes, but we have 2 bugs here! The first bug report does not involve Present, the latter backtrace and logs are a Present bug. Both will freeze the UI, but they have different underlying causes.
Comment 17 Leho Kraav (:macmaN :lkraav) 2016-03-13 11:48:31 UTC
I got the same GetModeLine flood starting on Gentoo + Chrome, after updating to 1.18.2. Haven't experienced freezes yet, but the upgrade may be too fresh and hasn't received much workload yet. Perhaps also configuration differences.

Is the GetModeLine flood merely cosmetic (in which case, any way to silence it?) or should I downgrade back to 1.18.1 for time being if actual work needs to be done?
Comment 18 Chris Wilson 2016-03-13 12:14:28 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #17)
> Is the GetModeLine flood merely cosmetic (in which case, any way to silence
> it?) or should I downgrade back to 1.18.1 for time being if actual work
> needs to be done?

It seems to be in response to a XVidMode requests from the client, so both log spam and excess X11 traffic. The debug log showed that it only the Xext/vidmode.c layer and wasn't causing extra work for the hardware (we were not continuing reprobing the hardware which is what I feared, it's both slow and any time we touch the hardware there be dragons). From my pov, it is cosmetic, but it may also be impacting system performance.
Comment 19 Leho Kraav (:macmaN :lkraav) 2016-03-13 12:19:03 UTC
(In reply to Chris Wilson from comment #18)
> 
> It seems to be in response to a XVidMode requests from the client, so both
> log spam and excess X11 traffic. The debug log showed that it only the
> Xext/vidmode.c layer and wasn't causing extra work for the hardware (we were
> not continuing reprobing the hardware which is what I feared, it's both slow
> and any time we touch the hardware there be dragons). From my pov, it is
> cosmetic, but it may also be impacting system performance.

Thanks for the breakdown. Should this thing be branched off into a separate bug? The original problem description also includes freezing and I for one don't have that (at least yet...).

Would be useful to have a drop-in patch attached here to eliminate the cosmetics perhaps. That'd enable further testing 1.18.2 for regressions and wouldn't force reverting back. This message flood is really bad.
Comment 20 Chris Wilson 2016-03-13 13:31:16 UTC
Could I please ask you to regnerate the debug Xorg.log with current xf86-video-intel.git? There are some details missing from the log files that are included in

commit 5489402ca1fe17d5b4d20e4dcdc6ac8e6e415b0b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Mar 13 13:28:42 2016 +0000

    sna/present: Add a modicum more debug info
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=94515
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 21 Chris Wilson 2016-03-13 13:37:21 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #19)
> (In reply to Chris Wilson from comment #18)
> > 
> > It seems to be in response to a XVidMode requests from the client, so both
> > log spam and excess X11 traffic. The debug log showed that it only the
> > Xext/vidmode.c layer and wasn't causing extra work for the hardware (we were
> > not continuing reprobing the hardware which is what I feared, it's both slow
> > and any time we touch the hardware there be dragons). From my pov, it is
> > cosmetic, but it may also be impacting system performance.
> 
> Thanks for the breakdown. Should this thing be branched off into a separate
> bug? The original problem description also includes freezing and I for one
> don't have that (at least yet...).

Right, I think the GetModeLine spam is a separate issue from the freezing and separating that out into its own bug report should help prevent it being forgotten.
 
> Would be useful to have a drop-in patch attached here to eliminate the
> cosmetics perhaps. That'd enable further testing 1.18.2 for regressions and
> wouldn't force reverting back. This message flood is really bad.

Looking at the changes, it appears entirely cosmetic. That logging used to exist in 1.18.1 but was hidden behind a verbosity check, now it is just using a standard LogMessage():

diff --git a/Xext/vidmode.c b/Xext/vidmode.c
index 7c838f4..e7f9947 100644
--- a/Xext/vidmode.c
+++ b/Xext/vidmode.c
@@ -267,13 +267,13 @@ ProcVidModeGetModeLine(ClientPtr client)
     rep.vtotal = VidModeGetModeValue(mode, VIDMODE_V_TOTAL);
     rep.flags = VidModeGetModeValue(mode, VIDMODE_FLAGS);
 
-    LogMessage(X_INFO, "GetModeLine - scrn: %d clock: %ld\n",
-               stuff->screen, (unsigned long) rep.dotclock);
-    LogMessage(X_INFO, "GetModeLine - hdsp: %d hbeg: %d hend: %d httl: %d\n",
-               rep.hdisplay, rep.hsyncstart, rep.hsyncend, rep.htotal);
-    LogMessage(X_INFO, "              vdsp: %d vbeg: %d vend: %d vttl: %d flags: %ld\n",
-               rep.vdisplay, rep.vsyncstart, rep.vsyncend,
-               rep.vtotal, (unsigned long) rep.flags);
+    DebugF("GetModeLine - scrn: %d clock: %ld\n",
+          stuff->screen, (unsigned long) rep.dotclock);
+    DebugF("GetModeLine - hdsp: %d hbeg: %d hend: %d httl: %d\n",
+          rep.hdisplay, rep.hsyncstart, rep.hsyncend, rep.htotal);
+    DebugF("              vdsp: %d vbeg: %d vend: %d vttl: %d flags: %ld\n",
+          rep.vdisplay, rep.vsyncstart, rep.vsyncend,
+          rep.vtotal, (unsigned long) rep.flags);
 
     /*
      * Older servers sometimes had server privates that the VidMode

will quieten it up again. Though any client actually still using XVidMode should be fixed!
Comment 22 Carlos Jimenez 2016-03-17 18:37:43 UTC
So This bug has been fixed ? because i hit on it too 
on gentoo and archlinux .

Graphics:  Card: Intel Core Processor Integrated Graphics Controller
           Display Server: X.Org 1.18.2 driver: Intel
           Resolution: 1366x768@60.17hz
           GLX Renderer: Mesa DRI Intel Ironlake Mobile
           GLX Version: 2.1 Mesa 11.1.2

the system freezes a lot and lagging a lot.
With Sna + Dri3 + Mate Desktop.
Comment 23 Chris Wilson 2016-03-17 19:19:24 UTC
(In reply to Carlos Jimenez from comment #22)
> So This bug has been fixed ? because i hit on it too 
> on gentoo and archlinux .

No, still waiting on the debug logs.
Comment 24 Stefan Neufeind 2016-03-18 08:07:23 UTC
I experienced similar frequent hangs of maybe 15-20sec with a Lenovo ThinkPad X1 Carbon. Downgrading to xorg-x11-server-Xorg-1.18.1-3.fc23.x86_64 (Fedora, from 1.18.2-something) did the trick, it seems.
Comment 25 Leho Kraav (:macmaN :lkraav) 2016-03-18 09:58:04 UTC
It seems I also got my first hang on 1.18.2. Resumed from suspend, and i3lock password circle suddenly just hung on flashing that green bar on a password character entry. Soon after Caps-Lock started flashing. Magic-SysRq was unavailable after the hang.
Comment 26 Tobias Klausmann 2016-03-19 10:25:33 UTC
(In reply to Chris Wilson from comment #23)
> ...
> No, still waiting on the debug logs.

Hmm, aren't the ones in #12 and #13 enough? If something is missing, I'll gladly provide it...
Comment 27 Chris Wilson 2016-03-19 12:19:41 UTC
(In reply to Tobias Klausmann from comment #26)
> (In reply to Chris Wilson from comment #23)
> > ...
> > No, still waiting on the debug logs.
> 
> Hmm, aren't the ones in #12 and #13 enough? If something is missing, I'll
> gladly provide it...

No, they are missing some debug information from later builds which should help understand the issue.
Comment 28 Sait Umar 2016-03-20 13:37:31 UTC
A patch was introduced as a part of the below bug report that is working. I
also tested it on Fedora 23 and freezes are gone.
FYI

https://bugs.archlinux.org/task/48549
Comment 29 Leho Kraav (:macmaN :lkraav) 2016-03-20 13:42:13 UTC
(In reply to Sait Umar from comment #28)
> A patch was introduced as a part of the below bug report that is working. I
> also tested it on Fedora 23 and freezes are gone.
> FYI
> 
> https://bugs.archlinux.org/task/48549

I am unable to locate the source for the specific patch. What am I missing?
Comment 30 Sait Umar 2016-03-20 13:53:48 UTC
I posted it (with due credits) in another bug report:

https://bugzilla.redhat.com/show_bug.cgi?id=1317153
Comment 31 FadeMind 2016-03-20 14:55:09 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #29)
> (In reply to Sait Umar from comment #28)
> > A patch was introduced as a part of the below bug report that is working. I
> > also tested it on Fedora 23 and freezes are gone.
> > FYI
> > 
> > https://bugs.archlinux.org/task/48549
> 
> I am unable to locate the source for the specific patch. What am I missing?

FYI https://projects.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/xorg-server&id=bb37fae55e67f14bc491ca6bcccde3907f73b21c
Comment 32 Lorenz 2016-03-21 15:56:42 UTC
I have the GetModeline logspam as well on Debian unstable (1.18.2). It also severely impacts UI performance, switching between windows or workspaces has a noticeable (~1s) delay as soon as I start Chrome or Spotify (both of which trigger the logspam). Unfortunately I can't downgrade to 1.18.1 as that had a different issue that caused Xwayland to segfault (resolved in 1.18.2) :)
Comment 33 Vadim Dyadkin 2016-03-22 16:02:30 UTC
I have the same issue, gentoo, xorg-server-1.18.2, kde-plasma-5.19 and VGA compatible controller: Intel Corporation Broadwell-U Integrated Graphics (i915). It always happens when I click on an icon in the kde tray, the Xorg process uses 100% of one of the cpus, everything hangs for 20-30 seconds. Downgrade to xorg-server-1.18.1 resolves the issue.
Comment 34 Tobias Klausmann 2016-03-22 19:35:28 UTC
Created attachment 122483 [details]
xserver 1.18.2 log w 74cd4d0

Another take on the desired log. Cut from an 1.8G sized file
Comment 35 Chris Wilson 2016-03-22 23:14:29 UTC
It looks like the Client sent PresentPixmap with a divisor=1<<32, or even target_msc of 1<<32.

I think

commit f656f6afa288c63968f45d6d32c96a4cf6213117
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 22 22:11:18 2016 +0000

    sna/present: Report BadValue if target_msc exceeds last known msc by 1<<32
    
    As we cannot queue a vblank further than 1<<32 into the future, nor can
    we set a Timer to expire that far into the future, creating such a
    vblank event is futile. Report BadValue and hope for the best.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=94515
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

should hide the immediate issue with the freeze / busy spin.

Can someone attach gdb and put a breakpoint on present_pixmap() line 953, just after present_queue_vblank() returns an error and print the locals? I am most interested in getting the arguments to the function so we can compare what the client requested with the computed vblank.
Comment 36 Vadim Dyadkin 2016-03-24 15:55:40 UTC
The problem seems to be gone after this patch:
https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=f656f6afa288c63968f45d6d32c96a4cf6213117
Comment 37 Vadim Dyadkin 2016-03-24 16:19:18 UTC
Sorry, I was wrong, after an hour it appeared again.
Comment 38 Tobias Klausmann 2016-03-24 16:24:05 UTC
(In reply to Vadim Dyadkin from comment #36, #37)
> The problem seems to be gone after this patch:
> https://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/
> ?id=f656f6afa288c63968f45d6d32c96a4cf6213117

> Sorry, I was wrong, after an hour it appeared again.

i can confirm it is still possible to hit the problem, but way more unlikely to happen after the commit.

Thanks for the emergency (half)fix!
Comment 39 Joakim Tjernlund 2016-03-24 16:36:22 UTC
(In reply to Chris Wilson from comment #35)
> It looks like the Client sent PresentPixmap with a divisor=1<<32, or even
> target_msc of 1<<32.
> 
> I think
> 
> commit f656f6afa288c63968f45d6d32c96a4cf6213117
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Mar 22 22:11:18 2016 +0000
> 
>     sna/present: Report BadValue if target_msc exceeds last known msc by
> 1<<32
>     
>     As we cannot queue a vblank further than 1<<32 into the future, nor can
>     we set a Timer to expire that far into the future, creating such a
>     vblank event is futile. Report BadValue and hope for the best.
>     
>     References: https://bugs.freedesktop.org/show_bug.cgi?id=94515
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> should hide the immediate issue with the freeze / busy spin.
> 
> Can someone attach gdb and put a breakpoint on present_pixmap() line 953,
> just after present_queue_vblank() returns an error and print the locals? I
> am most interested in getting the arguments to the function so we can
> compare what the client requested with the computed vblank.

+if (warn_unless(msc - swap->msc < 1ull<<32))
Should that be <= ?

or possibly just cap the time to 1<<32 ?
Comment 40 Leho Kraav (:macmaN :lkraav) 2016-03-24 16:41:31 UTC
I got the freeze and I'm on UXA GLAMOR, not SNA. Are we sure this patch is related?
Comment 41 Chris Wilson 2016-03-24 16:49:10 UTC
(In reply to Joakim Tjernlund from comment #39)
> +if (warn_unless(msc - swap->msc < 1ull<<32))
> Should that be <= ?

No. if msc was current + 1<<32 (swap->msc should be the current msc give or take a vblank occurring between the start of PresentPixmap and the end of the function), then (uint32_t)msc == swap->msc and the scheduled event returns instantly. We then see that that the target msc hasn't occurred and requeue the event, which fires instantly again. And we get trapped into the loop.

> or possibly just cap the time to 1<<32 ?

Possibly, but given the choice of reporting "I can't support this value" and faking it, we should punt that decision up. It saves having to make a policy decision in the driver.
Comment 42 Chris Wilson 2016-03-24 16:53:38 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #40)
> I got the freeze and I'm on UXA GLAMOR, not SNA. Are we sure this patch is
> related?

Yes. The bug is the client is requesting a target_msc that wraps around onto the current msc. No backend handled that overflow and so all are susceptible. The patch adds a check for the overflow when casting from the 64bit target_msc into the 32bit hardware sequence number and reporting the error from the driver.

The bug as far as I can tell so far is that the requested divisor received from the Client is 1<<32 rather than 1. That suggests that the two CARD32 in the protocol have been swapped, but that seems unlikely...
Comment 43 Chris Wilson 2016-03-24 19:21:17 UTC
On a slight tangent, it seems the bogus MSC is not limited to current + 1<<32, but we can also get current + 1<<32-1, which at a guess would be the vblank before a flip for a missed divisor=1<<32.
Comment 44 Aleksey Rybalkin 2016-03-26 09:01:17 UTC
I had these freezes, on 1.18.1 and more frequently on 1.18.2, until I changed my cpu_freq governor from "ondemand" to "conservative". Have been running without freezes for days (tested with different versions of xf86-video-intel, e.g. with commits 0204e55b618af285eb31c6eecc3c57b54319a0f4 and 4e108afe55f6b13aa946d1252ff0b828929245b3 for the last couple of days). The freezes are either gone or maybe became much less frequent (with "conservative" governor I have encountered zero).

May be this will help finding the issue.
Comment 45 Sait Umar 2016-03-26 14:54:13 UTC
There is a new one line patch at the xorg/xserver freedesktop git that solved
the problem for me. The previous proposed patch had taken this section of code
out. This one seems to fix it instead. I have been using it for a few hours and
had no freezes.
Comment 47 Michel Dänzer 2016-03-28 07:51:55 UTC
(In reply to Sait Umar from comment #45)
> There is a new one line patch at the xorg/xserver freedesktop git that solved
> the problem for me.

Note that it would still be interesting to find out where the bogus MSC values are coming from, per the last paragraph of comment 35, in case there's a client side bug which needs to be addressed there.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.