Bug 105308 - X log ballooning in size with "drmmode_wait_vblank failed for scanout update" and "get vblank counter failed"
Summary: X log ballooning in size with "drmmode_wait_vblank failed for scanout update"...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-03-01 11:25 UTC by Bob Ham
Modified: 2018-07-30 08:25 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
First 3900 lines of Xorg log (311.28 KB, text/plain)
2018-03-01 11:25 UTC, Bob Ham
no flags Details
Kernel log (95.08 KB, text/plain)
2018-03-01 11:26 UTC, Bob Ham
no flags Details
Output of xrandr --verbose (10.02 KB, text/plain)
2018-03-01 12:32 UTC, Bob Ham
no flags Details
Kernel output while setting /sys/module/drm/parameters/debug to 255 (126.63 KB, text/plain)
2018-03-01 16:10 UTC, Bob Ham
no flags Details
Kernel output while setting /sys/module/drm/parameters/debug to 255 with monitor off (23.62 KB, text/plain)
2018-03-04 12:31 UTC, Bob Ham
no flags Details
Kernel output while setting /sys/module/drm/parameters/debug to 255 with monitor off while playing video (89.27 KB, text/plain)
2018-03-04 12:32 UTC, Bob Ham
no flags Details
Output of xrandr --verbose (2.50 KB, text/plain)
2018-03-04 12:36 UTC, Bob Ham
no flags Details
First 1000 lines of /var/log/Xorg.0.log (2018-03-04) (82.69 KB, text/plain)
2018-03-04 12:40 UTC, Bob Ham
no flags Details

Description Bob Ham 2018-03-01 11:25:33 UTC
Created attachment 137717 [details]
First 3900 lines of Xorg log

Hi,

My X log is on occassion filling up my hard disk, reaching a size of 18G with mostly repeats of two lines.  The X server is working and running fine; I'm typing this on it now.  The first indication that there is a problem is seeing "No space left on device".  It looks like the woes begin with this line:

[   668.860] (WW) AMDGPU(0): flip queue failed in amdgpu_scanout_flip: Invalid argument, TearFree inactive until next modeset

The kernel also reported this although I've no idea when in relation to other events:

[57903.995504] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* amdgpu_cs_list_validate(validated) failed.
[57903.995539] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -22!
[58014.990950] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* amdgpu_cs_list_validate(validated) failed.
[58014.990998] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -22!


rah@lotus:~/store-f$ ls -lh Xorg.0.log 
-rw-r--r-- 1 rah rah 18G Mar  1 09:52 Xorg.0.log
rah@lotus:~/store-f$ grep "drmmode_wait_vblank failed for scanout update" Xorg.0.log | head -n 1
[   668.860] (WW) AMDGPU(0): drmmode_wait_vblank failed for scanout update: Invalid argument
rah@lotus:~/store-f$ grep "get vblank counter failed" Xorg.0.log | head -n 1
[ 10749.146] (WW) AMDGPU(0): get vblank counter failed: Invalid argument
rah@lotus:~/store-f$ wc -l Xorg.0.log 
203212490 Xorg.0.log
rah@lotus:~/store-f$ egrep -vc "(drmmode_wait_vblank failed for scanout update|get vblank counter failed)" Xorg.0.log 
1509
rah@lotus:~/store-f$ grep -c "drmmode_wait_vblank failed for scanout update" Xorg.0.log 
200689290
rah@lotus:~/store-f$ grep -c "get vblank counter failed" Xorg.0.log 
2521692

I'll attach the first 3900 lines of the Xorg log and the kernel log.

Thanks,

Bob
Comment 1 Bob Ham 2018-03-01 11:26:03 UTC
Created attachment 137718 [details]
Kernel log
Comment 2 Michel Dänzer 2018-03-01 12:03:02 UTC
(In reply to Bob Ham from comment #0)
> 
> [   668.860] (WW) AMDGPU(0): flip queue failed in amdgpu_scanout_flip:
> Invalid argument, TearFree inactive until next modeset

Did anything in particular happen, or did you do anything in particular, around this time (about 11 minutes after system boot)?


> The kernel also reported this although I've no idea when in relation to
> other events:
> 
> [57903.995504] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR*
> amdgpu_cs_list_validate(validated) failed.
> [57903.995539] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the
> buffer list -22!

The timestamps should be comparable, so this happened much later and is probably not directly related. Looks like the kernel ran out of memory or some other resource while processing GPU commands.


Please attach the output of

 xrandr --verbose

while the problem is occurring.
Comment 3 Bob Ham 2018-03-01 12:32:57 UTC
Created attachment 137719 [details]
Output of xrandr --verbose

(In reply to Michel Dänzer from comment #2)
> (In reply to Bob Ham from comment #0)
> > 
> > [   668.860] (WW) AMDGPU(0): flip queue failed in amdgpu_scanout_flip:
> > Invalid argument, TearFree inactive until next modeset
> 
> Did anything in particular happen, or did you do anything in particular,
> around this time (about 11 minutes after system boot)?

Not that I recall, no.  FYI, current uptime is 2 days, 17:25.


> Please attach the output of
> 
>  xrandr --verbose
> 
> while the problem is occurring.

I'll attach the output as requested.  It's from the same X session but I ran "truncate /var/log/Xorg.0.log" and since then the file has been at 0 size so I can only assume the problem is still occurring.

Thanks,

Bob
Comment 4 Michel Dänzer 2018-03-01 15:57:50 UTC
While the problem is occurring, and the screen contents are being updated at least once per second, please run the following as root:

 echo 255 >/sys/module/drm/parameters/debug"; sleep 1; echo 0 >/sys/module/drm/parameters/debug

and attach the resulting dmesg output.
Comment 5 Bob Ham 2018-03-01 16:10:42 UTC
Created attachment 137723 [details]
Kernel output while setting /sys/module/drm/parameters/debug to 255

This was done while playing video on the display.
Comment 6 Michel Dänzer 2018-03-01 16:25:25 UTC
There's nothing in there that would correspond to the Xorg log messages. Please make sure the problem is actually occurring when you capture the dmesg debugging output.
Comment 7 Bob Ham 2018-03-04 12:30:45 UTC
(In reply to Michel Dänzer from comment #6)
> Please make sure the problem is actually occurring when you capture the
> dmesg debugging output.

I've discovered that the problem occurs when the monitor is switched off.  With just the monitor off, the "get vblank counter failed: Invalid argument" error appears.  With the monitor off while playing a video, both the aforementioned error and "drmmode_wait_vblank failed for scanout update: Invalid argument" appears.  I'll attach two kernel logs as instructed, one for each condition.
Comment 8 Bob Ham 2018-03-04 12:31:27 UTC
Created attachment 137774 [details]
Kernel output while setting /sys/module/drm/parameters/debug to 255 with monitor off
Comment 9 Bob Ham 2018-03-04 12:32:09 UTC
Created attachment 137775 [details]
Kernel output while setting /sys/module/drm/parameters/debug to 255 with monitor off while playing video
Comment 10 Bob Ham 2018-03-04 12:36:00 UTC
Created attachment 137776 [details]
Output of xrandr --verbose

Output of xrandr --verbose while the error is really occurring
Comment 11 Bob Ham 2018-03-04 12:40:20 UTC
Created attachment 137777 [details]
First 1000 lines of /var/log/Xorg.0.log (2018-03-04)

Here is an updated version of the Xorg.0.log.  The "flip queue failed in amdgpu_scanout_flip" error is present again.
Comment 12 Michel Dänzer 2018-03-08 18:24:04 UTC
This might work better if you enable DC with

 amdgpu.dc=1

on the kernel command line.
Comment 13 Michel Dänzer 2018-03-09 17:29:49 UTC
https://patchwork.freedesktop.org/patch/209464/ fixes this for me with DC disabled.
Comment 14 Mariusz Mazur 2018-07-29 18:49:16 UTC
So, I'm not 100% sure yet, I need time to rebuild 4.14.28 with Michel's patch and run it for a bit, but at this point I'm about 90% certain that a patch in 4.14.29 introduces a regression, as described in bug 106529. And from reading the changelog and looking at the code, it does seem to me that this patch is the most likely culprit.

Short version of what's wrong: on a multi-monitor setup, if the primary display is on DP and this code toggles it off/on, the window manager doesn't take it well and you end up with windows all over the place. 

(I don't know if this is the same codepath as is used with amdgpu.dc=1 on more recent kernels, though I do now the same bug occurs on 4.16+ with dc=1.)
Comment 15 Michel Dänzer 2018-07-30 08:25:48 UTC
Resolving this report, as it should be fixed or at least much better with the kernel change referenced in comment 13.

Mariusz, please keep the discussion about your issue on bug 106529, or maybe file a new report for the non-DC case.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.