Bug 93147

Summary: [regression bisected] Stuttering in games caused by commit 4dfd6486 "drm: Use vblank timestamps to guesstimate how many vblanks were missed"
Product: DRI Reporter: Dave Witbrodt <dawitbro>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: minor    
Priority: medium CC: dawitbro, ernstp, mario.kleiner, ville.syrjala
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg from kernel built at "bad" commit
none
Xorg.0.log when running kernel built at "bad" commit
none
First proposed patch to fix this on radeon-kms
none
port fix to amdgpu
none
v2 of the radeon-kms fix: Slightly more efficient.
none
v2 of the amdgpu-kms fix: Slightly more efficient.
none
v3 for amdgpu none

Description Dave Witbrodt 2015-11-28 03:08:57 UTC
Created attachment 120187 [details]
dmesg from kernel built at "bad" commit

Building Linux from drm-next or drm-fixes gives me a kernel which causes what appears to be stuttering/hesitations in test applications (particularly severe in 'prboom-plus') which were not present in Linux 4.3.


    HARDWARE
    GPU:  Radeon HD 7850 (Pitcairn)
    CPU:  64-bit AMD FX-8320E Eight-Core Processor
chipset:  AMD 990FX
 mboard:  ASUS M5A99FX PRO R2.0

    SOFTWARE
xf86-video-ati:  7.6.99+ (git master at commit 10b7c3de, Nov 17 2015)
   xorg-server:  1.18.0+ (git master at commit 51984ddd, Nov 18 2015)
          mesa:  11.1.0+ (git master at commit d8c26969, Nov 20 2015)
        libdrm:  2.4.65+ (git master at commit c3deddd9, Nov 10 2015)
        kernel:  drm-fixes (at commit 2d591ab1, Nov 20 2015)
  distribution:  Debian unstable


Reproducing steps:

I apply future Radeon DRM and core DRM code to latest stable Linux kernels, and
I use several applications to test the results.  I use a less demanding game,
'prboom-plus', as a kind of canary; I use more demanding games like 'alien-arena' and 'torcs' as well.  All 3 of the games mentioned show a kind of stuttering or hesitation if using the current git HEAD of drm-next or drm-fixes, but 'prboom-plus' is affected to the point of being extremely annoying, almost unplayable.

To reproduce, boot Linux 4.3 and play a few moments of 'prboom-plus'; then boot
a kernel built from drm-next or drm-fixes, then play 'prboom-plus' again.  The
effects are quite pronounced and obvious.  (The effects on 'alien-arena' and
'torcs' are much more subtle, and might easily go unnoticed, though someone
looking for a difference should easily be able to notice.)  None of the games crash, but they are simply affected (to varying degrees) by a kind of stuttering lag caused by the commit identified below by bisection.


Additional info:

I have reported this against DRI/Radeon but problem commit is touching code in
'drivers/gpu/drm/drm_irq.c', which does not seem limited to affecting only Radeon.  (I am not a developer, so any comment I make about code can probably be ignored.)  I would appreciate if the developers reading this would consider whether I have reported the bug against the wrong product and/or component, and make changes as appropriate!

In 'xorg.conf' I usually have "SwapbuffersWait" disabled with a stanza like this:

    Section "Device"
      Identifier	"Configured Video Device"
      Driver		"radeon"
      Option		"DRI"              "3"
      Option		"SwapbuffersWait"  "off"
    EndSection

Changing the "SwapbuffersWait" setting has no effect on the symptoms of this bug, however.


Bisection:

[I test future Radeon code in advance by cherry picking from drm-next and drm-fixes changesets which touch Radeon DRM and core DRM.  Because of the likelihood of error on my part, I never report bugs on one of these Frankenstein kernels of my own making.  If I bisect a bug, I confirm that the same buggy behavior is present upstream before reporting it.]

I first noticed this problem in early October, when I had made a local branch from Linux 4.2.3 and cherry picked from DRM 4.3 and DRM 4.4 (up to cf648305).  I noticed the bug described earlier immediately.  I didn't have much time to troubleshoot, and assumed the error was my own, so I made a second local branch from 4.2.3 and only cherry-picked from DRM 4.3 (up to 30c64664, my previously working list of commits).

After Linux 4.3 was released, I tried again.  I made a local branch based on 4.3 and applied DRM 4.4 up to commit 5481c8fb.  The process went very smooth, but the buggy behavior of my test programs (particularly 'prboom-plus') was back again.  I decided to bisect my 4.3 + DRM 4.4 tree to identify where things were going wrong.

The HEAD of my tree was clearly "bad" so I had a place to start.  I built Linux 4.3 without any cherry picks, and it was "good".  After 6 more builds, 'git bisect' pointed at the following commit:

    commit 4dfd64862ff852df7b1198d667dda778715ee88f
    Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Date:   Mon Sep 14 22:43:51 2015 +0300

        drm: Use vblank timestamps to guesstimate how many vblanks were missed

Since the problem could still have been only in my Frankenstein local tree, I built kernels from drm-next at commits 4dfd6486 and 4dfd6486^ (= 1b2eb710).  I found that 4dfd6486 showed the buggy behavior, while 1b2eb710 did not.


I did notice that a regression against this same commit was noticed and apparently fixed in this commit:

    commit fa4270d8e0257b4b76f11baa2866f4313d29aaf5
    Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Date:   Wed Sep 30 19:21:34 2015 +0300

        drm: Don't zero vblank timestamps from the irq handler

        [...]    
        This fixes a regression from
        4dfd64862ff852df drm: Use vblank timestamps to guesstimate how many
        vblanks were missed

This commit was present in my local Frankenstein tree, as well as both drm-next
and drm-fixes, and (obviously) does not fix the bug I am reporting.  The problem
commit no doubt had other effects that were fixed by fa4270d8, but the stuttering I am seeing was not fixed by it.
Comment 1 Dave Witbrodt 2015-11-28 03:10:56 UTC
Created attachment 120188 [details]
Xorg.0.log when running kernel built at "bad" commit
Comment 2 Michel Dänzer 2015-11-28 06:43:13 UTC
Does running the affected apps with the environment variable LIBGL_DRI3_DISABLE=1 avoid the problem?
Comment 3 Dave Witbrodt 2015-11-28 13:11:01 UTC
(In reply to Michel Dänzer from comment #2)
> Does running the affected apps with the environment variable
> LIBGL_DRI3_DISABLE=1 avoid the problem?

Yes, symptoms are completely eliminated in prboom-plus when using that variable.

(I am currently using a kernel built from a local branch with fa4270d8 and 4dfd6486 reverted, but I have left my previous kernel installed.  I tested by rebooting to the previous kernel, and running prboom-plus with and w/o the variable set.)
Comment 4 Michel Dänzer 2015-11-30 03:52:44 UTC
I recently ran into this with armagetronad. I've now confirmed that reverting these changes fixes that as well, thanks Dave for saving me the time to track this down!

Ville, any ideas what's going on / how to fix it? If not, please revert these changes.
Comment 5 Ville Syrjala 2015-11-30 20:12:20 UTC
(In reply to Michel Dänzer from comment #4)
> I recently ran into this with armagetronad. I've now confirmed that
> reverting these changes fixes that as well, thanks Dave for saving me the
> time to track this down!
> 
> Ville, any ideas what's going on / how to fix it? If not, please revert
> these changes.

I'm guessing this is the stuff Mario was trying to fix. Oh FFS, apparently he didn't cc dri-devel so the entire discussion has all been in private :( Though you were cc:d on most of it I think. Maybe I can bounce the entire thing onto the list...
Comment 6 Michel Dänzer 2015-12-01 02:56:02 UTC
(In reply to Ville Syrjala from comment #5)
> I'm guessing this is the stuff Mario was trying to fix.

You're right, the patch from http://lists.freedesktop.org/archives/dri-devel/2015-November/095679.html fixes it for me.
Comment 7 Michel Dänzer 2015-12-01 02:57:46 UTC
Mario, can you submit your fix(es) for inclusion?
Comment 8 Dave Witbrodt 2015-12-01 12:55:52 UTC
(In reply to Michel Dänzer from comment #6)
> (In reply to Ville Syrjala from comment #5)
> > I'm guessing this is the stuff Mario was trying to fix.
> 
> You're right, the patch from
> http://lists.freedesktop.org/archives/dri-devel/2015-November/095679.html
> fixes it for me.

Applying the patch in question to my local 4.3 + DRM 4.4 kernel (with no reverts) makes the symptoms go away for me also.
Comment 9 Mario Kleiner 2015-12-01 13:26:55 UTC
(In reply to Michel Dänzer from comment #7)
> Mario, can you submit your fix(es) for inclusion?

Still on it. The proper patch for radeon-kms will hopefully be finished and tested within a day or so. But ideally i'd need some feedback on the line buffer sizes/partitions/watermarks - see my last e-mail on that. Otherwise i'll have to use very conservative guesstimates for the fudge constant in the current patch to be on the safe side.

amdgpu-kms will have the same problem and solution, but i haven't looked into how to hook that up, with the difference in how the display engine code is hooked up.

thanks,
-mario
Comment 10 Mario Kleiner 2015-12-03 15:02:23 UTC
Created attachment 120301 [details] [review]
First proposed patch to fix this on radeon-kms

Ok, the attached patch works on my test systems with my timing tests and hardware timing measurement equipment and i think it hopefully avoids the races Ville and me could come up with so far.

One thing which might make sense for improved efficiency is to replace the udelay(5) in the radeon_flip_work_func() with a usleep_range() that avoids polling and sleeps the right minimum amount of time. I'll try to try that today.

For the line buffer sizes for < DCE4 i just guessed values which are hopefully big enough to cover earlier asics. I think too big is not a big problem for correctness, but potentially for performance under higher graphics load.
Comment 11 Alex Deucher 2015-12-03 17:35:33 UTC
Created attachment 120310 [details] [review]
port fix to amdgpu
Comment 12 Dave Witbrodt 2015-12-04 05:00:33 UTC
(In reply to Mario Kleiner from comment #10)
> Created attachment 120301 [details] [review] [review]
> First proposed patch to fix this on radeon-kms

Works good on my hardware.  Thanks.

Will test any proposed fixes that follow, as well.
Comment 13 Mario Kleiner 2015-12-04 15:10:02 UTC
Created attachment 120350 [details] [review]
v2 of the radeon-kms fix: Slightly more efficient.

Same as original, but busy wait in flip_work_func replaced by hr-timer sleep with event lock released, for more efficiency. Ditto for the following amdgpu fix v2.
Comment 14 Mario Kleiner 2015-12-04 15:11:32 UTC
Created attachment 120351 [details] [review]
v2 of the amdgpu-kms fix: Slightly more efficient.

Like v2 of radeon-kms. Alex original port is R-b'ed by me.
Comment 15 Mario Kleiner 2015-12-04 15:13:40 UTC
Thanks Dave for testing the v1 of the patch.

I think i'm done with these, v2 should be as good as it gets from my side.
-mario
Comment 16 Ernst Sjöstrand 2015-12-04 17:52:41 UTC
I tried the v2 amdgpu patch on top of agd5f/amdgpu-powerplay and my screens never turn on with it.
Comment 17 Alex Deucher 2015-12-04 20:16:25 UTC
Created attachment 120356 [details] [review]
v3 for amdgpu

Fix a crash in the crtc disabled case.
Comment 18 Dave Witbrodt 2015-12-04 20:44:21 UTC
(In reply to Mario Kleiner from comment #13)
> Created attachment 120350 [details] [review] [review]
> v2 of the radeon-kms fix: Slightly more efficient.

Works very well on my hardware.  Thanks for "Tested By:"

Will wait to see if agd5f picks this up (then airlied), or if others will be commenting/changing it first.

DW
Comment 19 Ernst Sjöstrand 2015-12-05 01:04:26 UTC
Tested the amdgpu v3 patch on top of agd5f/amdgpu-powerplay, seems to work fine.
Comment 20 Dave Witbrodt 2015-12-06 02:29:50 UTC
Just brought my local tree up to date with drm-fixes.  The patch from comment 13 is now present there (and has been picked up by Linus), and updating with this patch and the others relevant to my system produces a nicely working kernel.  Thanks for getting this fixed!

I'm not sure what the protocol is for closing this bug.  Usually if a bug is in a released kernel, the bug is not closed until the patch appears in a released kernel.  This bug never appeared in a released kernel, though, so it seems like it can be closed immediately?
Comment 21 Ernst Sjöstrand 2015-12-18 08:16:37 UTC
Don't forget to merge the AMDGPU version also.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.