Bug 109830 - CPU iowait higher when run the GPU multi decoding threads
Summary: CPU iowait higher when run the GPU multi decoding threads
Status: RESOLVED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: lowest trivial
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-05 03:25 UTC by Owen Zhang
Modified: 2019-05-14 09:32 UTC (History)
3 users (show)

See Also:
i915 platform: ALL
i915 features: GEM/execlists, GEM/Other


Attachments
iostat_in_4.19 (89.30 KB, text/plain)
2019-03-05 03:25 UTC, Owen Zhang
no flags Details
patch_bisect.log (2.60 KB, text/plain)
2019-03-05 03:26 UTC, Owen Zhang
no flags Details
reproducer (100.95 MB, application/x-gzip)
2019-03-05 03:38 UTC, Owen Zhang
no flags Details
increase_jobs (41.08 KB, text/plain)
2019-03-05 10:16 UTC, Owen Zhang
no flags Details
decrease_jobs (128.00 KB, text/plain)
2019-03-05 10:17 UTC, Owen Zhang
no flags Details

Description Owen Zhang 2019-03-05 03:25:46 UTC
Created attachment 143528 [details]
iostat_in_4.19

Hi,

our customer report one CPU iowait higher issue when run the multi decoding threads, i reproduced this issue in kernel 4.19 and kernel 5.0 RC. 

the logs: (the full log in attach)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.99    0.00    1.87   93.14    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   71.00    0.00  8256.00     0.00   232.56     0.11    1.66    1.66    0.00   0.87   6.20
dm-0              0.00     0.00    6.00    0.00    60.00     0.00    20.00     0.00    0.17    0.17    0.00   0.17   0.10
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00   65.00    0.00  8196.00     0.00   252.18     0.11    1.75    1.75    0.00   0.94   6.10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.25    0.00    1.38   91.88    0.00    1.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   11.00    0.00    92.00     0.00    16.73     0.00    0.18    0.18    0.00   0.09   0.10
dm-0              0.00     0.00   11.00    0.00    92.00     0.00    16.73     0.00    0.09    0.09    0.00   0.09   0.10
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

i also test the drm-tip(the last commit: 3dd976663b985375368c6229903a1c47971c2a71), this issue has been a little bit less than the kernel 4.19/5.0.  the cpu iowait reduced to below (~80%). so i do the bisect the patch on drm-tip, and find the following patch saved this.

"
drm/i915: Replace global breadcrumbs with per-context interrupt tracking

    A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
    thundering i915_wait_request herd"), the issue of handling multiple
    clients waiting in parallel was brought to our attention. The
    requirement was that every client should be woken immediately upon its
    request being signaled, without incurring any cpu overhead.

    To handle certain fragility of our hw meant that we could not do a
    simple check inside the irq handler (some generations required almost
    unbounded delays before we could be sure of seqno coherency) and so
    request completion checking required delegation.

    Before commit 688e6c725816, the solution was simple. Every client
    waiting on a request would be woken on every interrupt and each would do
    a heavyweight check to see if their request was complete. Commit
    688e6c725816 introduced an rbtree so that only the earliest waiter on
    the global timeline would woken, and would wake the next and so on.
    (Along with various complications to handle requests being reordered
    along the global timeline, and also a requirement for kthread to provide
    a delegate for fence signaling that had no process context.)

    The global rbtree depends on knowing the execution timeline (and global
    seqno). Without knowing that order, we must instead check all contexts
    queued to the HW to see which may have advanced. We trim that list by
    only checking queued contexts that are being waited on, but still we
    keep a list of all active contexts and their active signalers that we
    inspect from inside the irq handler. By moving the waiters onto the fence
    signal list, we can combine the client wakeup with the dma_fence
    signaling (a dramatic reduction in complexity, but does require the HW
    being coherent, the seqno must be visible from the cpu before the
    interrupt is raised - we keep a timer backup just in case).

    Having previously fixed all the issues with irq-seqno serialisation (by
    inserting delays onto the GPU after each request instead of random delays
    on the CPU after each interrupt), we can rely on the seqno state to
    perfom direct wakeups from the interrupt handler. This allows us to
    preserve our single context switch behaviour of the current routine,
    with the only downside that we lose the RT priority sorting of wakeups.
    In general, direct wakeup latency of multiple clients is about the same
    (about 10% better in most cases) with a reduction in total CPU time spent
    in the waiter (about 20-50% depending on gen). Average herd behaviour is
    improved, but at the cost of not delegating wakeups on task_prio.

    v2: Capture fence signaling state for error state and add comments to
    warm even the most cold of hearts.
    v3: Check if the request is still active before busywaiting
    v4: Reduce the amount of pointer misdirection with list_for_each_safe
    and using a local i915_request variable inside the loops
    v5: Add a missing pluralisation to a purely informative selftest message.

    Change-Id: Ibf5251cc874f4ee338266641df072cfa1012faae
    References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
"

the bisect log in attach.

si try to backport this patch to kernel 4.19, but this patch is big, need your helps to fix and give the suggestions. 

the reproduce steps:
1) Build this stack: https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack
2) run the ./decode_video_avc.sh
3) meanwhile use the top or iostat to check the cpu iowait value.
Comment 1 Owen Zhang 2019-03-05 03:26:26 UTC
Created attachment 143529 [details]
patch_bisect.log
Comment 2 Owen Zhang 2019-03-05 03:38:31 UTC
Created attachment 143530 [details]
reproducer
Comment 3 Chris Wilson 2019-03-05 08:58:38 UTC
iowait is an indication of userspace badness, that userspace is sleeping for the gpu. That it goes down is quite possibly a regression... But the only numbers that matter here is GPU throughput; did the rare of decoding jobs increase or decrease (jobs per second)? Did the energy consumption change?
Comment 4 Owen Zhang 2019-03-05 10:16:53 UTC
Created attachment 143532 [details]
increase_jobs
Comment 5 Owen Zhang 2019-03-05 10:17:17 UTC
Created attachment 143533 [details]
decrease_jobs
Comment 6 Owen Zhang 2019-03-05 10:19:11 UTC
(In reply to Chris Wilson from comment #3)
> iowait is an indication of userspace badness, that userspace is sleeping for
> the gpu. That it goes down is quite possibly a regression... But the only
> numbers that matter here is GPU throughput; did the rare of decoding jobs
> increase or decrease (jobs per second)? Did the energy consumption change?

i try the increase and decrease the jobs, the iowait will increase and decrease correspondingly. thanks very much.
Comment 7 Tvrtko Ursulin 2019-03-06 14:04:31 UTC
Is there a difference in actual decoding performance?
Comment 8 Owen Zhang 2019-03-06 14:32:11 UTC
(In reply to Tvrtko Ursulin from comment #7)
> Is there a difference in actual decoding performance?

from our tests, there haven't obvious performance drop.
Comment 9 Tvrtko Ursulin 2019-03-07 09:14:33 UTC
Why is the change in reported iowait a concern then? We could replace io_schedule_timeout with schedule_timeout and make it even lower, but we chose to use io waits to reflect that a process is waiting on a device.
Comment 10 Owen Zhang 2019-03-08 07:05:41 UTC
(In reply to Tvrtko Ursulin from comment #9)
> Why is the change in reported iowait a concern then? We could replace
> io_schedule_timeout with schedule_timeout and make it even lower, but we
> chose to use io waits to reflect that a process is waiting on a device.

the CPU usage(including iowait) is customer's criteria. 
and i try to change the io_schedule_timeout to schedule_timeout, the cpu iowait value have the same high value, Any other suggestions? thanks very much in advance.

and our customer is pending on this issue fix, Is it possible to set to the high priority? thanks very much.
Comment 11 Tvrtko Ursulin 2019-03-08 07:58:49 UTC
I downloaded the reproducer and tried running it but I get:

root@sc:~/bug109830/reproduce# ./decode_video_avc.sh 

[ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/pipeline_decode.cpp:296

[ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/sample_decode.cpp:666

[ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/pipeline_decode.cpp:296

... and so on ...
Comment 12 Tvrtko Ursulin 2019-03-08 08:06:30 UTC
What was iowait like before 4.19?
Comment 13 Tvrtko Ursulin 2019-03-08 08:11:34 UTC
Also, are you sure you changed io_schedule_timeout to schedule_timeout in i915_request_wait, deployed the correct kernel for testing?
Comment 14 Owen Zhang 2019-03-08 08:15:56 UTC
(In reply to Tvrtko Ursulin from comment #11)
> I downloaded the reproducer and tried running it but I get:
> 
> root@sc:~/bug109830/reproduce# ./decode_video_avc.sh 
> 
> [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at
> /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/
> build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/
> sample_decode/src/pipeline_decode.cpp:296
> 
> [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at
> /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/
> build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/
> sample_decode/src/sample_decode.cpp:666
> 
> [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at
> /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/
> build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/
> sample_decode/src/pipeline_decode.cpp:296
> 
> ... and so on ...

can you try this cmd: vainfo?
Comment 15 Owen Zhang 2019-03-08 08:16:48 UTC
(In reply to Tvrtko Ursulin from comment #12)
> What was iowait like before 4.19?

we tested 4.4/4.9/4.10/4.12/4.14/4.19.

for 4.4, the iowait is good, others will have high value.
Comment 16 Owen Zhang 2019-03-08 08:17:32 UTC
(In reply to Tvrtko Ursulin from comment #13)
> Also, are you sure you changed io_schedule_timeout to schedule_timeout in
> i915_request_wait, deployed the correct kernel for testing?

yes, the following is my changes:

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 5c2c93c..21950c1 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1330,7 +1330,7 @@ long i915_request_wait(struct i915_request *rq,
                        goto complete;
                }

-               timeout = io_schedule_timeout(timeout);
+               timeout = schedule_timeout(timeout);
        } while (1);

        GEM_BUG_ON(!intel_wait_has_seqno(&wait));
@@ -1363,7 +1363,7 @@ long i915_request_wait(struct i915_request *rq,
                        break;
                }

-               timeout = io_schedule_timeout(timeout);
+               timeout = schedule_timeout(timeout);

                if (intel_wait_complete(&wait) &&
                    intel_wait_check_request(&wait, rq))
[media@localhost linux-4.19]$
Comment 17 Tvrtko Ursulin 2019-03-08 08:25:43 UTC
What is the good iowait in percentage?

I don't have vaapi installed - it looks I incorrectly assumed the reproducer is self contained? I still need to build the stack from #1?

Can you try with drm-tip? With schedule_timeout I get iowait of 0% for simulated media pipelines. :)
Comment 18 Chris Wilson 2019-03-08 08:38:36 UTC
If you s/io_schedule_timeout/schedule_timeout/ inside i915.ko and the iowait remains... The iowait itself is nothing to do with us and we are just a victim; merely a symptom -- and in all likelihood that iowait goes down is indicative that _we_ regressed and are preventing the system from waiting for io elsewhere by keeping it busy instead.
Comment 19 Owen Zhang 2019-03-11 03:42:22 UTC
(In reply to Tvrtko Ursulin from comment #17)
> What is the good iowait in percentage?
> 
> I don't have vaapi installed - it looks I incorrectly assumed the reproducer
> is self contained? I still need to build the stack from #1?
> 
> Can you try with drm-tip? With schedule_timeout I get iowait of 0% for
> simulated media pipelines. :)

from our test on kenrel 4.4, the iowait is around 60% in average for whole media process. for kernel 4.19, the iowait is 90%+.

our reproducer hasn't included vaapi binary, so Is it possible give your ip address, i can remote to check your env? and i will upload this later.

yes, i try on drm-tip with schedule_timeout, the iowait will reduce to around 60%. but i think the Chris's patch also can fix this.
drm/i915: Replace global breadcrumbs with per-context interrupt tracking


BTW, you may need the following cmd to see the full changes when run the media pipelines.
iostat -x 1 120
Comment 20 Owen Zhang 2019-03-11 03:44:08 UTC
(In reply to Chris Wilson from comment #18)
> If you s/io_schedule_timeout/schedule_timeout/ inside i915.ko and the iowait
> remains... The iowait itself is nothing to do with us and we are just a
> victim; merely a symptom -- and in all likelihood that iowait goes down is
> indicative that _we_ regressed and are preventing the system from waiting
> for io elsewhere by keeping it busy instead.

from my test, the change io_schedule_timeout to schedule_timeout in drm-tip, it can see the iowait value changes to down, but can't work on linux 4.19.
Comment 21 Tvrtko Ursulin 2019-03-11 09:03:56 UTC
Why it can't work for 4.19? Or you mean doesn't work?

What are the iowait numbers in drm-tip and 4.19 with the schedule_timeout hack?
Comment 22 Owen Zhang 2019-03-11 10:39:51 UTC
(In reply to Tvrtko Ursulin from comment #21)
> Why it can't work for 4.19? Or you mean doesn't work?
> 
> What are the iowait numbers in drm-tip and 4.19 with the schedule_timeout
> hack?

i changed the io_schedule_timeout to schedule_timeout in kernel 4.19, the iowait also is 90%+.

but in drm-tip, the iowait number is around 60& in average. 

i will try to test on kernel 5.0 in the next.
Comment 23 Tvrtko Ursulin 2019-03-11 17:12:21 UTC
I built the media-driver but the reproducer still doesn't work for me:

root@sc:~/bug109830/reproduce# vainfo
error: can't connect to X server!
libva info: VA-API version 1.5.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri//iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_5
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.5 (libva 2.2.0)
vainfo: Driver version: Intel iHD driver - 1.0.0
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Simple            : VAEntrypointEncSlice
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointFEI
root@sc:~/bug109830/reproduce# ./decode_video_avc.sh 

[ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/pipeline_decode.cpp:296

[ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/sample_decode.cpp:666
Comment 24 Owen Zhang 2019-03-12 05:57:25 UTC
(In reply to Tvrtko Ursulin from comment #23)
> I built the media-driver but the reproducer still doesn't work for me:
> 
> root@sc:~/bug109830/reproduce# vainfo
> error: can't connect to X server!
> libva info: VA-API version 1.5.0
> libva info: va_getDriverName() returns 0
> libva info: User requested driver 'iHD'
> libva info: Trying to open /usr/local/lib/dri//iHD_drv_video.so
> libva info: Found init function __vaDriverInit_1_5
> libva info: va_openDriver() returns 0
> vainfo: VA-API version: 1.5 (libva 2.2.0)
> vainfo: Driver version: Intel iHD driver - 1.0.0
> vainfo: Supported profile and entrypoints
>       VAProfileNone                   : VAEntrypointVideoProc
>       VAProfileNone                   : VAEntrypointStats
>       VAProfileMPEG2Simple            : VAEntrypointVLD
>       VAProfileMPEG2Simple            : VAEntrypointEncSlice
>       VAProfileMPEG2Main              : VAEntrypointVLD
>       VAProfileMPEG2Main              : VAEntrypointEncSlice
>       VAProfileH264Main               : VAEntrypointVLD
>       VAProfileH264Main               : VAEntrypointEncSlice
>       VAProfileH264Main               : VAEntrypointFEI
>       VAProfileH264Main               : VAEntrypointEncSliceLP
>       VAProfileH264High               : VAEntrypointVLD
>       VAProfileH264High               : VAEntrypointEncSlice
>       VAProfileH264High               : VAEntrypointFEI
>       VAProfileH264High               : VAEntrypointEncSliceLP
>       VAProfileVC1Simple              : VAEntrypointVLD
>       VAProfileVC1Main                : VAEntrypointVLD
>       VAProfileVC1Advanced            : VAEntrypointVLD
>       VAProfileJPEGBaseline           : VAEntrypointVLD
>       VAProfileJPEGBaseline           : VAEntrypointEncPicture
>       VAProfileH264ConstrainedBaseline: VAEntrypointVLD
>       VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
>       VAProfileH264ConstrainedBaseline: VAEntrypointFEI
>       VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
>       VAProfileVP8Version0_3          : VAEntrypointVLD
>       VAProfileHEVCMain               : VAEntrypointVLD
>       VAProfileHEVCMain               : VAEntrypointEncSlice
>       VAProfileHEVCMain               : VAEntrypointFEI
> root@sc:~/bug109830/reproduce# ./decode_video_avc.sh 
> 
> [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at
> /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/
> build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/
> sample_decode/src/pipeline_decode.cpp:296
> 
> [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at
> /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/
> build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/
> sample_decode/src/sample_decode.cpp:666

maybe our sample_decode binary is mismatch your build one, Could you help to copy the sample_decode bin from your build folder:
cd MediaSDK/__cmake/intel64.make.release/__bin/release/sample_decode?

or Is it possible to give your ip address, i can check the env by remote.

and i will also offer one binary later, thanks very much.
Comment 25 Tvrtko Ursulin 2019-03-12 06:35:59 UTC
I can't build MediaSDK since the instruction don't work for me. This step:

  perl tools/builder/build_mfx.pl --cmake=intel64.make.release

There is no tools/builder/build_mfx.pl in my checkout. Neither master or the SHA referenced as tested by the build script.

P.S. My test machine is not accessible from the outside.
Comment 26 Owen Zhang 2019-03-12 08:05:17 UTC
(In reply to Tvrtko Ursulin from comment #25)
> I can't build MediaSDK since the instruction don't work for me. This step:
> 
>   perl tools/builder/build_mfx.pl --cmake=intel64.make.release
> 
> There is no tools/builder/build_mfx.pl in my checkout. Neither master or the
> SHA referenced as tested by the build script.
> 
> P.S. My test machine is not accessible from the outside.

i change the reproduce steps, which should work in your side. 

1.	Get the UMD build from github: https://github.com/Intel-Media-SDK/MediaSDK/releases/download/intel-mediasdk-18.4.1/MediaStack.tar.gz
2.	tar xzvf MediaStack.tar.gz
3.	cd MediaStack && sudo ./install_media.sh   # Install UMD
4.	Reboot
5.	tar xzvf reproduces.tar.gz
6.	cd reproduces
7.	./make_clips.sh  # Generate multiple clips for testing
8.	cp /opt/intel/mediasdk/share/mfx/samples/_bin/x64/sample_decode .   # Use the sample_decode from MediaStack
9.	./decode_video_avc.sh
Comment 27 Owen Zhang 2019-03-19 01:59:43 UTC
Hi,

i have made the following changes on kernel 4.19:

1) apply the following patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839

2) change the io_schedule_timeout to schedule_timeout.
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1322,7 +1322,7 @@ long i915_request_wait(struct i915_request *rq,
                        break;
                }

-               timeout = io_schedule_timeout(timeout);
+               timeout = schedule_timeout(timeout);
        }


the iowait value reduce to below 80%. so Is it possible to do one W/R to customer temp deployment patch before the formal fix patch release?
any suggestions? thanks very much.
Comment 28 Chris Wilson 2019-03-19 08:51:58 UTC
There is no reason to make any change upstream.

Performance improvements even for bogus metrics, especially for bogus metrics, do not get backported as they have a much greater risk of regression than their benefit.

If it was a regression in real-world performance, then it would be a bug. By your own statement it is not a regression to previous kernels, and the performance has not changed.
Comment 29 Owen Zhang 2019-03-19 08:57:02 UTC
(In reply to Chris Wilson from comment #28)
> There is no reason to make any change upstream.
> 
> Performance improvements even for bogus metrics, especially for bogus
> metrics, do not get backported as they have a much greater risk of
> regression than their benefit.
> 
> If it was a regression in real-world performance, then it would be a bug. By
> your own statement it is not a regression to previous kernels, and the
> performance has not changed.

yes, the performance(fps) hasn't any changes.
but the CPU usage(iowait) is much higher. in our customer view, the CPU usage also is one important performance reference point. thanks very much.
Comment 30 Tvrtko Ursulin 2019-03-19 09:49:12 UTC
(In reply to Owen Zhang from comment #27)
> Hi,
> 
> i have made the following changes on kernel 4.19:
> 
> 1) apply the following patch:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/
> ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839

This patch makes a difference on it's own? How much?

If it does, I think you have to test by ensuring input data is cached before the timing run, and dump the output to /dev/null.

Otherwise you are accounting for the changes in kernel's block and filesystem subsystems as well.

P.S. I haven't had the time yet to try the repro using the latest instructions.
Comment 31 Owen Zhang 2019-03-20 04:27:39 UTC
(In reply to Tvrtko Ursulin from comment #30)
> (In reply to Owen Zhang from comment #27)
> > Hi,
> > 
> > i have made the following changes on kernel 4.19:
> > 
> > 1) apply the following patch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/
> > ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839
> 
> This patch makes a difference on it's own? How much?
> 
> If it does, I think you have to test by ensuring input data is cached before
> the timing run, and dump the output to /dev/null.
> 
> Otherwise you are accounting for the changes in kernel's block and
> filesystem subsystems as well.
> 
> P.S. I haven't had the time yet to try the repro using the latest
> instructions.

only apply this patch, there hasn't any difference. it also need to change the io_schedule_timeout to schedule_timeout. 

both apply these two changes, the cpu iowait number will be down.
Comment 32 Tvrtko Ursulin 2019-03-20 07:25:30 UTC
(In reply to Owen Zhang from comment #31)
> (In reply to Tvrtko Ursulin from comment #30)
> > (In reply to Owen Zhang from comment #27)
> > > Hi,
> > > 
> > > i have made the following changes on kernel 4.19:
> > > 
> > > 1) apply the following patch:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/
> > > ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839
> > 
> > This patch makes a difference on it's own? How much?
> > 
> > If it does, I think you have to test by ensuring input data is cached before
> > the timing run, and dump the output to /dev/null.
> > 
> > Otherwise you are accounting for the changes in kernel's block and
> > filesystem subsystems as well.
> > 
> > P.S. I haven't had the time yet to try the repro using the latest
> > instructions.
> 
> only apply this patch, there hasn't any difference. it also need to change
> the io_schedule_timeout to schedule_timeout. 
> 
> both apply these two changes, the cpu iowait number will be down.

So only two changes in conjunction work? Test system is using device mapper?

Can you try pre-caching, a different block device and output to /dev/null? (Without either patch applied.)
Comment 33 Owen Zhang 2019-03-25 14:53:20 UTC
(In reply to Tvrtko Ursulin from comment #32)
> (In reply to Owen Zhang from comment #31)
> > (In reply to Tvrtko Ursulin from comment #30)
> > > (In reply to Owen Zhang from comment #27)
> > > > Hi,
> > > > 
> > > > i have made the following changes on kernel 4.19:
> > > > 
> > > > 1) apply the following patch:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/
> > > > ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839
> > > 
> > > This patch makes a difference on it's own? How much?
> > > 
> > > If it does, I think you have to test by ensuring input data is cached before
> > > the timing run, and dump the output to /dev/null.
> > > 
> > > Otherwise you are accounting for the changes in kernel's block and
> > > filesystem subsystems as well.
> > > 
> > > P.S. I haven't had the time yet to try the repro using the latest
> > > instructions.
> > 
> > only apply this patch, there hasn't any difference. it also need to change
> > the io_schedule_timeout to schedule_timeout. 
> > 
> > both apply these two changes, the cpu iowait number will be down.
> 
> So only two changes in conjunction work? Test system is using device mapper?
> 
> Can you try pre-caching, a different block device and output to /dev/null?
> (Without either patch applied.)

yes only two changes can work now, and we use device mapper.
now we used pre-allocate the buffer and no output in our test case.
Comment 34 Tvrtko Ursulin 2019-03-25 15:56:51 UTC
And what were the results?
Comment 35 Owen Zhang 2019-03-27 05:05:01 UTC
(In reply to Tvrtko Ursulin from comment #34)
> And what were the results?

we are doing the full test now, will update the results when it finish, any suggestions, pls let me know, thanks very much.
Comment 36 Owen Zhang 2019-03-28 05:56:52 UTC
(In reply to Tvrtko Ursulin from comment #34)
> And what were the results?

we finished the full test, no regression found, and no performance drop after applied these to changes, and the iowait value close to Zero.
any suggestions for these patches? thanks very much.
Comment 37 Tvrtko Ursulin 2019-03-28 06:52:19 UTC
Ignoring these patches for now, but with pre-caching and output to /dev/null only. The iowait is still the same? I am still curious about that device mapper patch - since you said the i915 patch on it's own is not enough to bring the iowait down.

Otherwise, a single test job from your scripts spends ~12% of it's runtime in GEM_WAIT ioctl - which is the source of iowait.

When you run 34 this jobs in parallel the waits will naturally scale up due GPU being more congested.

In other words userspace seems to be deliberately waiting on the GPU and if you want to decrease reported iowait you could also try making userspace more asynchronous.
Comment 38 Owen Zhang 2019-03-29 02:41:14 UTC
(In reply to Tvrtko Ursulin from comment #37)
> Ignoring these patches for now, but with pre-caching and output to /dev/null
> only. The iowait is still the same? 

from our test, yes.  

and from QA test:
only device mapper patch, the iowait can't down.
only io_schedule_timeout to schedule_timeout, the iowait can't down.

both device mapper patch and schedule timeout changes, the iowait can down.
thanks very much.
Comment 39 Chris Wilson 2019-05-14 09:32:00 UTC
The discussion here has petered out, away from the io_schedule_timeout being the issue, and we are still debating the merits of the global iowait as being a valid statistic. (Note that customers probably want to watch blockstats instead if they want to track an overloaded block device.)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.