Created attachment 143528 [details] iostat_in_4.19 Hi, our customer report one CPU iowait higher issue when run the multi decoding threads, i reproduced this issue in kernel 4.19 and kernel 5.0 RC. the logs: (the full log in attach) avg-cpu: %user %nice %system %iowait %steal %idle 4.99 0.00 1.87 93.14 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 71.00 0.00 8256.00 0.00 232.56 0.11 1.66 1.66 0.00 0.87 6.20 dm-0 0.00 0.00 6.00 0.00 60.00 0.00 20.00 0.00 0.17 0.17 0.00 0.17 0.10 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 65.00 0.00 8196.00 0.00 252.18 0.11 1.75 1.75 0.00 0.94 6.10 avg-cpu: %user %nice %system %iowait %steal %idle 5.25 0.00 1.38 91.88 0.00 1.50 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 11.00 0.00 92.00 0.00 16.73 0.00 0.18 0.18 0.00 0.09 0.10 dm-0 0.00 0.00 11.00 0.00 92.00 0.00 16.73 0.00 0.09 0.09 0.00 0.09 0.10 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 i also test the drm-tip(the last commit: 3dd976663b985375368c6229903a1c47971c2a71), this issue has been a little bit less than the kernel 4.19/5.0. the cpu iowait reduced to below (~80%). so i do the bisect the patch on drm-tip, and find the following patch saved this. " drm/i915: Replace global breadcrumbs with per-context interrupt tracking A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd"), the issue of handling multiple clients waiting in parallel was brought to our attention. The requirement was that every client should be woken immediately upon its request being signaled, without incurring any cpu overhead. To handle certain fragility of our hw meant that we could not do a simple check inside the irq handler (some generations required almost unbounded delays before we could be sure of seqno coherency) and so request completion checking required delegation. Before commit 688e6c725816, the solution was simple. Every client waiting on a request would be woken on every interrupt and each would do a heavyweight check to see if their request was complete. Commit 688e6c725816 introduced an rbtree so that only the earliest waiter on the global timeline would woken, and would wake the next and so on. (Along with various complications to handle requests being reordered along the global timeline, and also a requirement for kthread to provide a delegate for fence signaling that had no process context.) The global rbtree depends on knowing the execution timeline (and global seqno). Without knowing that order, we must instead check all contexts queued to the HW to see which may have advanced. We trim that list by only checking queued contexts that are being waited on, but still we keep a list of all active contexts and their active signalers that we inspect from inside the irq handler. By moving the waiters onto the fence signal list, we can combine the client wakeup with the dma_fence signaling (a dramatic reduction in complexity, but does require the HW being coherent, the seqno must be visible from the cpu before the interrupt is raised - we keep a timer backup just in case). Having previously fixed all the issues with irq-seqno serialisation (by inserting delays onto the GPU after each request instead of random delays on the CPU after each interrupt), we can rely on the seqno state to perfom direct wakeups from the interrupt handler. This allows us to preserve our single context switch behaviour of the current routine, with the only downside that we lose the RT priority sorting of wakeups. In general, direct wakeup latency of multiple clients is about the same (about 10% better in most cases) with a reduction in total CPU time spent in the waiter (about 20-50% depending on gen). Average herd behaviour is improved, but at the cost of not delegating wakeups on task_prio. v2: Capture fence signaling state for error state and add comments to warm even the most cold of hearts. v3: Check if the request is still active before busywaiting v4: Reduce the amount of pointer misdirection with list_for_each_safe and using a local i915_request variable inside the loops v5: Add a missing pluralisation to a purely informative selftest message. Change-Id: Ibf5251cc874f4ee338266641df072cfa1012faae References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> " the bisect log in attach. si try to backport this patch to kernel 4.19, but this patch is big, need your helps to fix and give the suggestions. the reproduce steps: 1) Build this stack: https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack 2) run the ./decode_video_avc.sh 3) meanwhile use the top or iostat to check the cpu iowait value.
Created attachment 143529 [details] patch_bisect.log
Created attachment 143530 [details] reproducer
iowait is an indication of userspace badness, that userspace is sleeping for the gpu. That it goes down is quite possibly a regression... But the only numbers that matter here is GPU throughput; did the rare of decoding jobs increase or decrease (jobs per second)? Did the energy consumption change?
Created attachment 143532 [details] increase_jobs
Created attachment 143533 [details] decrease_jobs
(In reply to Chris Wilson from comment #3) > iowait is an indication of userspace badness, that userspace is sleeping for > the gpu. That it goes down is quite possibly a regression... But the only > numbers that matter here is GPU throughput; did the rare of decoding jobs > increase or decrease (jobs per second)? Did the energy consumption change? i try the increase and decrease the jobs, the iowait will increase and decrease correspondingly. thanks very much.
Is there a difference in actual decoding performance?
(In reply to Tvrtko Ursulin from comment #7) > Is there a difference in actual decoding performance? from our tests, there haven't obvious performance drop.
Why is the change in reported iowait a concern then? We could replace io_schedule_timeout with schedule_timeout and make it even lower, but we chose to use io waits to reflect that a process is waiting on a device.
(In reply to Tvrtko Ursulin from comment #9) > Why is the change in reported iowait a concern then? We could replace > io_schedule_timeout with schedule_timeout and make it even lower, but we > chose to use io waits to reflect that a process is waiting on a device. the CPU usage(including iowait) is customer's criteria. and i try to change the io_schedule_timeout to schedule_timeout, the cpu iowait value have the same high value, Any other suggestions? thanks very much in advance. and our customer is pending on this issue fix, Is it possible to set to the high priority? thanks very much.
I downloaded the reproducer and tried running it but I get: root@sc:~/bug109830/reproduce# ./decode_video_avc.sh [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/pipeline_decode.cpp:296 [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/sample_decode.cpp:666 [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/pipeline_decode.cpp:296 ... and so on ...
What was iowait like before 4.19?
Also, are you sure you changed io_schedule_timeout to schedule_timeout in i915_request_wait, deployed the correct kernel for testing?
(In reply to Tvrtko Ursulin from comment #11) > I downloaded the reproducer and tried running it but I get: > > root@sc:~/bug109830/reproduce# ./decode_video_avc.sh > > [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at > /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/ > build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/ > sample_decode/src/pipeline_decode.cpp:296 > > [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at > /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/ > build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/ > sample_decode/src/sample_decode.cpp:666 > > [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at > /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/ > build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/ > sample_decode/src/pipeline_decode.cpp:296 > > ... and so on ... can you try this cmd: vainfo?
(In reply to Tvrtko Ursulin from comment #12) > What was iowait like before 4.19? we tested 4.4/4.9/4.10/4.12/4.14/4.19. for 4.4, the iowait is good, others will have high value.
(In reply to Tvrtko Ursulin from comment #13) > Also, are you sure you changed io_schedule_timeout to schedule_timeout in > i915_request_wait, deployed the correct kernel for testing? yes, the following is my changes: diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 5c2c93c..21950c1 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1330,7 +1330,7 @@ long i915_request_wait(struct i915_request *rq, goto complete; } - timeout = io_schedule_timeout(timeout); + timeout = schedule_timeout(timeout); } while (1); GEM_BUG_ON(!intel_wait_has_seqno(&wait)); @@ -1363,7 +1363,7 @@ long i915_request_wait(struct i915_request *rq, break; } - timeout = io_schedule_timeout(timeout); + timeout = schedule_timeout(timeout); if (intel_wait_complete(&wait) && intel_wait_check_request(&wait, rq)) [media@localhost linux-4.19]$
What is the good iowait in percentage? I don't have vaapi installed - it looks I incorrectly assumed the reproducer is self contained? I still need to build the stack from #1? Can you try with drm-tip? With schedule_timeout I get iowait of 0% for simulated media pipelines. :)
If you s/io_schedule_timeout/schedule_timeout/ inside i915.ko and the iowait remains... The iowait itself is nothing to do with us and we are just a victim; merely a symptom -- and in all likelihood that iowait goes down is indicative that _we_ regressed and are preventing the system from waiting for io elsewhere by keeping it busy instead.
(In reply to Tvrtko Ursulin from comment #17) > What is the good iowait in percentage? > > I don't have vaapi installed - it looks I incorrectly assumed the reproducer > is self contained? I still need to build the stack from #1? > > Can you try with drm-tip? With schedule_timeout I get iowait of 0% for > simulated media pipelines. :) from our test on kenrel 4.4, the iowait is around 60% in average for whole media process. for kernel 4.19, the iowait is 90%+. our reproducer hasn't included vaapi binary, so Is it possible give your ip address, i can remote to check your env? and i will upload this later. yes, i try on drm-tip with schedule_timeout, the iowait will reduce to around 60%. but i think the Chris's patch also can fix this. drm/i915: Replace global breadcrumbs with per-context interrupt tracking BTW, you may need the following cmd to see the full changes when run the media pipelines. iostat -x 1 120
(In reply to Chris Wilson from comment #18) > If you s/io_schedule_timeout/schedule_timeout/ inside i915.ko and the iowait > remains... The iowait itself is nothing to do with us and we are just a > victim; merely a symptom -- and in all likelihood that iowait goes down is > indicative that _we_ regressed and are preventing the system from waiting > for io elsewhere by keeping it busy instead. from my test, the change io_schedule_timeout to schedule_timeout in drm-tip, it can see the iowait value changes to down, but can't work on linux 4.19.
Why it can't work for 4.19? Or you mean doesn't work? What are the iowait numbers in drm-tip and 4.19 with the schedule_timeout hack?
(In reply to Tvrtko Ursulin from comment #21) > Why it can't work for 4.19? Or you mean doesn't work? > > What are the iowait numbers in drm-tip and 4.19 with the schedule_timeout > hack? i changed the io_schedule_timeout to schedule_timeout in kernel 4.19, the iowait also is 90%+. but in drm-tip, the iowait number is around 60& in average. i will try to test on kernel 5.0 in the next.
I built the media-driver but the reproducer still doesn't work for me: root@sc:~/bug109830/reproduce# vainfo error: can't connect to X server! libva info: VA-API version 1.5.0 libva info: va_getDriverName() returns 0 libva info: User requested driver 'iHD' libva info: Trying to open /usr/local/lib/dri//iHD_drv_video.so libva info: Found init function __vaDriverInit_1_5 libva info: va_openDriver() returns 0 vainfo: VA-API version: 1.5 (libva 2.2.0) vainfo: Driver version: Intel iHD driver - 1.0.0 vainfo: Supported profile and entrypoints VAProfileNone : VAEntrypointVideoProc VAProfileNone : VAEntrypointStats VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Simple : VAEntrypointEncSlice VAProfileMPEG2Main : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointFEI VAProfileH264Main : VAEntrypointEncSliceLP VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264High : VAEntrypointFEI VAProfileH264High : VAEntrypointEncSliceLP VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264ConstrainedBaseline: VAEntrypointFEI VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP VAProfileVP8Version0_3 : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointEncSlice VAProfileHEVCMain : VAEntrypointFEI root@sc:~/bug109830/reproduce# ./decode_video_avc.sh [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/pipeline_decode.cpp:296 [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/sample_decode/src/sample_decode.cpp:666
(In reply to Tvrtko Ursulin from comment #23) > I built the media-driver but the reproducer still doesn't work for me: > > root@sc:~/bug109830/reproduce# vainfo > error: can't connect to X server! > libva info: VA-API version 1.5.0 > libva info: va_getDriverName() returns 0 > libva info: User requested driver 'iHD' > libva info: Trying to open /usr/local/lib/dri//iHD_drv_video.so > libva info: Found init function __vaDriverInit_1_5 > libva info: va_openDriver() returns 0 > vainfo: VA-API version: 1.5 (libva 2.2.0) > vainfo: Driver version: Intel iHD driver - 1.0.0 > vainfo: Supported profile and entrypoints > VAProfileNone : VAEntrypointVideoProc > VAProfileNone : VAEntrypointStats > VAProfileMPEG2Simple : VAEntrypointVLD > VAProfileMPEG2Simple : VAEntrypointEncSlice > VAProfileMPEG2Main : VAEntrypointVLD > VAProfileMPEG2Main : VAEntrypointEncSlice > VAProfileH264Main : VAEntrypointVLD > VAProfileH264Main : VAEntrypointEncSlice > VAProfileH264Main : VAEntrypointFEI > VAProfileH264Main : VAEntrypointEncSliceLP > VAProfileH264High : VAEntrypointVLD > VAProfileH264High : VAEntrypointEncSlice > VAProfileH264High : VAEntrypointFEI > VAProfileH264High : VAEntrypointEncSliceLP > VAProfileVC1Simple : VAEntrypointVLD > VAProfileVC1Main : VAEntrypointVLD > VAProfileVC1Advanced : VAEntrypointVLD > VAProfileJPEGBaseline : VAEntrypointVLD > VAProfileJPEGBaseline : VAEntrypointEncPicture > VAProfileH264ConstrainedBaseline: VAEntrypointVLD > VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice > VAProfileH264ConstrainedBaseline: VAEntrypointFEI > VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP > VAProfileVP8Version0_3 : VAEntrypointVLD > VAProfileHEVCMain : VAEntrypointVLD > VAProfileHEVCMain : VAEntrypointEncSlice > VAProfileHEVCMain : VAEntrypointFEI > root@sc:~/bug109830/reproduce# ./decode_video_avc.sh > > [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), Init, m_mfxSession.Init failed at > /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/ > build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/ > sample_decode/src/pipeline_decode.cpp:296 > > [ERROR], sts=MFX_ERR_UNSUPPORTED(-3), main, Pipeline.Init failed at > /home/media/jenkins/workspace/MEDIA/DCG_MEDIA_DRIVER/PV5_BUILD_VPG_BASE/ > build_mss_pv5_external/build/WS/GENERIC/build_external/msdk/MediaSDK/samples/ > sample_decode/src/sample_decode.cpp:666 maybe our sample_decode binary is mismatch your build one, Could you help to copy the sample_decode bin from your build folder: cd MediaSDK/__cmake/intel64.make.release/__bin/release/sample_decode? or Is it possible to give your ip address, i can check the env by remote. and i will also offer one binary later, thanks very much.
I can't build MediaSDK since the instruction don't work for me. This step: perl tools/builder/build_mfx.pl --cmake=intel64.make.release There is no tools/builder/build_mfx.pl in my checkout. Neither master or the SHA referenced as tested by the build script. P.S. My test machine is not accessible from the outside.
(In reply to Tvrtko Ursulin from comment #25) > I can't build MediaSDK since the instruction don't work for me. This step: > > perl tools/builder/build_mfx.pl --cmake=intel64.make.release > > There is no tools/builder/build_mfx.pl in my checkout. Neither master or the > SHA referenced as tested by the build script. > > P.S. My test machine is not accessible from the outside. i change the reproduce steps, which should work in your side. 1. Get the UMD build from github: https://github.com/Intel-Media-SDK/MediaSDK/releases/download/intel-mediasdk-18.4.1/MediaStack.tar.gz 2. tar xzvf MediaStack.tar.gz 3. cd MediaStack && sudo ./install_media.sh # Install UMD 4. Reboot 5. tar xzvf reproduces.tar.gz 6. cd reproduces 7. ./make_clips.sh # Generate multiple clips for testing 8. cp /opt/intel/mediasdk/share/mfx/samples/_bin/x64/sample_decode . # Use the sample_decode from MediaStack 9. ./decode_video_avc.sh
Hi, i have made the following changes on kernel 4.19: 1) apply the following patch: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839 2) change the io_schedule_timeout to schedule_timeout. +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1322,7 +1322,7 @@ long i915_request_wait(struct i915_request *rq, break; } - timeout = io_schedule_timeout(timeout); + timeout = schedule_timeout(timeout); } the iowait value reduce to below 80%. so Is it possible to do one W/R to customer temp deployment patch before the formal fix patch release? any suggestions? thanks very much.
There is no reason to make any change upstream. Performance improvements even for bogus metrics, especially for bogus metrics, do not get backported as they have a much greater risk of regression than their benefit. If it was a regression in real-world performance, then it would be a bug. By your own statement it is not a regression to previous kernels, and the performance has not changed.
(In reply to Chris Wilson from comment #28) > There is no reason to make any change upstream. > > Performance improvements even for bogus metrics, especially for bogus > metrics, do not get backported as they have a much greater risk of > regression than their benefit. > > If it was a regression in real-world performance, then it would be a bug. By > your own statement it is not a regression to previous kernels, and the > performance has not changed. yes, the performance(fps) hasn't any changes. but the CPU usage(iowait) is much higher. in our customer view, the CPU usage also is one important performance reference point. thanks very much.
(In reply to Owen Zhang from comment #27) > Hi, > > i have made the following changes on kernel 4.19: > > 1) apply the following patch: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/ > ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839 This patch makes a difference on it's own? How much? If it does, I think you have to test by ensuring input data is cached before the timing run, and dump the output to /dev/null. Otherwise you are accounting for the changes in kernel's block and filesystem subsystems as well. P.S. I haven't had the time yet to try the repro using the latest instructions.
(In reply to Tvrtko Ursulin from comment #30) > (In reply to Owen Zhang from comment #27) > > Hi, > > > > i have made the following changes on kernel 4.19: > > > > 1) apply the following patch: > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/ > > ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839 > > This patch makes a difference on it's own? How much? > > If it does, I think you have to test by ensuring input data is cached before > the timing run, and dump the output to /dev/null. > > Otherwise you are accounting for the changes in kernel's block and > filesystem subsystems as well. > > P.S. I haven't had the time yet to try the repro using the latest > instructions. only apply this patch, there hasn't any difference. it also need to change the io_schedule_timeout to schedule_timeout. both apply these two changes, the cpu iowait number will be down.
(In reply to Owen Zhang from comment #31) > (In reply to Tvrtko Ursulin from comment #30) > > (In reply to Owen Zhang from comment #27) > > > Hi, > > > > > > i have made the following changes on kernel 4.19: > > > > > > 1) apply the following patch: > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/ > > > ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839 > > > > This patch makes a difference on it's own? How much? > > > > If it does, I think you have to test by ensuring input data is cached before > > the timing run, and dump the output to /dev/null. > > > > Otherwise you are accounting for the changes in kernel's block and > > filesystem subsystems as well. > > > > P.S. I haven't had the time yet to try the repro using the latest > > instructions. > > only apply this patch, there hasn't any difference. it also need to change > the io_schedule_timeout to schedule_timeout. > > both apply these two changes, the cpu iowait number will be down. So only two changes in conjunction work? Test system is using device mapper? Can you try pre-caching, a different block device and output to /dev/null? (Without either patch applied.)
(In reply to Tvrtko Ursulin from comment #32) > (In reply to Owen Zhang from comment #31) > > (In reply to Tvrtko Ursulin from comment #30) > > > (In reply to Owen Zhang from comment #27) > > > > Hi, > > > > > > > > i have made the following changes on kernel 4.19: > > > > > > > > 1) apply the following patch: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/ > > > > ?h=linux-4.20.y&id=0a2fff2428f1e175932dc3cf115b68c868e3a839 > > > > > > This patch makes a difference on it's own? How much? > > > > > > If it does, I think you have to test by ensuring input data is cached before > > > the timing run, and dump the output to /dev/null. > > > > > > Otherwise you are accounting for the changes in kernel's block and > > > filesystem subsystems as well. > > > > > > P.S. I haven't had the time yet to try the repro using the latest > > > instructions. > > > > only apply this patch, there hasn't any difference. it also need to change > > the io_schedule_timeout to schedule_timeout. > > > > both apply these two changes, the cpu iowait number will be down. > > So only two changes in conjunction work? Test system is using device mapper? > > Can you try pre-caching, a different block device and output to /dev/null? > (Without either patch applied.) yes only two changes can work now, and we use device mapper. now we used pre-allocate the buffer and no output in our test case.
And what were the results?
(In reply to Tvrtko Ursulin from comment #34) > And what were the results? we are doing the full test now, will update the results when it finish, any suggestions, pls let me know, thanks very much.
(In reply to Tvrtko Ursulin from comment #34) > And what were the results? we finished the full test, no regression found, and no performance drop after applied these to changes, and the iowait value close to Zero. any suggestions for these patches? thanks very much.
Ignoring these patches for now, but with pre-caching and output to /dev/null only. The iowait is still the same? I am still curious about that device mapper patch - since you said the i915 patch on it's own is not enough to bring the iowait down. Otherwise, a single test job from your scripts spends ~12% of it's runtime in GEM_WAIT ioctl - which is the source of iowait. When you run 34 this jobs in parallel the waits will naturally scale up due GPU being more congested. In other words userspace seems to be deliberately waiting on the GPU and if you want to decrease reported iowait you could also try making userspace more asynchronous.
(In reply to Tvrtko Ursulin from comment #37) > Ignoring these patches for now, but with pre-caching and output to /dev/null > only. The iowait is still the same? from our test, yes. and from QA test: only device mapper patch, the iowait can't down. only io_schedule_timeout to schedule_timeout, the iowait can't down. both device mapper patch and schedule timeout changes, the iowait can down. thanks very much.
The discussion here has petered out, away from the io_schedule_timeout being the issue, and we are still debating the merits of the global iowait as being a valid statistic. (Note that customers probably want to watch blockstats instead if they want to track an overloaded block device.)
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.