Bug 103314 - Strange slow-fast performance latching between gfxbench3 (and 4) test runs
Summary: Strange slow-fast performance latching between gfxbench3 (and 4) test runs
Status: RESOLVED MOVED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-17 17:53 UTC by Tvrtko Ursulin
Modified: 2019-11-27 13:49 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Only use staged uploads for the same batch. (2.04 KB, patch)
2017-10-18 09:32 UTC, Chris Wilson
no flags Details | Splinter Review

Description Tvrtko Ursulin 2017-10-17 17:53:05 UTC
I have observed a strange phenomena where benchmarks scores can vary, by a differing degree, between benchmark runs. By looking at what is happening with the system I have noticed the fast case only ever uses the RCS engine, while the slow case uses BCS at 100% and RCS at a lower percentage (depends on the benchmark exactly how much).

I have already chatted with Kenneth (and some other people) about this and apparently this is somewhat known and caused by different BO upload paths. It was supposed to be alleviated in the master branch but for me the results are exactly the same (bad) as what I originally discovered.

The most obvious example is the gl_driver test:


Slow run, 17.5% RCS busy, 100% BCS busy:
========================================

root@e31:~/benchmarks/gfxbench3_desktop# INTEL_DEBUG=perf ~/bin/run-mesa ~/mesa ./gfxbench-driver.sh
Running following GfxBench 3.x test-cases:
- gl_driver

In following resolutions:
- 1920x1080

Fullscreened:
- 1

On/offscreen:
- on

COMMAND: build/linux/gfxbench_Release/mainapp/mainapp -w 1920 -ow 1920 -h 1080 -oh 1080 -t gl_driver -fullscreen 1
ATTENTION: default value of option vblank_mode overridden by environment.
Scanning index buffer to compute index buffer bounds.  Use glDrawRangeElements() to avoid this.
CPU mapping a busy "statebuffer" BO stalled and took 0.017 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.035 ms.
CPU mapping a busy "streamed data" BO stalled and took 0.012 ms.
CPU mapping a busy "statebuffer" BO stalled and took 0.010 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.022 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.013 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.025 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.038 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.039 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.018 ms.
#Name, FPS, Score, Unit, Width, Height, GL version:
GLB30_gl_driver, 34.2, 1026.0, frames, 1920, 1080, 3.0 Mesa 17.3.0-devel (git-31fb7bbe0b)

Results are in:
- gfxbench-result-fullscreen-1.csv

Full output logs are in:
- gfxbench-result-fullscreen-1.txt

Fast run, 75% RCS busy, 0% BCS busy:
=====================================

root@e31:~/benchmarks/gfxbench3_desktop# INTEL_DEBUG=perf ~/bin/run-mesa ~/mesa ./gfxbench-driver.sh
Running following GfxBench 3.x test-cases:
- gl_driver

In following resolutions:
- 1920x1080

Fullscreened:
- 1

On/offscreen:
- on

COMMAND: build/linux/gfxbench_Release/mainapp/mainapp -w 1920 -ow 1920 -h 1080 -oh 1080 -t gl_driver -fullscreen 1
ATTENTION: default value of option vblank_mode overridden by environment.
Scanning index buffer to compute index buffer bounds.  Use glDrawRangeElements() to avoid this.
CPU mapping a busy "batchbuffer" BO stalled and took 0.010 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.020 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.021 ms.
CPU mapping a busy "streamed data" BO stalled and took 0.019 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.015 ms.
CPU mapping a busy "statebuffer" BO stalled and took 0.017 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.037 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.036 ms.
CPU mapping a busy "statebuffer" BO stalled and took 0.021 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.029 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.019 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.017 ms.
CPU mapping a busy "streamed data" BO stalled and took 0.019 ms.
CPU mapping a busy "statebuffer" BO stalled and took 0.016 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.023 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.011 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.014 ms.
CPU mapping a busy "statebuffer" BO stalled and took 0.015 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.011 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.013 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.015 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.027 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.010 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.015 ms.
CPU mapping a busy "statebuffer" BO stalled and took 0.011 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.031 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.038 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.023 ms.
CPU mapping a busy "batchbuffer" BO stalled and took 0.026 ms.
#Name, FPS, Score, Unit, Width, Height, GL version:
GLB30_gl_driver, 101.4, 3042.0, frames, 1920, 1080, 3.0 Mesa 17.3.0-devel (git-31fb7bbe0b)

Results are in:
- gfxbench-result-fullscreen-1.csv

Full output logs are in:
- gfxbench-result-fullscreen-1.txt



Other tests show a smaller margin between slow and fast modes, but the crux of the problem is still there.

I think this is quite a problem for benchmarking, and unless people will say something along the lines of "Don't use gfxbench3/4 ever because A B C", I think it would be good to get to the bottom of this and having something which behaves somewhat more predictably.

I also wonder if it is possible that something in the kernel (i915) is causing Mesa to behave like this? Because I am pretty sure I used these benchmarks before, but I don't remember noticing this issue before. 3x difference in gl_driver certainly looks like something one would have thought people would have noticed before if it was old behaviour?
Comment 1 Chris Wilson 2017-10-18 09:32:39 UTC
Created attachment 134902 [details] [review]
Only use staged uploads for the same batch.

An idea to cut out the flip flops on staging uploads.
Comment 2 Tvrtko Ursulin 2017-10-18 10:09:25 UTC
(In reply to Chris Wilson from comment #1)
> Created attachment 134902 [details] [review] [review]
> Only use staged uploads for the same batch.
> 
> An idea to cut out the flip flops on staging uploads.

No apparent effect in testing with this one.
Comment 3 Tvrtko Ursulin 2017-10-18 10:12:26 UTC
Btw, should I be seeing "Using a blit copy to avoid stalling on..." messages since I have INTEL_DEBUG=perf turned on? Or there is some other path wo/ perf_debug which does blitter uploads as well?
Comment 4 Chris Wilson 2017-10-18 11:28:37 UTC
(In reply to Tvrtko Ursulin from comment #3)
> Btw, should I be seeing "Using a blit copy to avoid stalling on..." messages
> since I have INTEL_DEBUG=perf turned on? Or there is some other path wo/
> perf_debug which does blitter uploads as well?

Yes... And you still see high BCS usage on master? That too shouldn't happen for brw_blorp_copy_buffers, so another indication of barking up the wrong tree.

Let's see if we can perf_debug() the switch from RCS to BCS.
Comment 5 Eero Tamminen 2017-10-18 13:21:13 UTC
Tvrtko, do you see the same issue also with the offscreen version of the test?

Benchmarks shouldn't normally be doing uploads after test startup, unless its benchmark for texture upload.

The only thing that I've seen using a lot of blitter during test run-time is X server, when it does copy of the non-vsynched frame.  This would be most visible when using Intel DDX with DRI2.

What X server and X driver you're using?  Intel DDX one, or modesetting? If former, do you use DRI2 or DRI3?

(LIBGL_DEBUG=verbose should output whether Mesa uses DRI2 or DRI3.)
Comment 6 Tvrtko Ursulin 2017-10-18 14:02:03 UTC
On that machine I have the Intel DDX with DRI 3 turned on, and Mesa confirms it is using DRI 3.

Offscreen version of the test does not seem to suffer from this problem.

So I guess user error of some sort?
Comment 7 Eero Tamminen 2017-10-18 14:28:51 UTC
(In reply to Tvrtko Ursulin from comment #6)
> On that machine I have the Intel DDX with DRI 3 turned on, and Mesa confirms
> it is using DRI 3.
> 
> Offscreen version of the test does not seem to suffer from this problem.
>
> So I guess user error of some sort?

I think the dual results issue we've discussed is still real.

You could test also with modesetting, to make sure blitter usage really goes away, and if yes, whether that makes the performance results more consistent (mostly/partly CPU bound tests like gl_driver are still going to have at least 5x more variance than GPU bound tests have).

Even if the cause would be Intel DDX instead of Mesa, it's quite suspicious that it would randomly use blitter for frame copies.
Comment 8 Tvrtko Ursulin 2017-10-23 09:24:43 UTC
Can't repro with modesetting. Let's see if I can move the bug to xorg/driver/intel..
Comment 9 Martin Peres 2019-11-27 13:49:06 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-intel/issues/150.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.