Copying from the comment made on bug #102575 to here. > Pardon my intrusion. > > Running subtest basic-small-bo-tiledX with latest igt-gpu-tools > (1.22+173+gf560ae5a-1) and drm-tip (4.17rc6+1560+g9d5095539d5f+755171-1) > yields high failure rates on my GM45: out of 1000 iterations, only 360 are > successful. > > Neither basic-small-bo-tiledY nor basic-small-bo have any failures with 100 > iterations each. > > Platform: > Dell Inspiron 1545 > Eagle Lake / Core2Duo (Pentium(R) Dual-Core CPU T4200 @ 2.00GHz) / GMA4500 > LVDS (VGA) > > I tested it out of curiosity to its relation to my other bug. [#103025] In my own investigation I have found the failure to start between kernels 4.7 and 4.8, making it easy to bisect. I haven't finished the bisect yet, but I believe there is some relation to the mm's compaction and oom changes that were made around that time; as I recall, it took quite a bit of time to get those mm changes working right. I can have the bisect results within a day.
Note that this is not a CI bug (I removed the [NOT CI] tag as it's not used anywhere else)
Can you attach debug logs when reproducing the issue on drm-tip?
Created attachment 140071 [details] drm.debug=0x1f log while running test x100 Reproduced and attached the output of journalctl -b -k -o short-monotonic Additional kernel info: linux-drm-tip-git 4.17rc7+1947+gc1064b9be065+755888-1 (via Archlinux's pacman -Q) Build Date: Tue 05 Jun 2018 10:53:05 PM EDT
I have bisected the kernel, which revealed commit e6cbd7f2efb433d717af72aa8510a9db6f7a7e05 to be the first bad commit. commit e6cbd7f2efb433d717af72aa8510a9db6f7a7e05 (HEAD, refs/bisect/bad) Author: Mel Gorman <mgorman@techsingularity.net> Date: Thu Jul 28 15:46:50 2016 -0700 mm, page_alloc: remove fair zone allocation policy The fair zone allocation policy interleaves allocation requests between zones to avoid an age inversion problem whereby new pages are reclaimed to balance a zone. Reclaim is now node-based so this should no longer be an issue and the fair zone allocation policy is not free. This patch removes it. Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> git bisect log: git bisect start # good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7 git bisect good 523d939ef98fd712632d93a5a2b588e477a7565e # bad: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8 git bisect bad c8d2bc9bc39ebea8437fd974fdbc21847bb897a3 # skip: [e61c10e468a42512f5fad74c00b62af5cc19f65f] sh: add device tree source for J2 FPGA on Mimas v2 board git bisect skip e61c10e468a42512f5fad74c00b62af5cc19f65f # good: [6d51c813b172b4374fe7a6b732b6666f8d77bfea] drm/amdgpu: update golden setting of stoney git bisect good 6d51c813b172b4374fe7a6b732b6666f8d77bfea # bad: [e7b4f2d8edbbc58c8e2c3134ff884611433ba3db] Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs git bisect bad e7b4f2d8edbbc58c8e2c3134ff884611433ba3db # good: [0e6acf0204da5b8705722a5f6806a4f55ed379d6] Merge tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs git bisect good 0e6acf0204da5b8705722a5f6806a4f55ed379d6 # good: [da54bb13c02660544c286e7922b2ec660e5b1e77] Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue git bisect good da54bb13c02660544c286e7922b2ec660e5b1e77 # good: [6a492b0f23d28e1f946cdf08e54617484400dafb] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi git bisect good 6a492b0f23d28e1f946cdf08e54617484400dafb # bad: [c3486f5376696034d0fcbef8ba70c70cfcb26f51] mm, compaction: simplify contended compaction handling git bisect bad c3486f5376696034d0fcbef8ba70c70cfcb26f51 # good: [92effdf8b8b214165d5437f02b0ccbe80ba244cf] [media] doc-rst: Remove deprecated API.html document git bisect good 92effdf8b8b214165d5437f02b0ccbe80ba244cf # good: [818e607b57c94ade9824dad63a96c2ea6b21baf3] Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random git bisect good 818e607b57c94ade9824dad63a96c2ea6b21baf3 # good: [c4a25635b60d08853a3e4eaae3ab34419a36cfa2] mm: move vmscan writes and file write accounting to the node git bisect good c4a25635b60d08853a3e4eaae3ab34419a36cfa2 # bad: [35b3445e97352732f0d64a7e629f629b1d81827e] mm/zsmalloc: add __init,__exit attribute git bisect bad 35b3445e97352732f0d64a7e629f629b1d81827e # bad: [68eb0731c4ce1d64aa59b244abae4e72300719b6] mm, pagevec: release/reacquire lru_lock on pgdat change git bisect bad 68eb0731c4ce1d64aa59b244abae4e72300719b6 # good: [e5146b12e2d02af04608301c958d95b2fc47a0f9] mm, vmscan: add classzone information to tracepoints git bisect good e5146b12e2d02af04608301c958d95b2fc47a0f9 # bad: [7cc30fcfd2a894589d832a192cac3dc5cd302bb8] mm: vmstat: account per-zone stalls and pages skipped during reclaim git bisect bad 7cc30fcfd2a894589d832a192cac3dc5cd302bb8 # bad: [3b8c0be43cb844b3cd26fac00e7663a1201176fd] mm: page_alloc: cache the last node whose dirty limit is reached git bisect bad 3b8c0be43cb844b3cd26fac00e7663a1201176fd # bad: [e6cbd7f2efb433d717af72aa8510a9db6f7a7e05] mm, page_alloc: remove fair zone allocation policy git bisect bad e6cbd7f2efb433d717af72aa8510a9db6f7a7e05 # first bad commit: [e6cbd7f2efb433d717af72aa8510a9db6f7a7e05] mm, page_alloc: remove fair zone allocation policy
Not to cast any aspersions or anything, but I wasn't expect a non-i915 result. Do you mind testing e6cbd7f2efb433d717af72aa8510a9db6f7a7e05^ and e6cbd7f2efb433d717af72aa8510a9db6f7a7e05 for a few hours apiece (or until failure) to confirm reliability of the result?
Just to be sure, you want me to run the subtest igt@gem_mmap_gtt@basic-small-bo-tiledx repeatedly for a few hours for the last good commit and the first bad commit? If so, I have already run that test 250 times (about 10 min total) on both kernels with rather definitive results. (On the "good" kernel, it succeeds 250 times, while the "bad" one fails 139 of those 250 tries, although with large variance.) I can do so if you still think it's necessary. Could the bad commit also be a cause for the issues with "scratch page" allocations mentioned at bug 103025 comment #44?
If my reading is correct e6cbd7f2efb433d717af72aa8510a9db6f7a7e05 just perturbs the physical page address. You have 6GiB, right? Could you try with mem=3G (my guess is that will limit it to the *low* 3G!) That was a gen4 bug with >4G memory, but we believed it to be Broadwater/Crestline and not G4x; anyway it sounds like we should just restrict our allocations to DMA32. Try: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index fd882eb389d2..e6b48adf0fae 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -4720,7 +4720,7 @@ i915_gem_object_create(struct drm_i915_private *dev_priv, u64 size) goto fail; mask = GFP_HIGHUSER | __GFP_RECLAIMABLE; - if (IS_I965GM(dev_priv) || IS_I965G(dev_priv)) { + if (IS_GEN4(dev_priv)) { /* 965gm cannot relocate objects above 4GiB. */ mask &= ~__GFP_HIGHMEM; mask |= __GFP_DMA32;
I tried testing your theory and it doesn't seem to hold up. I have tested several different mem= and physical module configurations. For the mem= options, I verified the available memory each boot with the "free" program and/or "htop." With 8G (4G + 4G) physical RAM, I tested mem= configs for every multiple of 512M, from 6G (inclusive) to 7.5G, as well as 8G without the mem= option. None of those could produce the bug. I should note that the 6G config here had less available RAM than the physical 6G RAM configuration (by precisely 154MiB). With 6G (4G + 2G) physical RAM, I tested mem= configs for every multiple of 512M, from 3G (inclusive) to 6G. Except for 6G, none of those could produce the bug. The order of the RAM modules on the board did not matter (thankfully). With 5G (4G + 1G), 4G (2G + 2G), and 4G (4G + empty slot) physical RAM configurations, I didn't bother testing mem= configs and booted each with all available RAM. None of these could produce the bug. With 3G (2G + 1G) physical RAM, I could produce the bug! I should note that the kernel logged the messages: mtrr_cleanup: can not find optimal value please specify mtrr_gran_size/mtrr_chunk_size I think this issue only comes up with recent (4.15+?) kernels though. I did not test 2G (2G + empty slot), 2G (1G + 1G), or 1G (1G + empty slot). Other possible configurations for my hardware are (4G + 512M [maybe]), (2G + 512M), (1G + 512M), (1G + empty slot), (512M + 512M), and (512M + empty slot). I can't test these, though, since I don't have any 512M RAM modules. That all said, if we use the 6G and 3G configs as a basis for a pattern, then we could assume that configs of (x + x/2) produce the bug, and that the (1G + 512M) config would also produce the bug. That is all speculation, though. I'm more confused as to how such a bug only decided to show up after the bad commit that I bisected to.
L-shaped memory! In an uneven config (where the different channels have different number of banks on them) the swizzling depends on where abouts in memory the page is (as the system splits into a dual-channel portion and a single-channel portion). For the uneven configs where you couldn't reproduce the issue, I expect it just so happened that we didn't get memory placed in different regions (or the unbalance was so much that the system didn't even try a dual/single setup. In light of that, bisecting to a commit that changed the allocation pattern still makes sense (albeit that it just changes the likelihood of receiving differently swizzled pages). Hmm. Oh, I see the problem. We are looking at a CPU mmap of the backing page to check the tiling pattern across the GTT mmap. That doesn't work if the swizzling varies (since the tiling pattern depends on the physical location of the page in memory, which we cannot know in userspace).
Created attachment 140096 [details] [review] Require knonw swizlling Ok, this patch should make it skip on any config that might cause us to fail.
The second part of that patch doesn't apply for me, at least with my version of the latest igt git. I could probably fix it, but I don't trust my results to be reliable in that case.
Created attachment 140097 [details] [review] Require known swizzling Take two.
Okay, so the tiny i915 patch, on drm-tip, does "fix" the issue, at least for the 6G config. It might be a different story for the 3G config since it's all 32-bit (36-bit?) addressable, from what I can understand. I haven't tested it, though. The patched IGT test works, but all it does is skip every time. Is that intentional? It does that both on a 6G system and a 4G (2G + 2G) system.
(In reply to Adric Blake from comment #13) > Okay, so the tiny i915 patch, on drm-tip, does "fix" the issue, at least for > the 6G config. It might be a different story for the 3G config since it's > all 32-bit (36-bit?) addressable, from what I can understand. I haven't > tested it, though. Interesting. If we could work out the dual-channel portion, we could try and restrict our physical pages to that portion... But our choice is limit to DM32, so it'll only work for a few configs. > The patched IGT test works, but all it does is skip every time. Is that > intentional? It does that both on a 6G system and a 4G (2G + 2G) system. Yes. Your system will be reporting that it is using the address of the physical page as a component in its swizzling, which the igt doesn't take into account and so its assertions are flawed.
commit a0f2d23b7d3d4226a0a7637a9240bfa86f08c1d3 (HEAD, upstream/master) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jun 8 17:29:46 2018 +0100 igt/gem_mmap_gtt: Checking tiling pattern requires known swizzling As the swizzling is baked into the tiling pattern, the swizzling has to be consistent across the entire GTT mmap for our tests to work. However, under L-shaped memory configurations on older architectures, the swizzling varied depending on which region the page found itself in -- invalidating our assumptions and ability to predict the tiling pattern. Reported-by: Adric Blake <promarbler14@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106848 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reporter, could you please verify and let us know if it is okay to close the issue?
Please, go ahead! The issue is resolved.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.