Bug 56859

Summary: [SNB regression]i-g-t gem_tiled_swapping fails
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: ben, bingx.a.yan, chris, daniel, jbarnes, yi.sun
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg none

Description lu hua 2012-11-08 08:02:43 UTC
System Environment:
--------------------------
Arch:           i386
Platform:       Sandybridge
Mesa:	(master)5cbc0f00368b9ddc127007be2bd7f60940aa93ed
Kernel:	( drm-intel-nightly) b5a833707960154164cf450647c76547be43a167

Bug detailed description:
-------------------------
It fails on sandybridge with -nightly branch. It doesn't happen on -fixes branch.
output:
mismatch at 254208: -378754475

The last known bad commit: b5a833707960154164cf450647c76547be43a167( Merge: afef67f 4a8dece)
The last known good commit: 032e254cefb0485c95aceca269be499b91f48aa0(Merge: 8c74a16 b6e0e54)

Reproduce steps:
----------------
1 ./gem_tiled_swapping
Comment 1 Daniel Vetter 2012-11-08 15:25:10 UTC
Can you please bisect this regression?
Comment 2 lu hua 2012-11-13 05:48:39 UTC
Bisect shows:7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125 is the first bad commit
commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
Author: Jianguo Wu <wujianguo@huawei.com>
Date:   Mon Oct 8 16:33:06 2012 -0700

    mm: fix-up zone present pages

    I think zone->present_pages indicates pages that buddy system can management,
    it should be:

        zone->present_pages = spanned pages - absent pages - bootmem pages,

    but is now:
        zone->present_pages = spanned pages - absent pages - memmap pages.

    spanned pages: total size, including holes.
    absent pages: holes.
    bootmem pages: pages used in system boot, managed by bootmem allocator.
    memmap pages: pages used by page structs.

    This may cause zone->present_pages less than it should be.  For example,
    numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
    bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
    present_pages should be spanned pages - absent pages, but now it also
    minus memmap pages(free_area_init_core), which are actually allocated from
    ZONE_MOVABLE.  When offlining all memory of a zone, this will cause
    zone->present_pages less than 0, because present_pages is unsigned long
    type, it is actually a very large integer, it indirectly caused
    zone->watermark[WMARK_MIN] becomes a large
    integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
    large integer(calculate_totalreserve_pages()), and finally cause memory
    allocating failure when fork process(__vm_enough_memory()).

    [root@localhost ~]# dmesg
    -bash: fork: Cannot allocate memory

    I think the bug described in

      http://marc.info/?l=linux-mm&m=134502182714186&w=2

    is also caused by wrong zone present pages.

    This patch intends to fix-up zone->present_pages when memory are freed to
    buddy system on x86_64 and IA64 platforms.
Comment 3 Daniel Vetter 2012-11-13 10:21:12 UTC
Two things to test:

- Can you please check whether reverting the bisected commit on top of dinq resolves the issue?

- Before we report this problem upstream it's good to test whether it's fixed already. I've pushed out a for-QA branch with latestet dinq, -fixes and upstream git from Linus all merged together. Please test that.
Comment 4 lu hua 2012-11-14 05:45:10 UTC
Created attachment 70052 [details]
dmesg
Comment 5 lu hua 2012-11-14 05:46:24 UTC
It works well when revert the bisect commit.

It also fails on for-QA branch.
Test on commit 104ec25077751a0abbd9f523a48b7f84e6842ea3
commit:104ec25077751a0abbd9f523a48b7f84e6842ea3(Merge: c8928b6 9924a19)
Comment 6 Daniel Vetter 2012-11-14 09:40:32 UTC
For paranoia: Can you please run a memtester on the affected box, to rule out memory corruptions?
Comment 7 Chris Wilson 2012-11-14 09:56:46 UTC
I also observe the bug on a SNB i5-2520m (32-bit PAE with 3GiB), and can confirm the revert fixes gem_tiled_swapping.
Comment 8 Daniel Vetter 2012-11-14 13:48:40 UTC
Can you please test the patch at https://lkml.org/lkml/2012/11/5/866 ?
Comment 9 Chris Wilson 2012-11-14 14:34:53 UTC
Patch worksforme. I see it already is in mmotm, so close?
Comment 10 Chris Wilson 2012-11-14 15:21:04 UTC
Hmm, machine later died completely whilst idle. Possibly unrelated, but unlikely...
Comment 11 Gordon Jin 2012-11-15 02:28:11 UTC
looks like Chris has answered. So clearnin needinfo.
Comment 12 Daniel Vetter 2012-12-05 20:27:28 UTC
Offending patch has been reverted in upstream Linus' git:

commit 5576646f3c1abd60d72d19829de6f5d8c2ca8ecf
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Fri Nov 16 14:15:06 2012 -0800

    revert "mm: fix-up zone present pages"

It's not yet in any of the branches merged together with -nightly though.
Comment 13 lu hua 2012-12-10 05:34:08 UTC
Fixed on -nightly branch.
Still happens on -queued branch.
Comment 14 lu hua 2012-12-27 08:12:25 UTC
Verified.Fixed.
Comment 15 Daniel Vetter 2013-01-08 08:09:22 UTC
*** Bug 59095 has been marked as a duplicate of this bug. ***
Comment 16 Elizabeth 2017-10-06 14:47:52 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.