28813 – [GM45] broken swizzling in swap-in/out paths/L-shaped memory swizzling

Bug 28813 - [GM45] broken swizzling in swap-in/out paths/L-shaped memory swizzling

Summary: [GM45] broken swizzling in swap-in/out paths/L-shaped memory swizzling

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	low normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (4):	31704 86898 87065 88924 (view as bug list)
Depends on:
Blocks:

Reported:	2010-06-29 04:16 UTC by steubens
Modified:	2017-07-24 23:07 UTC (History)
CC List:	14 users (show)

See Also:
i915 platform:
i915 features:

Attachments
large sample of visible corruption on wallpaper (776.71 KB, image/png) 2010-06-29 04:16 UTC, steubens	no flags	Details
bug as seen in browser, returns even after image is fetched and decoded again (clean cache) (13.35 KB, image/png) 2010-06-29 04:17 UTC, steubens	no flags	Details
as seen in nautilus, window drop shadows were also uniquely corrupted (not typical of other patterns) (5.09 KB, image/png) 2010-06-29 04:18 UTC, steubens	no flags	Details
RAM layout, R500 Lenovo laptop w/GM45 (13.88 KB, text/plain) 2012-01-08 11:51 UTC, Ray Lee	no flags	Details
Output from `make test` minus gem_partial_pwrite_pread test (6.66 KB, text/plain) 2012-01-08 11:53 UTC, Ray Lee	no flags	Details
Dmesg from the R500, note the warn/oops at the end (77.58 KB, text/plain) 2012-01-08 11:55 UTC, Ray Lee	no flags	Details
Register dump from within X (11.13 KB, text/plain) 2012-01-08 11:57 UTC, Ray Lee	no flags	Details
Corruption of wallpaper (1.82 MB, image/png) 2012-01-17 07:17 UTC, Ray Lee	no flags	Details
Hack to keep tiled pages pinned (1.90 KB, patch) 2012-12-21 09:35 UTC, Chris Wilson	no flags	Details \| Splinter Review
Hack to keep tiled pages pinned (1.86 KB, patch) 2012-12-21 12:38 UTC, Chris Wilson	no flags	Details \| Splinter Review
kernel bug screen (1.67 MB, image/png) 2012-12-21 20:55 UTC, Guillaume Seguin	no flags	Details
Hack to keep tiled pages pinned (1.87 KB, patch) 2012-12-27 14:11 UTC, Chris Wilson	no flags	Details \| Splinter Review
Print out more DRAM info (2.53 KB, patch) 2013-07-03 16:42 UTC, Chris Wilson	no flags	Details \| Splinter Review
Print out more DRAM info (mobiles), step 2 (3.00 KB, patch) 2013-07-03 17:00 UTC, Chris Wilson	no flags	Details \| Splinter Review
A lead? (967 bytes, patch) 2013-07-03 17:01 UTC, Chris Wilson	no flags	Details \| Splinter Review
Hack, redux (3.30 KB, patch) 2013-07-03 17:12 UTC, Chris Wilson	no flags	Details \| Splinter Review
screenshots of possably related bug (124.62 KB, image/png) 2015-02-14 13:23 UTC, fennectech	no flags	Details
another screenshot of a possably related bug (165.05 KB, image/png) 2015-02-14 13:24 UTC, fennectech	no flags	Details
Show Obsolete (3) View All

Description steubens 2010-06-29 04:16:05 UTC

Created attachment 36603 [details]
large sample of visible corruption on wallpaper

Bug description:
getting screen corruption on the order of shifted or transposed small tiles, seems to happen on certain patterns/colors (refreshing a page in firefox for example, will have the same image corruption even after re-decoding the image)
i'm _always_ running compiz

System environment:
-- chipset:
[    46.235] (II) intel(0): Integrated Graphics Chipset: Intel(R) GM45

-- system architecture: 64-bit

-- xf86-video-intel:
[    45.954] (II) Module intel: vendor="X.Org Foundation"
[    45.954]    compiled for 1.8.1.902, module version = 2.12.0
[    45.954]    Module class: X.Org Video Driver
[    45.954]    ABI class: X.Org Video Driver, version 7.0

-- xserver:
X.Org X Server 1.8.1.902 (1.8.2 RC 2)

-- mesa:
OpenGL renderer string: Mesa DRI Mobile Intel® GM45 Express Chipset GEM 20100330 DEVELOPMENT 
OpenGL version string: 2.1 Mesa 7.9-devel

-- libdrm:
$ pkg-config --modversion libdrm
2.4.21

-- kernel:
2.6.32-23-generic

-- Linux distribution:
Ubuntu (10.04)
using xorg-edgers ppa; but same bug occurs as shipped for 10.04

-- Machine or mobo model:
$ cat /sys/class/dmi/id/product_name 
Compaq Presario CQ60 Notebook PC

-- Display connector:
internal LVDS

Reproducing steps:
happens after a period of time (2 hours?) without fail, though i suspect my huge firefox session helps it along

Additional info:

Comment 1 steubens 2010-06-29 04:17:20 UTC

Created attachment 36604 [details]
bug as seen in browser, returns even after image is fetched and decoded again (clean cache)

Comment 2 steubens 2010-06-29 04:18:48 UTC

Created attachment 36605 [details]
as seen in nautilus, window drop shadows were also uniquely corrupted (not typical of other patterns)

Comment 3 Eric Anholt 2010-07-11 22:09:12 UTC

Probably A17 issues after paging the image out.  Assigning to myself, but unlikely to make progress on it.

Comment 4 Chris Wilson 2010-07-12 15:11:06 UTC

The idle thought I had was that perhaps the MOVABLE flag is allowing the system to move a17 pages without our realisation.

So do things improve after:

commit 985b823b919273fe1327d56d2196b4f92e5d0fae
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Jul 2 10:04:42 2010 +1000

    drm/i915: fix hibernation since i915 self-reclaim fixes
    
    Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915:
    Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the
    i915 page allocator where we weren't before due to some over-eager
    removal of the page mapping gfp_flags games the code used to play.
    
    This caused hibernate on Intel hardware to result in a lot of memory
    corruptions on resume.  See for example
    
      http://bugzilla.kernel.org/show_bug.cgi?id=13811
    
    Reported-by: Evengi Golov (in bugzilla)
    Signed-off-by: Dave Airlie <airlied@redhat.com>
    Tested-by: M. Vefa Bicakci <bicave@superonline.com>
    Cc: stable@kernel.org
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

included with 2.6.35-rc4.

Comment 5 steubens 2010-07-13 03:52:03 UTC

(In reply to comment #4)
> included with 2.6.35-rc4.

it was at rc4 by the time i tried mainline; if there was any change it was that it happens in a slightly more rareified manner, and it doesn't "spread" as fast (if at all)

Comment 6 Chris Wilson 2010-07-13 03:56:17 UTC

Appears I can't just wish this corruption away. Thanks for checking.

Comment 7 steubens 2010-09-03 01:44:32 UTC

i've been told the memory configuration might have something to do with it, the DCC register in particular, and "bit 11"
                 DCC: 0x000f0002 (dual channel interleaved, XOR randomization: enabled, XOR bit: 11)

and the full output of intel_reg_dumper
                 DCC: 0x000f0002 (dual channel interleaved, XOR randomization: enabled, XOR bit: 11)
           CHDECMISC: 0x00000000 (none, ch2 enh disabled, ch1 enh disabled, ch0 enh disabled, flex disabled, ep not present)
              C0DRB0: 0x000f0002 (0x0002)
              C0DRB1: 0x0000000f (0x000f)
              C0DRB2: 0x00000000 (0x0000)
              C0DRB3: 0x0c000000 (0x0000)
              C1DRB0: 0x00000000 (0x0000)
              C1DRB1: 0x00000000 (0x0000)
              C1DRB2: 0x00000000 (0x0000)
              C1DRB3: 0x00000000 (0x0000)
             C0DRA01: 0x00030c00 (0x0c00)
             C0DRA23: 0x001c0003 (0x0003)
             C1DRA01: 0x00000000 (0x0000)
             C1DRA23: 0x00000000 (0x0000)
          PGETBL_CTL: 0x00000001
   VCLK_DIVISOR_VGA0: 0x00031108 (n = 3, m1 = 17, m2 = 8)
   VCLK_DIVISOR_VGA1: 0x00031406 (n = 3, m1 = 20, m2 = 6)
       VCLK_POST_DIV: 0x00020002 (vga0 p1 = 4, p2 = 2, vga1 p1 = 2, p2 = 2)
           DPLL_TEST: 0x00010001 ()
        CACHE_MODE_0: 0x00006820
             D_STATE: 0x00000000
       DSPCLK_GATE_D: 0x1004000c (clock gates disabled: VRHUNIT DSSUNIT OVRUNIT OVCUNIT)
      RENCLK_GATE_D1: 0x00000000
      RENCLK_GATE_D2: 0x000002c0
               SDVOB: 0x0000001c (disabled, pipe A, stall disabled, detected)
               SDVOC: 0x00000018 (disabled, pipe A, stall disabled, not detected)
             SDVOUDI: 0x00000000
              DSPARB: 0x00000000
              DSPFW1: 0x1a901702
              DSPFW2: 0x00001f00
              DSPFW3: 0x20000000
                ADPA: 0x40008c18 (disabled, pipe B, +hsync, +vsync)
                LVDS: 0xc2308300 (enabled, pipe B, 18 bit, 1 channel)
                DVOA: 0x00000000 (disabled, pipe A, no stall, -hsync, -vsync)
                DVOB: 0x0000001c (disabled, pipe A, no stall, +hsync, +vsync)
                DVOC: 0x00000018 (disabled, pipe A, no stall, +hsync, +vsync)
         DVOA_SRCDIM: 0x00000000
         DVOB_SRCDIM: 0x00000000
         DVOC_SRCDIM: 0x00000000
          PP_CONTROL: 0x00000001 (power target: on)
           PP_STATUS: 0xc0000008 (on, ready, sequencing idle)
        PP_ON_DELAYS: 0x012c07d0
       PP_OFF_DELAYS: 0x012c157c
          PP_DIVISOR: 0x0033f305
        PFIT_CONTROL: 0x00000000
     PFIT_PGM_RATIOS: 0x00000000
     PORT_HOTPLUG_EN: 0x28000200
   PORT_HOTPLUG_STAT: 0x08000300
            DSPACNTR: 0x00000000 (disabled, pipe A)
          DSPASTRIDE: 0x00000000 (0 bytes)
             DSPAPOS: 0x00000000 (0, 0)
            DSPASIZE: 0x00000000 (1, 1)
            DSPABASE: 0x00000000
            DSPASURF: 0x00000000
         DSPATILEOFF: 0x00000000
           PIPEACONF: 0x00000000 (disabled, inactive)
            PIPEASRC: 0x00000000 (1, 1)
           PIPEASTAT: 0x00000000 (status:)
   PIPEA_GMCH_DATA_M: 0x00000000
   PIPEA_GMCH_DATA_N: 0x00000000
     PIPEA_DP_LINK_M: 0x00000000
     PIPEA_DP_LINK_N: 0x00000000
       CURSOR_A_BASE: 0x00000000
    CURSOR_A_CONTROL: 0x00000000
   CURSOR_A_POSITION: 0x00000000
                FPA0: 0x00021509 (n = 2, m1 = 21, m2 = 9)
                FPA1: 0x00021509 (n = 2, m1 = 21, m2 = 9)
              DPLL_A: 0x04040004 (disabled, non-dvo, VGA, default clock, DAC/serial mode, p1 = 3, p2 = 10)
           DPLL_A_MD: 0x00000003
            HTOTAL_A: 0x00000000 (1 active, 1 total)
            HBLANK_A: 0x00000000 (1 start, 1 end)
             HSYNC_A: 0x00000000 (1 start, 1 end)
            VTOTAL_A: 0x00000000 (1 active, 1 total)
            VBLANK_A: 0x00000000 (1 start, 1 end)
             VSYNC_A: 0x00000000 (1 start, 1 end)
           BCLRPAT_A: 0x00000000
        VSYNCSHIFT_A: 0x00000000
            DSPBCNTR: 0xd9000400 (enabled, pipe B)
          DSPBSTRIDE: 0x00001600 (5632 bytes)
             DSPBPOS: 0x00000000 (0, 0)
            DSPBSIZE: 0x00000000 (1, 1)
            DSPBBASE: 0x00000000
            DSPBSURF: 0x0588a000
         DSPBTILEOFF: 0x00000000
           PIPEBCONF: 0xc0000000 (enabled, active)
            PIPEBSRC: 0x055502ff (1366, 768)
           PIPEBSTAT: 0x00440100 (status: LBLC_EVENT_ENABLE SVBLANK_INT_ENABLE DLINE_COMPARE_STATUS)
   PIPEB_GMCH_DATA_M: 0x00000000
   PIPEB_GMCH_DATA_N: 0x00000000
     PIPEB_DP_LINK_M: 0x00000000
     PIPEB_DP_LINK_N: 0x00000000
       CURSOR_B_BASE: 0x00000000
    CURSOR_B_CONTROL: 0x10000000
   CURSOR_B_POSITION: 0x029a0148
                FPB0: 0x00021309 (n = 2, m1 = 19, m2 = 9)
                FPB1: 0x00021309 (n = 2, m1 = 19, m2 = 9)
              DPLL_B: 0x98046c00 (enabled, non-dvo, spread spectrum clock, LVDS mode, p1 = 3, p2 = 14)
           DPLL_B_MD: 0x00000000
            HTOTAL_B: 0x05c70555 (1366 active, 1480 total)
            HBLANK_B: 0x05c70555 (1366 start, 1480 end)
             HSYNC_B: 0x05a50585 (1414 start, 1446 end)
            VTOTAL_B: 0x030b02ff (768 active, 780 total)
            VBLANK_B: 0x030b02ff (768 start, 780 end)
             VSYNC_B: 0x03060301 (770 start, 775 end)
           BCLRPAT_B: 0x00000000
        VSYNCSHIFT_B: 0x00000000
   VCLK_DIVISOR_VGA0: 0x00031108
   VCLK_DIVISOR_VGA1: 0x00031406
       VCLK_POST_DIV: 0x00020002
            VGACNTRL: 0x80000000 (disabled)
              TV_CTL: 0x00000010
              TV_DAC: 0x70000000
            TV_CSC_Y: 0x00000000
           TV_CSC_Y2: 0x00000000
            TV_CSC_U: 0x00000000
           TV_CSC_U2: 0x00000000
            TV_CSC_V: 0x00000000
           TV_CSC_V2: 0x00000000
        TV_CLR_KNOBS: 0x00000000
        TV_CLR_LEVEL: 0x00000000
          TV_H_CTL_1: 0x00000000
          TV_H_CTL_2: 0x00000000
          TV_H_CTL_3: 0x00000000
          TV_V_CTL_1: 0x00000000
          TV_V_CTL_2: 0x00000000
          TV_V_CTL_3: 0x00000000
          TV_V_CTL_4: 0x00000000
          TV_V_CTL_5: 0x00000000
          TV_V_CTL_6: 0x00000000
          TV_V_CTL_7: 0x00000000
         TV_SC_CTL_1: 0x00000000
         TV_SC_CTL_2: 0x00000000
         TV_SC_CTL_3: 0x00000000
          TV_WIN_POS: 0x00000000
         TV_WIN_SIZE: 0x00000000
     TV_FILTER_CTL_1: 0x00000000
     TV_FILTER_CTL_2: 0x00000000
     TV_FILTER_CTL_3: 0x00000000
       TV_CC_CONTROL: 0x00000000
          TV_CC_DATA: 0x00000000
         TV_H_LUMA_0: 0x00000000
        TV_H_LUMA_59: 0x00000000
       TV_H_CHROMA_0: 0x00000000
      TV_H_CHROMA_59: 0x00000000
        FBC_CFB_BASE: 0x00000000
         FBC_LL_BASE: 0x00000000
         FBC_CONTROL: 0xe0000404
         FBC_COMMAND: 0x08c80034
          FBC_STATUS: 0x00000000
        FBC_CONTROL2: 0x00000000
       FBC_FENCE_OFF: 0x00000000
         FBC_MOD_NUM: 0x000000f2
             MI_MODE: 0x00000200
        MI_ARB_STATE: 0x00000040
      MI_RDRET_STATE: 0x00000000
             ECOSKPD: 0x00000307
                DP_B: 0x0000001c
      DPB_AUX_CH_CTL: 0x00050000
    DPB_AUX_CH_DATA1: 0x00000000
    DPB_AUX_CH_DATA2: 0x00000000
    DPB_AUX_CH_DATA3: 0x00000000
    DPB_AUX_CH_DATA4: 0x00000000
    DPB_AUX_CH_DATA5: 0x00000000
                DP_C: 0x00000018
      DPC_AUX_CH_CTL: 0x00050000
    DPC_AUX_CH_DATA1: 0x00000000
    DPC_AUX_CH_DATA2: 0x00000000
    DPC_AUX_CH_DATA3: 0x00000000
    DPC_AUX_CH_DATA4: 0x00000000
    DPC_AUX_CH_DATA5: 0x00000000
                DP_D: 0x0000001c
      DPD_AUX_CH_CTL: 0x00050000
    DPD_AUX_CH_DATA1: 0x00000000
    DPD_AUX_CH_DATA2: 0x00000000
    DPD_AUX_CH_DATA3: 0x00000000
    DPD_AUX_CH_DATA4: 0x00000000
    DPD_AUX_CH_DATA5: 0x00000000
          AUD_CONFIG: 0x00000004
    AUD_HDMIW_STATUS: 0x00000000
      AUD_CONV_CHCNT: 0x00000000
       VIDEO_DIP_CTL: 0x20000600
       AUD_PINW_CNTR: 0x00000140
         AUD_CNTL_ST: 0x00002000
         AUD_PIN_CAP: 0x00000094
        AUD_PINW_CAP: 0x004073bd
  AUD_PINW_UNSOLRESP: 0x00000000
    AUD_OUT_DIG_CNVT: 0x00000001
       AUD_OUT_CWCAP: 0x00006211
         AUD_GRP_CAP: 0x00000004
       FENCE START 0: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
         FENCE END 0: 0x00000000 (                                   0x00000000 end)
       FENCE START 1: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
         FENCE END 1: 0x00000000 (                                   0x00000000 end)
       FENCE START 2: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
         FENCE END 2: 0x00000000 (                                   0x00000000 end)
       FENCE START 3: 0x044920ad ( enabled, X tile walk, 5632 pitch, 0x04492000 start)
         FENCE END 3: 0x04991000 (                                   0x04991000 end)
       FENCE START 4: 0x0588a0ad ( enabled, X tile walk, 5632 pitch, 0x0588a000 start)
         FENCE END 4: 0x05d89000 (                                   0x05d89000 end)
       FENCE START 5: 0x0636e01d ( enabled, X tile walk, 1024 pitch, 0x0636e000 start)
         FENCE END 5: 0x06373000 (                                   0x06373000 end)
       FENCE START 6: 0x0632601d ( enabled, X tile walk, 1024 pitch, 0x06326000 start)
         FENCE END 6: 0x0632b000 (                                   0x0632b000 end)
       FENCE START 7: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
         FENCE END 7: 0x00000000 (                                   0x00000000 end)
       FENCE START 8: 0x0d6e701d ( enabled, X tile walk, 1024 pitch, 0x0d6e7000 start)
         FENCE END 8: 0x0d6ec000 (                                   0x0d6ec000 end)
       FENCE START 9: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
         FENCE END 9: 0x00000000 (                                   0x00000000 end)
      FENCE START 10: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
        FENCE END 10: 0x00000000 (                                   0x00000000 end)
      FENCE START 11: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
        FENCE END 11: 0x00000000 (                                   0x00000000 end)
      FENCE START 12: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
        FENCE END 12: 0x00000000 (                                   0x00000000 end)
      FENCE START 13: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
        FENCE END 13: 0x00000000 (                                   0x00000000 end)
      FENCE START 14: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
        FENCE END 14: 0x00000000 (                                   0x00000000 end)
      FENCE START 15: 0x00000000 (disabled, X tile walk,  128 pitch, 0x00000000 start)
        FENCE END 15: 0x00000000 (                                   0x00000000 end)
             INST_PM: 0x00000000
SDVO phase shift 0 out of range -- probobly not an issue.
pipe A dot 100800 n 2 m1 21 m2 9 p1 3 p2 10
pipe B dot 69047 n 2 m1 19 m2 9 p1 3 p2 14

Comment 8 Serge Gavrilov 2010-10-28 15:12:12 UTC

I can easily reproduce this :(. Thinkpad X301, gentoo, kernel 2.6.36, xorg-server 1.9.1, mesa 7.8.2, xf86-video-intel-2.13.0, libdrm 2.4.22. 

Can I help somehow?

Comment 9 Serge Gavrilov 2010-10-28 15:51:52 UTC

Sorry, in my case this happens after hibernate. Perhaps this is another bug?

Comment 10 steubens 2010-10-28 21:00:46 UTC

yes, the hibernate version is probably a different bug, but might be related

Comment 11 Daniel Vetter 2011-11-07 14:37:26 UTC

If my analysis of the visual corruptions is correct, this should be fixed by my swizzling branch, specifically

http://cgit.freedesktop.org/~danvet/drm/commit/?h=swizzling&id=b337d071edfc0a161667fb5eb67188f6d1e07428

Comment 12 Daniel Vetter 2011-11-08 02:18:43 UTC

Ok, my analysis is wrong. Looking further at the corruptions on the wallpaper, it looks like bit11/10/9 get swizzled into bit6 for an entire X-tiled page. If the DCC from the intel_reg_dump comes from the same configuration as the screenshot, then that would mean that some of the pages get swizzled (with the detected bit11/10/9 configuration) and some don't. Once the pages go through swap once and get reallocated, all hell breaks loose.

Comment 13 Ray Lee 2012-01-08 09:27:31 UTC

I have a GM45 integrated GPU (Lenovo R500 laptop) and suffer from the same bug. The issue seems correlated with page faults, whether those faults come from memory pressure during normal usage or due to all the shenanigans that suspend and hibernate put on the MM subsystem.

Running v1.1 of intel-gpu-tools's make test shows failure of the pread tests designed to find the swizzling bugs (didn't save the log the first time, and I'm in X at the moment so can't rerun). I can attach that log as well as whatever else should someone deem them useful, otherwise I'll keep the noise level down.

Regardless, please ask if there are things you wish me to try. After skimming google on the topic of the GM45 and swizzle bugs it sounds like the core issue is the mixed (L-shaped) memory layout? In which case perhaps the swizzle memory region boundary can be detected as per the i-g-t tests?

Finally, thanks for all your good work on the i915 driver, despite the hiccups it's still nice to finally have a more modern graphics stack under Linux.

Comment 14 Daniel Vetter 2012-01-08 09:49:09 UTC

Well, if you can easily reproduce it with the i-g-t test, we might get
a handle on this. Please attach the following things:
- anything the test spits out.
- detailed description of your ram dimms (i.e. which slot, total
sizes, ranks, anything else you can grab from the bios). Alternatively
you can also read out the dimm details using the i2c tool decode-dimm.
- an i915 register dump from i-g-t/tools/intel_reg_dumper.
- full dmesg.

Thanks, Daniel

Comment 15 Ray Lee 2012-01-08 11:51:48 UTC

Created attachment 55310 [details]
RAM layout, R500 Lenovo laptop w/GM45

This is actually the full DMI decode for the laptop BIOS, but the RAM section is located almost at the very top. Two slots filled, 3072MB memory.

Comment 16 Ray Lee 2012-01-08 11:53:52 UTC

Created attachment 55311 [details]
Output from `make test` minus gem_partial_pwrite_pread test

the gem_partial_pwrite_pread test spins at 100% CPU for over 11 minutes, repeatedly (twice, anyway). Killed it and removed the test from tests/Makefile to get the other tests running. If I was merely being impatient, I can rerun it, but looking at the source for that test I'm at a loss as to why it was spinning so long.

Comment 17 Ray Lee 2012-01-08 11:55:42 UTC

Created attachment 55312 [details]
Dmesg from the R500, note the warn/oops at the end

After the kernel whinged I saved things off, tried to bring X up again in hopes of it being merely a warning, hit a hard lock. Rebooted.

Comment 18 Ray Lee 2012-01-08 11:57:41 UTC

Created attachment 55313 [details]
Register dump from within X

output of the i-g-t/tools/intel_reg_dumper. Run from within X, but can rerun from outside X if that's important.

Comment 19 Daniel Vetter 2012-01-11 02:50:56 UTC

On Sun, Jan 08, 2012 at 07:55:42PM +0000, bugzilla-daemon@freedesktop.org wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=28813
> 
> --- Comment #17 from Ray Lee <ray-bugzilla@madrabbit.org> 2012-01-08 11:55:42 PST ---
> Created attachment 55312 [details]
>   --> https://bugs.freedesktop.org/attachment.cgi?id=55312
> Dmesg from the R500, note the warn/oops at the end
> 
> After the kernel whinged I saved things off, tried to bring X up again in hopes
> of it being merely a warning, hit a hard lock. Rebooted.

The oops is a known bug in one of the debugfs files. Patch is on track to get
merged.

Comment 20 Daniel Vetter 2012-01-11 03:30:20 UTC

In the intel-gpu-tools packages is also tools/intel_reg_dumper to read out arbitraray registers. Can you read out the following ones:
0x10200 0x10204, 0x100e0, 0x11234, 0x11334

Also, do you by chance have another 1G dimm so that you could create a symmetric 2x 1G memory configuration? If that's possible, please check with the i-g-t tests whether that works (it should) and grab the same set of registers.

Comment 21 Ray Lee 2012-01-11 10:15:08 UTC

I suspect you wanted intel_reg_read? If so:

$ for i in 0x10200 0x10204 0x100e0 0x11234 0x11334; do sudo ./intel_reg_read $i; done
0x10200 : 0xF0002
0x10204 : 0x10
0x100E0 : 0x0
0x11234 : 0x910C1800
0x11334 : 0x910C1800

Sadly, I have no matching memory for this machine. Checking online it looks like I could use this as an excuse to cheaply upgrade to 2 x 4G. OTOH I'm running a 32-bit kernel at the moment, and I've no idea how that layout would interact with the kernel and driver, so I'll let you guide me on that one. (I suppose I could try to reproduce this with a 64-bit kernel, to rule that out.)

Anyway, let me know if the above isn't what you were looking for, or if you think you could serve as my excuse for an upgrade :-)

Comment 22 Daniel Vetter 2012-01-12 02:52:40 UTC

Totally forgotten: Ray, can you please also add a screenshot of typical corruptions? Just to check that the pattern of the corruption is what I expect it to be ... (i.e. some larger picture where it's easy to guess what it should really look like is ideal).

Comment 23 Daniel Vetter 2012-01-12 03:12:18 UTC

A few other things to test:

Can you boot with mem=2G and check with the i-g-t tests that things work as expected?

If you're adventurous and have decent backups, can you try

intel_reg_write 0x10204 0x100010

This will set bit20 in ddc2, which I hope controls the xor decoding of the gpu. If we're unlucky it also controls xor decoding for the cpu, which will result in randomly-looking corruptions in the top 1G of memory. Which will surely eat any data that's there :(

Still from a in-depth reading of the mch docs that bit is about my only hope, so if you dare (maybe on a new install in an otherwise empty hd) to test this it would be interesting to see what happens.

Maybe attach the screenshot with corruptions and to the mem=2G test first ;-)

Comment 24 Ray Lee 2012-01-17 07:17:13 UTC

Created attachment 55678 [details]
Corruption of wallpaper

Never a failure around when you need one. This took a while to occur, despite trying to encourage the issue with memory pressure and suspend cycles. Corruption in this case was limited to wallpaper, but on other occasions window borders or fonts would also exhibit corruption in the same way.

Comment 25 Daniel Vetter 2012-01-17 09:35:45 UTC

> --- Comment #24 from Ray Lee <ray-bugzilla@madrabbit.org> 2012-01-17 07:17:13 PST ---
> Created attachment 55678 [details]
>  --> https://bugs.freedesktop.org/attachment.cgi?id=55678
> Corruption of wallpaper

Thanks for the picture, it's definitely swizzling mistmatch failure
exactly like in the screenshot already attached to the bug.

Comment 26 Ray Lee 2012-01-30 09:38:01 UTC

Apologies for the delay, but due to travel constraints I'm going to need a bit more time. I think I can run through the tests you asked, however, by booting from a USB device instead. I'll leave my root partition unmounted which should keep things happy. I am definitely running a backup first, though :-)

Comment 27 Daniel Vetter 2012-01-30 10:11:18 UTC

On Mon, Jan 30, 2012 at 18:38,  <bugzilla-daemon@freedesktop.org> wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=28813
>
> --- Comment #26 from Ray Lee <ray-bugzilla@madrabbit.org> 2012-01-30 09:38:01 PST ---
> Apologies for the delay, but due to travel constraints I'm going to need a bit
> more time. I think I can run through the tests you asked, however, by booting
> from a USB device instead. I'll leave my root partition unmounted which should
> keep things happy. I am definitely running a backup first, though :-)

Some one else already tried it on a i965gm with the same issues and it
doesn't work - hw doesn't allow this bit to be flipped after initial
setup :( I'll look into other options, but that might take a while ...

Comment 28 Daniel Vetter 2012-02-20 13:25:43 UTC

*** Bug 31704 has been marked as a duplicate of this bug. ***

Comment 29 Serge Gavrilov 2012-04-01 03:41:32 UTC

Still reproducible with 

kernel 3.2.11
mesa 8.0.2
xf86-video-intel 2.17
xorg-server 1.11.4
libdrm 2.4.33

Gentoo ~x86, Thinkpad X301.

Please, help!

Comment 30 Daniel Vetter 2012-04-01 03:44:41 UTC

> --- Comment #29 from Serge Gavrilov <serge@pdmi.ras.ru> 2012-04-01 > Please, help!

Well, I now have a gm45 machine myself and I can reproduce this issue
with the intel-gpu-tools test. Bad news is that I still have no idea
how we could fix this :(

Comment 31 Serge Gavrilov 2012-05-25 03:02:49 UTC

Seems to be related with https://bugzilla.kernel.org/show_bug.cgi?id=37142

They claim there that tuxonice kernel does not suffer from corruption after hibernate. Trying to check this...

Comment 32 Daniel Vetter 2012-05-25 03:37:43 UTC

On Fri, May 25, 2012 at 12:02 PM,  <bugzilla-daemon@freedesktop.org> wrote:
> --- Comment #31 from Serge Gavrilov <serge@pdmi.ras.ru> 2012-05-25 03:02:49 PDT ---
> Seems to be related with https://bugzilla.kernel.org/show_bug.cgi?id=37142
>
> They claim there that tuxonice kernel does not suffer from corruption after
> hibernate. Trying to check this...

Imo red herring. If you swap out and the swap in at a different place,
things can get corrupted. Maybe tuxonice tries harder to swap stuff
back at the same place, but that does in no way fix the underlying
bug.

Comment 33 Guillaume Seguin 2012-12-15 21:50:28 UTC

Is there a way to help track and fix this bug ?

I have been having it for years, on both Ubuntu and Arch Linux, with kernels from 2.6.32 to 3.6.9 at least and numerous X.org and xorg-video-intel versions.

Test system is a Dell Latitude E4200 with GM45.

Comment 34 Chris Wilson 2012-12-21 09:35:51 UTC

Created attachment 71911 [details] [review]
Hack to keep tiled pages pinned

This should prevent tiled buffers from being paged out whilst they are referenced by userspace.

Comment 35 Chris Wilson 2012-12-21 12:38:17 UTC

Created attachment 71924 [details] [review]
Hack to keep tiled pages pinned

Comment 36 Guillaume Seguin 2012-12-21 20:55:49 UTC

Created attachment 71948 [details]
kernel bug screen

Hm, do I need some other patch/some updated driver ? I applied this patch to linux-3.7.1 (using Arch's testing linux-3.7.1-2 https://www.archlinux.org/packages/testing/x86_64/linux/) and the kernel died when X started (see attached picture for the (blurry) kernel backtrace).

Test setup was running xorg-video-intel 2.20.16 and xorg-server 1.13.1

Comment 37 Chris Wilson 2012-12-27 14:11:53 UTC

Created attachment 72183 [details] [review]
Hack to keep tiled pages pinned

Sorry about that bad patch, hopefully this one is less buggy.

Comment 38 Jani Nikula 2013-01-08 14:51:17 UTC

(In reply to comment #37)
> Created attachment 72183 [details] [review] [review]
> Hack to keep tiled pages pinned
> 
> Sorry about that bad patch, hopefully this one is less buggy.

Guillame, please retest with this patch.

Comment 39 Serge Gavrilov 2013-02-02 19:35:56 UTC

(In reply to comment #37)
> Created attachment 72183 [details] [review] [review]
> Hack to keep tiled pages pinned
> 
> Sorry about that bad patch, hopefully this one is less buggy.

I have tested the latest patch applied against kernel 3.7.4 (Gentoo, x86, Lenovo X301). Kernel seems to be stable, and I cannot reproduce the problem. Thank you very much for squashing this very old and ugly bug! 

Is it possible to include the patch into mainstream kernel? 

Again, thanks a lot.

Comment 40 Chris Wilson 2013-02-02 21:35:30 UTC

That's good to know. The last remaining detail is a way to detect the L shape memory configurations. There should be enough info in the CSPEC that the memory has different swizzle regions (even if not enough info to work out the swizzle) - as we really only want to enable this page pinning as a means of last resort.

Comment 41 Chris Wilson 2013-07-03 16:26:43 UTC

Found a potential lead in the 4 series chipset datasheet:

5.2.1
CHDECMISC—Channel Decode Miscellaneous

B/D/F/Type: 0/0/0/MCHBAR
Address Offset:111h
Default Value:00h
Access: R/W/L, R/W
Size: 8 bits

Bit 4, written by bios then locked:

L-Shaped GFX Tile Cycle (LGFXTLCYC): This bit forces
graphics tiled cycles in L-shaped memory configuration to
modify bit 6 of the address. This field should be set to 1
only when L-mode memory configuration is enabled and
should be set to 0 for all other memory configurations.
This bit is locked by ME stolen Memory lock.

Comment 42 Chris Wilson 2013-07-03 16:42:50 UTC

Created attachment 81967 [details] [review]
Print out more DRAM info

Can you please apply this patch and paste the output of /sys/kernel/debug/dri/0/i915_swizzle_info? Thanks.

Comment 43 Chris Wilson 2013-07-03 16:53:04 UTC

Hmm, the mobile chipset are different. DCC and the channel descriptors are at different locations.

Comment 44 Chris Wilson 2013-07-03 17:00:42 UTC

Created attachment 81968 [details] [review]
Print out more DRAM info (mobiles), step 2

Comment 45 Chris Wilson 2013-07-03 17:01:08 UTC

Created attachment 81969 [details] [review]
A lead?

Comment 46 Chris Wilson 2013-07-03 17:12:12 UTC

Created attachment 81970 [details] [review]
Hack, redux

Comment 47 Daniel Vetter 2013-11-18 17:52:08 UTC

Does anyone still care about this?

We kinda can't fix tiled swapping on these boxes, and ickle's hack breaks a few assumptions here and there. So if we still have unhappy users out there, please speak up.

Comment 48 Serge Gavrilov 2013-11-19 09:31:59 UTC

Yes, I am care. Unfortunately, I did not read your messages from July. I will try to patch the kernel and print the necessary output in the evening. What kernel I need to patch?

Comment 49 Serge Gavrilov 2013-11-19 18:30:53 UTC

bit6 swizzle for X-tiling = bit9/bit10/bit11
bit6 swizzle for Y-tiling = bit9/bit11
C0DRB0 = 0x0020
C0DRB1 = 0x0040
C0DRB2 = 0x0040
C0DRB3 = 0x0040
C1DRB0 = 0x0000
C1DRB1 = 0x0000
C1DRB2 = 0x0000
C1DRB3 = 0x0000
C1DRB3 = 0x0000

Comment 50 Daniel Vetter 2013-11-19 18:47:49 UTC

(In reply to comment #48)
> Yes, I am care. Unfortunately, I did not read your messages from July. I
> will try to patch the kernel and print the necessary output in the evening.
> What kernel I need to patch?

If Chri's hunch for the bit is right then we neeed to know the value of register 0x10111 and 0x10200 (Chris' patch had a bug for dumping those regs):

# intel_reg_read 0x10111
# intel_reg_read 0x10200

Then I can confirm with my own gm45 (which is also affected iirc) and update the patch.

Comment 51 Serge Gavrilov 2013-11-19 18:51:09 UTC

What should I do? :)

Comment 52 Daniel Vetter 2013-11-19 18:55:17 UTC

Execute the listed commands as root. You need to install the intel-gpu-tools package to get them.

Comment 53 Serge Gavrilov 2013-11-19 19:09:22 UTC

0x10111 : 0x0
0x10200 : 0xF0002

Comment 54 Ildar Muyukov 2014-03-19 11:24:38 UTC

I have the same:
0x10111 : 0x0
0x10200 : 0xF0002

Are these values helpful?

Comment 55 Serge Gavrilov 2014-11-14 18:20:50 UTC

Could you be so kind to provide the patch for a newer kernel?

Comment 56 Daniel Vetter 2014-11-18 13:43:06 UTC

Ok I've finally gotten around to polish Chris' patch and update testcase:

http://patchwork.freedesktop.org/patch/37073/

As soon as I have a few  tested-by reports I'll pull this in, so please go wild. Patch applies on top of latest drm-intel-nightly.

Comment 57 Daniel Vetter 2014-11-20 10:27:27 UTC

Workaround is now merged into drm-intel-nightly, should land in 3.19

commit 14a369b6c9bdb40cebdac5a248321a05119fe02b
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Nov 20 09:26:30 2014 +0100

    drm/i915: Pin tiled objects for L-shaped configs

Note that this is v2, v1 was a bit WARNING-happy.

Comment 58 Chris Wilson 2014-12-02 11:32:00 UTC

*** Bug 86898 has been marked as a duplicate of this bug. ***

Comment 59 Chris Wilson 2014-12-07 09:37:36 UTC

*** Bug 87065 has been marked as a duplicate of this bug. ***

Comment 60 Chris Wilson 2015-02-06 08:48:36 UTC

*** Bug 88924 has been marked as a duplicate of this bug. ***

Comment 61 fennectech 2015-02-14 13:23:32 UTC

Created attachment 113489 [details]
screenshots of possably related bug

possably related?

Comment 62 fennectech 2015-02-14 13:24:25 UTC

Created attachment 113490 [details]
another screenshot of a possably related bug

Possably related bug?

Comment 63 Chris Wilson 2015-05-28 13:36:19 UTC

*** Bug 90725 has been marked as a duplicate of this bug. ***

Comment 64 Fab Stz 2015-05-28 16:40:53 UTC

Actually I still have the issue with kernel 4.0.2 (see Bug 90725 marked as duplicate)

My device is as follows, which is appearently Gen3
 - 00:02.0 VGA compatible controller: Intel Corporation 82946GZ/GL Integrated Graphics Controller (rev 02)
 - 00:02.0 0300: 8086:2972 (rev 02)

Since your patch is designed for Gen4 could that explain that I'm still facing the issue ?

Should I reopen this one, or reopen 90725 ?

Comment 65 Fab Stz 2015-05-28 16:51:33 UTC

# cat i915_swizzle_info

bit6 swizzle for X-tiling = none
bit6 swizzle for Y-tiling = none
DDC = 0x00200010
DDC2 = 0x00200020
C0DRB3 = 0x0020
C1DRB3 = 0x0010

Comment 66 Chris Wilson 2015-05-28 16:58:32 UTC

(In reply to Fab Stz from comment #64)
> Since your patch is designed for Gen4 could that explain that I'm still
> facing the issue ?

No, only that I guessed incorrectly you had gen4 given the tiling artifacts.

> Should I reopen this one, or reopen 90725 ?

Hmm, reopen bug 90725 and this time add your Xorg.0.log! Can you also please use xf86-video-intel.git to rule out one gen3 swizzling bug in the process.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.