Bug 35796 - [GMA 3150] Graphics corruption after hibernate
Summary: [GMA 3150] Graphics corruption after hibernate
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-30 07:13 UTC by Seth Forshee
Modified: 2017-07-24 23:05 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Kernel log (59.07 KB, text/plain)
2011-03-30 07:13 UTC, Seth Forshee
no flags Details
Xorg log (30.65 KB, text/plain)
2011-03-30 07:14 UTC, Seth Forshee
no flags Details
intel_reg_dumper output (8.72 KB, text/plain)
2011-03-30 07:15 UTC, Seth Forshee
no flags Details
lspci output (10.42 KB, text/plain)
2011-03-30 07:15 UTC, Seth Forshee
no flags Details
Screenshot of corruption (269.16 KB, image/jpeg)
2011-03-30 07:20 UTC, Seth Forshee
no flags Details
Disable planes on resume (1.32 KB, patch)
2011-04-11 15:01 UTC, Seth Forshee
no flags Details | Splinter Review
Sanitize output registers after resume (4.00 KB, patch)
2011-04-11 23:01 UTC, Chris Wilson
no flags Details | Splinter Review
Sanitize output registers after resume v2 (4.04 KB, patch)
2011-04-12 08:28 UTC, Seth Forshee
no flags Details | Splinter Review

Description Seth Forshee 2011-03-30 07:13:12 UTC
I have three machines with this chipset that all display graphics corruption after resuming from S4 with both i386 and amd64 builds. The leftmost 3/4 of the screen is corrupted and the rightmost 1/4 is correct. The mouse pointer appears correctly anywhere on the screen, even over the corrupted area. The corruption clears up after an S3 cycle. The issue is 100% reproducible but does not appear in console mode.

System environment: 
-- chipset: GMA 3150
-- system architecture: 32-bit or 64-bit
-- xserver-xorg-video-intel: 2.14.0
-- xserver: 1.10.0
-- mesa: 7.10.1
-- libdrm: 2.4.23
-- kernel: 2.6.38
-- Linux distribution: Ubuntu 11.04
-- Machine or mobo model: Asus EeePC T101MT, Samsung N150, Toshiba NB305

Reproducing steps: Hibernate the machine, then wake it back up. Corruption appears during resume.
Comment 1 Seth Forshee 2011-03-30 07:13:50 UTC
Created attachment 45043 [details]
Kernel log
Comment 2 Seth Forshee 2011-03-30 07:14:39 UTC
Created attachment 45044 [details]
Xorg log
Comment 3 Seth Forshee 2011-03-30 07:15:11 UTC
Created attachment 45045 [details]
intel_reg_dumper output
Comment 4 Seth Forshee 2011-03-30 07:15:41 UTC
Created attachment 45046 [details]
lspci output
Comment 5 Seth Forshee 2011-03-30 07:20:19 UTC
Created attachment 45047 [details]
Screenshot of corruption
Comment 6 Chris Wilson 2011-03-30 07:33:42 UTC
A key question is whether this was recently introduced? X only (not from VT) does suggest a tiling corruption, for which there was one related fix in 2.6.38.2.
Comment 7 Seth Forshee 2011-03-30 07:55:46 UTC
I've tested with a 2.6.35 kernel and still seen the corruption, so it's not recently introduced in the kernel at least. I was just about to build a 2.6.38.2 kernel to test something else on one of these machines though, so I'll check hibernate with that kernel as well.
Comment 8 Chris Wilson 2011-03-30 08:03:10 UTC
Having looked at the screenshot, it doesn't match the failure mode I thought you were describing and I don't think the bug fixes in 2.6.38+. So something new.
Comment 9 Seth Forshee 2011-03-30 09:05:26 UTC
Confirmed that this issue is still present with a 2.6.38.2 kernel.
Comment 10 Chris Wilson 2011-04-05 03:12:40 UTC
I wonder if the disable-outputs-first helps. Can you try drm-intel-staging:

commit e6793fa5504ac5c09a8f22f907c2b5f4543af7d9
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 29 10:40:27 2011 +0100

    drm/i915: Disable all outputs early, before KMS takeover
    
    If the outputs are active and continuing to access the GATT when we
    teardown the PTEs, then there is a potential for us to hang the GPU.
    The hang tends to be a PGTBL_ER with either an invalid host access or
    an invalid display plane fetch.
    
    Reported-by: Pekka Enberg <penberg@kernel.org>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Daniel Vetter <daniel.vetter@ffwll.ch> (855GM)
Comment 11 Seth Forshee 2011-04-05 07:27:29 UTC
The problem is not present in drm-intel-staging. I'll cherry pick that commit back to .38 and see if it's fixed.

How safe is that patch? I ask because I saw a thread on lkml that seemed to include some people reporting problems related to those changes. I'd like to get a fix into the Ubuntu kernel before we ship natty later this month, but of course only if it won't introduce any regressions.
Comment 12 Chris Wilson 2011-04-05 08:07:50 UTC
There is a secondary patch to fix the initialisation of the per-ring irq queues and a new version to hopefully eliminate the issue reported.
Comment 13 Seth Forshee 2011-04-05 09:26:33 UTC
Hmm, the patch cherry-picked cleanly to .38, but I'm getting an oops during boot. The output is below. I'll take a look, but you'll probably be able to spot the problem more quickly than I can.

[   17.377624] BUG: unable to handle kernel NULL pointer dereference at 00000004
[   17.377831] IP: [<c106d5fb>] prepare_to_wait+0x5b/0x70
[   17.377951] *pde = 3d764067 
[   17.378031] Oops: 0002 [#1] SMP 
[   17.378148] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.1/usb3/3-2/3-2:1.0/bInterfaceClass
[   17.378316] Modules linked in: parport_pc ppdev arc4 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep binfmt_misc ath9k snd_pcm mac80211 snd_seq_midi i915 snd_rawmidi snd_seq_midi_event snd_seq ath9k_common eeepc_wmi joydev ath9k_hw snd_timer sparse_keymap drm_kms_helper snd_seq_device ath cfg80211 snd drm uvcvideo hid_mosart psmouse videodev serio_raw soundcore snd_page_alloc i2c_algo_bit video netconsole configfs lp parport usbhid hid ahci libahci atl1c
[   17.379516] 
[   17.379555] Pid: 1067, comm: Xorg Not tainted 2.6.38-8-generic #40~saf01drmbackport ASUSTeK Computer INC. T101MT/T101MT
[   17.379797] EIP: 0060:[<c106d5fb>] EFLAGS: 00013046 CPU: 0
[   17.379904] EIP is at prepare_to_wait+0x5b/0x70
[   17.379992] EAX: 00003296 EBX: f28520b0 ECX: 00000000 EDX: f291bccc
[   17.380021] ESI: f291bcc0 EDI: 00000002 EBP: f291bc80 ESP: f291bc70
[   17.380021]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   17.380021] Process Xorg (pid: 1067, ti=f291a000 task=f1e00000 task.ti=f291a000)
[   17.380021] Stack:
[   17.380021]  f2852054 00000003 f285201c f28531f8 f291bce0 f855dc1b f285201c f291bcd0
[   17.380021]  f291bcac 00000003 f285201c 00000002 00000003 f2ee5200 00000000 f28520b0
[   17.380021]  f291bcc0 f28531f8 f33e2800 f2852000 00000000 f1e00000 c106d380 f291bccc
[   17.380021] Call Trace:
[   17.380021]  [<f855dc1b>] i915_do_wait_request+0x2bb/0x580 [i915]
[   17.380021]  [<c106d380>] ? autoremove_wake_function+0x0/0x50
[   17.380021]  [<f855df21>] i915_gem_object_wait_rendering+0x41/0x60 [i915]
[   17.380021]  [<f8560233>] i915_gem_object_set_to_display_plane+0x43/0xd0 [i915]
[   17.380021]  [<f856dfc1>] intel_pin_and_fence_fb_obj+0xc1/0x100 [i915]
[   17.380021]  [<f856e088>] intel_pipe_set_base+0x88/0x220 [i915]
[   17.380021]  [<f8310001>] ? drm_mode_getresources+0x4b1/0x560 [drm]
[   17.380021]  [<f830926f>] ? drm_ut_debug_printk+0x2f/0x50 [drm]
[   17.380021]  [<f82d3d60>] drm_crtc_helper_set_config+0x6b0/0x920 [drm_kms_helper]
[   17.380021]  [<f8310742>] drm_mode_setcrtc+0x192/0x3d0 [drm]
[   17.380021]  [<f83116e6>] ? drm_mode_gamma_set_ioctl+0x76/0x110 [drm]
[   17.380021]  [<f8303cd1>] drm_ioctl+0x1e1/0x470 [drm]
[   17.380021]  [<f83105b0>] ? drm_mode_setcrtc+0x0/0x3d0 [drm]
[   17.380021]  [<f8303af0>] ? drm_ioctl+0x0/0x470 [drm]
[   17.380021]  [<c11367db>] do_vfs_ioctl+0x7b/0x2e0
[   17.380021]  [<c1136ac7>] sys_ioctl+0x87/0x90
[   17.380021]  [<c106c991>] ? sys_clock_gettime+0x71/0xb0
[   17.380021]  [<c1509c34>] syscall_call+0x7/0xb
[   17.380021] Code: 64 8b 35 ec 74 83 c1 87 0e 89 4d f0 8b 55 f0 89 c2 89 d8 e8 b8 c2 49 00 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 8d 76 00 8b 4b 04 <89> 51 04 89 4e 0c 8d 4b 04 89 4e 10 89 53 04 eb be 8d 74 26 00 
[   17.380021] EIP: [<c106d5fb>] prepare_to_wait+0x5b/0x70 SS:ESP 0068:f291bc70
[   17.380021] CR2: 0000000000000004
Comment 14 Chris Wilson 2011-04-05 12:07:11 UTC
I was obviously being too cryptic, sorry. You also need the previous commit:

commit b023d74ad16336ea07fb237b52899df6df63e4b2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 29 13:19:09 2011 +0100

    drm/i915: Move the irq wait queue initialisation into the ring init
    
    Required so that we don't obliterate the queue if initialising the
    rings after the global IRQ handler is installed.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 15 Seth Forshee 2011-04-05 12:52:54 UTC
Yeah, I realized that a little bit ago. It's my fault, if I had read your earlier comment a little more carefully I would have realized I needed that patch. I'm just now getting a test build ready with both patches.
Comment 16 Seth Forshee 2011-04-05 15:24:54 UTC
I discovered that two of these systems are now fine after hibernate, even with kernels that were previously failing. I had recently updated userspace on both of these machines, so it seems like something there fixed the problem. One of these was the one I tested the drm-intel-staging branch on earlier, so those results are completely invalid.

The third machine has started locking up hard booting up after hibernate, so I haven't been able to give the kernel patches a legitimate test. I'm interested in finding out though what changed in userspace to eliminate the issue.
Comment 17 Chris Wilson 2011-04-05 15:33:07 UTC
Anything of note in the packages updated? How do you hibernate? (echo disk > /sys/power/state ?) I assumed that we failed to disable outside interference, but it could equally be outside interference afterwards (say, some app poking directly into memory to restore a fb).

Which machine now dies upon resume?
Comment 18 Seth Forshee 2011-04-05 20:02:07 UTC
I still need to look into the package updates, but it's definitely something that happened in the past week.

I've used pm-hibernate and 'echo disk > /sys/power/state' with the same results. The machine that dies on resume is a Samsung N150. This is a recent regression, so I'll go back to a previous version to test the backport and bisect this regression separately.
Comment 19 Seth Forshee 2011-04-06 08:05:13 UTC
I _finally_ got a legitimate test run. Those kernel patches do not fix the corruption.

I've started looking through the list of updated packages, and I see there is an update to libdrm. The only intel-specific change looks to be this one:

  commit 36d4939343d8789d9066f7245fa2d4fe69119dd8
  Author: Chris Wilson <chris@chris-wilson.co.uk>
  Date:   Mon Feb 14 09:39:06 2011 +0000
  
      intel: Remember named bo
      
      ... and if asked to open a bo by the same global name, return a fresh
      reference to the previously allocated buffer.
      
      Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Do you think that would have any effect?
Comment 20 Chris Wilson 2011-04-06 08:24:05 UTC
That patch I associated with fixing the "buffer appears twice in execbuffer" error. You will have seen such in dmesg, and seems to involve dual head.
Comment 21 Seth Forshee 2011-04-06 12:52:32 UTC
I have seen that message on occasion in dmesg, but it wasn't consistently associated with this error so I didn't mention it. I wasn't using any of these machines in a dual-head setup though.

I finally tracked down that the issue is fixed by an update to grub2. We've switched to using gfxpayload=keep for a smooth hand-off between grub and the kernel, but apparently it isn't supported on some hardware, including the GMA 3150. The grub update adds a check for hardware support before using that option, and after the update it falls back to gfxpayload=text for these machines.
Comment 22 Jesse Barnes 2011-04-06 12:55:27 UTC
But it *should* work, there's no functional reason we should have to switch back to text mode.  Something in our init or resume from hibernate code must be broken...
Comment 23 Seth Forshee 2011-04-06 13:36:55 UTC
After looking at the changes that got rid of the corruption, I think that it's just picking text mode by accident anyway. It's trying to check the PCI id against a blacklist, but the blacklist file isn't present, so it falls back to text mode. When I install the blacklist (which happens to be empty currently) it goes back to graphics mode and the corruption returns.
Comment 24 Seth Forshee 2011-04-08 08:03:17 UTC
After getting past my hibernate issue (it was a bug affecting all 64-bit builds), I've started testing 2.6.39-rc2 and am still seeing the corruption. I'm getting this warning after hibernate:

[  145.916311] WARNING: at drivers/gpu/drm/i915/intel_display.c:1240 intel_disable_pipe+0x1b9/0x1c0 [i915]()
[  145.916319] Hardware name: TOSHIBA NB305
[  145.916324] plane B assertion failure, should be off on pipe B but is still active
[  145.916329] Modules linked in: binfmt_misc parport_pc ppdev snd_hda_codec_realtek arc4 snd_hda_intel snd_hda_codec snd_hwdep ath9k snd_pcm mac80211 snd_seq_midi snd_rawmidi snd_seq_midi_event ath9k_common i915 joydev snd_seq uvcvideo ath9k_hw snd_timer snd_seq_device ath videodev drm_kms_helper cfg80211 v4l2_compat_ioctl32 psmouse serio_raw drm snd sparse_keymap i2c_algo_bit soundcore snd_page_alloc video lp parport ahci libahci r8169
[  145.916406] Pid: 33, comm: kworker/u:1 Not tainted 2.6.39-rc2+ #15
[  145.916412] Call Trace:
[  145.916428]  [<ffffffff810626cf>] warn_slowpath_common+0x7f/0xc0
[  145.916439]  [<ffffffff810627c6>] warn_slowpath_fmt+0x46/0x50
[  145.916467]  [<ffffffffa0294049>] intel_disable_pipe+0x1b9/0x1c0 [i915]
[  145.916495]  [<ffffffffa02958e8>] i9xx_crtc_disable+0xb8/0x170 [i915]
[  145.916523]  [<ffffffffa02959de>] i9xx_crtc_prepare+0xe/0x10 [i915]
[  145.916539]  [<ffffffffa0111577>] drm_crtc_helper_set_mode+0x317/0x4e0 [drm_kms_helper]
[  145.916553]  [<ffffffff815c089e>] ? _raw_spin_lock+0xe/0x20
[  145.916572]  [<ffffffffa01117b2>] drm_helper_resume_force_mode+0x72/0x150 [drm_kms_helper]
[  145.916595]  [<ffffffffa0268202>] i915_drm_thaw+0xd2/0x120 [i915]
[  145.916617]  [<ffffffffa0268443>] i915_resume+0x53/0x70 [i915]
[  145.916639]  [<ffffffffa0268476>] i915_pm_resume+0x16/0x20 [i915]
[  145.916650]  [<ffffffff812fc46b>] pci_pm_restore+0x6b/0xc0
[  145.916660]  [<ffffffff813ba369>] pm_op+0x1f9/0x220
[  145.916670]  [<ffffffff8108bf80>] ? async_schedule+0x20/0x20
[  145.916679]  [<ffffffff813baaca>] device_resume+0x8a/0x160
[  145.916687]  [<ffffffff813baf21>] async_resume+0x21/0x60
[  145.916696]  [<ffffffff8108c004>] async_run_entry_fn+0x84/0x180
[  145.916707]  [<ffffffff8107e1ed>] process_one_work+0x11d/0x420
[  145.916717]  [<ffffffff8107eef3>] worker_thread+0x163/0x360
[  145.916728]  [<ffffffff8107ed90>] ? manage_workers.clone.20+0x240/0x240
[  145.916738]  [<ffffffff81083e56>] kthread+0x96/0xa0
[  145.916747]  [<ffffffff815c9aa4>] kernel_thread_helper+0x4/0x10
[  145.916758]  [<ffffffff81083dc0>] ? flush_kthread_worker+0xb0/0xb0
[  145.916766]  [<ffffffff815c9aa0>] ? gs_change+0x13/0x13

It's present when the corruption occurs and not present when there's no corruption. Could it be related?
Comment 25 Seth Forshee 2011-04-08 11:37:03 UTC
I also confirmed the problem is present in drm-intel-staging. I still see the warning on hibernate but not at any other time.
Comment 26 Seth Forshee 2011-04-11 15:01:05 UTC
Created attachment 45493 [details] [review]
Disable planes on resume

I'm attaching a patch that gets rid of both the warning and the corruption. It's probably not the correct solution, but it does seem to demonstrate what the problem is. From my rudimentary understanding, it appears the BIOS is setting up things differently from the kernel, so a plane that's been assigned to the pipe is left enabled when the pipe is disabled. So the resume handler needs to be aware that the setup may have changed since when the suspend happened.
Comment 27 Chris Wilson 2011-04-11 15:05:40 UTC
(In reply to comment #26)
> Created an attachment (id=45493) [details]
> Disable planes on resume
> 
> I'm attaching a patch that gets rid of both the warning and the corruption.
> It's probably not the correct solution, but it does seem to demonstrate what
> the problem is. From my rudimentary understanding, it appears the BIOS is
> setting up things differently from the kernel, so a plane that's been assigned
> to the pipe is left enabled when the pipe is disabled. So the resume handler
> needs to be aware that the setup may have changed since when the suspend
> happened

That's sounds strangely believable. We have logic along init to ensure that the hardware is consistent with our world view, that and similar logic, needs to be repeated on resume.

Seth, thanks for the invaluable analysis.
Comment 28 Chris Wilson 2011-04-11 23:01:41 UTC
Created attachment 45498 [details] [review]
Sanitize output registers after resume

Seth, this I think is a slightly better patch only because it reuses the code we already have to try and fix this problem (at boot).
Comment 29 Seth Forshee 2011-04-12 08:28:33 UTC
Created attachment 45532 [details] [review]
Sanitize output registers after resume v2

Chris, after a small change to fix build breakage the patch seems to be doing the job on all three machines. The screen still shows the corruption very briefly, maybe 0.25 seconds, before it gets corrected.

I'm attaching an updated patch with the build fix.

Also, one clarification to the comment I made earlier. In the comment you mention this being similar to a bug caused by the state left by grub2. In reality it's probably the exact same thing, since it's grub2 that's changing to graphics mode when coming out of hibernate. I said BIOS since the BIOS is what's actually manipulating the registers (since grub2 just uses the VESA BIOS extensions afaict).

Do you plan to propose this patch for stable?
Comment 30 Chris Wilson 2011-04-12 10:06:05 UTC
(In reply to comment #29)
> Do you plan to propose this patch for stable?

Sounds like a good plan.
Comment 31 Chris Wilson 2011-04-17 00:13:42 UTC
commit f6e5b1603b8bb7131b6778d0d4e2e5dda120a379
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 12 18:06:51 2011 +0100

    drm/i915: Sanitize the output registers after resume
    
    Similar to booting, we need to inspect the state left by the BIOS and
    remove any conflicting bits before we take over. The example reported by
    Seth Forshee is very similar to the bug we encountered with the state left
    by grub2, that the crtc pipe<->planning mapping was reversed from our
    expectations and so we failed to turn off the outputs when booting or,
    in this case, resuming. This may be in fact the same bug, but triggered
    at resume time.
    
    This patch rearranges the code we already have to clear up the
    conflicting state upon init and calls it from reset (which is called
    after we have lost control of the hardware, i.e. along both the boot and
    resume paths) instead.
    
    Reported-and-tested-by: Seth Forshee <seth.forshee@canonical.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35796
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
    Reviewed-by: Keith Packard <keithp@keithp.com>
    Signed-off-by: Keith Packard <keithp@keithp.com>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.