I have three machines with this chipset that all display graphics corruption after resuming from S4 with both i386 and amd64 builds. The leftmost 3/4 of the screen is corrupted and the rightmost 1/4 is correct. The mouse pointer appears correctly anywhere on the screen, even over the corrupted area. The corruption clears up after an S3 cycle. The issue is 100% reproducible but does not appear in console mode. System environment: -- chipset: GMA 3150 -- system architecture: 32-bit or 64-bit -- xserver-xorg-video-intel: 2.14.0 -- xserver: 1.10.0 -- mesa: 7.10.1 -- libdrm: 2.4.23 -- kernel: 2.6.38 -- Linux distribution: Ubuntu 11.04 -- Machine or mobo model: Asus EeePC T101MT, Samsung N150, Toshiba NB305 Reproducing steps: Hibernate the machine, then wake it back up. Corruption appears during resume.
Created attachment 45043 [details] Kernel log
Created attachment 45044 [details] Xorg log
Created attachment 45045 [details] intel_reg_dumper output
Created attachment 45046 [details] lspci output
Created attachment 45047 [details] Screenshot of corruption
A key question is whether this was recently introduced? X only (not from VT) does suggest a tiling corruption, for which there was one related fix in 2.6.38.2.
I've tested with a 2.6.35 kernel and still seen the corruption, so it's not recently introduced in the kernel at least. I was just about to build a 2.6.38.2 kernel to test something else on one of these machines though, so I'll check hibernate with that kernel as well.
Having looked at the screenshot, it doesn't match the failure mode I thought you were describing and I don't think the bug fixes in 2.6.38+. So something new.
Confirmed that this issue is still present with a 2.6.38.2 kernel.
I wonder if the disable-outputs-first helps. Can you try drm-intel-staging: commit e6793fa5504ac5c09a8f22f907c2b5f4543af7d9 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Mar 29 10:40:27 2011 +0100 drm/i915: Disable all outputs early, before KMS takeover If the outputs are active and continuing to access the GATT when we teardown the PTEs, then there is a potential for us to hang the GPU. The hang tends to be a PGTBL_ER with either an invalid host access or an invalid display plane fetch. Reported-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Daniel Vetter <daniel.vetter@ffwll.ch> (855GM)
The problem is not present in drm-intel-staging. I'll cherry pick that commit back to .38 and see if it's fixed. How safe is that patch? I ask because I saw a thread on lkml that seemed to include some people reporting problems related to those changes. I'd like to get a fix into the Ubuntu kernel before we ship natty later this month, but of course only if it won't introduce any regressions.
There is a secondary patch to fix the initialisation of the per-ring irq queues and a new version to hopefully eliminate the issue reported.
Hmm, the patch cherry-picked cleanly to .38, but I'm getting an oops during boot. The output is below. I'll take a look, but you'll probably be able to spot the problem more quickly than I can. [ 17.377624] BUG: unable to handle kernel NULL pointer dereference at 00000004 [ 17.377831] IP: [<c106d5fb>] prepare_to_wait+0x5b/0x70 [ 17.377951] *pde = 3d764067 [ 17.378031] Oops: 0002 [#1] SMP [ 17.378148] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.1/usb3/3-2/3-2:1.0/bInterfaceClass [ 17.378316] Modules linked in: parport_pc ppdev arc4 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep binfmt_misc ath9k snd_pcm mac80211 snd_seq_midi i915 snd_rawmidi snd_seq_midi_event snd_seq ath9k_common eeepc_wmi joydev ath9k_hw snd_timer sparse_keymap drm_kms_helper snd_seq_device ath cfg80211 snd drm uvcvideo hid_mosart psmouse videodev serio_raw soundcore snd_page_alloc i2c_algo_bit video netconsole configfs lp parport usbhid hid ahci libahci atl1c [ 17.379516] [ 17.379555] Pid: 1067, comm: Xorg Not tainted 2.6.38-8-generic #40~saf01drmbackport ASUSTeK Computer INC. T101MT/T101MT [ 17.379797] EIP: 0060:[<c106d5fb>] EFLAGS: 00013046 CPU: 0 [ 17.379904] EIP is at prepare_to_wait+0x5b/0x70 [ 17.379992] EAX: 00003296 EBX: f28520b0 ECX: 00000000 EDX: f291bccc [ 17.380021] ESI: f291bcc0 EDI: 00000002 EBP: f291bc80 ESP: f291bc70 [ 17.380021] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 17.380021] Process Xorg (pid: 1067, ti=f291a000 task=f1e00000 task.ti=f291a000) [ 17.380021] Stack: [ 17.380021] f2852054 00000003 f285201c f28531f8 f291bce0 f855dc1b f285201c f291bcd0 [ 17.380021] f291bcac 00000003 f285201c 00000002 00000003 f2ee5200 00000000 f28520b0 [ 17.380021] f291bcc0 f28531f8 f33e2800 f2852000 00000000 f1e00000 c106d380 f291bccc [ 17.380021] Call Trace: [ 17.380021] [<f855dc1b>] i915_do_wait_request+0x2bb/0x580 [i915] [ 17.380021] [<c106d380>] ? autoremove_wake_function+0x0/0x50 [ 17.380021] [<f855df21>] i915_gem_object_wait_rendering+0x41/0x60 [i915] [ 17.380021] [<f8560233>] i915_gem_object_set_to_display_plane+0x43/0xd0 [i915] [ 17.380021] [<f856dfc1>] intel_pin_and_fence_fb_obj+0xc1/0x100 [i915] [ 17.380021] [<f856e088>] intel_pipe_set_base+0x88/0x220 [i915] [ 17.380021] [<f8310001>] ? drm_mode_getresources+0x4b1/0x560 [drm] [ 17.380021] [<f830926f>] ? drm_ut_debug_printk+0x2f/0x50 [drm] [ 17.380021] [<f82d3d60>] drm_crtc_helper_set_config+0x6b0/0x920 [drm_kms_helper] [ 17.380021] [<f8310742>] drm_mode_setcrtc+0x192/0x3d0 [drm] [ 17.380021] [<f83116e6>] ? drm_mode_gamma_set_ioctl+0x76/0x110 [drm] [ 17.380021] [<f8303cd1>] drm_ioctl+0x1e1/0x470 [drm] [ 17.380021] [<f83105b0>] ? drm_mode_setcrtc+0x0/0x3d0 [drm] [ 17.380021] [<f8303af0>] ? drm_ioctl+0x0/0x470 [drm] [ 17.380021] [<c11367db>] do_vfs_ioctl+0x7b/0x2e0 [ 17.380021] [<c1136ac7>] sys_ioctl+0x87/0x90 [ 17.380021] [<c106c991>] ? sys_clock_gettime+0x71/0xb0 [ 17.380021] [<c1509c34>] syscall_call+0x7/0xb [ 17.380021] Code: 64 8b 35 ec 74 83 c1 87 0e 89 4d f0 8b 55 f0 89 c2 89 d8 e8 b8 c2 49 00 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 8d 76 00 8b 4b 04 <89> 51 04 89 4e 0c 8d 4b 04 89 4e 10 89 53 04 eb be 8d 74 26 00 [ 17.380021] EIP: [<c106d5fb>] prepare_to_wait+0x5b/0x70 SS:ESP 0068:f291bc70 [ 17.380021] CR2: 0000000000000004
I was obviously being too cryptic, sorry. You also need the previous commit: commit b023d74ad16336ea07fb237b52899df6df63e4b2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Mar 29 13:19:09 2011 +0100 drm/i915: Move the irq wait queue initialisation into the ring init Required so that we don't obliterate the queue if initialising the rings after the global IRQ handler is installed. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Yeah, I realized that a little bit ago. It's my fault, if I had read your earlier comment a little more carefully I would have realized I needed that patch. I'm just now getting a test build ready with both patches.
I discovered that two of these systems are now fine after hibernate, even with kernels that were previously failing. I had recently updated userspace on both of these machines, so it seems like something there fixed the problem. One of these was the one I tested the drm-intel-staging branch on earlier, so those results are completely invalid. The third machine has started locking up hard booting up after hibernate, so I haven't been able to give the kernel patches a legitimate test. I'm interested in finding out though what changed in userspace to eliminate the issue.
Anything of note in the packages updated? How do you hibernate? (echo disk > /sys/power/state ?) I assumed that we failed to disable outside interference, but it could equally be outside interference afterwards (say, some app poking directly into memory to restore a fb). Which machine now dies upon resume?
I still need to look into the package updates, but it's definitely something that happened in the past week. I've used pm-hibernate and 'echo disk > /sys/power/state' with the same results. The machine that dies on resume is a Samsung N150. This is a recent regression, so I'll go back to a previous version to test the backport and bisect this regression separately.
I _finally_ got a legitimate test run. Those kernel patches do not fix the corruption. I've started looking through the list of updated packages, and I see there is an update to libdrm. The only intel-specific change looks to be this one: commit 36d4939343d8789d9066f7245fa2d4fe69119dd8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Feb 14 09:39:06 2011 +0000 intel: Remember named bo ... and if asked to open a bo by the same global name, return a fresh reference to the previously allocated buffer. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Do you think that would have any effect?
That patch I associated with fixing the "buffer appears twice in execbuffer" error. You will have seen such in dmesg, and seems to involve dual head.
I have seen that message on occasion in dmesg, but it wasn't consistently associated with this error so I didn't mention it. I wasn't using any of these machines in a dual-head setup though. I finally tracked down that the issue is fixed by an update to grub2. We've switched to using gfxpayload=keep for a smooth hand-off between grub and the kernel, but apparently it isn't supported on some hardware, including the GMA 3150. The grub update adds a check for hardware support before using that option, and after the update it falls back to gfxpayload=text for these machines.
But it *should* work, there's no functional reason we should have to switch back to text mode. Something in our init or resume from hibernate code must be broken...
After looking at the changes that got rid of the corruption, I think that it's just picking text mode by accident anyway. It's trying to check the PCI id against a blacklist, but the blacklist file isn't present, so it falls back to text mode. When I install the blacklist (which happens to be empty currently) it goes back to graphics mode and the corruption returns.
After getting past my hibernate issue (it was a bug affecting all 64-bit builds), I've started testing 2.6.39-rc2 and am still seeing the corruption. I'm getting this warning after hibernate: [ 145.916311] WARNING: at drivers/gpu/drm/i915/intel_display.c:1240 intel_disable_pipe+0x1b9/0x1c0 [i915]() [ 145.916319] Hardware name: TOSHIBA NB305 [ 145.916324] plane B assertion failure, should be off on pipe B but is still active [ 145.916329] Modules linked in: binfmt_misc parport_pc ppdev snd_hda_codec_realtek arc4 snd_hda_intel snd_hda_codec snd_hwdep ath9k snd_pcm mac80211 snd_seq_midi snd_rawmidi snd_seq_midi_event ath9k_common i915 joydev snd_seq uvcvideo ath9k_hw snd_timer snd_seq_device ath videodev drm_kms_helper cfg80211 v4l2_compat_ioctl32 psmouse serio_raw drm snd sparse_keymap i2c_algo_bit soundcore snd_page_alloc video lp parport ahci libahci r8169 [ 145.916406] Pid: 33, comm: kworker/u:1 Not tainted 2.6.39-rc2+ #15 [ 145.916412] Call Trace: [ 145.916428] [<ffffffff810626cf>] warn_slowpath_common+0x7f/0xc0 [ 145.916439] [<ffffffff810627c6>] warn_slowpath_fmt+0x46/0x50 [ 145.916467] [<ffffffffa0294049>] intel_disable_pipe+0x1b9/0x1c0 [i915] [ 145.916495] [<ffffffffa02958e8>] i9xx_crtc_disable+0xb8/0x170 [i915] [ 145.916523] [<ffffffffa02959de>] i9xx_crtc_prepare+0xe/0x10 [i915] [ 145.916539] [<ffffffffa0111577>] drm_crtc_helper_set_mode+0x317/0x4e0 [drm_kms_helper] [ 145.916553] [<ffffffff815c089e>] ? _raw_spin_lock+0xe/0x20 [ 145.916572] [<ffffffffa01117b2>] drm_helper_resume_force_mode+0x72/0x150 [drm_kms_helper] [ 145.916595] [<ffffffffa0268202>] i915_drm_thaw+0xd2/0x120 [i915] [ 145.916617] [<ffffffffa0268443>] i915_resume+0x53/0x70 [i915] [ 145.916639] [<ffffffffa0268476>] i915_pm_resume+0x16/0x20 [i915] [ 145.916650] [<ffffffff812fc46b>] pci_pm_restore+0x6b/0xc0 [ 145.916660] [<ffffffff813ba369>] pm_op+0x1f9/0x220 [ 145.916670] [<ffffffff8108bf80>] ? async_schedule+0x20/0x20 [ 145.916679] [<ffffffff813baaca>] device_resume+0x8a/0x160 [ 145.916687] [<ffffffff813baf21>] async_resume+0x21/0x60 [ 145.916696] [<ffffffff8108c004>] async_run_entry_fn+0x84/0x180 [ 145.916707] [<ffffffff8107e1ed>] process_one_work+0x11d/0x420 [ 145.916717] [<ffffffff8107eef3>] worker_thread+0x163/0x360 [ 145.916728] [<ffffffff8107ed90>] ? manage_workers.clone.20+0x240/0x240 [ 145.916738] [<ffffffff81083e56>] kthread+0x96/0xa0 [ 145.916747] [<ffffffff815c9aa4>] kernel_thread_helper+0x4/0x10 [ 145.916758] [<ffffffff81083dc0>] ? flush_kthread_worker+0xb0/0xb0 [ 145.916766] [<ffffffff815c9aa0>] ? gs_change+0x13/0x13 It's present when the corruption occurs and not present when there's no corruption. Could it be related?
I also confirmed the problem is present in drm-intel-staging. I still see the warning on hibernate but not at any other time.
Created attachment 45493 [details] [review] Disable planes on resume I'm attaching a patch that gets rid of both the warning and the corruption. It's probably not the correct solution, but it does seem to demonstrate what the problem is. From my rudimentary understanding, it appears the BIOS is setting up things differently from the kernel, so a plane that's been assigned to the pipe is left enabled when the pipe is disabled. So the resume handler needs to be aware that the setup may have changed since when the suspend happened.
(In reply to comment #26) > Created an attachment (id=45493) [details] > Disable planes on resume > > I'm attaching a patch that gets rid of both the warning and the corruption. > It's probably not the correct solution, but it does seem to demonstrate what > the problem is. From my rudimentary understanding, it appears the BIOS is > setting up things differently from the kernel, so a plane that's been assigned > to the pipe is left enabled when the pipe is disabled. So the resume handler > needs to be aware that the setup may have changed since when the suspend > happened That's sounds strangely believable. We have logic along init to ensure that the hardware is consistent with our world view, that and similar logic, needs to be repeated on resume. Seth, thanks for the invaluable analysis.
Created attachment 45498 [details] [review] Sanitize output registers after resume Seth, this I think is a slightly better patch only because it reuses the code we already have to try and fix this problem (at boot).
Created attachment 45532 [details] [review] Sanitize output registers after resume v2 Chris, after a small change to fix build breakage the patch seems to be doing the job on all three machines. The screen still shows the corruption very briefly, maybe 0.25 seconds, before it gets corrected. I'm attaching an updated patch with the build fix. Also, one clarification to the comment I made earlier. In the comment you mention this being similar to a bug caused by the state left by grub2. In reality it's probably the exact same thing, since it's grub2 that's changing to graphics mode when coming out of hibernate. I said BIOS since the BIOS is what's actually manipulating the registers (since grub2 just uses the VESA BIOS extensions afaict). Do you plan to propose this patch for stable?
(In reply to comment #29) > Do you plan to propose this patch for stable? Sounds like a good plan.
commit f6e5b1603b8bb7131b6778d0d4e2e5dda120a379 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 12 18:06:51 2011 +0100 drm/i915: Sanitize the output registers after resume Similar to booting, we need to inspect the state left by the BIOS and remove any conflicting bits before we take over. The example reported by Seth Forshee is very similar to the bug we encountered with the state left by grub2, that the crtc pipe<->planning mapping was reversed from our expectations and so we failed to turn off the outputs when booting or, in this case, resuming. This may be in fact the same bug, but triggered at resume time. This patch rearranges the code we already have to clear up the conflicting state upon init and calls it from reset (which is called after we have lost control of the hardware, i.e. along both the boot and resume paths) instead. Reported-and-tested-by: Seth Forshee <seth.forshee@canonical.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35796 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org Reviewed-by: Keith Packard <keithp@keithp.com> Signed-off-by: Keith Packard <keithp@keithp.com>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.