Summary: | [965gm regression v3.13] TOSHIBA Satellite U400 intel GM965/GL960 suspend/resume failure kernel 3.14 rc7, rc6, 3.13 | ||
---|---|---|---|
Product: | DRI | Reporter: | Tim Richardson <tim> |
Component: | DRM/Intel | Assignee: | Ville Syrjala <ville.syrjala> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | major | ||
Priority: | highest | CC: | gillesphilippe, intel-gfx-bugs |
Version: | unspecified | ||
Hardware: | x86 (IA32) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Tim Richardson
2014-03-23 17:30:16 UTC
Please note: I did this # echo mem > /sys/power/state from a terminal with the gdm service stopped (i.e. as close to init 3 as I can get) this suspended the machine, and it restored successfully. According to "Linux Graphics How to debug suspend-resume issues" this indicates a DRM bug Hmm, neither here nor in the linked bug, are there any details on what happens across suspend/resume. That it still happens with "init 3" is not conclusively evidence that it is i915.ko, since i915.ko is still active and providing the VT. Please tell me what more I can provide, I have followed the guides I found On 24/03/2014 5:54 pm, <bugzilla-daemon@freedesktop.org> wrote: > Chris Wilson <chris@chris-wilson.co.uk> changed bug 76520<https://bugs.freedesktop.org/show_bug.cgi?id=76520> > What Removed Added Status NEW NEEDINFO > > *Comment # 2 <https://bugs.freedesktop.org/show_bug.cgi?id=76520#c2> on > bug 76520 <https://bugs.freedesktop.org/show_bug.cgi?id=76520> from Chris > Wilson <chris@chris-wilson.co.uk> * > > Hmm, neither here nor in the linked bug, are there any details on what happens > across suspend/resume. That it still happens with "init 3" is not conclusively > evidence that it is i915.ko, since i915.ko is still active and providing the > VT. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Most important is a dmesg following resume, if at all possible. If not, a description of what prevents recovering that dmesg. Ok that sounds easy On 24/03/2014 6:31 pm, <bugzilla-daemon@freedesktop.org> wrote: > *Comment # 4 <https://bugs.freedesktop.org/show_bug.cgi?id=76520#c4> on > bug 76520 <https://bugs.freedesktop.org/show_bug.cgi?id=76520> from Chris > Wilson <chris@chris-wilson.co.uk> * > > Most important is a dmesg following resume, if at all possible. If not, a > description of what prevents recovering that dmesg. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Created attachment 96282 [details]
dmesg.0
this was the dmesg log prior to me doing pm-suspend
Also do we have any older kernels which do work, i.e. is this a regression? Yes, a regression. This machine has been running arch and ubuntu for years. Migrating to pre-release versions of 14.04 introduced the bug. I tried mainline .13 and rc 6&7 of .14, all three have this bug. Didn't have it previously, it was on Ubuntu 13.10. I'll provide more detailed info and find a previous mainline kernel without the bug tomorrow. On Monday, March 24, 2014, <bugzilla-daemon@freedesktop.org> wrote: > *Comment # 7 <https://bugs.freedesktop.org/show_bug.cgi?id=76520#c7> on > bug 76520 <https://bugs.freedesktop.org/show_bug.cgi?id=76520> from Daniel > Vetter <javascript:_e(%7B%7D,'cvml','daniel@ffwll.ch');> * > > Also do we have any older kernels which do work, i.e. is this a regression? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > I was wrong. I can not reproduce this bug with 3.14 rc7 or with earlier kernels if I first stop the display manager (in my case, via service lightdm stop) From a terminal, pm-suspend leads to suspend and a successful resume. It never works if lightdm is not stopped. I will reopen the ubuntu bug report. Please note that the dmesg does not include the suspend/resume event, so we have still no idea what the issue is. I don't know how to get this event logged ... I have to reboot the machine after I resume (hard power off). The log I attached was dmesg.0 after such a reboot. On Tue, Mar 25, 2014 at 7:21 PM, <bugzilla-daemon@freedesktop.org> wrote: > *Comment # 10 <https://bugs.freedesktop.org/show_bug.cgi?id=76520#c10> > on bug 76520 <https://bugs.freedesktop.org/show_bug.cgi?id=76520> from > Chris Wilson <chris@chris-wilson.co.uk> * > > Please note that the dmesg does not include the suspend/resume event, so we > have still no idea what the issue is. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Is the machine pingable? Can you remotely login after a failed resume? If not can you narrow down which kernel introduced the error, perhaps even do a bisect? No, it is not pingable. But more importantly, when I was testing multiple kernels I discovered that I cannot reproduce if I have killed the display manager. So I think the bug is with lightdm not the kernel. On Tue, Mar 25, 2014 at 8:44 PM, <bugzilla-daemon@freedesktop.org> wrote: > *Comment # 12 <https://bugs.freedesktop.org/show_bug.cgi?id=76520#c12> > on bug 76520 <https://bugs.freedesktop.org/show_bug.cgi?id=76520> from > Chris Wilson <chris@chris-wilson.co.uk> * > > Is the machine pingable? Can you remotely login after a failed resume? If not > can you narrow down which kernel introduced the error, perhaps even do a > bisect? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > No, it's a kernel bug if the machine is unpingable - you triggered a kernel panic. ok. ubuntu has an archive of mainline kernels. In the last of 3.12 kernels, there is no panic. resume works. i.e. 3.12.14 is ok The first kernel in 3.13.0 fails. I will try to bisect it. Yeah, for machine death upon resume bisect is the most promising approach usually. And ubuntu has a great bisect guide ;-) Do I have to make clean each time I bisect and build a new kernel? kernel build system is robust enough to allow rebuilds without cleaning everything out. I have concluded the bisect. 18442d08786472c63a0a80c27f92b033dffc26de is the first bad commit commit 18442d08786472c63a0a80c27f92b033dffc26de Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Fri Sep 13 16:00:08 2013 +0300 let me know if there is more info I can provide Poke Ville a bit ... Meanwhile can you please install latest intel-gpu-tools (preferrably git if your distro doesn't have the latest release) and then grap the output of the intel_reg_dumper tool for both a working and a broken kernel? Please try to grab them _after_ a suspend/resume cycle. Also please boot with drm.debug=0xe on a broken kernel and grab dmesg a) right after boot and b) after a failed suspend/resume if possible. Created attachment 97635 [details]
intel_reg_dumper_before_good.txt
intel_reg_dumper_before_good taken before pm-suspend on a kernel which works
Created attachment 97636 [details]
intel reg dumper after good
Created attachment 97637 [details]
intel reg dumper before bad
when the kernel does not resume, there is nothing added to dmesg (as far as I can see). So I ignore the dmesg after fail request. Sorry for the delay. I attach here a dmesg with that kernel log option however, sadly, dmesg is not updated by the resume on a bad kernel. I made a copy of dmesg "before suspend", did a pm-suspend, resumed (crash), hard reboot and compared dmesg.0 to my before suspend. identical, no content added. Created attachment 98841 [details]
dmesg with drm debug on a bad kernel
You might try ramoops to catch kernel oopses during resume. I recently used it sucesfully for that. To do that rebuild your kernel with: CONFIG_PSTORE=y CONFIG_PSTORE_CONSOLE=y CONFIG_PSTORE_RAM=y The set up the kernel command with somelike like this: mem=2G ramoops.mem_address=0x80000000 ramoops.mem_size=0x200000 ramoops.ecc=1 Reboot after the kernel hang and mount the pstore fs to see if it caught something: mkdir /mnt/pstore && mount -t pstore none /mnt/pstore I think we need to retest with commit ed5ca77ed7505cd389003a6d35ca1b7365429d71 Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Mon Dec 2 19:00:45 2013 +0200 drm/i915: Avoid div-by-zero in clock calculation funcs I guess best would be latest 3.15-rc. 3.15rc6 still has the bug I compiled a kernel with the three PSTORE options, but after crash and mounting the pstore file system, there was nothing there. (In reply to comment #30) > I compiled a kernel with the three PSTORE options, but after crash and > mounting the pstore file system, there was nothing there. It might be that the memory got cleared after the machine was restarted. On one of my machines the reset button and watchdog preserved the memory contents, but a quick off/on with the power button didn't. Maybe try enabling a few more kernel debug knobs: CONFIG_LOCKUP_DETECTOR=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y CONFIG_DETECT_HUNG_TASK=y CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y And also set "oops=panic panic=-1" on the kernel command line. You may also want to give the watchdog a try. So load the itco_wdt driver and tell the watchdog daemon to use /dev/watchdog, and start the daemon before suspending. Although on my machine the watchdog failed to reboot the machine when it hung during resume. Also one option would be to set "no_console_suspend" on the kernel command line (or "echo 0 > /sys/module/printk/parameters/console_suspend") and see if you can simply catch the oops on the console. If there's a serial port try serial console, if not try netconsole. Created attachment 100625 [details] [review] [PATCH] drm/i915: Populate pipe_config.pixel_multiplier even for disabled pipes I was pondering about this a bit and figured that SDVO must be the key here since it's fairly rare thing these days. So after going through the code again I came up with a pretty decent theory for why it might blow up. So hopefully this patch will fix your problems. Please test and report back. Also if the patch works, please attach the dmesg from the suspend/resume cycle. I included an extra WARN there to confirm a theory I have. I'll drop that WARN from the final version, but I just want to see whether it triggers. Thanks. The patch works. I applied it to v3.15-RC8 I'll attach the dmesg log ... Ah, there is no warn; do I have to enable debug messages somehow? The patch is definitely applied; I checked the code (and besides, I have a unpatched 3.15-RC8 which does not resume) Created attachment 100674 [details]
dmesg from resume after successful patch applied
(In reply to comment #34) > Ah, there is no warn; do I have to enable debug messages somehow? The patch > is definitely applied; I checked the code (and besides, I have a unpatched > 3.15-RC8 which does not resume) I guess that part of my theory is wrong then. That just means the SDVO port was also enabled at the time, which seems a bit weird but doesn't really matter. Can you still attach dmesg with drm.debug=0xe from the suspend/resume cycle so I can double check that everything else looks fine? Created attachment 100699 [details]
dmesg after suspend resume with drm debug on
(In reply to comment #37) > Created attachment 100699 [details] > dmesg after suspend resume with drm debug on Hmm. I don't see the suspend/resume in that log. Nothing from i915, nor from acpi or other subsystems. I tried again, making a copy of dmesg before suspend, and then after resume, but they are identical files. Nothing is added after the resume. I have attached the "After" one anyway. Created attachment 100714 [details]
dmesg after suspend resume with drm debug on (attempt 2)
Created attachment 100715 [details]
attempt 3: dmesg >
I tried again, this time via the dmesg command.
dmesg > filename
It is much bigger file, perhaps this has post-resume data.
OK, it really looks like the "attempt 3" log has what you were looking for. It mentions resume events. (In reply to comment #42) > OK, it really looks like the "attempt 3" log has what you were looking for. > It mentions resume events. Yeah that looks good. And now I even see my debug WARN triggering. But it also looks like the way I fixed the bug also leads to another WARN so I I'll need to fix it another way. I'll attach a revised patch shortly. Created attachment 100718 [details] [review] [PATCH] drm/i915: Avoid div-by-zero when pixel_multiplier is zero Here's the revised patch. Please test, and again attach a dmesg with debug enabled from the suspend/resume. should that patch be applied to a clean kernel or on top of the previous patch? (In reply to comment #45) > should that patch be applied to a clean kernel or on top of the previous > patch? clean Created attachment 100733 [details]
dmesg after suspend resume with drm debug on, second patch
this is dmesg output after resume, based on the second patch.
(In reply to comment #47) > Created attachment 100733 [details] > dmesg after suspend resume with drm debug on, second patch > > this is dmesg output after resume, based on the second patch. OK. No WARNs during suspend/resume which is good. I'll submit the patch for inclusion. Thanks for persisting and testing. There are plenty of earlier WARNs in the log from the modeset state checker. You may want to retest with the latest drm-intel-nightly and open a new bug for those if they still persist. thanks for your help. How can I track when this will end up in the mainline kernel? commit 17218c2c19cff4ce9de132a76822bb0f5bbe3d23 Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Mon Jun 9 16:20:46 2014 +0300 drm/i915: Avoid div-by-zero when pixel_multiplier is zero This should get merged mainline within two weeks, and eventually get backported to stable kernels after that. And thanks for the report and testing! *** Bug 75379 has been marked as a duplicate of this bug. *** For completeness, another dupe at kernel.org: https://bugzilla.kernel.org/show_bug.cgi?id=78381 patch drm/i915: Avoid div-by-zero when pixel_multiplier is zero added to the 3.15-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: drm-i915-avoid-div-by-zero-when-pixel_multiplier-is-zero.patch and it can be found in the queue-3.15 subdirectory. ... From: Ville Syrjälä <ville.syrjala@linux.intel.com> commit 2b85886a5457f5c5dbcd32edbd4e6bba0f4e8678 upstream. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.