Bug 91585

Summary: [BDW] System hard lock-up on resume from suspend
Product: DRI Reporter: Jerome <an.inbox>
Component: DRM/IntelAssignee: Humberto Israel Perez Rodriguez <humberto.i.perez.rodriguez>
Status: CLOSED NOTOURBUG QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, joonas.lahtinen, lawrence.ong, manuelkrause
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: ALL i915 features: power/suspend-resume
Attachments:
Description Flags
dmesg of Intel DRM nightly kernel, DRM logs enabled (0x1e)
none
Console mode resume glitch, journal extract (nightly kernel)
none
dmesg output after boot + suspend + resume none

Description Jerome 2015-08-08 11:58:53 UTC
Created attachment 117588 [details]
dmesg of Intel DRM nightly kernel, DRM logs enabled (0x1e)

The starting point is similar to bug 90342. On a Thinkpad X1 3rd gen with a i5 5200U with the stock Debian stable (Jessie) the hardware acceleration is disabled, llvmpipe is used instead. After upgrading the Intel X driver to 2.99.917 (from Debian stable backport) the hardware acceleration is properly enabled, but issues happen when resuming from suspend.

With the stock Debian kernel (3.16 + backports), the GPU hang but console switching works, it's as described in bug #90342.

With newer Debian kernels (tried 4.0.8 from testing and 4.1.3 from unstable) resumes work most of the time, but sometimes the system locks up hard. This happen when switching to X, with a black screen. Then the system is fully unresponsive and a hard power-off is required. On reboot, there is no log available. The hard lock-up happens for me after 3 to 5 suspend/cycle typically.
There were no additional DRM logs when doing those tests.

With Intel DRM nightly kernel (fb4572c00fadc1ac94816061e76c65b65607f66a, 2015y-08m-05d-15h-33m-02s UTC integration manifest) based on 4.2.0 RC5 the issue also happens with no additional DRM logs. When enabling logs as asked in the howto (drm.debug=0x1e log_buf_len=50M --- 1M was not enough) the issue did not happen for 20 suspend/resume cycles. 

So it looks like a racy problem that vanishes when adding a lot of logs and is present since 4.0.8.

I attached the output of dmesg with logs enabled.
Comment 1 Jerome 2015-08-09 19:57:38 UTC
Created attachment 117599 [details]
Console mode resume glitch, journal extract (nightly kernel)

I tried suspend/resume with X disabled (console mode) with the Intel DRM nightly kernel (same as before) just in case, and there is a i915/DRM glitch. The system recovers and works fine, but there's a console output and journal info. It occurred 3 times out of 3, so may be systematic (more tests to do). I attached the relevant journal extracts.

If it's systematic, it may not fully explain the lock-up when X is enabled (as the lock-up is not systematic), but hopefully may be related and could give an idea where to look. TBC.

In any case, when X is enabled and the system locks-up it's a hard lock-up: the network is down (no reply to ping), and USB is also down (when hooking an optical mouse, its bottom LED doesn't light up). So not easy to see what happened...

For more info on the laptop environment, see Debian initial bug:
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=794393

Let me know if there's anything I can do to collect useful information.

I'll try in console mode with the older kernels I used too to see if I get a similar behavior.
Comment 2 Jerome 2015-08-10 20:51:41 UTC
The console / no X glitch is systematic with the nightly kernel (no error log in /sys/class/drm/car0/error, only journal logs), but I cannot reproduce it on the older kernels I used (where the X lock-up can be reproduced). As the issue is racy not sure we can draw any conclusion anyway.
Comment 3 Jesse Barnes 2015-08-17 21:07:50 UTC
S3 resume failure is a critical one; marking high prio.  It would be easiest to fix if you could get a bisect, but it sounds like the nature of this problem might make it tough.  Can you reproduce it more easily by suspending while you have a 3D app running, like a game?
Comment 4 Jerome 2015-08-18 12:20:02 UTC
Hi Jesse, thanks for looking into this at a high priority.

When doing previous tests, I tried both without specific 3D app (under KDE Kwin WM, XRender for compositing) and with both glxgears and glxdemo running, and didn't notice a difference in lock-up frequency.

Since then I stumbled into something odd. Because stability is more important to me than 3D performance right now, I had reverted to the default Debian stable X Intel video driver, based on version 2.21.15. And I stayed on kernel 4.1.3 from Debian testing at the same time. With this combination 3D acceleration is not enabled, the system uses LLVMpipe for 3D. Still, I had two lock-ups.
After the second lock-up I tried to reproduce the bug systematically to assess the frequency, without success so far (15 successful resumes in a row). So it looks as if the lock-up can happen without full hardware 3D acceleration, even if it's less frequent.
BTW, I also tried the same Intel kernel 4.2.0rc5 as before with 2.21.15: can't reproduce either after 15 resume in a row.

The lock-up always occurs at the same point, early on. In a successful resume:
 1) at some point during the boot, the console is cleared, screen only shows a blinking cursor in top-left corner;

 2) there's a short screen flicker, like for a mode change. Screen is black;

 3) after a short duration (~ < 1 sec), previous session graphic image is restored;

 4) graphic environment is usable.

A lock-up occurs after (1) and before (2) as I never noticed the screen flicker, so it's early in the resume process.
And FYI the GPU hang of bug #90342 occurred between (3) and (4) so different timing.

The lock-up bug looks racy (not systematic, log level hide it) so bisecting on the kernel version looks chancy: some unrelated change may hide the problem. And the first symptoms are old (saw the lock-up with 4.0.8), with also a different but buggy behavior on 3.16 (could be different root cause). So the bug detection may not be always reliable, and where to start is not clear either.

Instead, it's relatively easy to reproduce the lock-up using Intel kernel 4.2rc5 with video driver 2.99.917. Is there any way to investigate based on this? To try to narrow down in which part of the resume sequence the issue happens?
I have a background in embedded dev on RTOS/RISC systems, but no low-level experience with x86 and Linux kernel dev, so please excuse some possibly naive questions/suggestions. Is there any special debug more to get some info past the hard lock-up? For example, even if I have to power cycle the laptop on a lock-up, it's short and the system memory may be partially preserved. A log to a buffer, and dump on next reboot may (TBC) show some data. Or could a watchdog IRQ reset the video in a basic, safe text mode to dump some logs after the lock-up?

Thanks
Comment 5 Jerome 2015-08-18 16:15:02 UTC
Lock-up reproduced, again unwillingly, with Debian kernel 4.1.3 and driver 2.21.15. The laptop locked-up on a resume after an hibernation duration of a few hours. And the two unexpected lock-ups were similar (long hibernation before resume) now that I think of it.

I move back to Intel kernel 4.2.0rc5 now.
Comment 6 Jerome 2015-08-20 10:27:51 UTC
With Intel kernel 4.2.0rc5, with old driver 2.21.15 (so LLVMpipe for 3D): I had a lock-up at the 3rd resume (long hibernations, > 1h). I then tried quick suspend/resume cycles and got 2 lock-ups in 6 cycles.

I then update the X Intel driver to 2.99.917 (still with Intel kernel 4.2.0rc5). Because glxgears may be too light to make a difference, I installed and ran Tux racer. Then did suspend/resume cycles, mostly in short succession but with some pauses of a few minutes at times: it locked-up at the 22nd resume only.

So, to resume:
 - the issue occurs both with LLVMpipe (X driver 2.21.15) and real hardware acceleration (X driver 2.99.917). Also both with kernels 4.1.3 (Debian testing) and Intel 4.2.0rc5, all 4 combinations tested;

 - having running 3D clients when suspending does not seem to increase the likelihood of a lock-up;

 - the lock-up seems to happen early, before the mode switch to a graphical setting (still in console mode when it locks up). By this I mean that anytime I saw the glitch/flicker that happens when switching to graphical mode, then the resume went fine (no lock-up);

 - the lock-up frequency is highly variable, it's hard to make reliable conclusion based on it. So ignore my previous comment on getting lock-up after a long suspend duration, it's likely just chance and not relevant.

Please let me know how how you want to proceed here. Thanks
Comment 7 Rodrigo Vivi 2015-08-20 23:47:36 UTC
Could you please check if you still face the issue if you boot your kernel with

i915.enable_execlists=0 i915.enable_ppgtt=1

?

thanks
Comment 8 Jerome 2015-08-21 12:48:14 UTC
Unfortunately it didn't help, with these options I reproduced a lock-up on third resume from suspend.

Output of /proc/cmdline shows the options are there:
BOOT_IMAGE=/vmlinuz-4.2.0-rc5-dbghang root=/dev/mapper/vgroup1-lvroot ro quiet i915.enable_execlists=0 i915.enable_ppgtt=1

Let me know if there's anything else you'd like to try.

Thanks
Comment 9 Jerome 2015-09-23 20:35:53 UTC
The lock-up can be reproduced with both the Intel kernel 4.2 above and a Debian 4.1 (backport) while disabling DRI and hardware acceleration in the X configuration with:

  Section "Device"
      Identifier  "Intel Graphics"
      Driver      "intel"
      Option      "NoAccel" "True"
      Option      "DRI" "False"
  EndSection

With the above 3D is using LLVMpipe and the X log properly indicate that hardware acceleration is disabled. But the lock-up on resume still happen.
As a reminder, the lock-up happen while still in console mode, about the time the display switch to graphic mode when all goes well.

I tried a recent Intel kernel (2015y-09m-21d-09h-47m-02s, commit 4f1d1fdaff9a6ad45b4c0399171f89b60e080070) but there are other issues that may be different (lots of display corruptions with DRI/hw accel disabled so not convenient to use, lock-up on suspend with hw acceleration on).
Comment 10 Joonas Lahtinen 2015-09-29 11:18:20 UTC
I'm unable to reproduce this bug on BDW-U (60 suspend&resume cycles).

Could you try attaching a USB serial adapter (or using the internal serial port if you have one) and logging in from it and running pm-suspend from that command line then using the power switch to wake the system from suspend.

The system I used has Ubuntu 15.04 and drm-intel-nightly from last Friday. Xorg was running lightdm during the cycling.
Comment 11 Jerome 2015-09-29 21:19:39 UTC
Hi Joonas,

The bug seems to depend on KWin being the application managing the display on resume.

In all the tests I performed so far I've been using KDE4 on Debian Jessie, and KDE was configured NOT to lock the screen on resume (KDE setting => Hardware section, Power Management => Advanced Settings and unselect "Lock screen on resume") as I use FDE and the DM crypt password is enough protection. This is not the default BTW.

Your comment made me wonder what other part of my environment the bug may depend upon. So I first installed Gnome and did suspend/resume cycles using it: 20 cycles and no problem. On the kernel I used (Debian 4.1 from backport, easiest one to trigger the issue so far) 3-5 cycles was enough to get a lock-up by comparison. So there seems to be a dependency on the DE.

Then I made another test: I went back to KDE, but re-enabled "Lock screen on resume". So on resume the screen is handled not by KWin but by the KDE screenlock application. 20 cycles and no problem.
So it looks like the lock-up also depends on KWin being active on resume.

I'll make further tests with KDE + lock screen on resume to make sure it's really ok. On your side, could you try installing KDE on Ubuntu, and doing cycles with the "Lock screen on resume" option disabled?

Regarding the USB serial adapter, I don't have one right now. I'll try to get one and try what you suggested. When the lock-up happens the USB seems dead though: when I plug a USB mouse it's not even powered. Maybe I'll still get some log out if it, we'll see.

Thanks
Comment 12 Jerome 2015-10-01 19:54:48 UTC
Lock-up reproduced with KDE + screen lock on suspend. It's just less frequent, but still there.
Comment 13 Jerome 2015-10-04 16:36:34 UTC
I tried attaching a USB serial adapter, but no luck: I can get the serial port functioning, but later after boot and not as a console. From what I could find on the topic, it's not enough to put the right kernel option (console=ttyUSB0), one must get the serial port recognized and used by grub for systemd to use it automatically as a console. The most helpful I found on this is http://www.coreboot.org/GRUB2, but it didn't work for me (no serial port found at grub level).

I tried something else: configure systemd so that the power button triggers an hibernate from the console. Then with KDE on VT7, I switched back to a simple console on VT1 and triggered suspend/resume cycles there. I can reproduce the lock-up this way too (I did this in the hope of finding a work-around, too bad...). It happens just before the console session is restored, and there's no info displayed: just the same lock-up with only the cursor in the top-left corner, exactly as with KDE.

However, if I stop KDM (so no X running at all) and I do hibernation cycles from VT1 then I can't reproduce the lock-up after 40 cycles.

So it seems the lock-up is related to restoring the X session, whether or not it's on screen at the moment.

If there are suggestions on what to try next, I'll be glad to help.
Comment 14 Joonas Lahtinen 2015-10-08 08:32:46 UTC
If you have the power button triggering a suspend and resume with just a button too, can you try booting the machine to desktop environment and just operating the machine with the button to make it hang without other interactions. And if it still does hang, can you try with lightdm too and report back.
Comment 15 cprigent 2015-10-08 16:21:48 UTC
Bug scrub:
Lower priority until we receive feedback from submitter
Comment 16 Jerome 2015-10-08 21:33:23 UTC
Hi Joonas. Regarding your comment #14, the lack of interaction does help. I tried 3 things:

1) as suggested, from boot I waited until KDM logging screen. From there, with no interaction I did suspend to disk / resume cycles using only the power button (keyboard for DM crypt unlock at boot too). No lock-up in 40 cycles.

2) then I logged in KDM, until I got my KDE session. From there, I tried the same: no interaction, suspend/resume cycles using the power button only. No lock-up in 40 cycles too.

3) from there, I continued doing the suspend/resume cycles the way I used to before. It has very little interactions: at the beginning I check hw acceleration with glxinfo, kernel version, just to make sure I have the right environment. Then open a konsole. At each cycle I do a "date ; echo N" where N is the cycle number, just to keep track. Sometimes I clear the notifications, and once in a while browse the web (uncommon, not correlated with the lock-up as far as I can tell). So really limited interaction, mostly it's key up in konsole to get the previous command, edit the number and type return. I got a lock-up at the 8th resume this way.

So interaction does seem a part of it. If you'd like me to try something else, let me know.
Comment 17 Manuel Krause 2015-11-24 00:53:15 UTC
Any news on this issue?!

regarding the following (here repeated) description in Comment 4:
------>
The lock-up always occurs at the same point, early on. In a successful resume:
 1) at some point during the boot, the console is cleared, screen only shows a blinking cursor in top-left corner;

 2) there's a short screen flicker, like for a mode change. Screen is black;

 3) after a short duration (~ < 1 sec), previous session graphic image is restored;

 4) graphic environment is usable.

A lock-up occurs after (1) and before (2) as I never noticed the screen flicker, so it's early in the resume process.
<------

It's a bit unlucky that there's so much misleading info in this thread.

I get the same issue (hibernation crashing) on my system for many months now. No logs available for it. Hard-lock.

For my system it's a bit complicated to summarize the kernel's circumstances.
Normally I use Alfred Chen's kernel patches (based on -ck/BFS) and the BFQ disk scheduler patches and the TuxOnIce patches. Atm kernel 4.3.0 from openSUSE with own .config.

So, please read, that I've retested with plain CFS + TuxOnIce... And it also crashes at the same point like cited above.

lspci excerpt:
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
        Subsystem: Hewlett-Packard Company Device 30dd
        Flags: bus master, fast devsel, latency 0, IRQ 31
        Memory at d0000000 (64-bit, non-prefetchable) [size=4M]
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 60f0 [size=8]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [d0] Power Management version 3
        Kernel driver in use: i915
        Kernel modules: i915

The system is a HP Compaq 6730b Laptop.

Running openSUSE 13.1 with updated drivers.

Best regards,
Manuel
Comment 18 Joonas Lahtinen 2015-11-27 13:41:42 UTC
Please, test against kernel from http://cgit.freedesktop.org/drm-intel/ branch drm-intel-nightly.

I was unable to get any hangs or display artifacts while doing 60 suspend resume cycles so it makes it impossible for me to debug.

It would be optimal if you could enable (USB) serial console from kernel and set the drm module debug parameter to -1 before suspending and try to capture as much debug as possible. Then it would give an indication what could possibly be causing the lock-ups.
Comment 19 Joonas Lahtinen 2015-11-27 14:12:45 UTC
I tested this again. With Broadwell based NUC5i7RYH, there's absolutely no trouble running pm-suspend continuously even from the graphical terminal. And I was using the stock Ubuntu 15.10 kernel.
Comment 20 Manuel Krause 2015-11-30 21:05:22 UTC
@Joonas:
Thank you very much for your replies.

Obviously we don't share the same hardware. Only the symptoms are/were the same as from the BUG's reporter.

Unfortunately I can't get you info via external channels like usb/ serial/console. Additinonally, I don't use Ubuntu.

For the moment, I've recompiled my kernel to include i915 into the kernel and attached myself to the more appropriate freedesktop BUG 91976.

Best regards,
Manuel
Comment 21 Joonas Lahtinen 2016-01-12 12:56:11 UTC
If this can still be reproduced, this should be tested with a kernel with Imre's RPM wakeref patches: http://patchwork.freedesktop.org/series/611/ . Daniel changed the WARN_ONCE into internal debug (http://patchwork.freedesktop.org/patch/69490/), so driver debugging must be enabled to see the message.

Moving this to QA for an attempt to reproduce.
Comment 22 Jerome 2016-01-13 07:22:00 UTC
With Intel DRM nightly from yesterday:
  commit fdbeff6c26904cedc81e0b3383d3174802230a60
  2016y-01m-12d-18h-13m-13s UTC integration manifest

Some progress: on resume, the screen is black and it's not possible to switch VT, BUT the system is still alive: it responds to ping, detects a plugged USB device. Seems systematic over a few trials.

I will install an SSH server and collect logs with debug after resume failure later today.

FYI, I had tried 4.3 kernels from Debian backport and Ubuntu (see comment #19): both were hanging as usual, after only a few resume cycles.
Comment 23 Jerome 2016-01-13 21:41:11 UTC
Created attachment 121007 [details]
dmesg output after boot + suspend + resume

With the nightly kernel as described in comment #22, this the output of dmesg after boot, suspend and resume with kernel option drm.debug=0x1e.

Resume fails:
  [  111.781637] [drm:intel_display_resume [i915]] *ERROR* Restoring old state failed with -22

/sys/class/drm/card0/error contains no error ("no error state collected").

As described, after resume screen is black and VT switching doesn't work but system is alive and can be logged into with SSH, and rebooted cleanly.

Let me know if you want to make other tests.
Comment 24 Joonas Lahtinen 2016-01-14 12:17:15 UTC
Thanks for checking, could you please open a new bug with "[drm:intel_display_resume [i915]] *ERROR* Restoring old state failed with -22" title, that is very likely a different issue.

Please mention to the bug which kind of display do you have attached, and attach a full dmesg from till the error, please use drm.debug=0x04 command line parameter and we'll give it a go.
Comment 25 Jerome 2016-01-14 20:59:35 UTC
Bug #93719 opened. Once fixed I'll check kernel 4.4.0 for this one bug.

Thanks
Comment 26 Jerome 2016-02-03 21:14:37 UTC
With DRM nightly kernel 4.5.0-rc1:
  commit 29cc50da24351521b481482cb64043304cafbba9
  drm-intel-nightly: 2016y-01m-29d-20h-37m-33s UTC integration manifest
where bug #93719 doesn't happen, I can reproduce this bug here.

There is only one difference: the "mute sound" key react and its LED can be turned on or off when the system is hanged, which was not the cast before. But the system is not replying to ping, and when plugging a USB mouse it's not even powered up. Still no log after reboot.
Comment 27 john.leuner 2016-05-04 19:52:08 UTC
I experienced the same symptoms and worked around the issue by using 'systemctl hibernate' (or systemctl suspend) instead of the hibernate script in the debian hibernate package.
Comment 28 Jerome 2016-05-05 18:01:05 UTC
Thanks for the info John, but it doesn't apply to my case unfortunately: I don't have the "hibernation" package installed. With systemd there is no need for it, and then whether you close the lid of use a power button or "systemctl hibernate" it's the same kernel interface (/sys/power) that is used. I tried all variants before and get hangs-up on my laptop in all cases.

I installed the "hibernate" package to have a look (it also hangs on my laptop BTW), and the suspend/resume sequence is slightly different than with just systemd: hibernate displays the state save/restore progress in text mode, and when restoring the graphic mode I have a transient "screen filled with random noise" state that doesn't happen with systemd only. So there is some difference, but not enough to change the result on my laptop...

While I'm at it, hang-up on resume still happens with Intel nightly kernel 4.6.0-rc6 as follows:
  commit e6160ef8b9b3ddfcb1fd382716887e57a2896710
  2016y-05m-05d-08h-06m-20s UTC integration manifest
Comment 29 john.leuner 2016-05-22 07:19:51 UTC
It seems that I still have the same problem, resumes work some of the time but most often lock up the machine.
Comment 30 Jerome 2016-06-27 19:49:49 UTC
It looks like it's not a graphics issue but a generic x86-64 PM bug, see kernel bug 104771:
   https://bugzilla.kernel.org/show_bug.cgi?id=104771

The discussion mentions a patch for this issue:
   https://patchwork.kernel.org/patch/9172981/

It applies cleanly to the Intel DRM kernel (I used 4.7.0-rc4, commit 5c244f4b128c6274755007e080d46e0a61b71534). From the comments it needs a recent kernel: someone tried a 4.6.2 ok, but the patch didn't apply to a 4.5.

It's too early to say this patch fixes the issue completely for sure, more testing is needed, but it certainly helps a lot. No crash so far with 20 cycles. For those having the issue I recommend trying it.
From comments, the patch should be in the final 4.7.0.

Also from comments, the hang is more likely when more memory is allocated at suspend time. In the kernel bugzilla entry there's a small app to allocate as many blocks of 1 GB as possible, you can try it.

I will report back later after more cycles.
Comment 31 Jerome 2016-07-10 10:34:28 UTC
I keep suspending my system until it crashes, and have used systemd sleep hooks to keep stats. Without the patch, on the last 15 reboots my laptop used to crash on average after ~2 suspend/resume cycles, with a maximum of 4 successful suspend/resume before the hang bug forced me to reboot.
With the patch applied I'm now at 42 successful cycles, not a single hang.

I'm closing the bug, people can follow the kernel bug entry:
   https://bugzilla.kernel.org/show_bug.cgi?id=104771

Compared to the patch I used there's been some changes, not tested yet, and the fix is not yet upstreamed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.