107304 – [CI][BAT] igt@gem_exec_suspend@basic-s4-devices - fail - fstrim prevents the machines from hibernating

Bug 107304 - [CI][BAT] igt@gem_exec_suspend@basic-s4-devices - fail - fstrim prevents the machines from hibernating

Summary: [CI][BAT] igt@gem_exec_suspend@basic-s4-devices - fail - fstrim prevents the ...

Status:	CLOSED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Tomi Sarvela
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-07-20 08:33 UTC by Martin Peres
Modified:	2019-07-31 11:59 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	ALL
i915 features:	GPU hang

Attachments

Description Martin Peres 2018-07-20 08:33:01 UTC

On rerun 42/105 of Fast feedback, 20 machines failed the test igt@gem_exec_suspend@basic-s4-devices. The mysterious thing was that all the other re-runs of the same kernel configuration passed.

After some investigation from Tomi, turns out it was due to the firing of the weekly fstrim job.

Suggested W/A are:
 - mount -o discard
 - run fstrim before each run

Comment 1 Tomi Sarvela 2018-07-20 08:44:01 UTC

The dmesg logs have been copied over for run CI_DRM_4451_42, eg.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4451_42/fi-glk-j4005/dmesg0.log

<7>[  255.076407] [IGT] gem_exec_suspend: starting subtest basic-S3
<6>[  257.096983] PM: suspend entry (deep)
<6>[  257.097032] PM: Syncing filesystems ... done.
<6>[  257.519122] Freezing user space processes ... 
<4>[  257.700828] hpet1: lost 7160 rtc interrupts
<4>[  258.610436] hpet1: lost 7161 rtc interrupts
<4>[  259.520732] hpet1: lost 7161 rtc interrupts
<4>[  260.431583] hpet1: lost 7161 rtc interrupts
<4>[  261.358618] hpet1: lost 7161 rtc interrupts
<4>[  262.270512] hpet1: lost 7161 rtc interrupts
<4>[  263.163399] hpet1: lost 7161 rtc interrupts
<4>[  264.078469] hpet1: lost 7161 rtc interrupts
<4>[  264.988525] hpet1: lost 7161 rtc interrupts
<4>[  265.900422] hpet1: lost 7161 rtc interrupts
<4>[  266.844790] hpet1: lost 7161 rtc interrupts
<4>[  267.797851] hpet1: lost 7161 rtc interrupts
<4>[  268.759081] hpet1: lost 7160 rtc interrupts
<4>[  269.697541] hpet1: lost 7161 rtc interrupts
<4>[  270.609418] hpet1: lost 7161 rtc interrupts
<4>[  271.520478] hpet1: lost 7161 rtc interrupts
<3>[  277.523861] Freezing of tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
<6>[  277.524696] fstrim          D12728  1426      1 0x00000004
<4>[  277.524772] Call Trace:
<4>[  277.524807]  ? __schedule+0x364/0xbf0
<4>[  277.524835]  ? __schedule+0x1df/0xbf0
<4>[  277.524865]  ? wait_for_common_io.constprop.2+0x106/0x1b0
<4>[  277.524884]  ? schedule+0x2d/0x90
<4>[  277.524896]  ? schedule_timeout+0x236/0x540
<4>[  277.524914]  ? _raw_spin_unlock_irq+0x41/0x50
<4>[  277.524930]  ? blk_flush_plug_list+0x213/0x270
<4>[  277.524963]  ? wait_for_common_io.constprop.2+0x106/0x1b0
<4>[  277.524980]  ? io_schedule_timeout+0x14/0x40
<4>[  277.525073]  ? wait_for_common_io.constprop.2+0x12e/0x1b0
<4>[  277.525091]  ? wake_up_q+0x70/0x70
<4>[  277.525120]  ? submit_bio_wait+0x5a/0x80
<4>[  277.525156]  ? blkdev_issue_discard+0x7b/0xd0
<4>[  277.525201]  ? ext4_trim_fs+0x46d/0xbb0
<4>[  277.525216]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[  277.525229]  ? ext4_trim_fs+0x46d/0xbb0
<4>[  277.525299]  ? ext4_ioctl+0xdc4/0x10d0
<4>[  277.525314]  ? cp_new_stat+0x155/0x190
<4>[  277.525360]  ? do_vfs_ioctl+0xa0/0x6d0
<4>[  277.525378]  ? __se_sys_newfstat+0x3c/0x60
<4>[  277.525409]  ? ksys_ioctl+0x35/0x60
<4>[  277.525432]  ? __x64_sys_ioctl+0x11/0x20
<4>[  277.525445]  ? do_syscall_64+0x55/0x190
<4>[  277.525462]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
<6>[  277.525520] OOM killer enabled.
<6>[  277.525529] Restarting tasks ... done.
<6>[  277.542101] video LNXVIDEO:00: Restoring backlight state
<6>[  277.542140] PM: suspend exit
<5>[  277.581046] Setting dangerous option reset - tainting kernel
<7>[  277.581509] [IGT] gem_exec_suspend: exiting, ret=99

Comment 2 Lakshmi 2018-09-27 13:59:31 UTC

We are not running crons in the test machines anymore. This bug is not valid. Closing this bug.

Comment 3 CI Bug Log 2019-02-18 09:48:42 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ELK ILK IVB BXT: igt@gem_exec_suspend@basic-s4-devices - fail - Freezing of tasks failed after 20.\d+ seconds (\d+ tasks refusing to freeze, wq_busy=0), fstrim
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-bxt-j4205/igt@gem_exec_suspend@basic-s4-devices.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-elk-e7500/igt@gem_exec_suspend@basic-s4-devices.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-ilk-650/igt@gem_exec_suspend@basic-s4-devices.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-ivb-3520m/igt@gem_exec_suspend@basic-s4-devices.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-ivb-3770/igt@gem_exec_suspend@basic-s4-devices.html

Comment 4 Martin Peres 2019-02-18 09:49:31 UTC

And it's back....

Comment 5 Francesco Balestrieri 2019-02-19 08:39:02 UTC

So is it valid or not?

Comment 6 Don Hiatt 2019-03-05 18:43:40 UTC

All the latest reports (Comment #3) are exactly the same: fstrim is refusing to sleep so there is an active fstrim job occurring. 

fstrim is run as a service (see below for Fedora) on later linux distributions so it's not an issue of cron jobs (Comment #2).

Tomi: I suggest that all fstrim services be disabled while the IGT tests are running.

[dhiatt ~]$ sudo hdparm -I /dev/sda | grep TRIM
           *    Data Set Management TRIM supported (limit 4 blocks)
           *    Deterministic read ZEROs after TRIM

[dhiatt ~]$ systemctl status fstrim.service
● fstrim.service - Discard unused blocks
   Loaded: loaded (/usr/lib/systemd/system/fstrim.service; static; vendor preset: disabled)
   Active: inactive (dead)

Comment 7 Don Hiatt 2019-03-28 17:52:39 UTC

Tomi,

I have assigned this to you as it is a machine configuration issue, please let me know if you think otherwise.

don

Comment 8 Tomi Sarvela 2019-03-29 07:12:44 UTC

If the fstrim is running, I think it's configuration issue, too.

systemctl fstrim.service and fstrim.timer are disabled but the job is still run. ext4 -o discard is probably the next best choice. There are some extremely old SSDs in DUTs (Intel 520, OCZ Vortex etc) so this also might have some effect in the performance, so I'm doing the change slowly.

Comment 9 Francesco Balestrieri 2019-06-03 06:15:40 UTC

Not seen for three months, closing.

Comment 10 Lakshmi 2019-07-31 11:59:39 UTC

Seen only once CI_DRM_5619 (5 months, 1 week old).

Comment 11 CI Bug Log 2019-07-31 11:59:44 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.