On rerun 42/105 of Fast feedback, 20 machines failed the test igt@gem_exec_suspend@basic-s4-devices. The mysterious thing was that all the other re-runs of the same kernel configuration passed. After some investigation from Tomi, turns out it was due to the firing of the weekly fstrim job. Suggested W/A are: - mount -o discard - run fstrim before each run
The dmesg logs have been copied over for run CI_DRM_4451_42, eg. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4451_42/fi-glk-j4005/dmesg0.log <7>[ 255.076407] [IGT] gem_exec_suspend: starting subtest basic-S3 <6>[ 257.096983] PM: suspend entry (deep) <6>[ 257.097032] PM: Syncing filesystems ... done. <6>[ 257.519122] Freezing user space processes ... <4>[ 257.700828] hpet1: lost 7160 rtc interrupts <4>[ 258.610436] hpet1: lost 7161 rtc interrupts <4>[ 259.520732] hpet1: lost 7161 rtc interrupts <4>[ 260.431583] hpet1: lost 7161 rtc interrupts <4>[ 261.358618] hpet1: lost 7161 rtc interrupts <4>[ 262.270512] hpet1: lost 7161 rtc interrupts <4>[ 263.163399] hpet1: lost 7161 rtc interrupts <4>[ 264.078469] hpet1: lost 7161 rtc interrupts <4>[ 264.988525] hpet1: lost 7161 rtc interrupts <4>[ 265.900422] hpet1: lost 7161 rtc interrupts <4>[ 266.844790] hpet1: lost 7161 rtc interrupts <4>[ 267.797851] hpet1: lost 7161 rtc interrupts <4>[ 268.759081] hpet1: lost 7160 rtc interrupts <4>[ 269.697541] hpet1: lost 7161 rtc interrupts <4>[ 270.609418] hpet1: lost 7161 rtc interrupts <4>[ 271.520478] hpet1: lost 7161 rtc interrupts <3>[ 277.523861] Freezing of tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0): <6>[ 277.524696] fstrim D12728 1426 1 0x00000004 <4>[ 277.524772] Call Trace: <4>[ 277.524807] ? __schedule+0x364/0xbf0 <4>[ 277.524835] ? __schedule+0x1df/0xbf0 <4>[ 277.524865] ? wait_for_common_io.constprop.2+0x106/0x1b0 <4>[ 277.524884] ? schedule+0x2d/0x90 <4>[ 277.524896] ? schedule_timeout+0x236/0x540 <4>[ 277.524914] ? _raw_spin_unlock_irq+0x41/0x50 <4>[ 277.524930] ? blk_flush_plug_list+0x213/0x270 <4>[ 277.524963] ? wait_for_common_io.constprop.2+0x106/0x1b0 <4>[ 277.524980] ? io_schedule_timeout+0x14/0x40 <4>[ 277.525073] ? wait_for_common_io.constprop.2+0x12e/0x1b0 <4>[ 277.525091] ? wake_up_q+0x70/0x70 <4>[ 277.525120] ? submit_bio_wait+0x5a/0x80 <4>[ 277.525156] ? blkdev_issue_discard+0x7b/0xd0 <4>[ 277.525201] ? ext4_trim_fs+0x46d/0xbb0 <4>[ 277.525216] ? rcu_read_lock_sched_held+0x6f/0x80 <4>[ 277.525229] ? ext4_trim_fs+0x46d/0xbb0 <4>[ 277.525299] ? ext4_ioctl+0xdc4/0x10d0 <4>[ 277.525314] ? cp_new_stat+0x155/0x190 <4>[ 277.525360] ? do_vfs_ioctl+0xa0/0x6d0 <4>[ 277.525378] ? __se_sys_newfstat+0x3c/0x60 <4>[ 277.525409] ? ksys_ioctl+0x35/0x60 <4>[ 277.525432] ? __x64_sys_ioctl+0x11/0x20 <4>[ 277.525445] ? do_syscall_64+0x55/0x190 <4>[ 277.525462] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe <6>[ 277.525520] OOM killer enabled. <6>[ 277.525529] Restarting tasks ... done. <6>[ 277.542101] video LNXVIDEO:00: Restoring backlight state <6>[ 277.542140] PM: suspend exit <5>[ 277.581046] Setting dangerous option reset - tainting kernel <7>[ 277.581509] [IGT] gem_exec_suspend: exiting, ret=99
We are not running crons in the test machines anymore. This bug is not valid. Closing this bug.
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ELK ILK IVB BXT: igt@gem_exec_suspend@basic-s4-devices - fail - Freezing of tasks failed after 20.\d+ seconds (\d+ tasks refusing to freeze, wq_busy=0), fstrim - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-bxt-j4205/igt@gem_exec_suspend@basic-s4-devices.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-elk-e7500/igt@gem_exec_suspend@basic-s4-devices.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-ilk-650/igt@gem_exec_suspend@basic-s4-devices.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-ivb-3520m/igt@gem_exec_suspend@basic-s4-devices.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5619/fi-ivb-3770/igt@gem_exec_suspend@basic-s4-devices.html
And it's back....
So is it valid or not?
All the latest reports (Comment #3) are exactly the same: fstrim is refusing to sleep so there is an active fstrim job occurring. fstrim is run as a service (see below for Fedora) on later linux distributions so it's not an issue of cron jobs (Comment #2). Tomi: I suggest that all fstrim services be disabled while the IGT tests are running. [dhiatt ~]$ sudo hdparm -I /dev/sda | grep TRIM * Data Set Management TRIM supported (limit 4 blocks) * Deterministic read ZEROs after TRIM [dhiatt ~]$ systemctl status fstrim.service ● fstrim.service - Discard unused blocks Loaded: loaded (/usr/lib/systemd/system/fstrim.service; static; vendor preset: disabled) Active: inactive (dead)
Tomi, I have assigned this to you as it is a machine configuration issue, please let me know if you think otherwise. don
If the fstrim is running, I think it's configuration issue, too. systemctl fstrim.service and fstrim.timer are disabled but the job is still run. ext4 -o discard is probably the next best choice. There are some extremely old SSDs in DUTs (Intel 520, OCZ Vortex etc) so this also might have some effect in the performance, so I'm doing the change slowly.
Not seen for three months, closing.
Seen only once CI_DRM_5619 (5 months, 1 week old).
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.