Bug 107304 - [CI][BAT] igt@gem_exec_suspend@basic-s4-devices - fail - fstrim prevents the machines from hibernating
Summary: [CI][BAT] igt@gem_exec_suspend@basic-s4-devices - fail - fstrim prevents the ...
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Tomi Sarvela
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-20 08:33 UTC by Martin Peres
Modified: 2019-07-31 11:59 UTC (History)
3 users (show)

See Also:
i915 platform: ALL
i915 features: GPU hang


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-07-20 08:33:01 UTC
On rerun 42/105 of Fast feedback, 20 machines failed the test igt@gem_exec_suspend@basic-s4-devices. The mysterious thing was that all the other re-runs of the same kernel configuration passed.

After some investigation from Tomi, turns out it was due to the firing of the weekly fstrim job.

Suggested W/A are:
 - mount -o discard
 - run fstrim before each run
Comment 1 Tomi Sarvela 2018-07-20 08:44:01 UTC
The dmesg logs have been copied over for run CI_DRM_4451_42, eg.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4451_42/fi-glk-j4005/dmesg0.log

<7>[  255.076407] [IGT] gem_exec_suspend: starting subtest basic-S3
<6>[  257.096983] PM: suspend entry (deep)
<6>[  257.097032] PM: Syncing filesystems ... done.
<6>[  257.519122] Freezing user space processes ... 
<4>[  257.700828] hpet1: lost 7160 rtc interrupts
<4>[  258.610436] hpet1: lost 7161 rtc interrupts
<4>[  259.520732] hpet1: lost 7161 rtc interrupts
<4>[  260.431583] hpet1: lost 7161 rtc interrupts
<4>[  261.358618] hpet1: lost 7161 rtc interrupts
<4>[  262.270512] hpet1: lost 7161 rtc interrupts
<4>[  263.163399] hpet1: lost 7161 rtc interrupts
<4>[  264.078469] hpet1: lost 7161 rtc interrupts
<4>[  264.988525] hpet1: lost 7161 rtc interrupts
<4>[  265.900422] hpet1: lost 7161 rtc interrupts
<4>[  266.844790] hpet1: lost 7161 rtc interrupts
<4>[  267.797851] hpet1: lost 7161 rtc interrupts
<4>[  268.759081] hpet1: lost 7160 rtc interrupts
<4>[  269.697541] hpet1: lost 7161 rtc interrupts
<4>[  270.609418] hpet1: lost 7161 rtc interrupts
<4>[  271.520478] hpet1: lost 7161 rtc interrupts
<3>[  277.523861] Freezing of tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
<6>[  277.524696] fstrim          D12728  1426      1 0x00000004
<4>[  277.524772] Call Trace:
<4>[  277.524807]  ? __schedule+0x364/0xbf0
<4>[  277.524835]  ? __schedule+0x1df/0xbf0
<4>[  277.524865]  ? wait_for_common_io.constprop.2+0x106/0x1b0
<4>[  277.524884]  ? schedule+0x2d/0x90
<4>[  277.524896]  ? schedule_timeout+0x236/0x540
<4>[  277.524914]  ? _raw_spin_unlock_irq+0x41/0x50
<4>[  277.524930]  ? blk_flush_plug_list+0x213/0x270
<4>[  277.524963]  ? wait_for_common_io.constprop.2+0x106/0x1b0
<4>[  277.524980]  ? io_schedule_timeout+0x14/0x40
<4>[  277.525073]  ? wait_for_common_io.constprop.2+0x12e/0x1b0
<4>[  277.525091]  ? wake_up_q+0x70/0x70
<4>[  277.525120]  ? submit_bio_wait+0x5a/0x80
<4>[  277.525156]  ? blkdev_issue_discard+0x7b/0xd0
<4>[  277.525201]  ? ext4_trim_fs+0x46d/0xbb0
<4>[  277.525216]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[  277.525229]  ? ext4_trim_fs+0x46d/0xbb0
<4>[  277.525299]  ? ext4_ioctl+0xdc4/0x10d0
<4>[  277.525314]  ? cp_new_stat+0x155/0x190
<4>[  277.525360]  ? do_vfs_ioctl+0xa0/0x6d0
<4>[  277.525378]  ? __se_sys_newfstat+0x3c/0x60
<4>[  277.525409]  ? ksys_ioctl+0x35/0x60
<4>[  277.525432]  ? __x64_sys_ioctl+0x11/0x20
<4>[  277.525445]  ? do_syscall_64+0x55/0x190
<4>[  277.525462]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
<6>[  277.525520] OOM killer enabled.
<6>[  277.525529] Restarting tasks ... done.
<6>[  277.542101] video LNXVIDEO:00: Restoring backlight state
<6>[  277.542140] PM: suspend exit
<5>[  277.581046] Setting dangerous option reset - tainting kernel
<7>[  277.581509] [IGT] gem_exec_suspend: exiting, ret=99
Comment 2 Lakshmi 2018-09-27 13:59:31 UTC
We are not running crons in the test machines anymore. This bug is not valid. Closing this bug.
Comment 4 Martin Peres 2019-02-18 09:49:31 UTC
And it's back....
Comment 5 Francesco Balestrieri 2019-02-19 08:39:02 UTC
So is it valid or not?
Comment 6 Don Hiatt 2019-03-05 18:43:40 UTC
All the latest reports (Comment #3) are exactly the same: fstrim is refusing to sleep so there is an active fstrim job occurring. 

fstrim is run as a service (see below for Fedora) on later linux distributions so it's not an issue of cron jobs (Comment #2).

Tomi: I suggest that all fstrim services be disabled while the IGT tests are running.

[dhiatt ~]$ sudo hdparm -I /dev/sda | grep TRIM
           *    Data Set Management TRIM supported (limit 4 blocks)
           *    Deterministic read ZEROs after TRIM

[dhiatt ~]$ systemctl status fstrim.service
● fstrim.service - Discard unused blocks
   Loaded: loaded (/usr/lib/systemd/system/fstrim.service; static; vendor preset: disabled)
   Active: inactive (dead)
Comment 7 Don Hiatt 2019-03-28 17:52:39 UTC
Tomi,

I have assigned this to you as it is a machine configuration issue, please let me know if you think otherwise.

don
Comment 8 Tomi Sarvela 2019-03-29 07:12:44 UTC
If the fstrim is running, I think it's configuration issue, too.

systemctl fstrim.service and fstrim.timer are disabled but the job is still run. ext4 -o discard is probably the next best choice. There are some extremely old SSDs in DUTs (Intel 520, OCZ Vortex etc) so this also might have some effect in the performance, so I'm doing the change slowly.
Comment 9 Francesco Balestrieri 2019-06-03 06:15:40 UTC
Not seen for three months, closing.
Comment 10 Lakshmi 2019-07-31 11:59:39 UTC
Seen only once CI_DRM_5619 (5 months, 1 week old).
Comment 11 CI Bug Log 2019-07-31 11:59:44 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.