Bug 110586

Summary: [CI][DRMTIP] igt@runner@aborted - fail - TAINT_BAD_PAGE: Bad page reference or an unexpected page flags.
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ICL i915 features:

Description Martin Peres 2019-05-02 10:01:00 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_272/fi-icl-y/igt@runner@aborted.html

Aborting.
Previous test: kms_plane_scaling (pipe-c-scaler-with-clipping-clamping)
Next test: i915_pm_rps (min-max-config-idle)

Kernel badly tainted (0x60) (check dmesg for details):
	(0x20) TAINT_BAD_PAGE: Bad page reference or an unexpected page flags.

<1>[  114.390809] BUG: Bad page state in process kms_plane_scali  pfn:25d890
<4>[  114.390961] page:fffff9a989762400 count:0 mapcount:0 mapping:0000000000000000 index:0x0
<4>[  114.390964] flags: 0x8000000000100000(unevictable)
<4>[  114.390967] raw: 8000000000100000 0000000000000000 fffff9a989762408 0000000000000000
<4>[  114.390969] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
<4>[  114.390971] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
<1>[  114.390972] bad because of flags: 0x100000(unevictable)
<4>[  114.390987] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul ax88179_178a crc32_pclmul snd_hda_intel usbnet mii snd_hda_codec ghash_clmulni_intel snd_hwdep e1000e snd_hda_core snd_pcm ptp pps_core mei_me mei prime_numbers
<4>[  114.391003] CPU: 6 PID: 1120 Comm: kms_plane_scali Tainted: G     U            5.1.0-rc7-g087f11254b9a-drmtip_272+ #1
<4>[  114.391005] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake Y LPDDR4x T4 RVP TLC, BIOS ICLSFWR1.R00.3071.A00.1902120336 02/12/2019
<4>[  114.391007] Call Trace:
<4>[  114.391012]  dump_stack+0x67/0x9b
<4>[  114.391017]  bad_page+0xbf/0x120
<4>[  114.391021]  get_page_from_freelist+0x623/0x13b0
<4>[  114.391034]  __alloc_pages_nodemask+0x15d/0x1130
<4>[  114.391037]  ? free_unref_page_list+0x1c9/0x250
<4>[  114.391039]  ? free_unref_page_list+0x1c9/0x250
<4>[  114.391046]  ? __lru_cache_add+0x90/0x90
<4>[  114.391050]  ? mark_held_locks+0x49/0x70
<4>[  114.391056]  do_huge_pmd_anonymous_page+0xee/0x630
<4>[  114.391059]  ? __lock_acquire+0x49f/0x1590
<4>[  114.391064]  __handle_mm_fault+0xca2/0xfa0
<4>[  114.391073]  handle_mm_fault+0x196/0x3a0
<4>[  114.391078]  __do_page_fault+0x248/0x4f0
<4>[  114.391084]  ? page_fault+0x8/0x30
<4>[  114.391087]  page_fault+0x1e/0x30
<4>[  114.391089] RIP: 0033:0x7feca25d7b1f
<4>[  114.391091] Code: 17 e0 c5 f8 77 c3 48 3b 15 76 1b 26 00 0f 83 25 01 00 00 48 39 f7 72 0f 74 12 4c 8d 0c 16 4c 39 cf 0f 82 c5 01 00 00 48 89 d1 <f3> a4 c3 80 fa 10 73 17 80 fa 08 73 27 80 fa 04 73 33 80 fa 01 77
<4>[  114.391093] RSP: 002b:00007fff1542c248 EFLAGS: 00010287
<4>[  114.391095] RAX: 00007fec95dff200 RBX: 0000000000000000 RCX: 0000000000001000
<4>[  114.391097] RDX: 0000000000001e00 RSI: 00007fff1542d1d0 RDI: 00007fec95e00000
<4>[  114.391098] RBP: 00007fff1542c3d0 R08: 0000000000000000 R09: 0000000000000780
<4>[  114.391100] R10: 00000000000000bf R11: 00007fff1542c370 R12: 00007fff1542c370
<4>[  114.391101] R13: 00000000000000c0 R14: 00007fff1542c2b0 R15: 000055f2efe87330
<4>[  114.391110] Disabling lock debugging due to kernel taint
Comment 1 CI Bug Log 2019-05-02 10:03:36 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@runner@aborted - fail - TAINT_BAD_PAGE: Bad page reference or an unexpected page flags.
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_272/fi-icl-y/igt@runner@aborted.html
Comment 2 Jani Saarinen 2019-05-07 11:43:58 UTC
We should update BIOS on this system.
Comment 3 Anshuman Gupta 2019-05-08 11:26:21 UTC
From initial triaging with my limitation, it looks nothing has broken with i915 driver with respect to this BUG, it is bad_page assertion, while allocating a new page for a user space process (kms_plane_scali).
[~lvudum] u may files this open bug to Linux MM Maintener.
Comment 4 Lakshmi 2019-05-09 07:48:12 UTC
(In reply to Anshuman Gupta from comment #3)
> From initial triaging with my limitation, it looks nothing has broken with
> i915 driver with respect to this BUG, it is bad_page assertion, while
> allocating a new page for a user space process (kms_plane_scali).
> [~lvudum] u may files this open bug to Linux MM Maintener.

Thanks for your input. This failure is reported here https://bugzilla.kernel.org/show_bug.cgi?id=203557
Comment 5 Daniel Vetter 2019-05-10 06:21:12 UTC
Please don't do this, by usual bug filing standards our reports here are completely unactionable and this might as well sound like a bad machine or our driver bug.

If you want to file a bug with a foreign subsystem please make damn sure you have solid proof it's a bug with them, and not just a hunch.

Quick comment from Dave Airlie on irc:

* airlied also guesses it's memory corrupt or some pages ending up in wrong state
<airlied> I doubt mm is at fault
<danvet> yeah
<airlied> we marked something unvevictable and freed it?

See the dmesg backtrace

<4>[  114.390971] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
<1>[  114.390972] bad because of flags: 0x100000(unevictable)

drm/i915 does mark pages as unevictable, so this very much could be our own bug. Or bad hw.

Internally it's not a big deal if our triaging doesn't really analyze the bug, but that stops being fun as soon as we involve non-intel people.

Lakshmi, can you pls close the kernel bugzilla entry with a quick "oops sry, need to look at this more first internally".

Thanks, Daniel
Comment 6 Lakshmi 2019-05-13 06:12:33 UTC
(In reply to Daniel Vetter from comment #5)
> Please don't do this, by usual bug filing standards our reports here are
> completely unactionable and this might as well sound like a bad machine or
> our driver bug.
> 
> If you want to file a bug with a foreign subsystem please make damn sure you
> have solid proof it's a bug with them, and not just a hunch.
> 
> Quick comment from Dave Airlie on irc:
> 
> * airlied also guesses it's memory corrupt or some pages ending up in wrong
> state
> <airlied> I doubt mm is at fault
> <danvet> yeah
> <airlied> we marked something unvevictable and freed it?
> 
> See the dmesg backtrace
> 
> <4>[  114.390971] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> <1>[  114.390972] bad because of flags: 0x100000(unevictable)
> 
> drm/i915 does mark pages as unevictable, so this very much could be our own
> bug. Or bad hw.
> 
> Internally it's not a big deal if our triaging doesn't really analyze the
> bug, but that stops being fun as soon as we involve non-intel people.
> 
> Lakshmi, can you pls close the kernel bugzilla entry with a quick "oops sry,
> need to look at this more first internally".
> 
> Thanks, Daniel

Thanks for pointing this. Yes, I will double check or be more cautious when creating an external bug. For now I have closed the kernel bug.
Comment 7 Lakshmi 2019-05-13 13:15:51 UTC
This failure has happened twice where in both the cases the previous test mentioned in the output was passed.

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_276/fi-icl-y/igt@runner@aborted.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_272/fi-icl-y/igt@runner@aborted.html

So, it might not be related to Bug https://bugs.freedesktop.org/show_bug.cgi?id=110040, or https://bugs.freedesktop.org/show_bug.cgi?id=110041

Needs further investigation.
Comment 8 Jani Saarinen 2019-05-13 14:47:27 UTC
This has been seen only 2 times on system that had old  BIOS and now BIOS has been updated and after that not seen. If we see issues on this BIOS ICLSFWR1.R00.3162.A00 then we could think further.
Comment 9 Jani Saarinen 2019-05-14 06:06:47 UTC
Not seen now in 1 week. Lowering.
Comment 10 Anshuman Gupta 2019-05-27 08:52:19 UTC
i915_gem_object_get_pages_gtt()->mapping_set_unevictable() set the mapping unevictable.

i915_gem_object_put_pages_gtt()->mapping_clear_unevictable() clear the mappings unevictable.
Gem expert need to comment are we missing any put sequence here causing this assertion.
Comment 11 Anshuman Gupta 2019-05-27 08:52:30 UTC
i915_gem_object_get_pages_gtt()->mapping_set_unevictable() set the mapping unevictable.

i915_gem_object_put_pages_gtt()->mapping_clear_unevictable() clear the mappings unevictable.
Gem expert need to comment are we missing any put sequence here causing this assertion.
Comment 12 Lakshmi 2019-06-12 10:25:58 UTC
For now this issue occurs only on icl-y and icl-dsi.
Comment 13 Martin Peres 2019-11-29 19:06:15 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/287.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.