Bug 97088 - [APL] [BAT] GuC firmware goes missing
Summary: [APL] [BAT] GuC firmware goes missing
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: Elio
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 97275 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-07-26 17:46 UTC by Dave Gordon
Modified: 2016-12-16 07:42 UTC (History)
2 users (show)

See Also:
i915 platform: BXT
i915 features: power/Other


Attachments
Dmesg log (206.38 KB, text/plain)
2016-10-20 20:09 UTC, Elio
no flags Details
Success log (222 bytes, text/plain)
2016-10-20 20:09 UTC, Elio
no flags Details
Failure output (444 bytes, text/plain)
2016-10-20 20:10 UTC, Elio
no flags Details

Description Dave Gordon 2016-07-26 17:46:46 UTC
Split off from

Bug 96526 - [APL] [BAT execlists] Sporadic - gem_exec_suspend basic-s4 GPU hang after resume

One of the logs attached to that bug (https://bugs.freedesktop.org/attachment.cgi?id=124911) shows the GuC firmware disappearing between reboots.

Early in the log we have

Jul  5 16:56:27 BXTP5 kernel: [    1.689144] [drm:intel_guc_setup] GuC fw status: path i915/bxt_guc_ver8_7.bin, fetch SUCCESS, load NONE

but 5 minutes later, on the next reboot cycle:

Jul  5 17:01:44 BXTP5 kernel: [    1.711802] i915 0000:00:02.0: Direct firmware load for i915/bxt_guc_ver8_7.bin failed with error -2

The kernel logs for each cycle look generally similar, but the order of some operations is not identical. In particular, the appearance of the MMC devices can come before OR after the attempt to load the GuC firmware.

So this is really a completely different issue, related to the way that devices are initialised asynchronously w.r.t one another. Hence, this bug report has been opened to track it separately.
Comment 1 Dave Gordon 2016-07-26 18:25:57 UTC
So to work out just what's happening here we need to know:
Is the correct version of the firmware present in the root filesystem, an dif so where. We expect it to be /lib/firmware/i915/bxt_guc_ver8_7.bin

And incidentally, is it a binary file or a symlink to the binary? Although they're generally equivalent, there are cases where symlinks will not be followed (although the error would probably be -EACCESS rather than -ENOENT).

Secondly: is the binary also present in the ramdisk? lsinitramfs(8) can be used to examine the contents of the initrd file, e.g.

$ lsinitramfs /boot/initrd.img-4.7.0-rc7-dsg-02211-g3b9ac6a-dsg-work-111 | grep -i guc
lib/firmware/i915/skl_guc_ver6_1.bin

shows that on this SKL machine, we have only version 6.1 of the GuC firmware.

Thirdly, we need the mount table (/etc/fstab) so we know which device contains the root filesystem, and therefore when it gets mounted.

And finally (for now) it would be useful to know what Kconfig options have been selected, as these can affect ramfs, i915, and device probing in general.
Comment 2 yann 2016-08-01 12:51:47 UTC
Updating priority/severity based on Feature failure w/o workaround
Comment 3 Dave Gordon 2016-08-09 13:05:49 UTC
A similar but different occurence

Command 	/opt/igt/tests/drv_module_reload_basic

Stdout 	

Reloading i915.ko with
unbinding /sys/class/vtconsole/vtcon1/: (M) frame buffer device
module successfully unloaded
module successfully loaded again
Reloading i915.ko with inject_load_failure=1
unbinding /sys/class/vtconsole/vtcon1/: (M) frame buffer device
module successfully unloaded
Reloading i915.ko with inject_load_failure=2
module successfully unloaded
Reloading i915.ko with inject_load_failure=3
module successfully unloaded
Reloading i915.ko with inject_load_failure=4
module successfully unloaded
Reloading i915.ko with
module successfully unloaded
module successfully loaded again

dmesg 	

[  164.447845] Setting dangerous option enable_psr - tainting kernel
[  164.482186] i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling runtime power management.
[  165.476834] i915 0000:00:02.0: Direct firmware load for i915/skl_guc_ver6_1.bin failed with error -2
[  165.476911] [drm:intel_guc_init [i915]] *ERROR* Failed to fetch GuC firmware from i915/skl_guc_ver6_1.bin (error -2)
[  174.256375] Setting dangerous option enable_psr - tainting kernel
[  174.256378] Setting dangerous option inject_load_failure - tainting kernel
[  174.584082] Setting dangerous option enable_psr - tainting kernel
[  174.584085] Setting dangerous option inject_load_failure - tainting kernel
[  174.948056] Setting dangerous option enable_psr - tainting kernel
[  174.948058] Setting dangerous option inject_load_failure - tainting kernel
[  175.286457] Setting dangerous option enable_psr - tainting kernel
[  175.286460] Setting dangerous option inject_load_failure - tainting kernel
[  175.638344] Setting dangerous option enable_psr - tainting kernel
[  175.667779] i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling runtime power management.
[  176.652812] i915 0000:00:02.0: Direct firmware load for i915/skl_guc_ver6_1.bin failed with error -2
[  176.652889] [drm:intel_guc_init [i915]] *ERROR* Failed to fetch GuC firmware from i915/skl_guc_ver6_1.bin (error -2)


Here both the GuC firmware AND the DMC firmware seem to have gone missing.
Unless machine (ro-skl3-i5-6260u) just doesn't have them installed?
Comment 4 Dave Gordon 2016-08-10 11:38:26 UTC
(In reply to david.s.gordon from comment #3)
> A similar but different occurence
> 
> Command 	/opt/igt/tests/drv_module_reload_basic
> 
> Stdout 	
> 
> Reloading i915.ko with
> unbinding /sys/class/vtconsole/vtcon1/: (M) frame buffer device
> module successfully unloaded
> module successfully loaded again
> Reloading i915.ko with inject_load_failure=1
> unbinding /sys/class/vtconsole/vtcon1/: (M) frame buffer device
> module successfully unloaded
> Reloading i915.ko with inject_load_failure=2
> module successfully unloaded
> Reloading i915.ko with inject_load_failure=3
> module successfully unloaded
> Reloading i915.ko with inject_load_failure=4
> module successfully unloaded
> Reloading i915.ko with
> module successfully unloaded
> module successfully loaded again
> 
> dmesg 	
> 
> [  164.447845] Setting dangerous option enable_psr - tainting kernel
> [  164.482186] i915 0000:00:02.0: Failed to load DMC firmware
> [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling
> runtime power management.
> [  165.476834] i915 0000:00:02.0: Direct firmware load for
> i915/skl_guc_ver6_1.bin failed with error -2
> [  165.476911] [drm:intel_guc_init [i915]] *ERROR* Failed to fetch GuC
> firmware from i915/skl_guc_ver6_1.bin (error -2)
> [  174.256375] Setting dangerous option enable_psr - tainting kernel
> [  174.256378] Setting dangerous option inject_load_failure - tainting kernel
> [  174.584082] Setting dangerous option enable_psr - tainting kernel
> [  174.584085] Setting dangerous option inject_load_failure - tainting kernel
> [  174.948056] Setting dangerous option enable_psr - tainting kernel
> [  174.948058] Setting dangerous option inject_load_failure - tainting kernel
> [  175.286457] Setting dangerous option enable_psr - tainting kernel
> [  175.286460] Setting dangerous option inject_load_failure - tainting kernel
> [  175.638344] Setting dangerous option enable_psr - tainting kernel
> [  175.667779] i915 0000:00:02.0: Failed to load DMC firmware
> [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling
> runtime power management.
> [  176.652812] i915 0000:00:02.0: Direct firmware load for
> i915/skl_guc_ver6_1.bin failed with error -2
> [  176.652889] [drm:intel_guc_init [i915]] *ERROR* Failed to fetch GuC
> firmware from i915/skl_guc_ver6_1.bin (error -2)
> 
> 
> Here both the GuC firmware AND the DMC firmware seem to have gone missing.
> Unless machine (ro-skl3-i5-6260u) just doesn't have them installed?

Actually, that is the problem for this specific case; the firmware isn't installed on that machine. That hasn't been causing failures because GuC submission has been disabled :( It also means DMC is not getting tested, which also goes unnoticed because DMC firmware load failure isn't flagged as an error :(
Comment 5 cprigent 2016-09-19 12:07:37 UTC
*** Bug 97275 has been marked as a duplicate of this bug. ***
Comment 6 anusha 2016-10-12 17:50:30 UTC
@cprigent @elio
Tried drv_module_reload_basic and gem_exec_suspend basic -S4. GuC get loaded in each case. Can you confirm this?
Comment 7 Elio 2016-10-20 20:08:47 UTC
gem_exec_suspend basic -S4 is failing 50/100, some times sends errors regarding rtcwake

but all the times the GUC is there

[  220.568647] [drm:intel_guc_setup [i915]] GuC fw status: path i915/bxt_guc_ver8_7.bin, fetch SUCCESS, load SUCCESS
[  220.568681] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load PENDING
[  220.568779] [drm:intel_huc_load [i915]] i915/bxt_huc_ver01_07_1398.bin fw status: fetch SUCCESS, load SUCCESS
[  220.570899] [drm:intel_huc_load [i915]] HuC DMA transfer wait over with ret 0
[  220.570932] [drm:intel_huc_load [i915]] i915/bxt_huc_ver01_07_1398.bin fw status: fetch SUCCESS, load SUCCESS

Output.

rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "disk" using /dev/rtc0 at Thu Oct 20 15:51:20 2016
rtcwake: write error
(gem_exec_suspend:1091) igt-aux-CRITICAL: Test assertion failure function suspend_via_rtcwake, file igt_aux.c:710:
(gem_exec_suspend:1091) igt-aux-CRITICAL: Failed assertion: system(cmd) == 0
(gem_exec_suspend:1091) igt-aux-CRITICAL: This failure means that something is wrong with the rtcwake tool or how your distro is set up. This is not a i915.ko or i-g-t bug.
IGT-Version: 1.16-gf7408bf (x86_64) (Linux: 4.8.0-nightly-patch-anusha+ x86_64)
Subtest basic-S4 failed.
**** DEBUG ****
(gem_exec_suspend:1091) DEBUG: Test requirement passed: gem_has_ring(fd, 0)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: can_mi_store_dword(gen, 0)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1091) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1091) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1091) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1091) DEBUG: Verifying result
(gem_exec_suspend:1091) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1091) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1091) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1091) DEBUG: Verifying result
(gem_exec_suspend:1091) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1091) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1091) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1091) DEBUG: Verifying result
(gem_exec_suspend:1091) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1091) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1091) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1091) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1091) DEBUG: Verifying result
(gem_exec_suspend:1091) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1091) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1091) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_exec_suspend:1091) igt-aux-DEBUG: Test requirement passed: (power_dir = open("/sys/power", O_RDONLY)) >= 0
(gem_exec_suspend:1091) igt-aux-DEBUG: Test requirement passed: get_supported_suspend_states(power_dir) & (1 << state)
(gem_exec_suspend:1091) igt-aux-DEBUG: Test requirement passed: test == SUSPEND_TEST_NONE || faccessat(power_dir, "pm_test", R_OK | W_OK, 0) == 0
(gem_exec_suspend:1091) igt-aux-DEBUG: Test requirement passed: system(cmd) == 0
(gem_exec_suspend:1091) igt-aux-CRITICAL: Test assertion failure function suspend_via_rtcwake, file igt_aux.c:710:
(gem_exec_suspend:1091) igt-aux-CRITICAL: Failed assertion: system(cmd) == 0
(gem_exec_suspend:1091) igt-aux-CRITICAL: This failure means that something is wrong with the rtcwake tool or how your distro is set up. This is not a i915.ko or i-g-t bug.
****  END  ****
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [igt_system_suspend_autoresume+0x5a5]
  #2 [run_test+0x387]
  #3 [__real_main241+0x13e]
  #4 [main+0x23]
  #5 [__libc_start_main+0xf0]
  #6 [_start+0x29]
  #7 [<unknown>+0x29]
Subtest basic-S4: FAIL (9.654s)
(11:50 AM) [gfx@gfx-bxt] [tests]$ : dmesg | grep intel_guc_setup                [    3.150691] [drm:intel_guc_setup [i915]] GuC fw status: path i915/bxt_guc_ver8_7.bin, fetch SUCCESS, load NONE
[    3.150726] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load PENDING
[    3.159459] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load SUCCESS
[   64.319473] [drm:intel_guc_setup [i915]] GuC fw status: path i915/bxt_guc_ver8_7.bin, fetch SUCCESS, load SUCCESS
[   64.319499] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load PENDING
[   64.332832] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load SUCCESS
[  155.722796] [drm:intel_guc_setup [i915]] GuC fw status: path i915/bxt_guc_ver8_7.bin, fetch SUCCESS, load SUCCESS
[  155.722830] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load PENDING
[  155.735341] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load SUCCESS


I just put drv_module_reload_basic in a short cycle (100 iterations, not able to reproduce the issue)

Attaching logs
Comment 8 Elio 2016-10-20 20:09:12 UTC
Created attachment 127436 [details]
Dmesg log
Comment 9 Elio 2016-10-20 20:09:36 UTC
Created attachment 127437 [details]
Success log
Comment 10 Elio 2016-10-20 20:10:01 UTC
Created attachment 127438 [details]
Failure output
Comment 11 Elio 2016-10-20 20:11:05 UTC
Configuration:

Kernel:  drm-intel-next-2016-09-19-15336-gcb978ac

IGT: intel-gpu-tools-1.16-90-gf7408bf
Component         : drm
	url       : http://cgit.freedesktop.org/mesa/drm
	tag       : libdrm-2.4.68
	commit    : fc09c5ab84240e9b6bd0bed01685ef004f56c4fa


Component         : mesa
	url       : http://cgit.freedesktop.org/mesa/mesa
	tag       : mesa-12.0.1
	commit    : 04277f058d00238937e664cf546c43b16cea7b2b

Component         : xf86-video-intel
	url       : http://cgit.freedesktop.org/xorg/driver/xf86-video-intel
	tag       : 2.99.917-701-g205146b
	commit    : 205146b0fdc8db016e5cfeeae5a6b25df3470ebc

Component         : libva
	url       : http://cgit.freedesktop.org/libva
	tag       : libva-1.7.2.pre1
	commit    : 9927bd2fbbb1d923fd6a8932a3cdbb5c9185ee22

Component         : intel-driver
	url       : http://cgit.freedesktop.org/vaapi/intel-driver
	tag       : 1.7.2.pre1
	commit    : ce444fb412966ca6afbb1331b7cae8ab621c1108

Component         : cairo
	url       : http://cgit.freedesktop.org/cairo
	tag       : 1.15.2
	commit    : db8a7f1697c49ae4942d2aa49eed52dd73dd9c7a

Component         : xserver
	url       : http://cgit.freedesktop.org/xorg/xserver
	tag       : xorg-server-1.18.3
	commit    : 9454cd51da9b38b974cff7c8b7125901f6403848

Component         : macros
	url       : https://cgit.freedesktop.org/xorg/util/macros
	tag       : util-macros-1.19.0-2-gd7acec2
Comment 12 Elio 2016-12-15 19:39:22 UTC
With latest configuration we are having the following, needs more investigation.

With Latest configuration + kernel patches we are having a failed action just after returning from S4 using "rtcwake"

[91291.107493] [drm:guc_ucode_xfer_dma [i915]] returning 0
[91291.107539] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load SUCCESS
[91291.127091] [drm] INTEL_GUC_SEND: Action 0x502 failed; ret=-110 status=0x00000502 response=0x40000000

Configuration:

(Graphic Stack) Intel® Graphics for Linux* | 01.org

============================================
 Software information
============================================
Kernel version                  : 4.9.0-rc8latestfirmware+
Linux distribution              : Ubuntu 16.04.1 LTS
Architecture                    : 64-bit
Gfx stack code                  : 1771486656
Mesa version                    : Not found << Please see the message at the bottom >>
xf86-video-intel version        :
Xorg-Xserver version            : 1.18.3
DRM version                     : 2.4.67
VAAPI version                   : Not found
Cairo version                   : 1.14.6
Intel GPU Tools version         : Tag [intel-gpu-tools-1.17-41-ga56a1f6] / Commit [a56a1f6]
Kernel driver in use            : i915
Hardware acceleration           :
Bios revision                   : 144.10
Bios release date               : 06/27/2016
KSC revision                    : 1.15

============================================
 Firmwares information
============================================
DMC fw loaded                   : yes
DMC version                     : 1.7
GUC fw loaded                   : SUCCESS
GUC version wanted              : 8.7
GUC version found               : 8.7

 Kernel parameters
============================================
 quiet drm.debug=0xe i915.enable_guc_loading=2 i915.enable_guc_submission=0 resume=/dev/sda3 fastboot
Comment 13 Elio 2016-12-15 20:12:23 UTC
Same configuration, i just put reboot in a 100 iteration cycle, all the times i got something similar:

8_7.bin, fetch SUCCESS, load NONE
[    3.598334] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load PENDING
[    3.606409] [drm:guc_ucode_xfer_dma [i915]] DMA status 0x10, GuC status 0x8002f0ec
[    3.606447] [drm:guc_ucode_xfer_dma [i915]] returning 0
[    3.606484] [drm:intel_guc_setup [i915]] GuC fw status: fetch SUCCESS, load SUCCESS


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.