Bug 107426 - i915 kernel driver unable to initialize device (-28) ( Xeon E3-1275 v6 ) unless Aperture size <= 128MB
Summary: i915 kernel driver unable to initialize device (-28) ( Xeon E3-1275 v6 ) unle...
Status: CLOSED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-30 10:24 UTC by Krist
Modified: 2018-08-21 00:50 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg with drm.debug=0xff (99.22 KB, text/plain)
2018-07-30 10:24 UTC, Krist
no flags Details
cat /proc/cpuinfo (9.89 KB, text/plain)
2018-07-30 10:24 UTC, Krist
no flags Details
dmidecode (16.50 KB, text/plain)
2018-07-30 10:24 UTC, Krist
no flags Details
zcat /proc/config.gz (208.83 KB, text/plain)
2018-07-30 10:25 UTC, Krist
no flags Details
lsb_release -a (87 bytes, text/plain)
2018-07-30 10:25 UTC, Krist
no flags Details
lshw (26.07 KB, text/plain)
2018-07-30 10:25 UTC, Krist
no flags Details
lspci (2.87 KB, text/plain)
2018-07-30 10:25 UTC, Krist
no flags Details
uname -a (99 bytes, text/plain)
2018-07-30 10:25 UTC, Krist
no flags Details
dmesg with aperture set to 128MB (127.36 KB, text/plain)
2018-07-30 10:35 UTC, Krist
no flags Details
128MB: lspci -vvv -s 0:0:2 (1.64 KB, text/plain)
2018-07-30 11:08 UTC, Krist
no flags Details
256MB: lspci -vvv -s 0:0:2 (1.57 KB, text/plain)
2018-07-30 11:08 UTC, Krist
no flags Details
dmesg when encoder is being initialized (16.53 KB, text/plain)
2018-07-30 11:42 UTC, Krist
no flags Details
dmesg when the encoder eventually succesfully initializes (250.72 KB, text/plain)
2018-07-30 11:53 UTC, Krist
no flags Details
Encoder debug logging and backtrace when init fails (5.30 KB, text/plain)
2018-07-30 12:46 UTC, Krist
no flags Details

Description Krist 2018-07-30 10:24:21 UTC
Created attachment 140880 [details]
dmesg with drm.debug=0xff

OS: Ubuntu 16.04.4 LTS
Kernel: 4.17.4
CPU: Intel(R) Xeon(R) CPU E3-1275 v6 @ 3.80GHz
MotherBoard: SuperMicro X11SAT-F

The i915 kernel driver is not able to initialize the Intel GPU unless the Aperture size in the BIOS is set to 128MB or less.
Doing this results in quite some problems for our use case though.

Attached logging (dmesg with drm.debug=0xff) and other information about the system.
Comment 1 Krist 2018-07-30 10:24:42 UTC
Created attachment 140881 [details]
cat /proc/cpuinfo
Comment 2 Krist 2018-07-30 10:24:53 UTC
Created attachment 140882 [details]
dmidecode
Comment 3 Krist 2018-07-30 10:25:10 UTC
Created attachment 140883 [details]
zcat /proc/config.gz
Comment 4 Krist 2018-07-30 10:25:22 UTC
Created attachment 140884 [details]
lsb_release -a
Comment 5 Krist 2018-07-30 10:25:31 UTC
Created attachment 140885 [details]
lshw
Comment 6 Krist 2018-07-30 10:25:41 UTC
Created attachment 140886 [details]
lspci
Comment 7 Krist 2018-07-30 10:25:51 UTC
Created attachment 140887 [details]
uname -a
Comment 8 Chris Wilson 2018-07-30 10:28:40 UTC
[    1.265582] [drm:i915_ggtt_probe_hw [i915]] GMADR size = 0M

and with aperture set to 128MiB?
Comment 9 Krist 2018-07-30 10:35:59 UTC
Created attachment 140888 [details]
dmesg with aperture set to 128MB
Comment 10 Chris Wilson 2018-07-30 10:38:38 UTC
As expected, then we get

[    1.444006] [drm:i915_ggtt_probe_hw [i915]] GMADR size = 128M
Comment 11 Chris Wilson 2018-07-30 10:43:08 UTC
Aside, how does having mappable limited to 128MiB affect you?
Comment 12 Chris Wilson 2018-07-30 10:48:18 UTC
And can you please attach lspci -vvv -s 0:0:2 for both configs?
Comment 13 Krist 2018-07-30 11:08:26 UTC
Created attachment 140890 [details]
128MB: lspci -vvv -s 0:0:2
Comment 14 Krist 2018-07-30 11:08:41 UTC
Created attachment 140891 [details]
256MB: lspci -vvv -s 0:0:2
Comment 15 Krist 2018-07-30 11:10:26 UTC
The mappable limit of 128MB is affecting us, because we are trying to run 4 different HD video encoders using the Intel Media SDK.
For running all these video encoders, we are required to have numerous VA Surface buffers and we need more than 128MB of them.
Comment 16 Chris Wilson 2018-07-30 11:14:07 UTC
(In reply to Krist from comment #14)
> Created attachment 140891 [details]
> 256MB: lspci -vvv -s 0:0:2

Ok, we are not going completely mad, your BIOS is buggy.

(In reply to Krist from comment #15)
> The mappable limit of 128MB is affecting us, because we are trying to run 4
> different HD video encoders using the Intel Media SDK.
> For running all these video encoders, we are required to have numerous VA
> Surface buffers and we need more than 128MB of them.

Mappable has no impact on the amount of usable memory. It impacts the number of surfaces that can be directly mapped through the very slow indirect GTT. If userspace is using that as it's preferred means of access, it needs to be shouted at. Still, that is handled as 2MiB chunks, the size of surface and the number of them have little impact, the size of the aperture basically means how many concurrent access you can handle before thrashing forces eviction.

The biggest consume of mappable is the display engine, but even that can handle unmmappable memory for the main part (some features like FBC not withstanding).
Comment 17 Krist 2018-07-30 11:20:27 UTC
Ok, I see. I thought it was related because of VA API call failures when trying to allocate buffers.
But perhaps this is also being caused because of the buggy BIOS?

Could you explain why the BIOS is buggy, so I can contact the manufacturer with that information?
Comment 18 Chris Wilson 2018-07-30 11:28:41 UTC
The BIOS is responsible for setting up the IO regions of the gfx PCI device,

	Region 2: Memory at <unassigned> (64-bit, prefetchable)

is the mappable aperture PCI BAR.

If you can get some specifics on how userspace fails with only 128MiB, we should follow that up as well.
Comment 19 Krist 2018-07-30 11:41:39 UTC
I'm not sure how to provide more info, but I can say that when I'm trying to initialize the Intel encoder with API call:
MFXVideoEncode->Init(...);

It is returning result:
-28  (MFX_ERR_DEVICE_FAILED)

It does seem to get past the VA Surface allocation though, so perhaps I remembered incorrectly where the userspace issue is.


Sometimes I can initialize one encoder successfully, but then the second one is failing etc.

Attached will be the dmesg log when all this is happening and failing.
Comment 20 Krist 2018-07-30 11:42:29 UTC
Created attachment 140892 [details]
dmesg when encoder is being initialized
Comment 21 Krist 2018-07-30 11:53:06 UTC
Created attachment 140893 [details]
dmesg when the encoder eventually succesfully initializes
Comment 22 Krist 2018-07-30 11:54:21 UTC
When we're trying to only use 1 encoder, eventually it will be able to initialize succesfully (after sometimes 500 times the above stated failure).

I have attached the dmesg log where at the end, the single encoder has been successfully initialized and is running.
Comment 23 Krist 2018-07-30 12:14:53 UTC
FYI:

SDK functions return MFX_ERR_DEVICE_LOST or MFX_ERR_DEVICE_FAILED to indicate that there is a complete failure in
hardware acceleration. The application must close and reinitialize the SDK function class. If the application has provided a
hardware acceleration device handle to the SDK, the application must reset the device.
Comment 24 Chris Wilson 2018-07-30 12:20:16 UTC
Yet the kernel didn't report any failures. I think it's all in the imagination of the media-driver.
Comment 25 Krist 2018-07-30 12:46:08 UTC
Created attachment 140894 [details]
Encoder debug logging and backtrace when init fails
Comment 26 Krist 2018-07-30 12:46:48 UTC
Not sure if this is applicable to the i915 driver anymore.

I will also continue contact with the Media SDK support team.
Comment 27 Simon Lee 2018-08-04 15:28:45 UTC
(In reply to Krist from comment #26)
> Not sure if this is applicable to the i915 driver anymore.
> 
> I will also continue contact with the Media SDK support team.

Hi Krist,

Did you get any response from MSDK?
Comment 28 Krist 2018-08-06 07:05:18 UTC
Hi Simon,

We went on to continue to debug using more debugging in userspace via MSDK. This resulted in no further results, unfortunately.

During contact with manufacturer, we also found out that this motherboard, X11SAT-F, has built in firmware which is apparently some kind of a PCIe multiplexer, able do more PCIe lanes than the CPU is actually supposed to be able to.
However, this is most likely now the cause of all kinds of instabilities and issues with the onboard graphics card.

Therefore, we have sent back mobo to factory for them to have another look at it.
I don't think this is an i915 / drm issue and will thus close the bug.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.