Bug 110334 - System freeze while creating a lot of network traffic when using Intel Inference Engine with OpenCL
Summary: System freeze while creating a lot of network traffic when using Intel Infere...
Status: RESOLVED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-05 09:24 UTC by Thomas Senfter
Modified: 2019-06-11 08:27 UTC (History)
1 user (show)

See Also:
i915 platform: KBL
i915 features:


Attachments
additional information and alternative scripts for the zombie Docker image (15.30 KB, application/zip)
2019-04-05 09:24 UTC, Thomas Senfter
no flags Details
log file from boot to freeze (1.13 MB, text/plain)
2019-04-05 09:25 UTC, Thomas Senfter
no flags Details
kern.log from boot to freeze (210.92 KB, text/plain)
2019-04-05 09:25 UTC, Thomas Senfter
no flags Details
Wakeref_stuff.txt (5.06 KB, text/plain)
2019-05-03 08:33 UTC, Thomas Senfter
no flags Details
syslog.6.gz (206.17 KB, application/gzip)
2019-05-03 08:35 UTC, Thomas Senfter
no flags Details

Description Thomas Senfter 2019-04-05 09:24:00 UTC
Created attachment 143875 [details]
additional information and alternative scripts for the zombie Docker image

We observe a "zombie" mode when running CNNs using the Intel Inference Engine with OpenCL. In this "zombie" mode the system does not react to anything (nothing when pressing Power OFF (except for 5 seconds obviously) and no reaction to Magic SysRq key), but it creates a lot of network traffic (enough to take down a 100Mbit switch). More details can be found here: https://software.intel.com/en-us/comment/1933685#comment-1933685

Affected hardware:
 - NUC7BNH with Intel i3-7100U
 - NUC7DNH with Intel i7-7567U
 - NUC8BEH with either Intel i3-8109U or Intel i5-8259U

Affected kernels:
 - 4.7.0.intel.r5.0-1
 - 4.15.18-041518-generic
 - 4.19.31-041931-generic

There is not much to be found in the logs (syslog and kern.log from boot to freeze are attached).

The problem occurs randomly and not very frequent with our software (with about 10 systems running we get the freeze maybe once a week). However we could create a Docker image which triggers the problem usually within a few minutes on kernel 4.7.0.intel.r5.0-1. 
The Docker image can be found here: https://hub.docker.com/r/accessio/zombie
An archive with additional information and alternative scripts is attached.
Comment 1 Thomas Senfter 2019-04-05 09:25:03 UTC
Created attachment 143876 [details]
log file from boot to freeze
Comment 2 Thomas Senfter 2019-04-05 09:25:56 UTC
Created attachment 143877 [details]
kern.log from boot to freeze
Comment 3 Lakshmi 2019-04-08 09:46:02 UTC
Have you tried to reproduce with latest drmtip?
(https://cgit.freedesktop.org/drm-tip)

Dmesg from drmtip will be helpful during investigation, can you please verify with drmtip?
Comment 4 Thomas Senfter 2019-04-15 07:27:13 UTC
We are now testing with drmtip. We also run tests with the ubuntu build of 5.1rc4 (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.1-rc4/) because we have problems with docker and iptables with drmtip.

This might take a while until a freezes happens (or even longer until we can say that the problem was fixed after 4.19)
Comment 5 Lakshmi 2019-05-02 13:03:46 UTC
(In reply to Thomas Senfter from comment #4)
> We are now testing with drmtip. We also run tests with the ubuntu build of
> 5.1rc4 (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.1-rc4/) because we
> have problems with docker and iptables with drmtip.
> 
> This might take a while until a freezes happens (or even longer until we can
> say that the problem was fixed after 4.19)

Any updates from drmtip testing?
Comment 6 Thomas Senfter 2019-05-03 08:33:10 UTC
Created attachment 144139 [details]
Wakeref_stuff.txt
Comment 7 Thomas Senfter 2019-05-03 08:35:32 UTC
Created attachment 144140 [details]
syslog.6.gz
Comment 8 Thomas Senfter 2019-05-03 08:35:50 UTC
No "zombie" yet.

On one NUC we observed another issue. The SSD was working at full load (according to the LED), no command could be executed and connecting with ssh also failed. However, ping did work and the command line was responsive (we could enter a command and cancel it with CTRL-C, but the command was not executed). We first thought this is because of a full disk, but this does not seem to be the case, as the disk was nearly empty after reboot and we also do not observe an increasing disk usage since yesterday.

We did a bit of investigation of the logs (which is difficult as we have no clue when the described behaviour started) and the only "unusual" thing we found in the logs was some Wakeref stuff (can be found in Wakeref_stuff.txt and in "syslog.6.gz", the whole syslog around this time). However we don't know if this is something interesting.
Comment 9 Lakshmi 2019-05-29 12:37:56 UTC
(In reply to Thomas Senfter from comment #8)
> No "zombie" yet.
> 
> On one NUC we observed another issue. The SSD was working at full load
> (according to the LED), no command could be executed and connecting with ssh
> also failed. However, ping did work and the command line was responsive (we
> could enter a command and cancel it with CTRL-C, but the command was not
> executed). We first thought this is because of a full disk, but this does
> not seem to be the case, as the disk was nearly empty after reboot and we
> also do not observe an increasing disk usage since yesterday.
> 
> We did a bit of investigation of the logs (which is difficult as we have no
> clue when the described behaviour started) and the only "unusual" thing we
> found in the logs was some Wakeref stuff (can be found in Wakeref_stuff.txt
> and in "syslog.6.gz", the whole syslog around this time). However we don't
> know if this is something interesting.

Thanks for the feedback. Can you please update the kernel and userspace to the latest and verify the issue?
Comment 10 Thomas Senfter 2019-06-11 07:14:05 UTC
We did not observe any problems in the last month (no original "zombie" and also no other problem). We will come back in case one of the problems occurs again.
Comment 11 Lakshmi 2019-06-11 08:27:06 UTC
(In reply to Thomas Senfter from comment #10)
> We did not observe any problems in the last month (no original "zombie" and
> also no other problem). We will come back in case one of the problems occurs
> again.

Thanks for the feedback. I will close this bug for now. Please reopen if this issue occurs with latest drmtip/userspace.
When you reopen please attach the dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M. Also attach the error file in case of a hang.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.