63946 – [snb] [h.264] [vaapi-mplayer] GPU hangs with both UXA and SNA after concurrent playback + GTT mapping fails after GPU hang

Bug 63946 - [snb] [h.264] [vaapi-mplayer] GPU hangs with both UXA and SNA after concurrent playback + GTT mapping fails after GPU hang

Summary: [snb] [h.264] [vaapi-mplayer] GPU hangs with both UXA and SNA after concurren...

Status:	RESOLVED DUPLICATE of bug 63921

Alias:	None

Product:	libva
Classification:	Unclassified
Component:	intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	haihao
QA Contact:	Sean V Kelley

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-04-26 07:12 UTC by Nicolas Hillegeer
Modified:	2015-11-23 13:51 UTC (History)
CC List:	1 user (show)

See Also:	63921
i915 platform:
i915 features:

Attachments
i915_error_state (gzipped) (990.95 KB, application/gzip) 2013-05-16 12:15 UTC, Krzysztof Kotlenga	Details
Patch to try if you get 'Can't allocate memory' when reading i915_error_state (19.30 KB, patch) 2013-05-24 11:16 UTC, Mika Kuoppala	Details \| Splinter Review
Insert a phantom slice (5.20 KB, patch) 2013-07-03 02:45 UTC, haihao	Details \| Splinter Review
View All

Description Nicolas Hillegeer 2013-04-26 07:12:30 UTC

Hey guys, this is a cross post of a bug I first posted on the xorg bug tracker:

https://bugs.freedesktop.org/show_bug.cgi?id=63921

Chris Wilson suggested I show you guys as well. Most of the information is available in that bug. Should I cross-post the files as well for easier management? Let me know :).

In short: I can reliably provoke a crash by playing many 1080p H.264 files concurrently. It does take a least an hour though. After a while either X.org completely crashes (SNA acceleration) or it refuses to load more accelerated surfaces (UXA acceleration). SNA seems to crash much sooner. With UXA, vaapi-mplayer tells me an assertion ('buffers') failed while performing dri2GetRenderingBuffer. This can continue for quite a while until finally either my test-program crashes or X.org gives up.

I'm running vaapi-mplayer with switches -vo vaapi -va vaapi -nolirc -nosound -vsync -fixed-vo (although the mplayer instances are often destroyed and recreated so the lasto ne isn't that important). Multiple instances (up to 16).

Versions I'm running:

Distribution: debian wheezy
Kernel: 3.9-rc8 (latest as of this writing)
libdrm: 2.4.43 (compiled from debian git)
mesa: 8.0.5-4 (stock)
intel-vaapi-driver: 1.0.21.pre1 (I tried stock too, ofcourse)
libva: 1.1.2.pre1 (VA-API version 0.33.0)
intel-xorg-driver: 2.21.6

If you guys have any idea, I'd be happy to give them a shot!

Thanks a bunch,
Nicolas

Comment 1 Chris Wilson 2013-04-26 07:38:15 UTC

Note the X crashes are irrelevant to this bug (and due to the kernel not handling itself after a GPU hang correctly). What is important here is that it seems to be libva triggering a GPU hang.

Nicolas, would just repeatedly looping through a single h.264 video trigger the hang?

Comment 2 Nicolas Hillegeer 2013-04-26 07:45:57 UTC

(In reply to comment #1)
> Note the X crashes are irrelevant to this bug (and due to the kernel not
> handling itself after a GPU hang correctly). What is important here is that
> it seems to be libva triggering a GPU hang.
> 
> Nicolas, would just repeatedly looping through a single h.264 video trigger
> the hang?

But the bad drawables occur before a full GPU hang, no? On an Ivy Bridge system I have here as well (all stock wheezy btw, except kernel 3.9-rc8), it's capable of generating quite a lot of bad drawables but still playing on. I'll go check up on how that system is performing soon, most likely it has crashed too. Fingers crossed.

It's been quite a while since I last tried playing just a single video. I have a recollection that it does happen but takes much longer (on the order of 3-to-5 days). Do you want me to try anyway? Which acceleration method would you prefer? From my side at least it does seem to make a difference in the time-to-crash. SNA is much more reliable in that regard.

Comment 3 haihao 2013-05-02 06:02:24 UTC

I have repeatedly played back a single 1080P H.264 video over 110 hours on SNB without GPU hang, then another H.264 video was added.  The two video files are being played back concurrently over 6 hours without any issue.

To avoid uncertainty, I didn't start up a desktop environment, instead I just started up X server with a client for testing. 

libva and libva-intel-driver are identical to yours, other componets are a little different from yours:

kernel: drm-intel-next with commit bae3699182027525d92b97d904578a533264b242
xf86-video-intel: master branch with commit 308f0208de59620190dd3cb65b3243d2e8a7bd87, UXA acceleration.

Comment 4 Nicolas Hillegeer 2013-05-02 07:58:24 UTC

(In reply to comment #3)
> I have repeatedly played back a single 1080P H.264 video over 110 hours on
> SNB without GPU hang, then another H.264 video was added.  The two video
> files are being played back concurrently over 6 hours without any issue.
> 
> To avoid uncertainty, I didn't start up a desktop environment, instead I
> just started up X server with a client for testing. 
> 
> libva and libva-intel-driver are identical to yours, other componets are a
> little different from yours:
> 
> kernel: drm-intel-next with commit bae3699182027525d92b97d904578a533264b242
> xf86-video-intel: master branch with commit
> 308f0208de59620190dd3cb65b3243d2e8a7bd87, UXA acceleration.

Hey Haihao,

Yes, with single or double videos it can take a random but (sometimes very long) amount of time. I've had a single video running for 6 days (!) before it crashed.

Or you using mplayer -vo vaapi -va vaapi?

To expedite the issue, I would start playing 16 or even better 32 concurrent videos. It should start crashing within a day. My heavily upgraded setup (different from stock wheezy) sometimes manages to pull a bit over 12 hours on UXA, but it seems to be random.

Note that I would find it entirely acceptable if the card wasn't able to player 32 videos concurrently, but it can and it does, to my surprise. It's just that after a while it locks up, which is something that I would like to avoid.

If you want the problem to occur sooner, switch to SNA instead of UXA as well. It will crash within 2 hours if your setup is similar.

That said, I'm not using drm-intel-next (master) nor xf86-video-intel (master). Do you think these are very relevant or it could be an libva issue? It is far easier for me to upgrade libva and intel-vaapi-driver, but if you think it's other things I will do my very best to upgrade those as well and see if the problem persists.

I always leave 2 units running during the night to see if they crash and sure enough, both crashed again. Sometimes they lock up hard (I can't get any information then), sometimes the X server just quits.

This morning, I came back to find a lot of this in the Xorg.0.log:

__kgem_bo_map__gtt: failed to retrieve GTT offset for handle=103: 5
__kgem_bo_map__gtt: failed to retrieve GTT offset for handle=75: 5
__kgem_bo_map__gtt: failed to retrieve GTT offset for handle=103: 5
__kgem_bo_map__gtt: failed to retrieve GTT offset for handle=19: 5

Dmesg gave me the usual:

[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:] capturing error event; ...
[drm:i915_reset] *ERROR* Failed to reset chip.
[drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for the GPU reset to complete

Unfortunately I once again could not access i915_error_log, it still complains about its inability to allocate memory, even when I only have 6 measly processes running (mostly tty's and a login shell).

Comment 5 haihao 2013-05-06 02:03:46 UTC

> Yes, with single or double videos it can take a random but (sometimes very
> long) amount of time. I've had a single video running for 6 days (!) before
> it crashed.

I started the testing on 4.27, till now the hang issue doesn't occur. (I added other 7 videos for testing on 5.3)   

> Or you using mplayer -vo vaapi -va vaapi?

I am using mplayer -vo vaapi without -va vaapi, note -va is deprecated. You can check the log to make sure vaapi acceleration is used or not.

======
VO: [vaapi] 1920x1080 => 1920x1080 H.264 VA-API Acceleration
[vo_vaapi] Using 1:1 VA surface mapping
[VD_FFMPEG] XVMC-accelerated MPEG-2.
V:  16.6   0/  0  1%  4%  0.0% 0 0
======

> 
> To expedite the issue, I would start playing 16 or even better 32 concurrent
> videos. It should start crashing within a day. My heavily upgraded setup
> (different from stock wheezy) sometimes manages to pull a bit over 12 hours
> on UXA, but it seems to be random.
> 
> Note that I would find it entirely acceptable if the card wasn't able to
> player 32 videos concurrently, but it can and it does, to my surprise. It's
> just that after a while it locks up, which is something that I would like to
> avoid.
> 
> If you want the problem to occur sooner, switch to SNA instead of UXA as
> well. It will crash within 2 hours if your setup is similar.
> 
> That said, I'm not using drm-intel-next (master) nor xf86-video-intel
> (master). Do you think these are very relevant or it could be an libva
> issue? It is far easier for me to upgrade libva and intel-vaapi-driver, but
> if you think it's other things I will do my very best to upgrade those as
> well and see if the problem persists.

I guess some resources are exhausted, which results in the hang issue. It's probably that it isn't a libva bug, just video playback reveals the bug.  Do you have a try with X server without window manager/session manager ?

Comment 6 Yang Lianyue 2013-05-14 06:09:13 UTC

Environment:
--------------------------
Platform:     SNB
Kernel:       3.9.0-rc5+
X11R7:        X11R7.2013-04-05.unstable
libva:        (staging) 5ec25c3d563d9ebd479a5ff978afe0a32f9cc00b
intel-driver: (staging) 1fd62ffd336293dce7d091bcea8399a40ccea21e
mplayer:      (hwaccel-vaapi) f1ad459a263f8537f6cba3bf479daea61c6104b9

Comments:
--------------------------
We were running mplayer with switches -nosound -fps 30 -vo vaapi -va vaapi *.* (At the same time 42 instances). The test had been started 7days ago, and it didn't hang. Now it is still running.

Comment 7 Nicolas Hillegeer 2013-05-14 06:41:54 UTC

(In reply to comment #6)
> Environment:
> --------------------------
> Platform:     SNB
> Kernel:       3.9.0-rc5+
> X11R7:        X11R7.2013-04-05.unstable
> libva:        (staging) 5ec25c3d563d9ebd479a5ff978afe0a32f9cc00b
> intel-driver: (staging) 1fd62ffd336293dce7d091bcea8399a40ccea21e
> mplayer:      (hwaccel-vaapi) f1ad459a263f8537f6cba3bf479daea61c6104b9
> 
> Comments:
> --------------------------
> We were running mplayer with switches -nosound -fps 30 -vo vaapi -va vaapi
> *.* (At the same time 42 instances). The test had been started 7days ago,
> and it didn't hang. Now it is still running.

Wow, that's... very impressive. First of all: thank you for taking this bug report seriously. I've been testing with kernel 3.9.0 and am now running with 3.9.2 today, will see if it crashes still.

Since I still haven't been able to get a useful i915_error_state I guess the only thing I've got is to search for any differences (and how that could help me avoid the issue or you guys solve it). The only really different thing I see in the mplayer commandline is the -fps 30 switch. I don't use it, do you think it's necessary? Does it help?

Just for the record: did you use UXA or SNA acceleration?

How much main memory do you guys have? Does sandy bridge actually use main memory as vga memory? I have 1.8GB on my systems.

My version of mplayer is identical to yours. libva is similar, as well as intel-driver. My xorg is quite a bit older though (stock debian wheezy). Do you think it could be the old Xorg? I've been holding off on an upgrade because it's a huge piece of infrastructure that seems to be quite difficult to replace. Before last month I hadn't even created a debian package, but I could try my hand at Xorg if that's necessary. Whatever it takes :)

Another thing that might be different but that I don't see as likely is that I continiously create and destroy instances of mplayer, I don't play them in a loop with the -loop switch, rather I have an external program kill and recreate them.

Did you not get any dri2SwapComplete bugs in your Xorg.0.log? Those always happen to me, even quite a bit before crashing, and on my IVB system they occur but afterwards the system doesn't crash. If you don't get them at all, something must be seriously different.

Any way, you guys are the experts, if there's anything that I could try, feel free to tell me. I'd like to get this niggling bug resolved as much (likely even more) than you. Thanks in advance for everything.

Kind regards,
Nicolas

Comment 8 Nicolas Hillegeer 2013-05-14 07:05:22 UTC

I forgot to mention it in my response above, but just in case: did you use heavy 1080p videos for all 42 instances? It might be some kind of memory exhaustion issue (pushing up against the limits, maybe some edge case switch which hasn't been exercised much that I'm hitting). I'm using the 1080p trailers downloaded from the apple movie trailers page. They're about 150MB each (usually).

Comment 9 Krzysztof Kotlenga 2013-05-14 11:25:33 UTC

I'm experiencing similar crashes. I have both SNB and IVB for testing - their behaviour is slightly different but both eventually hang.

OS: Fedora 19
SNB: Celeron B810
IVB: i5-3610ME
Board: Aaeon GENE-QM77
libva and intel-driver - tried both staging and master with no difference

I think the two most important things right now is that:
- i915_error_state doesn't work after a hang - "Cannot allocate memory" (3.9.1-301.fc19.i686.PAE kernel, but I'm preparing to test drm-intel-testing)
- not every h264 stream leads to a hang. I have a sample that with 100% probability will cause a GPU hang on SNB, but will play fine on IVB (just a single playback from gst-vaapi). Available here: http://u.42.pl/2JWo (gst-launch-0.10 filesrc location=... ! matroskademux ! vaapidecode ! vaapisink)

I can see three kind of crashes:
- GPU hang, followed by a successful reset
- GPU hang shortly followed by a complete system hang
- immediate system hang

Sometimes, intel_gpu_top is able to show this:

render busy: 100%: ████████████████████
bitstream busy: 100%: ████████████████████
blitter busy: 100%: ████████████████████

task percent busy
CS: 100%: ████████████████████

I'm sure that on SNB it doesn't matter if SNA or UXA is being used. Haven't checked on IVB.

Both SNB and IVB eventually hang while playing multiple RTSP (h264) streams simultaneously.

I'm going to try exact revisions from the previous comment and also mplayer-vaapi and report back.

Can Bug #50719 be related?

Comment 10 Nicolas Hillegeer 2013-05-14 13:40:18 UTC

(In reply to comment #9)
> I'm experiencing similar crashes. I have both SNB and IVB for testing -
> their behaviour is slightly different but both eventually hang.

Nice, now I know I'm not crazy
 
> I think the two most important things right now is that:
> - i915_error_state doesn't work after a hang - "Cannot allocate memory"
> (3.9.1-301.fc19.i686.PAE kernel, but I'm preparing to test drm-intel-testing)
> - not every h264 stream leads to a hang. I have a sample that with 100%
> probability will cause a GPU hang on SNB, but will play fine on IVB (just a
> single playback from gst-vaapi). Available here: http://u.42.pl/2JWo
> (gst-launch-0.10 filesrc location=... ! matroskademux ! vaapidecode !
> vaapisink)
> 
> I can see three kind of crashes:
> - GPU hang, followed by a successful reset
> - GPU hang shortly followed by a complete system hang
> - immediate system hang
> 
> Sometimes, intel_gpu_top is able to show this:
> 
>    render busy: 100%: ████████████████████
> bitstream busy: 100%: ████████████████████
>   blitter busy: 100%: ████████████████████
> 
>           task  percent busy
>             CS: 100%: ████████████████████
> 
> I'm sure that on SNB it doesn't matter if SNA or UXA is being used. Haven't
> checked on IVB.
> 
> Both SNB and IVB eventually hang while playing multiple RTSP (h264) streams
> simultaneously.
> 
> I'm going to try exact revisions from the previous comment and also
> mplayer-vaapi and report back.
> 
> Can Bug #50719 be related?

That stream manages to make the GPU hang on my unupgraded debian wheezy boxes as well. It doesn't bring down the box on the first try though, the GPU seems to recover. But I imagine that it can easily do that, the X server was already complaining about its event queue overflowing. Maybe it was able to recover because it just starting crashing right at the end of the video.

I'm going to try the box on which I'm testing kernel 3.9.2 and report again. I'm pretty happy this seems to be a reproducible testcase of at least part of the error. Thanks a lot Krzysztof.

Comment 11 Krzysztof Kotlenga 2013-05-14 15:43:01 UTC

(In reply to comment #6)

> Environment:
> --------------------------
> Platform:     SNB
> Kernel:       3.9.0-rc5+
> X11R7:        X11R7.2013-04-05.unstable

Platform:         SNB (Celeron B810)
Kernel:           3.9.0-rc5+ (drm-intel-testing) f59736c314fc8835b5294cb955f3f16a75cd72d2
xserver:          xorg-x11-server 1.14.1-1.fc19
xf86-video-intel: xorg-x11-drv-intel-2.21.6-1.fc19.i686 (UXA)

> libva:        (staging) 5ec25c3d563d9ebd479a5ff978afe0a32f9cc00b
> intel-driver: (staging) 1fd62ffd336293dce7d091bcea8399a40ccea21e
> mplayer:      (hwaccel-vaapi) f1ad459a263f8537f6cba3bf479daea61c6104b9

Same.

Command: mplayer -vo vaapi <sample from comment #9>
Result: reproducible hangs as described previously.

Note: I'm not getting any errors from the X server as mentioned by Nicolas, at least not in tailf /var/log/Xorg.0.log.

Comment 12 Nicolas Hillegeer 2013-05-14 20:58:49 UTC

Just tried it on 3.9.2, every time I try it the GPU seems to recover. No hard hangs yet. I'm wondering if it's just my luck, but I will keep trying. Have you had hard hangs with that video yet Krzysztof? Does it also always happen at the end of the video for you?

Comment 13 Nicolas Hillegeer 2013-05-15 05:02:30 UTC

(In reply to comment #12)
> Just tried it on 3.9.2, every time I try it the GPU seems to recover. No
> hard hangs yet. I'm wondering if it's just my luck, but I will keep trying.
> Have you had hard hangs with that video yet Krzysztof? Does it also always
> happen at the end of the video for you?

Next morning report: the video posted by Krzysztof manages to cause a GPU hang but it seems to recover well after that. However running my usual test (32 concurrent mplayer-vaapi instances) over the night has left the machine in a hard locked state. Nothing moves on the screen and I can't SSH in.

Comment 14 Krzysztof Kotlenga 2013-05-15 07:15:24 UTC

(In reply to comment #10)
> That stream manages to make the GPU hang on my unupgraded debian wheezy
> boxes as well. It doesn't bring down the box on the first try though, the
> GPU seems to recover.
> (...)
> Maybe it was able to recover because it just starting crashing right at the
> end of the video.

Indeed, that's the behaviour I'm seeing as well.

(In reply to comment #12)
> Just tried it on 3.9.2, every time I try it the GPU seems to recover. No
> hard hangs yet. I'm wondering if it's just my luck, but I will keep trying.

I haven't tried 3.9.2 yet, but will do shortly.

> Have you had hard hangs with that video yet Krzysztof?

On the older kernels - yes, (almost) always on the second try.

> Does it also always happen at the end of the video for you?

Yes.

This leads me to believe the stream may be somehow incorrect and that the libva/intel-driver doesn't handle it properly. Perhaps it's a different issue than the concurrent playback one, but there's a similarity - not every video is problematic, judging by the fact that Yang has not been able to reproduce the problem. Can you post a link to the exact video you are using for your tests?

(In reply to comment #13)
> However running my usual test (32 concurrent mplayer-vaapi instances) over
> the night has left the machine in a hard locked state. Nothing moves on the
> screen and I can't SSH in.

I can confirm that too, I'm yet to see a GPU reset in that case. It seems to always lock almost immediately.

Comment 15 Nicolas Hillegeer 2013-05-15 07:27:08 UTC

> This leads me to believe the stream may be somehow incorrect and that the
> libva/intel-driver doesn't handle it properly. Perhaps it's a different
> issue than the concurrent playback one, but there's a similarity - not every
> video is problematic, judging by the fact that Yang has not been able to
> reproduce the problem. Can you post a link to the exact video you are using
> for your tests?

There's the pincher, I hand't found a video yet in which it was at all reproducible. The only thing I had done was to collect the heaviest video files I could get my hands on and play them all concurrently. These happen to be some random trailers from the apple movie trailers website. One of them for example is the last iron man 3 trailer.

Only by leaving this on for quite a few hours, it locks up hard. Sometimes it makes X crash and doesn't lock up. I'm not sure I've seen that on the 3.9 series of kernels yet, though.

I've been able to provoke the crash quite a bit quicker by enabling SNA acceleration, it quite consistently takes a bit over an hour of 32 concurrent videos to lock the system hard.

So to sum up: I've found no specific video that locks up the system yet (except for yours which causes a GPU hang), which is why I first supposed that it was a resource consumption issue rather than a decoding issue. I had the faint feeling that DRM (GEM) was not releasing all its memory or that there was contention in some kind of lock and the kernel decided to self-interrupt and crash or some such. Can't know for sure without one of those venerable i915_error_state's I guess. So this could be 2 separate issues, one decoding and another resource based.

Suprising though that everything seems to comsume a bit more memory with this kernel, my 32 instances push up so close to the memory boundary that the kernel sometimes decides to just kill the process that is using up most memory. Which makes me have to reboot the test. Maybe I'll try with 24 concurrent instances to get more reliable results.

Comment 16 Krzysztof Kotlenga 2013-05-16 12:15:43 UTC

Created attachment 79403 [details]
i915_error_state (gzipped)

After some babbling on #intel-gfx and silly kernel hacking (https://lkml.org/lkml/2013/1/31/20), I managed to get the i915_error_state.

dmesg output:
[  155.774698] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  155.774718] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state

Comment 17 Nicolas Hillegeer 2013-05-16 12:24:47 UTC

(In reply to comment #16)
> Created attachment 79403 [details]
> i915_error_state (gzipped)
> 
> After some babbling on #intel-gfx and silly kernel hacking
> (https://lkml.org/lkml/2013/1/31/20), I managed to get the i915_error_state.
> 
> dmesg output:
> [  155.774698] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed...
> GPU hung
> [  155.774718] [drm] capturing error event; look for more information
> in/sys/kernel/debug/dri/0/i915_error_state

That's really cool, did you just apply the patch to a 3.9 series kernel and did it work afterwards, or did you do something else too?

Comment 18 Krzysztof Kotlenga 2013-05-16 14:19:14 UTC

(In reply to comment #17)
3.9.2, nothing else except one minor correction. Anyway, there's a better patch  in the works by Mika Kuoppala: http://paste.ubuntu.com/5670894/

Should apply cleanly on drm-intel-next-queued and with little help on stable.

Comment 19 Krzysztof Kotlenga 2013-05-20 15:29:51 UTC

Nicolas, could you run the following script:

#!/bin/sh
while true; do
  tail -n 1 /sys/kernel/debug/dri/0/i915_gem_gtt
  sleep 1
done

and report back if the number of objects grows significantly before the system hangs? (simultaneous playback test)

Comment 20 Nicolas Hillegeer 2013-05-20 15:31:38 UTC

(In reply to comment #19)
> Nicolas, could you run the following script:
> 
> #!/bin/sh
> while true; do
>   tail -n 1 /sys/kernel/debug/dri/0/i915_gem_gtt
>   sleep 1
> done
> 
> and report back if the number of objects grows significantly before the
> system hangs? (simultaneous playback test)

I'll do so when I get back home, still have to apply the other patch as well. Been away for some days.

Comment 21 Nicolas Hillegeer 2013-05-24 08:27:35 UTC

(In reply to comment #20)
> (In reply to comment #19)
> > Nicolas, could you run the following script:
> > 
> > #!/bin/sh
> > while true; do
> >   tail -n 1 /sys/kernel/debug/dri/0/i915_gem_gtt
> >   sleep 1
> > done
> > 
> > and report back if the number of objects grows significantly before the
> > system hangs? (simultaneous playback test)
> 
> I'll do so when I get back home, still have to apply the other patch as
> well. Been away for some days.

Ok, so some preliminary results (nothing fancy I'm afraid, I want to be thorough), some things I've tried:

1) Test: Running with kernel 3.9.3 + patch, stock debian wheezy, not checking i915_gem_gtt: 
- Result: hard lock up, could not get any info

2) Test: Running with kernel 3.9.3 + patch, stock debian wheezy, checking i915_gem_gtt: 
- Result: Amount of gtt objects when playing 32 videos concurrently averages 2500, but after a while it drops to 1500 on average. I can see this on the screen because videos are being created much slower. I created the program without output because the amount of debugging output was making my system too slow to succesfully run 32 videos (go figure). So I can't see mplayer complaining about anything. However, starting up another mplayer instance (so 33 videos in total) gives no problems whatsoever. So something weird is going on, I can still play accelerated videos but something is a miss as the system is clearly having difficulties creating them. However, they are smooth once they are playing. Sadly, the system hasn't locked up yet. It seems that when averaging 1500 gtt objects, the system does not crash rapidly. I've tried this twice now and both times it started having less gtt objects after a while and just continued merrily after that. Dmesg shows no special output, no GPU hangs, nothing. /var/log/Xorg.0.log shows many bad drawables, as always.

Conclusion: it seems to be some kind of "heisenbug", I start checking the gtt objects and it doesn't crash outright anymore, it just slows down after a while. 

To alleviate the issue and try to diagnose it further, I will try the following things:

- reduce debug output to just mplayer related things to see if the instances sometimes have trouble creating and if they can tell me something.
- try to upgrade the libva and vaapi driver and libdrm, to see if that can help with the "sudden slowdown" issue and keep it going at full speed until it crashes.

If anybody has anymore ideas and/or can explain me why my testing methodology is faulty and what I should do instead, I'm all ears!

Comment 22 Nicolas Hillegeer 2013-05-24 10:13:42 UTC

> If anybody has anymore ideas and/or can explain me why my testing
> methodology is faulty and what I should do instead, I'm all ears!

So, I was able to get some output from an mplayer process that failed to start up. The current situation is that about half of the 32 concurrent mplayer processes does not start up correctly (they are constantly being respawed). The failing ones produce output like this (ctlHandleResult handles the output from the child mplayer processes):

[24/05 11:58:31] ctlHandleResult(): (0x7f8f20ac7000) read 33 bytes => Maximum number of clients reached
[24/05 11:58:31] ctlHandleResult(): (0x7f8f20ac7000) read 42 bytes => vo: couldn't open the X11 display (:0.0)!
[24/05 11:58:31] ctlHandleResult(): (0x7f8f20ac7000) read 64 bytes => Error opening/initializing the selected video_out (-vo) device    

I was able to find the second line in the mplayer source code in libvo/x11_common.c

    XSetErrorHandler(x11_errorhandler);

    dispName = XDisplayName(mDisplayName);

    mp_msg(MSGT_VO, MSGL_V, "X11 opening display: %s\n", dispName);

    mDisplay = XOpenDisplay(dispName);
    if (!mDisplay)
    {
        mp_msg(MSGT_VO, MSGL_ERR,
               "vo: couldn't open the X11 display (%s)!\n", dispName);
        return 0;
    }
    mScreen = DefaultScreen(mDisplay);  // screen ID
    mRootWin = RootWindow(mDisplay, mScreen);   // root window ID

So it seems the application can't connect to the X server anymore above a certain amount of clients after a certain amount of time. So something happens to make the X.org server either clogged or accept less clients. Since I couldn't grep for the message "Maximum number of clients reached" in the mplayer source I assume this is an X.org error.

I will try upgrading packages to see if that helps, because possibly the reduction of maximum gtt objects is not completely a driver issue. Although it is strange that in the beginning it works perfectly and after a while it sort of breaks down. Will recompile most recent libva, intel-vaapi-driver and intel-xorg-driver and retest asap.

Comment 23 Mika Kuoppala 2013-05-24 11:16:19 UTC

Created attachment 79760 [details] [review]
Patch to try if you get 'Can't allocate memory' when reading i915_error_state

Comment 24 Nicolas Hillegeer 2013-05-24 12:46:08 UTC

(In reply to comment #19)
> Nicolas, could you run the following script:
> 
> #!/bin/sh
> while true; do
>   tail -n 1 /sys/kernel/debug/dri/0/i915_gem_gtt
>   sleep 1
> done
> 
> and report back if the number of objects grows significantly before the
> system hangs? (simultaneous playback test)

Krzysztof, that's not what happens in the simultaneous playback test for me, it just suddenly drops when it crashes (I can see it if it doesn't lock up hard). In the other bug thread Chris just commented that it appears to be the intel-vaapi drivers fault. That's some big progress I think, just one component to try to fix instead of blindly going all over the place.

Comment 25 Krzysztof Kotlenga 2013-06-28 12:53:50 UTC

So, did anyone had any luck with this? The problem is still there in latest VA releases AFAICS. Interest from Intel guys seems to have stalled, with no further pointers how this can be resolved...

Comment 26 Nicolas Hillegeer 2013-06-28 13:02:58 UTC

(In reply to comment #25)
> So, did anyone had any luck with this? The problem is still there in latest
> VA releases AFAICS. Interest from Intel guys seems to have stalled, with no
> further pointers how this can be resolved...

Hey Krzystof, I've been busy with some other things that haven't concerned video playback for my app so much, but I can attest to this:

The problems are quite a bit better with some of the more recent updates. Now I am running this:

- Kernel 3.9.8 (I believe you can use from 3.9.4 onwards to have better results with the hangchecks, but not sure). The newer kernels seem to help a lot with the hanging, haven't had a hard hang in ages. Try it if you haven't!

- xserver-xorg-video-driver: 2.21.9 with SNA acceleration. Whereas before 2.21.8 SNA made it crash more often (about every hour), now SNA is definitely the most stable, better than UXA. So this is pretty important I believe, be sure to try this out. (even though I remember Chris Wilson saying that it shouldn't matter too much, I've noticed differences).

- libva 1.2.1 and vaapi-driver 1.2.0: I'm not sure how important this is, I haven't noticed too many improvements but it's hard to tell and it can't hurt. Which is something I assume because there have been many many bugfixes since stock debian wheezy 1.0.15. I'll keep using these.

- libdrm 2.4.45: this seems to be required by the newer vaapi-driver's as said in the release notes on the mailing list. I used to just use stock debian wheezy 2.4.42 and it did play videos but to take some uncertainty out of the equation I upgraded.

So, in short, could you please try the versions I listed above and tell us if it's still crashing? At my side things seem to be much better, but as I said I haven't done the heaviest stress test in quite a while.

Comment 27 Nicolas Hillegeer 2013-06-28 15:04:55 UTC

(In reply to comment #26)
> (In reply to comment #25)
> > So, did anyone had any luck with this? The problem is still there in latest
> > VA releases AFAICS. Interest from Intel guys seems to have stalled, with no
> > further pointers how this can be resolved...
> 
> Hey Krzystof, I've been busy with some other things that haven't concerned
> video playback for my app so much, but I can attest to this:
> 
> (snip)

By the way, I just tested that video you posted which made vaapi-mplayer crash consistently, and it no longer crashes with my current setup. It appears thing have much improved indeed. I'm not so sure whether this bug still exists, I shall perform stress-testing a bit later when I get time, but I already want to express my thanks to the whole intel team, they've done an awesome job so far.

Comment 28 Krzysztof Kotlenga 2013-07-01 10:42:33 UTC

(In reply to comment #26 and comment #27)

I have no such luck, the bugs are still definitely there, exact culprit has not been identified AFAIK. My test video + SNB gen6 gt1 + stock Debian Wheezy i386 + VA/kernel/drm/intel stack same as yours and it's still totally crash happy. Same thing on Fedora 19 + latest VA. UXA/SNA - doesn't matter.

Possibly (un)related reliability problem with simultaneous playback is still there too, although recently I only gave it limited testing with latest VA, some older 3.9.x kernel, older intel DDX and UXA only.

It's time for some serious debugging I guess.

Comment 29 Nicolas Hillegeer 2013-07-01 14:23:37 UTC

(In reply to comment #28)
> (In reply to comment #26 and comment #27)
> 
> I have no such luck, the bugs are still definitely there, exact culprit has
> not been identified AFAIK. My test video + SNB gen6 gt1 + stock Debian
> Wheezy i386 + VA/kernel/drm/intel stack same as yours and it's still totally
> crash happy. Same thing on Fedora 19 + latest VA. UXA/SNA - doesn't matter.
> 
> Possibly (un)related reliability problem with simultaneous playback is still
> there too, although recently I only gave it limited testing with latest VA,
> some older 3.9.x kernel, older intel DDX and UXA only.
> 
> It's time for some serious debugging I guess.

That's... very disconcerting. I went and rechecked everything since I'm not running on the same dev platform anymore. Some people delivered some new boxes which they claimed were exactly the same except with some newer celerons and SSD's. Since I prefer working with SSD's I switched to those boxes for development. However, not that I got and checkout the Xorg.0.log, it says IvyBridge GT1. So I was *NOT* running on a sandy bridge. Therefore, most likely the Sandy Bridges still display the instability that was noted earlier, and my report of a crash free and buttersmooth experience was erroneous.

My apologies for making you waste time duplicating my setup. The bug still stands...

Comment 30 haihao 2013-07-03 02:45:16 UTC

Created attachment 81919 [details] [review]
Insert a phantom slice

 This patch tries to fix the hang issue cause by the video you posted, but I am not sure the original issue still exists or not, even if I can't reproduce it.

Comment 31 Krzysztof Kotlenga 2013-07-03 10:47:40 UTC

(In reply to comment #30)
> Created attachment 81919 [details] [review] [review]
> Insert a phantom slice
> 
>  This patch tries to fix the hang issue cause by the video you posted, but I
> am not sure the original issue still exists or not, even if I can't
> reproduce it.

This patch indeed helps. Thank you Haihao, it's greatly appreciated.

(The two cases brought up here are most likely unrelated after all, so sorry for hijacking this thread with a different issue than the original one.)

I'm going to focus on trying to reproduce the original issue with concurrent playback hang on both SNB and IVB - perhaps it was somehow mitigated by some changes outside VA as per comment #26, particularly with SNA. Perhaps Bug #63921 is a better place to continue this topic, with the relevant i915_error_state there.

Nicolas, thank you for all your input! It would be great if we could make sure that with the latest changes the problem is reliably solved, or is it just "less crashy".

Comment 32 haihao 2015-11-23 13:51:07 UTC

According to comment #31, Krzysztof's issue was fixed. the original issue is a duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=63921

*** This bug has been marked as a duplicate of bug 63921 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.