Bug 100091 - Failure to create folder for on-disk shader cache
Summary: Failure to create folder for on-disk shader cache
Status: CLOSED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Mesa core (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium minor
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-07 00:38 UTC by John
Modified: 2017-03-25 00:58 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
HACK: disable dynamic, enable symbolic (983 bytes, patch)
2017-03-08 18:23 UTC, Emil Velikov
Details | Splinter Review

Description John 2017-03-07 00:38:35 UTC
Hello,

I am on mesa-git, built from around 1AM this morning.
This has the patches that add the 32/64b folders, but it's still not working in with the game StarCraft II running through wine.

First I remove the ~/.cache/mesa folder, then start the Battle.Net client with wine. This seems to work fine and I see a new 32 folder in ~/.cache/mesa. 
Here's the output of file on that binary:
PE32 executable (GUI) Intel 80386, for MS Windows

Then from the Battle.Net client I start StarCraft II. Per the configuration it starts the 64b client. Then my terminal gets flooded with these:
Failed to create ~/.cache/mesa/64/1488793729_1488791612/AMD TAHITI/89 for shader cache (No such file or directory)---disabling.

The issue is, as the message says, I do not have that folder on disk but:
~/.cache/mesa/64/1488793730_1488791612
notice the 30 instead of 29, the rest matches.
The output of file on the binary that is now in memory (but maybe not the one directly called by the launcher):
PE32+ executable (GUI) x86-64, for MS Windows.

I am not sure what else to provide but would be happy to help with whatever is needed.

Thanks!
John
Comment 1 Timothy Arceri 2017-03-07 01:43:48 UTC
I'm guessing there is no ~/.cache/mesa/64 just ~/.cache/mesa/32?
Comment 2 Timothy Arceri 2017-03-07 01:45:38 UTC
(In reply to Timothy Arceri from comment #1)
> I'm guessing there is no ~/.cache/mesa/64 just ~/.cache/mesa/32?

Oh you said you have ~/.cache/mesa/64/1488793729_1488791612 that is very odd.
Comment 3 John 2017-03-07 05:36:48 UTC
That's right I have both folders.

How do you create the identifier? I find it quite strange that the folder's number on disk is only one away from the one expected. Last week when I reported a similar issue on phoronix, I believe it was similar (but without the 32/64b folders).

Thanks!
Comment 4 Timothy Arceri 2017-03-07 10:05:31 UTC
(In reply to John from comment #3)
> That's right I have both folders.
> 
> How do you create the identifier? I find it quite strange that the folder's
> number on disk is only one away from the one expected. Last week when I
> reported a similar issue on phoronix, I believe it was similar (but without
> the 32/64b folders).
> 
> Thanks!

That id is the time at which the Mesa binary was last modified. Also once the cache is created the path is stored in memory so that makes what you are reporting doubly odd.

Do you have any programming experience? Any chance of you being able to a debug build through gdb? The functions you would want to look at are in disk_cache.c.

I tried to install the trail but I get stuck with missing buttons and text boxes in the Battle.Net app.
Comment 5 Grazvydas Ignotas 2017-03-07 11:12:01 UTC
(In reply to John from comment #0)
> Failed to create ~/.cache/mesa/64/1488793729_1488791612/AMD TAHITI/89 for shader cache (No such file or directory)---disabling.
> ...
> The issue is, as the message says, I do not have that folder on disk but:
> ~/.cache/mesa/64/1488793730_1488791612
> notice the 30 instead of 29, the rest matches.
If you try to run the game again (without updating mesa), do you get the exact same numbers or a new set?
Comment 6 Andy Furniss 2017-03-07 11:15:10 UTC
(In reply to John from comment #0)
> Hello,
> 
> I am on mesa-git, built from around 1AM this morning.
> This has the patches that add the 32/64b folders, but it's still not working
> in with the game StarCraft II running through wine.

Did you restart x so it also used the new mesa?

I saw a few of these late last night, now I normally remember to re-start x but it's possible I forgot.

My test case was using a video player with a shader de-interlace - other settings that (I guess) also use shaders didn't produce the logging.

Anyway after booting up today they are gone.

Trying to reproduce by building newer mesa again today and deliberately not re-starting x doesn't re-produce, but then maybe I would have to put mesa back to where it was commit wise before I updated last night to properly re-produce.

In summary - if you re-boot do the errors go away?
Comment 7 John 2017-03-07 11:48:12 UTC
Oh I see, a timestamp. 


Now I've seen something interesting.
I started B.Net and it created this folder:
~/.cache/mesa/32/1488874151_1488872775/

While launching SCII I saw some similar error message as before but about the 32b folder, so something 32b happens when I click on launching SCII. I then looked again in the cache folder but the ID folder had changed to this:
~/.cache/mesa/32/1488874150_1488872775/

Here's the current 64 one:
1488873224_1488871428/

So it's not like at some point one arch's ID gets overwritten by the other, but somehow the ID changes, and it seems always by just 1 before the underscore.

I am a dev too, though with very little mesa experience :) I'll try to look at that. As for your errors, I haven't had issues with wine-staging 2.1, wine 2.2 and wine-staging 2.2, I think these were more common pre 2. 
Oh also all my cache tests happened with nine. Let me quickly try without it and see if that changes anything.
Comment 8 John 2017-03-07 11:55:01 UTC
(In reply to Grazvydas Ignotas from comment #5)
> If you try to run the game again (without updating mesa), do you get the
> exact same numbers or a new set?
I get the same numbers.

1488874151_1488872775/ when starting BNET
which changes to
1488874150_1488872775/ when starting SCII

(In reply to Andy Furniss from comment #6)
> Did you restart x so it also used the new mesa?
> 
> I saw a few of these late last night, now I normally remember to re-start x
> but it's possible I forgot.
> 
> My test case was using a video player with a shader de-interlace - other
> settings that (I guess) also use shaders didn't produce the logging.
> 
> Anyway after booting up today they are gone.
> 
> Trying to reproduce by building newer mesa again today and deliberately not
> re-starting x doesn't re-produce, but then maybe I would have to put mesa
> back to where it was commit wise before I updated last night to properly
> re-produce.
> 
> In summary - if you re-boot do the errors go away?
I have not, I thought mesa only required the application to be restarted.
I will try that and report.



Actually, I just tried without nine and the id staid the same! So I guess the problem is around that.
Comment 9 Timothy Arceri 2017-03-07 13:00:11 UTC
(In reply to John from comment #8)
> (In reply to Grazvydas Ignotas from comment #5)
> > If you try to run the game again (without updating mesa), do you get the
> > exact same numbers or a new set?
> I get the same numbers.
> 
> 1488874151_1488872775/ when starting BNET
> which changes to
> 1488874150_1488872775/ when starting SCII
> 
> (In reply to Andy Furniss from comment #6)
> > Did you restart x so it also used the new mesa?
> > 
> > I saw a few of these late last night, now I normally remember to re-start x
> > but it's possible I forgot.
> > 
> > My test case was using a video player with a shader de-interlace - other
> > settings that (I guess) also use shaders didn't produce the logging.
> > 
> > Anyway after booting up today they are gone.
> > 
> > Trying to reproduce by building newer mesa again today and deliberately not
> > re-starting x doesn't re-produce, but then maybe I would have to put mesa
> > back to where it was commit wise before I updated last night to properly
> > re-produce.
> > 
> > In summary - if you re-boot do the errors go away?
> I have not, I thought mesa only required the application to be restarted.
> I will try that and report.
> 
> 
> 
> Actually, I just tried without nine and the id staid the same! So I guess
> the problem is around that.

Yeah that makes sense. It's probably grabbing a different time for nine vs opengl because they are separate dynamic libraries I'll take a look at this tomorrow.
Comment 10 John 2017-03-07 13:07:36 UTC
(In reply to Timothy Arceri from comment #9)
> 
> Yeah that makes sense. It's probably grabbing a different time for nine vs
> opengl because they are separate dynamic libraries I'll take a look at this
> tomorrow.

That's what I would expect from the 1s difference. I looked at libGL and d3dadapter9.so but their creation time is 3 minutes different, so these are probably not the 2 binaries used.

Thank you!
Comment 11 Grazvydas Ignotas 2017-03-07 13:20:05 UTC
Try this:
stat -c %Y d3d/d3dadapter9.so dri/radeonsi_dri.so

although I'm not sure if TAHITI uses radeonsi_dri or r600_dri.
Comment 12 John 2017-03-07 13:54:13 UTC
(In reply to Grazvydas Ignotas from comment #11)
> Try this:
> stat -c %Y d3d/d3dadapter9.so dri/radeonsi_dri.so
> 
> although I'm not sure if TAHITI uses radeonsi_dri or r600_dri.

1488874138
1488874151

The 2nd one is correct, but strangely not the first one.
TAHITI is on radeonsi.
Comment 13 John 2017-03-08 13:28:11 UTC
(In reply to John from comment #12)
> (In reply to Grazvydas Ignotas from comment #11)
> > Try this:
> > stat -c %Y d3d/d3dadapter9.so dri/radeonsi_dri.so
> > 
> > although I'm not sure if TAHITI uses radeonsi_dri or r600_dri.
> 
> 1488874138
> 1488874151
> 
> The 2nd one is correct, but strangely not the first one.
> TAHITI is on radeonsi.

Ok the other timestamp is from /usr/lib32/d3d/d3dadapter9.so.1.0.0 not the links to it.

stat -c %Y /usr/lib32/d3d/d3dadapter9.so.1.0.0 /usr/lib32/xorg/modules/dri/radeonsi_dri.so
1488874150
1488874151
Comment 14 Edmondo Tommasina 2017-03-08 16:21:16 UTC
You could try to set the modified time to the so files as a workaround.

Something like this:

touch -m -t 201708080000.00 /usr/lib32/d3d/d3dadapter9.so.1.0.0 /usr/lib32/xorg/modules/dri/radeonsi_dri.so
Comment 15 John 2017-03-08 16:50:07 UTC
(In reply to Edmondo Tommasina from comment #14)
> You could try to set the modified time to the so files as a workaround.
> 
> Something like this:
> 
> touch -m -t 201708080000.00 /usr/lib32/d3d/d3dadapter9.so.1.0.0
> /usr/lib32/xorg/modules/dri/radeonsi_dri.so

That works (well I used -r instead...).

I don't really need a workaround, as the current speed is good enough for me. I'm just trying to help fix out the bug :) 
This does prove what we were all guessing though!
Comment 16 Emil Velikov 2017-03-08 18:23:17 UTC
Created attachment 130129 [details] [review]
HACK: disable dynamic, enable symbolic

Some ideas:
 - a non-timestamp better solution might be applicable/in the works (guessing here)
 - changing timestamps manually is a _very_ bad idea
Please do NOT mention it nor do it. Some poor soul will start copying the commands having zero knowledge of the side effects.
 
 - add a bunch of printfs and establish which timestamp gets picked at which point

 - last but not least
I suspect that the symbols are getting stomped over another. Can you try if the attached patch works ? Note that it will break GL <> VDPAU interop for DRI2, so it's not a real fix.
Comment 17 Michel Dänzer 2017-03-09 00:55:03 UTC
Emil, I don't think there's any mystery anymore given comment 13 — the different cache timestamps are due to *_dri.so and d3dadapter9.so.1.0.0 having different timestamps.

(In reply to Emil Velikov from comment #16)
>  - a non-timestamp better solution might be applicable/in the works
> (guessing here)

If you mean build-id, that won't help for this issue, because different binaries will also have different build-ids.
Comment 18 John 2017-03-09 03:47:44 UTC
(In reply to Emil Velikov from comment #16)
> Some ideas:
>  - a non-timestamp better solution might be applicable/in the works
> (guessing here)

A simple workaround would be to use a combination of the timestamp and the binary's name (or anything else that identifies the binary, the checksum wouldn't work as it'd change with each build), and allow more than one base folder.

So instead of:
~/.cache/mesa/1488874151_1488872775
then overwritten by
~/.cache/mesa/1488874150_1488872775

we'd have:
~/.cache/mesa/radeonsi_dri_1488874151_1488872775
and
~/.cache/mesa/d3dadapter9_1488874150_1488872775

or
~/.cache/mesa/radeonsi_dri/1488874151_1488872775
and
~/.cache/mesa/d3dadapter9/1488874150_1488872775
Comment 19 Emil Velikov 2017-03-10 23:02:19 UTC
(In reply to Michel Dänzer from comment #17)
> Emil, I don't think there's any mystery anymore given comment 13 — the
> different cache timestamps are due to *_dri.so and d3dadapter9.so.1.0.0
> having different timestamps.
> 
IMHO the [file] source of the timestamps is not the bigger mystery, but why we end up with "no such file or directory...---disabling". Afaics that should _never_ happen.

The only thing that comes to mind is that the "wrong" library/symbol gets picked to get the timestamp from.

This may happen due to the dynamic-list we have for the dri module (and the fact that it lacks Bsymbolic). A trivial test is the patch attached.

Tl;Dr; "---disabling." should _never_ kick-in. Drop any hacks and give the patch a try, please ?
Comment 20 Emil Velikov 2017-03-10 23:12:22 UTC
(In reply to John from comment #18)
> (In reply to Emil Velikov from comment #16)
> > Some ideas:
> >  - a non-timestamp better solution might be applicable/in the works
> > (guessing here)
> 
> A simple workaround would be to use a combination of the timestamp and the
> binary's name (or anything else that identifies the binary, the checksum
> wouldn't work as it'd change with each build), and allow more than one base
> folder.
> 
> So instead of:
> ~/.cache/mesa/1488874151_1488872775
> then overwritten by
> ~/.cache/mesa/1488874150_1488872775
> 
> we'd have:
> ~/.cache/mesa/radeonsi_dri_1488874151_1488872775
> and
> ~/.cache/mesa/d3dadapter9_1488874150_1488872775
> 
> or
> ~/.cache/mesa/radeonsi_dri/1488874151_1488872775
> and
> ~/.cache/mesa/d3dadapter9/1488874150_1488872775

I would really appreciate if one can spend 2 minutes and test the patch in question. Pretty please ?

Do move/prune your cache and let me know if you still get the "---disabled" message(s).
Comment 21 John 2017-03-11 02:24:43 UTC
(In reply to Emil Velikov from comment #20)
> I would really appreciate if one can spend 2 minutes and test the patch in
> question. Pretty please ?
> 
> Do move/prune your cache and let me know if you still get the "---disabled"
> message(s).

I still do get these (and the different folders).
Comment 22 John 2017-03-11 08:27:09 UTC
In case it helps, here are the calls for disk_cache_create (always called by r600_disk_cache_create).

First when starting the Battle.Net launcher:
- Called for radeonsi_dri.so
- Called for d3dadapter9.so.1.0.0
- Called for d3dadapter9.so.1.0.0
- Called for d3dadapter9.so.1.0.0
- Called for radeonsi_dri.so

When starting StarCraft II:
- Called for d3dadapter9.so.1.0.0
- Called for d3dadapter9.so.1.0.0
- Called for radeonsi_dri.so

The order seems to stay the same in my various tests.

I've tried all 3 ifs related to path, ie with only MESA_GLSL_CACHE_DIR set, with only XDG_CACHE_HOME set and with no variable set, the behavior is the same.

Without nine the calls are:

First when starting the Battle.Net launcher:
- Called for radeonsi_dri.so
- Called for radeonsi_dri.so

When starting StarCraft II:
- Called for radeonsi_dri.so

I've verified that these were the right libraries in r600_disk_cache_create (since the timestamps are shared I wanted to verify with the filenames.).

And here is a stack trace for each library at that point:
/usr/lib32/d3d/d3dadapter9.so.1(+0x300323) [0x71859323].
/usr/lib32/d3d/d3dadapter9.so.1(+0x30e165) [0x71867165].
/usr/lib32/d3d/d3dadapter9.so.1(+0x3df181) [0x71938181].
/usr/lib32/d3d/d3dadapter9.so.1(+0x274388) [0x717cd388].
/usr/lib32/d3d/d3dadapter9.so.1(+0x43c4c) [0x7159cc4c].
/usr/lib32/d3d/d3dadapter9.so.1(+0x2fdc9) [0x71588dc9].
/usr/bin/../lib32/wine/d3d9-nine.dll.so(present_create_adapter9+0x128) [0x7d581c58].
/usr/bin/../lib32/wine/d3d9-nine.dll.so(+0x1794d) [0x7d57b94d].
/usr/bin/../lib32/wine/d3d9-nine.dll.so(d3dadapter9_new+0x7c) [0x7d57c18c].
/usr/bin/../lib32/wine/d3d9-nine.dll.so(Direct3DCreate9+0x5c) [0x7d579fac].

and

/usr/lib32/xorg/modules/dri/radeonsi_dri.so(+0x65d793) [0x6ae86793].
/usr/lib32/xorg/modules/dri/radeonsi_dri.so(+0x675365) [0x6ae9e365].
/usr/lib32/xorg/modules/dri/radeonsi_dri.so(+0x6a3081) [0x6aecc081].
/usr/lib32/xorg/modules/dri/radeonsi_dri.so(radeon_drm_winsys_create+0x8e8) [0x6add6718].
/usr/lib32/xorg/modules/dri/radeonsi_dri.so(+0x4e264b) [0x6ad0b64b].
/usr/lib32/xorg/modules/dri/radeonsi_dri.so(+0x60b656) [0x6ae34656].
/usr/lib32/xorg/modules/dri/radeonsi_dri.so(+0x4ec550) [0x6ad15550].
/usr/lib32/xorg/modules/dri/radeonsi_dri.so(+0x4edc20) [0x6ad16c20].
/usr/lib32/libGL.so.1(+0x2713c) [0x7d33613c].
/usr/lib32/libGL.so.1(+0x49678) [0x7d358678].

Should libGL really be loaded when using nine?


I hope something is useful in there.

Thanks,
John
Comment 23 Michel Dänzer 2017-03-11 10:35:39 UTC
(In reply to Emil Velikov from comment #19)
> IMHO the [file] source of the timestamps is not the bigger mystery, but why
> we end up with "no such file or directory...---disabling".

Because the shader cache code deletes all but one timestamp tuple directory.
Comment 24 Emil Velikov 2017-03-11 13:01:34 UTC
(In reply to Michel Dänzer from comment #23)
> (In reply to Emil Velikov from comment #19)
> > IMHO the [file] source of the timestamps is not the bigger mystery, but why
> > we end up with "no such file or directory...---disabling".
> 
> Because the shader cache code deletes all but one timestamp tuple directory.

Right, I must have misread 512c046eddc16bec9c309a3c7b4205d94ffc4ef3. Was under the impression that 'old' versions are purged only as we hit the size limit. Maybe it's worth exploring, maybe not.

Another alternative is to "disable" disk caching all together for anyone but the dri module(s).
Comment 25 Timothy Arceri 2017-03-12 06:06:00 UTC
After a bunch of discussion it's been decided to move all cache entries to a single cache dir and instead use the timestamp (in future maybe build-id hash) and gpu id as part of the cache entries sha input.

We will remove old entries as the cache approaches its max size. This solution should solve the issues for nine. It makes things a little harder for collection/distribution of precompilled shaders but not unsolvable. Thanks to everyone for there input, I'll try to get some patches this coming week :)
Comment 26 John 2017-03-20 07:45:00 UTC
Timothy, do you have any patch you'd like me to test?

Thanks!
John
Comment 27 Timothy Arceri 2017-03-21 05:24:07 UTC
https://patchwork.freedesktop.org/series/21582/(In reply to John from comment #26)
> Timothy, do you have any patch you'd like me to test?
> 
> Thanks!
> John

This should do it:

https://patchwork.freedesktop.org/series/21582/
Comment 28 John 2017-03-21 07:04:58 UTC
That seems to work indeed!

Thank you!
Comment 29 Timothy Arceri 2017-03-24 00:25:01 UTC
Should be fixed by:

commit feb716239e1a318eb844fd3bcaca1ffd6903067c
Author: Grazvydas Ignotas <notasas@gmail.com>
Date:   Thu Mar 16 01:09:29 2017 +0200

    util/disk_cache: hash timestamps into the cache keys
    
    Instead of using a directory, hash the timestamps into the cache keys
    themselves. Since there is no more timestamp directory, there is no more
    need for deleting the cache of other mesa versions and we rely on
    eviction to clean up the old cache entries. This solves the problem of
    using several incarnations of disk_cache at the same time, where one
    deletes a directory belonging to the other, like when both OpenGL and
    gallium nine are used simultaneously (or several different mesa
    installations).
    
    v2: using additional blob instead of trying to clone sha1 state
    
    v3: (Timothy Arceri) don't use an opaque data type to store
        timestamp.
    
    V4: (Timothy Arceri) use blob to store driver keys just make sure
        to store null terminator for strings, and make sure blob is
        defined by disk_cache and not it's users.
    
    Signed-off-by: Grazvydas Ignotas <notasas@gmail.com>
    Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100091
Comment 30 John 2017-03-25 00:58:56 UTC
That seems to be working fine.

Thanks to Timothy, Grazvydas and all others involved on fixing this!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.