100073 – Shader Disk Cache 32/64 bit detection has a flaw. Missed existence of x32 ABI

Bug 100073 - Shader Disk Cache 32/64 bit detection has a flaw. Missed existence of x32 ABI

Summary: Shader Disk Cache 32/64 bit detection has a flaw. Missed existence of x32 ABI

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Mesa core (show other bugs)
Version:	git
Hardware:	All All

Importance:	medium minor
Assignee:	mesa-dev
QA Contact:	mesa-dev

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-03-05 23:56 UTC by oiaohm
Modified:	2017-03-14 03:56 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description oiaohm 2017-03-05 23:56:04 UTC

https://cgit.freedesktop.org/mesa/mesa/commit/?id=11f0efec2e615f5233defdd8ca9693c54ea49b1f

This is the commit with the problem.


https://en.wikipedia.org/wiki/X32_ABI

This explains X32_ABI is.   I don't know if anything else like this is around where (void *) will return 4 and there will be a different register count and different functionality.   This will be important for software rendering pre-building shades in cases where X32_ABI is being used to get the most performance.    Yes Linux you can run a 32 bit and a x32 and a 64 bit at the same time. 

As far as I know Linux is the only thing with this third mode.

I don't know what an acceptable test would be to split X32 and 32bit as they need to be.   This is a platform particular thing.


I don't know if there are any other OS modes with different register counts and functionality that will return the same size pointer with.

get_arch_bitness_str as function maybe totally wrong.   
https://wiki.debian.org/X32Port
In fact I am not sure if this code for bitness is right at all.

Normal convention is that 32 bit code calls 32 bit libraries.  64 bit code calls 64 bit libraries and x32 code calls x32 libraries.
So you short out this bitness stuff with complier preprocessor flags at build time no runtime code.   As what is recommend on the Debian site.

The current code is depending on complier solving out what may not happen in all cases so result in a pointless waste of cpu time.

So I have this horible feeling the complete patch is wrong because it disobeying conventions.   This is why I set this to All I don't believe this patch is right for any platform in it current form as it depending on the complier to transform it out by optimisation.

Yes writing it using preprocessor macros for all the platforms is a pain but that will always solve out and would be good in a header shared.

I know a lot resort to runtime testing but with items like X32 this is not wise.

Comment 1 Ilia Mirkin 2017-03-05 23:58:39 UTC

Can you be specific as to what the problem is? Which situation, precisely, is mishandled?

Comment 2 Darren Salt 2017-03-06 01:05:15 UTC

This code is intended to be generic; it SHOULD work for other pairs such as armel and aarch64 (though breakage will happen should you have more than one 32-bit ARM ABI in use on one system). As such, as it's written, it should not be testing __i386__, __x86_64__ or similar macros, but it does need to test __ILP32__.

(Aside: x32 code is 64-bit.)

Without optimisation, yes, the testing is effectively at run-time; however, the values are constant. So in that sense, it's a small amount of bloat and unreachable code.

FWIW, my implementation (which uses the ABI name from the architecture triplet) is here:
  https://patchwork.freedesktop.org/patch/141891/

Comment 3 Timothy Arceri 2017-03-06 01:19:24 UTC

(In reply to Darren Salt from comment #2)
> This code is intended to be generic; it SHOULD work for other pairs such as
> armel and aarch64 (though breakage will happen should you have more than one
> 32-bit ARM ABI in use on one system).

It's not really breakage, its just that the 32-bit cache would get deleted should you be switching between different Mesa builds.

> As such, as it's written, it should
> not be testing __i386__, __x86_64__ or similar macros, but it does need to
> test __ILP32__.
> 
> (Aside: x32 code is 64-bit.)
> 
> Without optimisation, yes, the testing is effectively at run-time; however,
> the values are constant. So in that sense, it's a small amount of bloat and
> unreachable code.
> 

Yeah and it's tested once at screen creation so it's not even worth worrying about.

Comment 4 Jan Vesely 2017-03-06 13:52:51 UTC

there will always some configuration that is overlooked.

wouldn't it be easier if the subdirectories were based on timestamp itself?
that gives you clean slate for new builds, evicting the oldest will cleanup old builds, and you can safely switch between two builds for the same arch.

Comment 5 oiaohm 2017-03-07 23:50:38 UTC

https://wiki.debian.org/Multiarch

Sorry to say this is not minor the Darren Salt triple patch reminded me exactly why you cannot split by 32 and 64 bit and store in the users created directories.

The reality is under Debian you can be used qemu usermode to have a program every Architecture supported by Debian running at the exactly the same time using the exactly the same user accessible directories.

Timothy Arceri
> It's not really breakage, its just that the 32-bit cache would get deleted should you be switching between different Mesa builds.

Yes this is breakage because this can be happening every context switch between processes on a Multiarch installed Debian due to changing architecture and different architectures been different levels of library versions.   So fairly much resulting in the cache being worthless.

Yes x32 ABI under Linux will be running a different version Mesa to what i386 ABI will be running.   Because both have void * pointers 4 bytes in size calling them 32 bit is wrong.  x32 is closer to 64 bit and uses a different library.

The triplet patch will work in all these cases.   Also it does not result in junk code.

> Yeah and it's tested once at screen creation so it's not even worth worrying about.

This would be fine other than the fact more often than not these things are are done only once and are not done properly become points in the code base of incompatibilities on platform change.  Also extra areas where the complier could attempt to-do a smart modification and get it wrong.   Compliers optimisers will hunt out dead code and attempt to remove it someone incorrectly.  Best option is never put in any code that will not be on a possible code run path in the final binary.  If you do include dead code path code expect at some point a complier to-do something stupid to you.   Yes its better to comment or macro out code than leave it to the complier to think.

Each of these cases of dead code removal errors can be like a random shot gun effect across the final binary.


Jan Vesely
>wouldn't it be easier if the subdirectories were based on timestamp itself?

https://reproducible-builds.org/docs/timestamps/

Sorry the idea of using timestamps is really totally invalid.   It is possible to have two or more different builds with exactly the same timestamp coming more common as systems are getting more set-up for reproducible builds are are basically faking time to the build to get consistency of produced binary.

So you would need to generate something off of the complier flags, complier and source version.

Comment 6 Ilia Mirkin 2017-03-08 00:21:33 UTC

Before I mark this as INVALID...

Please outline a particular usage scenario that would suffer from this. There's no need to educate anyone about architectures, ABIs, library layout, qemu, virtfs, NFS, etc. We all know about it. Perhaps even better than you.

What's missing here is a specific usage pattern that will suffer from Mesa's policy, and a description of what the suffering will be. Depending on the reasonableness of said pattern, and the harm incurred by the user, we can make various adjustments to mesa.

Comment 7 oiaohm 2017-03-08 01:45:59 UTC

<b>Please outline a particular usage scenario that would suffer from this.</b>

I do run some x32 ABI programs so I have syscall.x32=y set in boot-loader under Debian. These are build x32 ABI because they are opengl and what they are doing can take days to complete. So small percentage of difference is important. The fact the current patch calls x32 and i386 the same is wrong. When my mesa updates I am going to have cache failure that was why cache feature was disabled in the first place.

Yes debian ships with x32 compliers the kernel is x32 enabled just turned off by default. For us who have turned it on the defect in Mesa not detect x32 is going to cause us head aches.

https://wiki.debian.org/QemuUserEmulation

Next my system is debian multi-arch as started with Debian normal qemu user emulation set-up. My system is i386 amd64 with x32 enabled and with arm64 installed as well by multiarch.

https://cgit.freedesktop.org/mesa/mesa/commit/?id=11f0efec2e615f5233defdd8ca9693c54ea49b1f
The reality here is the patch I have pointed to does not fix the bug. It fix the bug in 1 case only.

The cases where you are running x32 it does not support. The cases where you are running multiarch the the patch I pointed to does not support. So its basically broken. Does a case of a person using x32 or multiarch behave the same with it constantly nuking cache the answer is yes. So the patch is nothing more than a part fix.

https://patchwork.freedesktop.org/patch/141891/
Where this patch by Darren Salt is a full fix for more usage cases than my 2 cases.

In fact a person testing multi versions of mesa could put the version difference into the triplet so avoid when running two versions of mesa at the same time on the same on the arch them stopping all over each other cache.

So Darren Salt patch fixes my 2 use cases and 3 case.

I had not considered Jan Vesely case of running like 2 64 bit version of mesa at the same time in different programs and the include patch did not allow for that instead that would make mesa return to nuke the cache all the time why disk_cache was disabled by default.

Darren Salt patch might not be absolutely perfect for Jan Vesely case but a person with Jan Vesely case has a workaround with it that they don't have with the existing.

Just like mine the Jan Vesely issue exist because
https://cgit.freedesktop.org/mesa/mesa/commit/?id=11f0efec2e615f5233defdd8ca9693c54ea49b1f
is badly written code disobeying conventions and not covering enough usage cases.

What because I did not list out a usage case straight up I should not bother reporting bugs? Where the problem is breach of convention that will be causing more than 1 problem.

Almost everywhere in a code base you find if( sizeof(void *) == 4 ) being used to work out bit width of a system is wrong.

The reason why I pushed it up to critical instead of minor is if you change the directory once mesa with disc caching is production deployed you may leave behind a directory that never gets deleted. So the cache storage directory is something we need to right now. In fact I would say that this deserves to be pushed to blocker until fixed. This is a mistake that should not have entered the code base in the first place as its a patch is a hack not well considered code.

Comment 8 Jan Vesely 2017-03-08 01:49:41 UTC

(In reply to oiaohm from comment #5)

> 
> Jan Vesely
> >wouldn't it be easier if the subdirectories were based on timestamp itself?
> 
> https://reproducible-builds.org/docs/timestamps/
> 
> Sorry the idea of using timestamps is really totally invalid.   It is
> possible to have two or more different builds with exactly the same
> timestamp coming more common as systems are getting more set-up for
> reproducible builds are are basically faking time to the build to get
> consistency of produced binary.

If the pointer size is the same and the timestamp is the same, the cache will be reused, not deleted, and the problem does not exist.

Comment 9 oiaohm 2017-03-08 02:16:14 UTC

Jan Vesely
<b>If the pointer size is the same and the timestamp is the same, the cache will be reused, not deleted, and the problem does not exist.</b>

I have a little problem. I have x32 and i386.   So pointer size is the same timestamp and contents of cache should be different.   This will happen if I am running a reproducible build with frozen time.   This can also happen if you make to 64 library builds with frozen time because the two builds will have the same timestamps even if they are completely different versions of mesa incompatible disc cache requirements.

https://reproducible-builds.org/docs/timestamps/

The reason I provided this link is time is no longer trust-able as a unique value.  Do note the usage of libfaketime its really simple to forget you still have that loaded after doing a reproducible build so the clock is not in fact ticking forwards for anything you are doing you wake up when you finally do something that is dependant on the clock in fact ticking.   So you need to look at other sources to get unique value.

Comment 10 Ilia Mirkin 2017-03-08 02:34:23 UTC

There are lots of words, with lots of mentions of how Debian is awesome, but still no usage scenario. Let me help you. A usage scenario looks like:

"""
Step 1: run X
Step 2: run Y
Step 3: run Z

Note how when you run Z, things don't work or otherwise suck.
"""

Everything you're talking about is generic hypotheticals with various assumptions built in.. I want an actual use-case.

Comment 11 Timothy Arceri 2017-03-08 02:57:36 UTC

(In reply to oiaohm from comment #9)
> Jan Vesely
> <b>If the pointer size is the same and the timestamp is the same, the cache
> will be reused, not deleted, and the problem does not exist.</b>
> 
> I have a little problem. I have x32 and i386.   So pointer size is the same
> timestamp and contents of cache should be different.   This will happen if I
> am running a reproducible build with frozen time.   This can also happen if
> you make to 64 library builds with frozen time because the two builds will
> have the same timestamps even if they are completely different versions of
> mesa incompatible disc cache requirements.
> 
> https://reproducible-builds.org/docs/timestamps/
> 
> The reason I provided this link is time is no longer trust-able as a unique
> value.  Do note the usage of libfaketime its really simple to forget you
> still have that loaded after doing a reproducible build so the clock is not
> in fact ticking forwards for anything you are doing you wake up when you
> finally do something that is dependant on the clock in fact ticking.   So
> you need to look at other sources to get unique value.

x32 should end up in the ilp-32 directory. That's what the __ILP32__ is meant to do. Even if it didn't and the cache wasn't deleted because all your arch builds had the same timestamp it still wouldn't matter because the timestamp is used to identify the Mesa version not the arch, the arch doesn't matter in terms of the cache it will work just fine. The only reason for the different dirs is because we have no way of knowing if the different timestamps are the same Mesa version

Comment 12 oiaohm 2017-03-09 21:45:31 UTC

(In reply to Timothy Arceri from comment #11)
> (In reply to oiaohm from comment #9)
> > Jan Vesely
> > <b>If the pointer size is the same and the timestamp is the same, the cache
> > will be reused, not deleted, and the problem does not exist.</b>
> > 
> > I have a little problem. I have x32 and i386.   So pointer size is the same
> > timestamp and contents of cache should be different.   This will happen if I
> > am running a reproducible build with frozen time.   This can also happen if
> > you make to 64 library builds with frozen time because the two builds will
> > have the same timestamps even if they are completely different versions of
> > mesa incompatible disc cache requirements.
> > 
> > https://reproducible-builds.org/docs/timestamps/
> > 
> > The reason I provided this link is time is no longer trust-able as a unique
> > value.  Do note the usage of libfaketime its really simple to forget you
> > still have that loaded after doing a reproducible build so the clock is not
> > in fact ticking forwards for anything you are doing you wake up when you
> > finally do something that is dependant on the clock in fact ticking.   So
> > you need to look at other sources to get unique value.
> 
> x32 should end up in the ilp-32 directory. That's what the __ILP32__ is
> meant to do. Even if it didn't and the cache wasn't deleted because all your
> arch builds had the same timestamp it still wouldn't matter because the
> timestamp is used to identify the Mesa version not the arch, the arch
> doesn't matter in terms of the cache it will work just fine. The only reason
> for the different dirs is because we have no way of knowing if the different
> timestamps are the same Mesa version

1 if you are using timestamp it ID mesa version just use mesa version.

Next do read what x32 API is 

https://en.wikipedia.org/wiki/X32_ABI
There is a different count in registers between i386 and x32.  So yes they both report as ilp-32 but if you are running a software rendering engine how the how the shader code has to be optimised is totally different to take advantage of x32 mode.

Comment 13 oiaohm 2017-03-09 22:26:37 UTC

(In reply to Ilia Mirkin from comment #10)
> There are lots of words, with lots of mentions of how Debian is awesome, but
> still no usage scenario. Let me help you. A usage scenario looks like:
> 
> """
> Step 1: run X
> Step 2: run Y
> Step 3: run Z
> 
> Note how when you run Z, things don't work or otherwise suck.
> """
> 
> Everything you're talking about is generic hypotheticals with various
> assumptions built in.. I want an actual use-case.

This is a set of assumptions.  Why do I have to 100 percent wait for real world error.

1) you set your distribution to multiarch.
2) you install programs from different 32 and 64 bit archs as I have now.
This can result in like mesa-utils that contains glxgears being installed for arm64 and having blender installed for amd64


apt-get install mesa-utils:x32  kicad:i386 blender:amd64  gltron:arm64

Yes you can install from different arch using a single command line.

Now you try to run glgears, kicad, blender and gltron. 

You have 4 versions of mesa on the system.   So the first fix handle the kicad to blender that works now.   But gltron:arm64 or anything else I have built for arm64 is going to have trouble.   So I can build binary on x86 PC that is fast test it on x86 PC without needing extra vm stuff.

Yes building binary normally still involves a container/chroot linking container/chroot to x11 is a pain and pointless when you are on debian.

The kicad one installed as i386 can be at times that amd64 does not work correctly when using particular extensions.  gltron would be to test that arm64 it work. 

I don't know what architectures I will have set multi in the future it will be linked to what boards I am dealing with.

The stuff I have in x32 is mostly internal that you don't have permission to have access to. 

Ilia Mirkin the reality what you have been calling hypothetical will break my work-flows.  This is why I opened the bug. 

All I can guess is you are not interested in solving problem for my workflow so I should not bother opening bugs in future.

Comment 14 Ilia Mirkin 2017-03-09 22:30:51 UTC

(In reply to oiaohm from comment #13)
> All I can guess is you are not interested in solving problem for my workflow
> so I should not bother opening bugs in future.

I have been trying to get you to explain your workflow, which you hadn't up until now.

To summarize, your workflow is to run applications using multiple architectures (and thus multiple library binaries) while having $HOME set to the same directory.

Is that an accurate assessment? Is there more to it?

Comment 15 Timothy Arceri 2017-03-09 23:14:13 UTC

(In reply to oiaohm from comment #12)
> (In reply to Timothy Arceri from comment #11)
> > (In reply to oiaohm from comment #9)
> > > Jan Vesely
> > > <b>If the pointer size is the same and the timestamp is the same, the cache
> > > will be reused, not deleted, and the problem does not exist.</b>
> > > 
> > > I have a little problem. I have x32 and i386.   So pointer size is the same
> > > timestamp and contents of cache should be different.   This will happen if I
> > > am running a reproducible build with frozen time.   This can also happen if
> > > you make to 64 library builds with frozen time because the two builds will
> > > have the same timestamps even if they are completely different versions of
> > > mesa incompatible disc cache requirements.
> > > 
> > > https://reproducible-builds.org/docs/timestamps/
> > > 
> > > The reason I provided this link is time is no longer trust-able as a unique
> > > value.  Do note the usage of libfaketime its really simple to forget you
> > > still have that loaded after doing a reproducible build so the clock is not
> > > in fact ticking forwards for anything you are doing you wake up when you
> > > finally do something that is dependant on the clock in fact ticking.   So
> > > you need to look at other sources to get unique value.
> > 
> > x32 should end up in the ilp-32 directory. That's what the __ILP32__ is
> > meant to do. Even if it didn't and the cache wasn't deleted because all your
> > arch builds had the same timestamp it still wouldn't matter because the
> > timestamp is used to identify the Mesa version not the arch, the arch
> > doesn't matter in terms of the cache it will work just fine. The only reason
> > for the different dirs is because we have no way of knowing if the different
> > timestamps are the same Mesa version
> 
> 1 if you are using timestamp it ID mesa version just use mesa version.

We can't, distros patch Mesa we can't depend on the version numbers being an accurate representation of the source that was use to compile a Mesa binaray. There is an option to do a sha of the binaries but that is a little more complex for opengl than it is for vulkan (where it is currently used by Intel).

> 
> Next do read what x32 API is 
> 
> https://en.wikipedia.org/wiki/X32_ABI
> There is a different count in registers between i386 and x32.  So yes they
> both report as ilp-32

gcc only sets the __ILP32__ flag for x32 not i386. Clang may differ I'm not sure, but it's not a major concern for me right now.

> but if you are running a software rendering engine how
> the how the shader code has to be optimised is totally different to take
> advantage of x32 mode.

How about you go read about gpu architectures. Shader code is for the gpu, x32
is for the cpu. There is no shader code optimisations to take advantage of x32 mode.

Comment 16 oiaohm 2017-03-09 23:18:26 UTC

(In reply to Ilia Mirkin from comment #14)
> (In reply to oiaohm from comment #13)
> > All I can guess is you are not interested in solving problem for my workflow
> > so I should not bother opening bugs in future.
> 
> I have been trying to get you to explain your workflow, which you hadn't up
> until now.
> 
> To summarize, your workflow is to run applications using multiple
> architectures (and thus multiple library binaries) while having $HOME set to
> the same directory.
> 
> Is that an accurate assessment? Is there more to it?

That is close enough.  Exact would require the following extra
Multi architectures using qemu usermode registered in binfmt_misc so they binary all straight up run.   So its 1 system and operating system running applications from multiple architectures.   

But these bits really don't change the problem most likely can be ignored. 

Yes they all use 1 home directory someone NFS sharing a home directory between computer of different architectures could also run into a problem with just splitting 32 and 64 bit instead of using a triplet.

Darren Salt patch using triplet
https://patchwork.freedesktop.org/patch/141891/
Will make my workflow work and others with shared home directories with multi-architectures by other means.  I don't see how this would break workflows what the patch I class as defective fixed other than leaving behind not used cache directories.

Of course there is still those who want to custom build own instances of Mesa who I think with the triplet option could be told to use a custom triplet.  If they want to stick date/time/version in the triplet they can.

I have not considered the case of Multi different distributions sharing the same home directory but that is broken most of the time anyhow this would most likely require using version tag or something like that.   Debian same version different architecture sharing home directory has been fairly safe.

If someone wants Multi different distribution support using a single home directory I think that should be a bug in it own right.

Comment 17 oiaohm 2017-03-09 23:34:57 UTC

(In reply to Timothy Arceri from comment #15)
> (In reply to oiaohm from comment #12)
> > (In reply to Timothy Arceri from comment #11)
> > > (In reply to oiaohm from comment #9)
> > > > Jan Vesely
> > > > <b>If the pointer size is the same and the timestamp is the same, the cache
> > > > will be reused, not deleted, and the problem does not exist.</b>
> > > > 
> > > > I have a little problem. I have x32 and i386.   So pointer size is the same
> > > > timestamp and contents of cache should be different.   This will happen if I
> > > > am running a reproducible build with frozen time.   This can also happen if
> > > > you make to 64 library builds with frozen time because the two builds will
> > > > have the same timestamps even if they are completely different versions of
> > > > mesa incompatible disc cache requirements.
> > > > 
> > > > https://reproducible-builds.org/docs/timestamps/
> > > > 
> > > > The reason I provided this link is time is no longer trust-able as a unique
> > > > value.  Do note the usage of libfaketime its really simple to forget you
> > > > still have that loaded after doing a reproducible build so the clock is not
> > > > in fact ticking forwards for anything you are doing you wake up when you
> > > > finally do something that is dependant on the clock in fact ticking.   So
> > > > you need to look at other sources to get unique value.
> > > 
> > > x32 should end up in the ilp-32 directory. That's what the __ILP32__ is
> > > meant to do. Even if it didn't and the cache wasn't deleted because all your
> > > arch builds had the same timestamp it still wouldn't matter because the
> > > timestamp is used to identify the Mesa version not the arch, the arch
> > > doesn't matter in terms of the cache it will work just fine. The only reason
> > > for the different dirs is because we have no way of knowing if the different
> > > timestamps are the same Mesa version
> > 
> > 1 if you are using timestamp it ID mesa version just use mesa version.
> 
> We can't, distros patch Mesa we can't depend on the version numbers being an
> accurate representation of the source that was use to compile a Mesa
> binaray. There is an option to do a sha of the binaries but that is a little
> more complex for opengl than it is for vulkan (where it is currently used by
> Intel).

Distribution will block using timestamp.  So timestamp cannot be used to perform task.

Hashing to support your usage case might be the only way.
> 
> > 
> > Next do read what x32 API is 
> > 
> > https://en.wikipedia.org/wiki/X32_ABI
> > There is a different count in registers between i386 and x32.  So yes they
> > both report as ilp-32
> 
> gcc only sets the __ILP32__ flag for x32 not i386. Clang may differ I'm not
> sure, but it's not a major concern for me right now.
> 
http://stackoverflow.com/questions/34011718/is-ilp32-and-i386-a-valid-configuration
Yes Clang sets __ILP32__ on i386 build modes.

__ILP32__ is a valid description for i386.   Should be presumed as being possible to be set in i386.

In fact gcc not having __ILP32__ set for i386 could be called a bug.  Depending on __ILP32__ to split i386 and x32 mode is depending on complier behaviour not specification defined behaviour.

> > but if you are running a software rendering engine how
> > the how the shader code has to be optimised is totally different to take
> > advantage of x32 mode.
> 
> How about you go read about gpu architectures. Shader code is for the gpu,
> x32
> is for the cpu. There is no shader code optimisations to take advantage of
> x32 mode.

In upstream currently there are none at the moment for software rendering you are right.  In prototype hack patches to get more performance out of software rendering in x32 mode there is.   As I said to take advantage of x32 mode it need to be different.  Not that upstream has it different now.  To make it possible for in future x32 patches to go upstream it need to be kept split at the cache.

Comment 18 Ilia Mirkin 2017-03-09 23:42:57 UTC

Since the different arch's libraries will almost invariably be different, this is comparable to flipping back and forth between versions.

I expect it to be a fairly common developer scenario to have a system mesa install and a "I'm debugging this version" mesa install. It'd be nice to handle these in some moderately reasonable way.

Perhaps we should keep N directories around, and check the mtime (or whatever) of the directory before deleting it?

Comment 19 Timothy Arceri 2017-03-10 00:27:48 UTC

(In reply to oiaohm from comment #17)
> (In reply to Timothy Arceri from comment #15)
> > (In reply to oiaohm from comment #12)
> > > (In reply to Timothy Arceri from comment #11)
> > > > (In reply to oiaohm from comment #9)
> > > > > Jan Vesely
> > > > > <b>If the pointer size is the same and the timestamp is the same, the cache
> > > > > will be reused, not deleted, and the problem does not exist.</b>
> > > > > 
> > > > > I have a little problem. I have x32 and i386.   So pointer size is the same
> > > > > timestamp and contents of cache should be different.   This will happen if I
> > > > > am running a reproducible build with frozen time.   This can also happen if
> > > > > you make to 64 library builds with frozen time because the two builds will
> > > > > have the same timestamps even if they are completely different versions of
> > > > > mesa incompatible disc cache requirements.
> > > > > 
> > > > > https://reproducible-builds.org/docs/timestamps/
> > > > > 
> > > > > The reason I provided this link is time is no longer trust-able as a unique
> > > > > value.  Do note the usage of libfaketime its really simple to forget you
> > > > > still have that loaded after doing a reproducible build so the clock is not
> > > > > in fact ticking forwards for anything you are doing you wake up when you
> > > > > finally do something that is dependant on the clock in fact ticking.   So
> > > > > you need to look at other sources to get unique value.
> > > > 
> > > > x32 should end up in the ilp-32 directory. That's what the __ILP32__ is
> > > > meant to do. Even if it didn't and the cache wasn't deleted because all your
> > > > arch builds had the same timestamp it still wouldn't matter because the
> > > > timestamp is used to identify the Mesa version not the arch, the arch
> > > > doesn't matter in terms of the cache it will work just fine. The only reason
> > > > for the different dirs is because we have no way of knowing if the different
> > > > timestamps are the same Mesa version
> > > 
> > > 1 if you are using timestamp it ID mesa version just use mesa version.
> > 
> > We can't, distros patch Mesa we can't depend on the version numbers being an
> > accurate representation of the source that was use to compile a Mesa
> > binaray. There is an option to do a sha of the binaries but that is a little
> > more complex for opengl than it is for vulkan (where it is currently used by
> > Intel).
> 
> Distribution will block using timestamp.  So timestamp cannot be used to
> perform task.

I'm pretty sure that's what Vulkan drivers were already doing in the previous release. If distros are going to disable features I can't stop that.

> 
> Hashing to support your usage case might be the only way.

Happy for patches to implement this.

> > 
> > > 
> > > Next do read what x32 API is 
> > > 
> > > https://en.wikipedia.org/wiki/X32_ABI
> > > There is a different count in registers between i386 and x32.  So yes they
> > > both report as ilp-32
> > 
> > gcc only sets the __ILP32__ flag for x32 not i386. Clang may differ I'm not
> > sure, but it's not a major concern for me right now.
> > 
> http://stackoverflow.com/questions/34011718/is-ilp32-and-i386-a-valid-
> configuration
> Yes Clang sets __ILP32__ on i386 build modes.
> 
> __ILP32__ is a valid description for i386.   Should be presumed as being
> possible to be set in i386.
> 
> In fact gcc not having __ILP32__ set for i386 could be called a bug. 
> Depending on __ILP32__ to split i386 and x32 mode is depending on complier
> behaviour not specification defined behaviour.

We want to avoid hooks into the build system. If you have a better idea how to do this than is currently done please submit a patch for review.

> 
> > > but if you are running a software rendering engine how
> > > the how the shader code has to be optimised is totally different to take
> > > advantage of x32 mode.
> > 
> > How about you go read about gpu architectures. Shader code is for the gpu,
> > x32
> > is for the cpu. There is no shader code optimisations to take advantage of
> > x32 mode.
> 
> In upstream currently there are none at the moment for software rendering
> you are right.  In prototype hack patches to get more performance out of
> software rendering in x32 mode there is.   As I said to take advantage of
> x32 mode it need to be different.  Not that upstream has it different now. 
> To make it possible for in future x32 patches to go upstream it need to be
> kept split at the cache.

There is no shader cache for software renderers so this, or any other issues to do with caching for software renders in not a current concern. It would be more likely that we would use the gpu_id as a way to distinguish between implementations anyway.

Comment 20 Emil Velikov 2017-03-10 00:36:09 UTC

oiaohm@gmail.com let me put a couple of suggestions:

 - try to keep your answers brief
 - try to keep your analysis distro agnostic
 - keep Wikipedia references to a minimum - it'll get you an F in University ;-)
 - carefully check the code before calling things out (re: the reproducible builds) 

But above all - a patch says a thousand words, when sending out consider a) previous review feedback and b) maintainability.

As a final note: shader cache is not enabled on anything but r600/radeonsi atm, so the current code simply _cannot_ break your work-flows (workflow really).

Comment 21 oiaohm 2017-03-10 05:51:18 UTC

(In reply to Emil Velikov from comment #20)
>
> As a final note: shader cache is not enabled on anything but r600/radeonsi
> atm, so the current code simply _cannot_ break your work-flows (workflow
> really).

Guess what graphic card I have.   I have a AMD graphics card so I can do Multi architecture.   You have to use open source drivers because closed source drivers is not option because you need to have the opengl libraries in the same same architecture as the application.   So yes this direct hits my workflow.

This is why I am so upset.   As soon as this comes though I have possible trouble with miss matched mesa versions crossing with each other in a cache directory and being hard to debug.

Comment 22 oiaohm 2017-03-10 06:20:23 UTC

(In reply to Timothy Arceri from comment #19)
> (In reply to oiaohm from comment #17)
> > (In reply to Timothy Arceri from comment #15)
> > > (In reply to oiaohm from comment #12)
> > > > (In reply to Timothy Arceri from comment #11)
> > > > > (In reply to oiaohm from comment #9)
> > > > > > Jan Vesely
> > > > > > <b>If the pointer size is the same and the timestamp is the same, the cache
> > > > > > will be reused, not deleted, and the problem does not exist.</b>
> > > > > > 
> > > > > > I have a little problem. I have x32 and i386.   So pointer size is the same
> > > > > > timestamp and contents of cache should be different.   This will happen if I
> > > > > > am running a reproducible build with frozen time.   This can also happen if
> > > > > > you make to 64 library builds with frozen time because the two builds will
> > > > > > have the same timestamps even if they are completely different versions of
> > > > > > mesa incompatible disc cache requirements.
> > > > > > 
> > > > > > https://reproducible-builds.org/docs/timestamps/
> > > > > > 
> > > > > > The reason I provided this link is time is no longer trust-able as a unique
> > > > > > value.  Do note the usage of libfaketime its really simple to forget you
> > > > > > still have that loaded after doing a reproducible build so the clock is not
> > > > > > in fact ticking forwards for anything you are doing you wake up when you
> > > > > > finally do something that is dependant on the clock in fact ticking.   So
> > > > > > you need to look at other sources to get unique value.
> > > > > 
> > > > > x32 should end up in the ilp-32 directory. That's what the __ILP32__ is
> > > > > meant to do. Even if it didn't and the cache wasn't deleted because all your
> > > > > arch builds had the same timestamp it still wouldn't matter because the
> > > > > timestamp is used to identify the Mesa version not the arch, the arch
> > > > > doesn't matter in terms of the cache it will work just fine. The only reason
> > > > > for the different dirs is because we have no way of knowing if the different
> > > > > timestamps are the same Mesa version
> > > > 
> > > > 1 if you are using timestamp it ID mesa version just use mesa version.
> > > 
> > > We can't, distros patch Mesa we can't depend on the version numbers being an
> > > accurate representation of the source that was use to compile a Mesa
> > > binaray. There is an option to do a sha of the binaries but that is a little
> > > more complex for opengl than it is for vulkan (where it is currently used by
> > > Intel).
> > 
> > Distribution will block using timestamp.  So timestamp cannot be used to
> > perform task.
> 
> I'm pretty sure that's what Vulkan drivers were already doing in the
> previous release. If distros are going to disable features I can't stop that.
> 
> > 
> > Hashing to support your usage case might be the only way.
> 
> Happy for patches to implement this.
> 
> > > 
> > > > 
> > > > Next do read what x32 API is 
> > > > 
> > > > https://en.wikipedia.org/wiki/X32_ABI
> > > > There is a different count in registers between i386 and x32.  So yes they
> > > > both report as ilp-32
> > > 
> > > gcc only sets the __ILP32__ flag for x32 not i386. Clang may differ I'm not
> > > sure, but it's not a major concern for me right now.
> > > 
> > http://stackoverflow.com/questions/34011718/is-ilp32-and-i386-a-valid-
> > configuration
> > Yes Clang sets __ILP32__ on i386 build modes.
> > 
> > __ILP32__ is a valid description for i386.   Should be presumed as being
> > possible to be set in i386.
> > 
> > In fact gcc not having __ILP32__ set for i386 could be called a bug. 
> > Depending on __ILP32__ to split i386 and x32 mode is depending on complier
> > behaviour not specification defined behaviour.
> 
> We want to avoid hooks into the build system. If you have a better idea how
> to do this than is currently done please submit a patch for review.

https://patchwork.freedesktop.org/patch/141891/
This patch almost does it.

The person here goes around a different.   Since you are using autotools/configure script it has --target= that it either works out or your provide.   This is the triplet.   This ready exist in configure as target_alias and Macro: AC_CANONICAL_TARGET will get you target no matter what.

DEFINES="${DEFINES} -DCACHE_ARCH=\\\"$target\\""

But I will have to go through the complete configure because I would think it strange if you have not already passed the target in somewhere for something to the source code.

Might need something for platforms as fall back for those that are not Linux.  And not using autotools.

When you say build system hooks does that include adding extra to configure or using something it provides.  Since if complier stupid as long a configure does the right thing it does not matter.

The target triplet should be-correct and complier neutral.

> 
> There is no shader cache for software renderers so this, or any other issues
> to do with caching for software renders in not a current concern. It would
> be more likely that we would use the gpu_id as a way to distinguish between
> implementations anyway.

Comment 23 Grazvydas Ignotas 2017-03-10 13:27:13 UTC

(In reply to oiaohm from comment #21)
> This is why I am so upset.   As soon as this comes though I have possible
> trouble with miss matched mesa versions crossing with each other in a cache
> directory and being hard to debug.
That will only happen if you force the same modify timestamp on several variations of mesa that are built for matching pointer size. Do you really think it's likely to happen in practice?

There is also an option of disabling the cache entirely through environment if it makes you so nervous.

Comment 24 Emil Velikov 2017-03-10 19:12:07 UTC

On 10 March 2017 at 18:51, Steven Newbury <steve@snewbury.org.uk> wrote:
> On Fri, 2017-03-10 at 13:27 +0000, bugzilla-daemon@freedesktop.org
> wrote:
>> Comment # 23 on bug 100073 from Grazvydas Ignotas
>> (In reply to oiaohm from comment #21)
>> > This is why I am so upset.   As soon as this comes though I have
>> possible
>> > trouble with miss matched mesa versions crossing with each other in
>> a cache
>> > directory and being hard to debug.
>> That will only happen if you force the same modify timestamp on
>> several
>> variations of mesa that are built for matching pointer size. Do you
>> really
>> think it's likely to happen in practice?
>
> I believe that's exactly what happens with reproducible builds:
> https://reproducible-builds.org/
>
Take another look - there's nothing which mentions or suggests that
the binaries must have the same timestamps.
The rule of the game is to produce identical binaries, which is true
even when we had the timestamp file.

And if you're in doubt, do try it on your end - the instructions and
package(s) will help you double-check ;-)

-Emil

Comment 25 oiaohm 2017-03-11 03:36:19 UTC

(In reply to Emil Velikov from comment #24)
> On 10 March 2017 at 18:51, Steven Newbury <steve@snewbury.org.uk> wrote:
> > On Fri, 2017-03-10 at 13:27 +0000, bugzilla-daemon@freedesktop.org
> > wrote:
> >> Comment # 23 on bug 100073 from Grazvydas Ignotas
> >> (In reply to oiaohm from comment #21)
> >> > This is why I am so upset.   As soon as this comes though I have
> >> possible
> >> > trouble with miss matched mesa versions crossing with each other in
> >> a cache
> >> > directory and being hard to debug.
> >> That will only happen if you force the same modify timestamp on
> >> several
> >> variations of mesa that are built for matching pointer size. Do you
> >> really
> >> think it's likely to happen in practice?
> >
> > I believe that's exactly what happens with reproducible builds:
> > https://reproducible-builds.org/
> >
> Take another look - there's nothing which mentions or suggests that
> the binaries must have the same timestamps.
> The rule of the game is to produce identical binaries, which is true
> even when we had the timestamp file.
> 
> And if you're in doubt, do try it on your end - the instructions and
> package(s) will help you double-check ;-)
> 
> -Emil

https://reproducible-builds.org/docs/timestamps/

I gave this link before you need to read it.

SOURCE_DATE_EPOCH is what you are told to use for time stamp values inside source code.  That is not complier or arch unique.

You will be facing libfaketime or equal methods that freeze the clock at SOURCE_DATE_EPOCH to catch cases of developers who have done the wrong thing.

If you look into the existing package on debian that are built multi arch as reproducible builds you will already see many cases where they are the same source the binary files have the exact identical timestamps on them.

https://packages.debian.org/stretch/mesa-utils
From here download the amd64 and the arm64 check timestamps on the files and they are absolutely identical.

In fact you can pick any arch in the list there and they are all 23 December 2016 at 1:58 I could go right to the second identical.

So there is no way to split architecture on timestamp it does not work in the existing real world.

I know don't want to depend on build system.

I have a nightmare multi arch support means you need something like the GNU triplet.   With complier variation like llvm having __ILP32__ for i386 and gcc not and win32 and win64 defined in visual studio at the same time and so on.   Cannot use C macros to solve this effectively.   

Runtime Detection also end up in a huge mountain of code that ends up buggy as all get out due to the C macro issues need to pick between different arch detections.   Next part is complete hell where llvm can throw error because it finding never used code.  There is no promise that gcc will not add dead code detection.   Basically the first lot of code is complier dependant that I complained about as being wrong.

To make runtime detection be running the right detections you are back to the build system providing something like a GNU triplet.

What is usable is hash the binary that will cost a lot of cpu time or used the GNU triplet from the build system and SOURCE_DATE_EPOCH as 1 value.

Last modification time of the source is the SOURCE_DATE_EPOCH or should be.   So any patches to source should change that value as long as the party doing it right. 

Please note due to mesa being MIT license it lack a clause that GPL where modified source must update timestamp.  That limits the effectiveness of using SOURCE_DATE_EPOCH.

Comment 26 oiaohm 2017-03-11 03:48:03 UTC

(In reply to Grazvydas Ignotas from comment #23)
> (In reply to oiaohm from comment #21)
> > This is why I am so upset.   As soon as this comes though I have possible
> > trouble with miss matched mesa versions crossing with each other in a cache
> > directory and being hard to debug.
> That will only happen if you force the same modify timestamp on several
> variations of mesa that are built for matching pointer size. Do you really
> think it's likely to happen in practice?
> 
https://packages.debian.org/stretch/mesa-utils
From here download the amd64 and the arm64 check timestamps on the files and they are absolutely identical timestamped.

In fact you can pick any arch in the list there and they are all 23 December 2016 at 1:58 I could go right to the second identical.

Yes it really happens in practice today.  If you go download all the mesa parts from debian in testing unless there is a patch difference they are all identical timestamps on the files.   So it 100 percent for sure going to happen to me a Debian user who uses multiarch fairly completely.   Now other users of other distributions as those become reproducible could be facing the same problem.

I have said no to file or build timestamp enough times.  Hopefully now people will get it that as a option to split versions in the real world is no longer valid.   Depending on file or build timestamp is depending on those packaging to use different time what they are not required to-do and have to tools not to-do.

Comment 27 Tobias Droste 2017-03-11 12:06:42 UTC

They are build from the same source code version so it's perfectly fine for them to have the same timestamp.

In fact, this actually helps and would make it unnecessary to have separate folders! 
I think you miss the point of why there are different folders.

There are different folders, because usually 32bit and 64bit builds from the _same_ source version usually have different timestamps. If you would have the cache in only a single directory and run 64bit and 32bit applications mixed, each would invalidate the cache if it was created from the other arch.

So if you run multiple archs and thery have the same timestamp (as you claim) you will not run into problems if different archs share the same cache folder.

As mentioned before, the arch is changing the content of the cache item as the content is defined by our GPU and not CPU.

Comment 28 Emil Velikov 2017-03-11 12:40:30 UTC

Peter Dolding [seems like you like posting huge emails everywhere], attempting to summarise:

Debian's interpretation of "reproducible" is - _all_ the binaries across _any_ arch _must_ have the same build timestamp. Yet the documentation does not even remotely say so, even if [Debian people] they wrote it the first place.

Do I understand this correctly when I say that as you run the arm64 binary that will be used alongside arm64 Mesa - correct ? If so how can one use the arm64 build radeonsi alongside the x86 kernel amdgpu/radeon driver - kindly point us to a reading material.

As a parting gift - being an open-source project does not mean that people can or should accommodate of all the setups out there. Not my call, but personally I'll put this in the WONTFIX pile.

Comment 29 oiaohm 2017-03-12 03:49:52 UTC

(In reply to Tobias Droste from comment #27)
> They are build from the same source code version so it's perfectly fine for
> them to have the same timestamp.
> 
> In fact, this actually helps and would make it unnecessary to have separate
> folders! 

Are these folders unnecessary I am not too sure.

> I think you miss the point of why there are different folders.
> 
> There are different folders, because usually 32bit and 64bit builds from the
> _same_ source version usually have different timestamps. If you would have
> the cache in only a single directory and run 64bit and 32bit applications
> mixed, each would invalidate the cache if it was created from the other arch.
> 
> So if you run multiple archs and thery have the same timestamp (as you
> claim) you will not run into problems if different archs share the same
> cache folder.
> 
> As mentioned before, the arch is changing the content of the cache item as
> the content is defined by our GPU and not CPU.
Its not theory that different archs in debian do share identical timestamps.  This is reality.

You are missing something important here Nouveau already picked up and started using the shared util/disk_cache so its only a matter of time before one of the software renderers do the same thing.

Tobias Droste so for the amd hardware I have there should not be a fatal issue when the match up happens.   That is good to know.  Will that also apply in the Nouveau we would have to ask and it is going to apply in every case that is going to use the disk_cache code?   So this could come a bug that comes back and bites in future.

Tobias step back look at the broader picture.   This is an area of common code and need to be design not for one case but in fact for all.  Now if something like this was inside just the AMD user-space library code everything would be good if a promise was given that compatibility would be maintained.

I would still prefer each arch has there own cache directory in case one of the archs has a suspect gcc or other complier and has in fact built the library wrong.   So that only that archs application go splat.   Currently you have an environmental to turn disc cache off extra option would be an environmental to allow forcing using another cache location so that Multiarch can do diagnostics.

You can look at each arch as it own build environment with its own bugs.

Turning the cache off is not exactly a solution.

Comment 30 Tobias Droste 2017-03-12 03:58:01 UTC

(In reply to oiaohm from comment #29)
> I would still prefer each arch has there own cache directory in case one of
> the archs has a suspect gcc or other complier and has in fact built the
> library wrong.   So that only that archs application go splat.   Currently
> you have an environmental to turn disc cache off extra option would be an
> environmental to allow forcing using another cache location so that
> Multiarch can do diagnostics.

MESA_GLSL_CACHE_DIR=<something>

> 
> You can look at each arch as it own build environment with its own bugs.
> 
> Turning the cache off is not exactly a solution.

Comment 31 oiaohm 2017-03-12 04:33:54 UTC

(In reply to Emil Velikov from comment #28)
> Peter Dolding [seems like you like posting huge emails everywhere],
> attempting to summarise:
> 
> Debian's interpretation of "reproducible" is - _all_ the binaries across
> _any_ arch _must_ have the same build timestamp. Yet the documentation does
> not even remotely say so, even if [Debian people] they wrote it the first
> place.
> 
Its in the documentation just you have missed it.
https://reproducible-builds.org/docs/timestamps/
>>>The Debian reproducible builds effort proposed the SOURCE_DATE_EPOCH >>>environment variable to address the problem. Tools that support it1 will use >>>its value—a number of seconds since January 1st 1970, 00:00 UTC—instead of >>>the current date and time (when set). The variable has been formally defined >>>in the hope of wider adoption.

The lethal bit if you did not see it this time.
>>>instead of the current date and time (when set)
So you have no current date or time and they told you so in the documentation.  So all files will be created with SOURCE_DATE_EPOCH because that is the current date and time in the build.   So they set libfaketime to SOURCE_DATE_EPOCH.

So any system obeying reproducible build documentation to the letter file timestamp is only going to tell you source code version roughly.  So if SOURCE_DATE_EPOCH exists you might as well use your git submit version information at least you will not be fooling yourself that it doing more than what it is.

> Do I understand this correctly when I say that as you run the arm64 binary
> that will be used alongside arm64 Mesa - correct ? If so how can one use the
> arm64 build radeonsi alongside the x86 kernel amdgpu/radeon driver - kindly
> point us to a reading material.
> 
The process is straight forward on debian.
https://wiki.debian.org/QemuUserEmulation
apt-get install qemu binfmt-support qemu-user-static
To install and register qemu usermode for any elf binary the current kernel cannot support directly.
https://wiki.debian.org/Multiarch/HOWTO
Then you just multiarch add arm64
dpkg --add-architecture arm64
update and install the arm64 userspace libraries and any programs you want as arm64 by using :arm64 after it.

At that point you are ready to go and it currently just works that way currently.

Debian does not bother about any special reading materials for it.  Same process no matter the arch binaries you want to run.

So its really no different to adding i386 for 32 bit support other than having to add the qemu usermode support.

Comment 32 Timothy Arceri 2017-03-12 05:59:56 UTC

As has been stated multiple times 32bit, 64bit, x32, arm, x86 etc are all mean nothing in regards to the resulting shader cache entries, you should be able to share them across platforms if not then there is a bug.

The only reason for the 32, ipl-32 and 64 folders was to help keep different Mesa versions for 32 vs 64 from causing the cache to be deleted.

After discussions about another issue it's been decided to move all cache entries to a single cache dir and instead use the timestamp (in future maybe build-id hash) and gpu id as part of the cache entries sha input. 

Since we will no longer be deleting the cache (instead using LRU to remove old entries) arch will no longer be an issue and will be ignored completely, so I'm closing this bug.

As for software renders that will be up to those drivers to decide how to represent the gpu id, but that is a separate issue and can be discussed further when a patch is submitted to the list.

Comment 33 oiaohm 2017-03-12 06:33:55 UTC

(In reply to Timothy Arceri from comment #32)
> As has been stated multiple times 32bit, 64bit, x32, arm, x86 etc are all
> mean nothing in regards to the resulting shader cache entries, you should be
> able to share them across platforms if not then there is a bug.
> 
> The only reason for the 32, ipl-32 and 64 folders was to help keep different
> Mesa versions for 32 vs 64 from causing the cache to be deleted.
> 
> After discussions about another issue it's been decided to move all cache
> entries to a single cache dir and instead use the timestamp (in future maybe
> build-id hash) and gpu id as part of the cache entries sha input.

I would suggest keeping build-id and gpu-id outside the sha value.  Hash collisions are rare but you can fairly much bet someone will do it at some point not point salting the hash and altering risk.   Also if have removed a particular version of mesa from the system for good being able to flush all cache entries that no longer apply by deleting all the cache entries with that build-id.  If you have removed a particular gpu for good same with the gpu-id.  Embedded in the hash will make cleaning cache harder.

Basically make mesa form triplets [build-id]-[gpu-id]-[hash] for platform sensitive [build-id]-[gpu-id]-[arch]-[hash]

That should cover most cases.   Also having the file-name triplet it could be possible to have a update tool to update from one build-id to new build-id cache entries if someone every wants to attempt that.

It is important that everyone learns don't trust timestamps to be unique the world no longer works that way.

So this fix my worry by completely doing away with split.   Seeing a split I was thinking some form of shader uniqueness to arch or it was just masking over a bug that need fixing it turned out to be the case.   So thank you for your time.

Comment 34 Marek Olšák 2017-03-12 18:44:58 UTC

Your argument for keeping the GPU and build IDs out of the hash because of the possibility of collisions has one flaw: The probability of a collision doesn't grow when the cache is already full (if the number of items in the cache remains the same).

Therefore, I think having one cache shared by all GPUs and builds would be a nice simplification without any theoretical disadvantage.

It's a bad idea to wipe the cache on a driver or GPU switch, because DRI PRIME (hybrid graphics) switches between 2 GPUs all the time. It's also not nice to have 4 separate caches ({32-bit, 64-bit} x {dGPU, iGPU}).

Comment 35 oiaohm 2017-03-13 03:18:57 UTC

(In reply to Marek Olšák from comment #34)
> Your argument for keeping the GPU and build IDs out of the hash because of
> the possibility of collisions has one flaw: The probability of a collision
> doesn't grow when the cache is already full (if the number of items in the
> cache remains the same).

Sorry Marek you have your maths wrong you have missed something important predictability of collisions.

The build_id particularly.  As this is a changing value.  When you put this in the hash this does increase the risk of a collision not detected in program development.

So you tested program with X build version no collisions in the hashes.   That you merge build-id into sha the program is running with Y build version and now due to insane bad luck the Y build-id value being different has caused a collision that the programmer who made the program has no clue about.   So now users using Y version are reporting problems that developer does not have and is left totally confused.

So possibility of collisions has barely changed by merging build-id into hash.  But how predictable a collision will be has changed massively.   Yet you still see people making these arguments without the mathematics behind it.    Predictability of when a collision is going to happen has altered massively by merging the build-id so making collisions a lot more of a random event to debug.

Reality you should avoid throwing changing values into hashes where you can as this adds extra level of unpredictability.  With new optimisations to in ways shaders are generated there is enough risk of collisions without making them unpredictable.

gpu-id are less likely to be a cause and should be testable for by a developer.

Basically it bad enough to have a collision without giving it a pseudorandomness factor.   I should have been more clear what I meant by altering risk.   Altering risk for those attempting to provide working programs to end users.   Making collision pseudorandom is not a useful feature not helpful to those making programs.   A predictable hash collision developer of program could possibly work around by not making anything that does that.

So please do remember hash collision have two factors.  Possibility and Predictability.  A good hash solution for a cache decreases the possibility of collision without being insanely unpredictable in case of collision if it ever happens.  If you only look at Possibility you complete end up creating yourself the worse cases if hash collision does happen.

> It's a bad idea to wipe the cache on a driver or GPU switch, because DRI PRIME
> (hybrid graphics) switches between 2 GPUs all the time.

I agree with this.  Its also a bad idea not to be able to clean cache when a GPU is removed for good.   This is the main reason why I suggest keeping the gpu-id exposed not hidden in a hash.

Also not being able to clean out items out of the cache for build_id or gpu_id you are no longer using also increases risk of a collision version to version.

Design a system to clean the cache is just as important as making the cache.   Yes wiping the cache on driver change is over cleaning.   Under cleaning is also bad.   The issue is getting the right balance.

Comment 36 Marek Olšák 2017-03-13 10:14:00 UTC

The build id and GPU id can be compared directly after the hash lookup and decompression to prevent collisions when two shaders with different IDs have the same hash. Thus we only have to worry about collisions between shaders having the same IDs.

Comment 37 oiaohm 2017-03-14 03:06:30 UTC

(In reply to Marek Olšák from comment #36)
> The build id and GPU id can be compared directly after the hash lookup and
> decompression to prevent collisions when two shaders with different IDs have
> the same hash. Thus we only have to worry about collisions between shaders
> having the same IDs.

This read we will deal with a collisions instead of avoiding collisions by using more cpu cycles every time a shader is loaded this does not sound particularly good for performance.

Build-id and gpu-id could be directories in the cache.   This would reduce the number of entries you need to search for hits in and you would not find matches with invalid values of build-id or gpu-id in the first place.   This means you could skip checking for those value when loading a shader from cache because only the correct files should in those directories unless user has done something stupid.

If build-id and gpu-id were directories cache cleaning would be simpler.   Why you could rm -rf the no longer required directories.   This would still keep file names fairly clean.

Marek Olšák is really valid to waste cpu cycles checking Build-id and Gpu-id when it can be done once when opening the directory to search for cache entries.  I think not.   So I don't think build-id and gpu-id mixed into hash would be a good thing.   

Build-id and gpu-id could be in the shader entries to detect if user has done something bad to the cache with a audit tool but this is not something that has to be run when a game is running.   But you do have to be careful with build-id like appending the build-id on the end of the file past the point the hash was calculated to so that collision are consistent.  Also keeping collision possibilities consistent make it possible for application developers to work around them if they ever happen.   

Performance does mean worrying about wasting cycles.   Reducing possibility of collisions by design reduces how many cpu cycles you have to waste preventing collision based errors.   You really don't want to have to be doing any extra checks every time you access the cache thinking a shader cache should be reasonably heavily used at times.

Comment 38 Timothy Arceri 2017-03-14 03:56:26 UTC

This bug report is closed. Please stop spamming it. We have already discussed collisions on the mailing list [1] where these discussions belong.

https://lists.freedesktop.org/archives/mesa-dev/2017-February/145993.html

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.