Bug 97524

Summary:	Samplers referring to the same texture unit with different types should raise GL_INVALID_OPERATION
Product:	Mesa	Reporter:	Matias N. Goldberg <dark_sylinc>
Component:	Mesa core	Assignee:	mesa-dev
Status:	RESOLVED FIXED	QA Contact:	mesa-dev
Severity:	minor
Priority:	medium
Version:	git
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	Bug repro Kern.log when it managed to reset Kern.log when it failed to reset Validate sampler types across the whole program

Description Matias N. Goldberg 2016-08-29 04:06:14 UTC

Created attachment 126091 [details]
Bug repro

Running invalid the following GL settings hangs Mesa pretty badly. Sometimes it says stuck on ring(0)/(3), other times it just hangs; GPU tries to reset but does a really poor job (at least I can blindly go to tty1 via Ctrl+Alt+F1 then reboot via Ctrl+Alt+Supr).

Attachment provided to repro the bug.

What is causing it:
Vertex Shader must use samplerBuffer on binding point 0.
Pixel Shader must use samplerCube on binding point 0.
Bind a TBO to binding point 0.

What happens:
Near full system hang, it becomes really unstable. System can be soft-rebooted but that's it.

What should happen:
All other GPU drivers I tested with handle this gracefully by raising a GL_INVALID_OPERATION error and continuing rendering the rest normally.

Version tested: Latest git 22cec6dc5e5a3060bc87f4a92871b4f6eef04632

I'm assigning a low priority since this GL setting combination is invalid to begin with and I want to believe software isn't shipped like this (though considering other GPU drivers allow to ignore the problem, I wouldn't be fully surprised if there is faulty software out there...); though I believe the system shouldn't hang because of this.

GL_VERSION = 4.3.0.0
GL_VENDOR = X.Org
GL_RENDERER = Gallium 0.4 on AMD CAPE VERDE (DRM 2.45.0 / 4.7.0-040700-generic, LLVM 3.9.0)

uname -r
4.7.0-040700-generic

Cheers

Comment 1 Nicolai Hähnle 2016-09-02 10:03:43 UTC

Hi Matias, thanks for the report. I cannot reproduce this on Bonaire or Polaris. Could you please send the dmesg of this when running your bug repro program?

Comment 2 Matias N. Goldberg 2016-09-03 21:54:05 UTC

Created attachment 126190 [details]
Kern.log when it managed to reset

Comment 3 Matias N. Goldberg 2016-09-03 21:54:29 UTC

Created attachment 126191 [details]
Kern.log when it failed to reset

Comment 4 Matias N. Goldberg 2016-09-03 21:55:01 UTC

Ah, you're on bleeding edge HW. Stopping showing off!

It's an AMD Radeon HD 7770 1GB. The audio is not in use but I'm including it just in case.

Vendor and Device ID 1682:3231

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde XT [Radeon HD 7770/8760 / R7 250X] (prog-if 00 [VGA controller])
	Subsystem: XFX Pine Group Inc. Cape Verde XT [Radeon HD 7770/8760 / R7 250X]
	Flags: bus master, fast devsel, latency 0, IRQ 27
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Memory at f7d00000 (64-bit, non-prefetchable) [size=256K]
	I/O ports at e000 [size=256]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] #19
	Kernel driver in use: radeon
	Kernel modules: radeon

Vendor and Device ID 1682:aab0

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
	Subsystem: XFX Pine Group Inc. Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
	Flags: bus master, fast devsel, latency 0, IRQ 32
	Memory at f7d60000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 
	Capabilities: [150] Advanced Error Reporting
	Kernel driver in use: snd_hda_intel


I'll be attaching two logs. One of them from a few days ago where the GPU managed to recover (if we can call that "recover" it was barely functional and had to reset via Ctrl+Alt+Supr). Another from today I just repro again and was totally unable to recover, screen flashed several times as if it tried to soft reset more than once (I'm not certain that's what happened), still could do Ctrl+Alt+Supr to reset.

I've been thinking these few days this bug MIGHT be related (causing): https://bugs.freedesktop.org/show_bug.cgi?id=93649
If the game for just one frame randomly presents an invalid setup for one measly small object, it would hang Radeon 7770s but it would work fine for everybody else including these users in Windows (seriously, in my own program I could not visually tell what was missing, it was some small object in a cubemap render used for reflections) and nobody would notice.

If you have a suspect on where I should look (i.e. you have Mesa code that SHOULD be catching this incorrect setup but isn't; I can analyze why it's not catching the error) I can insert a few printfs or hook gdb. As a graphics programmer I have my pride to keep.

Comment 5 Nicolai Hähnle 2016-09-12 15:11:59 UTC

I've now tried your sample with Verde, and cannot reproduce it there either.

The logs you have submitted are unfortunately not very useful, because all I see are messages about ring stalls and reset attempts. Are there any messages _before_ that?

To be honest, I don't see what the problem could be. Running it with a debug build (which shows all OpenGL errors) shows quite clearly that the bad state is caught at program link time, which then leads to the bad program not being used.

One thing that catches my eye is that your glxinfo says DRM 2.45.0, while mine says DRM 2.47.0.

Comment 6 Matias N. Goldberg 2016-09-12 22:00:58 UTC

Of course you can't repro... nothing is ever easy, isnt it? :D

Regarding the log: Unfortunately that's it. If I go one line up, I can see a crash from nemo, the file browser which had crashed a few minutes earlier for something completely unrelated.

Regarding libdrm: it's hard to know you're on latest everything! I didn't know I wasn't on latest DRM. I will compile, update and try again.

Could you tell me the error msgs you get? (from successfully catching the error). I could Find in Files the error string to see where it's being caught and begin investigating why it never hits on my machine.

Cheers
Matias

Comment 7 Matias N. Goldberg 2016-09-14 00:08:04 UTC

I've researched yesterday and I did not know DRM 2.47 was tied to the kernel version. I thought libdrm was what determined the version.

I would have to update to the kernel 4.8 which I can't do immediately.

In the meantime it would help a lot if I know what errors show up for you so I can research why my machine is just firing up the draw commands.

Comment 8 Nicolai Hähnle 2016-09-14 08:56:55 UTC

Here's the output:

GL Context creation suceeded.
Mesa: User error: GL_INVALID_OPERATION in glUseProgram(program 3 not linked)
^C0:1(1): error: syntax error, unexpected $end0:1(1): error: syntax error, unexpected $endGL_INVALID_OPERATION in glUseProgram(program 3 not linked)Shader Stats: SGPRS: 24 VGPRS: 16 Code Size: 64 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0Shader Stats: SGPRS: 16 VGPRS: 16 Code Size: 40 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0Shader Stats: SGPRS: 18 VGPRS: 8 Code Size: 252 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0Shader Stats: SGPRS: 16 VGPRS: 16 Code Size: 64 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0Shader Stats: SGPRS: 16 VGPRS: 16 Code Size: 40 LDS: 0 Scratch: 0 Max Waves: 10 Spilled SGPRs: 0 Spilled VGPRs: 0

Again, this is with a debug build of Mesa, but there should be no difference in error checking -- only the output when an error is detected is different.

Comment 9 Matias N. Goldberg 2016-09-14 22:01:14 UTC

Oh I see what's happening.

That's my fault.

The paths to the shader files were hardcoded as absolute paths.

Look for:
readFile( "/home/matias/Projects/MesaRing0Bug/src/VertexShader_vs.glsl", data, FILE_BUF );

and:
readFile( "/home/matias/Projects/MesaRing0Bug/src/PixelShader_ps.glsl", data, FILE_BUF );

Change the hardcoded paths, and try again. Sorry about that.

Comment 10 Nicolai Hähnle 2016-10-06 13:56:04 UTC

I can reproduce this now. It really seems like this should be fixed in Mesa main, though: there is already code that checks for this condition when it affects a single program stage (in _mesa_update_shader_textures_used) and when it affects a SSO pipeline (in _mesa_sampler_uniforms_pipeline_are_valid). This code needs to be hooked into the non-SSO case as well.

Comment 11 Tapani Pälli 2016-10-25 05:11:09 UTC

> All other GPU drivers I tested with handle this gracefully by raising a
> GL_INVALID_OPERATION error and continuing rendering the rest normally.

For which command in the sample program should GL_INVALID_OPERATION happen? I remember the check for sampler conflict is done for glValidateProgram but TBH I'm not quite sure where this error should happen in the supplied case.

Comment 12 Matias N. Goldberg 2016-10-25 14:11:40 UTC

I believe the error should happen in either glLinkProgram or glUseProgram, and then every time glDraw* is called.

If I recall correctly AMD's Windows driver raised the error every time glDrawArrays was called; but I don't remember what was raised during glLink/UseProgram (if anything was raised at all).

Comment 13 Roland Scheidegger 2016-10-25 17:33:17 UTC

(In reply to Matias N. Goldberg from comment #12)
> I believe the error should happen in either glLinkProgram or glUseProgram,
> and then every time glDraw* is called.
This is impossible, since the values of the (sampler) uniforms are not part of the state which can be considered with these steps (you link/use and then set the sampler uniforms).

> If I recall correctly AMD's Windows driver raised the error every time
> glDrawArrays was called; but I don't remember what was raised during
> glLink/UseProgram (if anything was raised at all).
Sounds correct, glValidateProgram validates if the program would be valid to execute with all current gl state - essentially the same as what should be validated at draw time (the docs on glValidateProgram even mention sampler mismatch specifically). I'm not sure if draw calls actually need to return an error too (but possibly yes).

Comment 14 Matias N. Goldberg 2016-10-25 18:34:02 UTC

Dang it! You're right.

I forgot GL allows (requires?) changing sampler values after the shader is compiled.

As for glValidateProgram, our actual software (not this sample) would often warn about sampler collisions due to a bug in our code where all the samplers would initially be set to 0, but we later correct it and thus still works fine afterwards.

https://www.khronos.org/opengles/sdk/docs/man/xhtml/glValidateProgram.xml

I know the docs isn't the spec, but this is relevant nonetheless:
"The error GL_INVALID_OPERATION will be generated by glDrawArrays or glDrawElements if any two active samplers in the current program object are of different types, but refer to the same texture image unit."

Comment 15 Timothy Arceri 2016-10-26 00:35:53 UTC

(In reply to Nicolai Hähnle from comment #10)
> I can reproduce this now. It really seems like this should be fixed in Mesa
> main, though: there is already code that checks for this condition when it
> affects a single program stage (in _mesa_update_shader_textures_used) and
> when it affects a SSO pipeline (in
> _mesa_sampler_uniforms_pipeline_are_valid). This code needs to be hooked
> into the non-SSO case as well.

It looks like its hooked up ok the _mesa_update_shader_textures_used() function just doesn't do what it should.

It needs to be looping over the other stages attached to the gl_shader_program each time it is called and setting SamplersValidated accordingly.

In its current form we could fail to throw an error when two shaders have a different types at the same texture unit as in the attached program and I believe also when the Vertex shader has two conflicting types but the fragment shader is ok, since the SamplersValidated flag will be overwritten before we check it.

We obviously need some more piglit tests for this. Looking at the history this may have regressed somewhat with 953a0af8e3f73

Comment 16 Marek Olšák 2016-12-01 02:56:37 UTC

Workaround for radeonsi:

CI-VI:
If the resource type and instruction mismatch (e.g. a buffer constant with an image instruction, or an image resource with a buffer instruction), the instruction will be ignored (reads return nothing and writes do not alter memory).
Solution: Move the buffer descriptor to dwords [0:3] or [8:11] of the sampler slot (the image or fmask portion, respectively).
Note that on Southern Islands, this condition causes a hang.

SI:
Move the buffer descriptor to dwords [8:11] of the sampler slot. This will only cause a hang when a buffer is used as sampler2DMS* and vice versa, because dwords[8:11] are unused by other texture types.

Comment 17 Timothy Arceri 2016-12-02 00:22:03 UTC

Created attachment 128306 [details] [review]
Validate sampler types across the whole program

It's untested but this patch should do the trick.

Note it applies on top of my recent refactors to sampler in this series: https://patchwork.freedesktop.org/series/15613/ so it doesn't apply cleanly to master.

Comment 18 Timothy Arceri 2017-04-13 02:13:37 UTC

Fix tested and sent to list:

https://patchwork.freedesktop.org/patch/150080/

New piglit test:

https://patchwork.freedesktop.org/patch/150079/

Comment 19 Timothy Arceri 2017-04-22 00:08:14 UTC

The error should now be flagged as of:

commit d682f8aa8e0edd166166f87fcd774dd2d57b4180
Author: Timothy Arceri <tarceri@itsqueeze.com>
Date:   Fri Apr 21 17:04:10 2017 +1000

    mesa: validate sampler type across the whole program
    
    Currently we were only making sure types were the same within a
    single stage. This looks to have regressed with 953a0af8e3f73.
    
    Fixes: 953a0af8e3f73 ("mesa: validate sampler uniforms during gluniform calls")
    
    Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
    Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
    https://bugs.freedesktop.org/show_bug.cgi?id=97524

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.