96296 – [clover r600g juniper] clpeak causes a GPU hang

Bug 96296 - [clover r600g juniper] clpeak causes a GPU hang

Summary: [clover r600g juniper] clpeak causes a GPU hang

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/r600 (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	99553
	Show dependency tree / graph

Reported:	2016-05-31 20:47 UTC by Grazvydas Ignotas
Modified:	2019-09-18 19:21 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
logs (234.05 KB, application/octet-stream) 2016-06-01 00:21 UTC, Grazvydas Ignotas	Details
global_bandwidth_v16_local_offset asm dump (14.98 KB, text/plain) 2016-06-06 17:59 UTC, Jan Vesely	Details
AMD PALM (DRM 2.49.0 / 4.10.0-qtec-standard, LLVM 4.0.1 + MESA 17.0.3 (1.54 MB, text/x-log) 2017-04-19 10:48 UTC, Ricardo Ribalda	Details
View All

Description Grazvydas Ignotas 2016-05-31 20:47:42 UTC

AMD JUNIPER (DRM 2.43.0 / 4.6.0)
Mesa 12.1.0-devel (git-3581812)
llvm-3.8 1:3.8-2ubuntu3

clpeak - https://github.com/krrishnarraj/clpeak.git

As soon as it starts it's float8 test (earlier ones run fine), the machine locks up and does not recover. Perhaps it attempts to execute some fp64 instructions that are missing on Juniper?

Comment 1 Jan Vesely 2016-05-31 22:20:51 UTC

(In reply to Grazvydas Ignotas from comment #0)
> AMD JUNIPER (DRM 2.43.0 / 4.6.0)
> Mesa 12.1.0-devel (git-3581812)
> llvm-3.8 1:3.8-2ubuntu3
> 
> clpeak - https://github.com/krrishnarraj/clpeak.git
> 
> As soon as it starts it's float8 test (earlier ones run fine), the machine
> locks up and does not recover. Perhaps it attempts to execute some fp64
> instructions that are missing on Juniper?

any attempt to use doubles should fail to build the kernel (even with llvm 3.8).

Running with CLOVER_DEBUG=llvm,asm CLOVER_OUTPUT=out_file should give you an idea about what the compiled program looks like, though I'd recommend using llvm 3.9.

Comment 2 Grazvydas Ignotas 2016-06-01 00:21:42 UTC

Created attachment 124221 [details]
logs

Comment 3 Grazvydas Ignotas 2016-06-01 00:22:27 UTC

OK so it's the memory bandwidth test that causes the GPU hang, --compute-dp fails with "No double precision support! Skipped", as expected.

llvm 3.9 doesn't seemed to be released so I've build the trunk, but the hang is still there. I've been able to capture the logs before the system dies, attached.

BTW CLOVER_OUTPUT doesn't seem to be handled, did you mean CLOVER_DEBUG_FILE?

Comment 4 Jan Vesely 2016-06-06 17:59:35 UTC

Created attachment 124375 [details]
global_bandwidth_v16_local_offset asm dump

Comment 5 Jan Vesely 2016-06-07 15:01:31 UTC

One problem is that starting from R700 ADD_INT is VecALU only instruction (should not be in Trans slot), but it was not enough to fix the hang on my Turks.

Comment 6 Ricardo Ribalda 2017-04-19 10:46:12 UTC

Using llvm 4.0.1 and the latest git commit from libclc ( 17648cd846390e294feafef21c32c7106eac1e24 ):

I am getting a cpu endless loop with clpeak, fixable with ctrl+c.

Other samples, such as Matrix Multiply work fine.

CLOVER_DEBUG=llvm,asm,clc CLOVER_OUTPUT=clover.out clpeak >dump 2>dump.err

Comment 7 Ricardo Ribalda 2017-04-19 10:48:22 UTC

Created attachment 130914 [details]
AMD PALM (DRM 2.49.0 / 4.10.0-qtec-standard, LLVM 4.0.1 + MESA 17.0.3

Comment 8 Jan Vesely 2017-07-28 21:08:04 UTC

got this today. No hang.

Platform: Clover
  Device: AMD TURKS (DRM 2.49.0 / 4.11.11-300.fc26.x86_64, LLVM 6.0.0)
    Driver version  : 17.3.0-devel (Linux x64)
    Compute units   : 6
    Clock frequency : 650 MHz

    Global memory bandwidth (GBPS)
      float   : 40.47
      float2  : 41.01
      float4  : 38.05
      float8  : 25.09
      float16 : 13.33

    Single-precision compute (GFLOPS)
      float   : 124.18
      float2  : 243.14
      float4  : 249.80
      float8  : 285.99
      float16 : 350.36

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 62.25
      int2  : 122.03
      int4  : 123.01
      int8  : 122.29
      int16 : 122.11

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 18.15
      enqueueReadBuffer          : 3.06
      enqueueMapBuffer(for read) : 6.53
        memcpy from mapped ptr   : 5.65
      enqueueUnmap(after write)  : 2108.68
        memcpy to mapped ptr     : 7.49

    Kernel launch latency : 67.10 us

Comment 9 Grazvydas Ignotas 2017-07-30 11:47:33 UTC

I've changed hardware and can no longer test, so I'll just trust Jan and close this.

Comment 10 Jan Vesely 2017-08-02 22:18:11 UTC

turns out I spoke too fast. The GPU still hangs, but Linux is better at recovering.
There are still GPU hang(ring 0 stalled for more than) messages in dmesg.

Comment 11 GitLab Migration User 2019-09-18 19:21:40 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/586.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.