Bug 96296 - [clover r600g juniper] clpeak causes a GPU hang
Summary: [clover r600g juniper] clpeak causes a GPU hang
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 99553
  Show dependency treegraph
 
Reported: 2016-05-31 20:47 UTC by Grazvydas Ignotas
Modified: 2019-09-18 19:21 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
logs (234.05 KB, application/octet-stream)
2016-06-01 00:21 UTC, Grazvydas Ignotas
Details
global_bandwidth_v16_local_offset asm dump (14.98 KB, text/plain)
2016-06-06 17:59 UTC, Jan Vesely
Details
AMD PALM (DRM 2.49.0 / 4.10.0-qtec-standard, LLVM 4.0.1 + MESA 17.0.3 (1.54 MB, text/x-log)
2017-04-19 10:48 UTC, Ricardo Ribalda
Details

Description Grazvydas Ignotas 2016-05-31 20:47:42 UTC
AMD JUNIPER (DRM 2.43.0 / 4.6.0)
Mesa 12.1.0-devel (git-3581812)
llvm-3.8 1:3.8-2ubuntu3

clpeak - https://github.com/krrishnarraj/clpeak.git

As soon as it starts it's float8 test (earlier ones run fine), the machine locks up and does not recover. Perhaps it attempts to execute some fp64 instructions that are missing on Juniper?
Comment 1 Jan Vesely 2016-05-31 22:20:51 UTC
(In reply to Grazvydas Ignotas from comment #0)
> AMD JUNIPER (DRM 2.43.0 / 4.6.0)
> Mesa 12.1.0-devel (git-3581812)
> llvm-3.8 1:3.8-2ubuntu3
> 
> clpeak - https://github.com/krrishnarraj/clpeak.git
> 
> As soon as it starts it's float8 test (earlier ones run fine), the machine
> locks up and does not recover. Perhaps it attempts to execute some fp64
> instructions that are missing on Juniper?

any attempt to use doubles should fail to build the kernel (even with llvm 3.8).

Running with CLOVER_DEBUG=llvm,asm CLOVER_OUTPUT=out_file should give you an idea about what the compiled program looks like, though I'd recommend using llvm 3.9.
Comment 2 Grazvydas Ignotas 2016-06-01 00:21:42 UTC
Created attachment 124221 [details]
logs
Comment 3 Grazvydas Ignotas 2016-06-01 00:22:27 UTC
OK so it's the memory bandwidth test that causes the GPU hang, --compute-dp fails with "No double precision support! Skipped", as expected.

llvm 3.9 doesn't seemed to be released so I've build the trunk, but the hang is still there. I've been able to capture the logs before the system dies, attached.

BTW CLOVER_OUTPUT doesn't seem to be handled, did you mean CLOVER_DEBUG_FILE?
Comment 4 Jan Vesely 2016-06-06 17:59:35 UTC
Created attachment 124375 [details]
global_bandwidth_v16_local_offset asm dump
Comment 5 Jan Vesely 2016-06-07 15:01:31 UTC
One problem is that starting from R700 ADD_INT is VecALU only instruction (should not be in Trans slot), but it was not enough to fix the hang on my Turks.
Comment 6 Ricardo Ribalda 2017-04-19 10:46:12 UTC
Using llvm 4.0.1 and the latest git commit from libclc ( 17648cd846390e294feafef21c32c7106eac1e24 ):

I am getting a cpu endless loop with clpeak, fixable with ctrl+c.

Other samples, such as Matrix Multiply work fine.

CLOVER_DEBUG=llvm,asm,clc CLOVER_OUTPUT=clover.out clpeak >dump 2>dump.err
Comment 7 Ricardo Ribalda 2017-04-19 10:48:22 UTC
Created attachment 130914 [details]
AMD PALM (DRM 2.49.0 / 4.10.0-qtec-standard, LLVM 4.0.1 + MESA 17.0.3
Comment 8 Jan Vesely 2017-07-28 21:08:04 UTC
got this today. No hang.

Platform: Clover
  Device: AMD TURKS (DRM 2.49.0 / 4.11.11-300.fc26.x86_64, LLVM 6.0.0)
    Driver version  : 17.3.0-devel (Linux x64)
    Compute units   : 6
    Clock frequency : 650 MHz

    Global memory bandwidth (GBPS)
      float   : 40.47
      float2  : 41.01
      float4  : 38.05
      float8  : 25.09
      float16 : 13.33

    Single-precision compute (GFLOPS)
      float   : 124.18
      float2  : 243.14
      float4  : 249.80
      float8  : 285.99
      float16 : 350.36

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 62.25
      int2  : 122.03
      int4  : 123.01
      int8  : 122.29
      int16 : 122.11

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 18.15
      enqueueReadBuffer          : 3.06
      enqueueMapBuffer(for read) : 6.53
        memcpy from mapped ptr   : 5.65
      enqueueUnmap(after write)  : 2108.68
        memcpy to mapped ptr     : 7.49

    Kernel launch latency : 67.10 us
Comment 9 Grazvydas Ignotas 2017-07-30 11:47:33 UTC
I've changed hardware and can no longer test, so I'll just trust Jan and close this.
Comment 10 Jan Vesely 2017-08-02 22:18:11 UTC
turns out I spoke too fast. The GPU still hangs, but Linux is better at recovering.
There are still GPU hang(ring 0 stalled for more than) messages in dmesg.
Comment 11 GitLab Migration User 2019-09-18 19:21:40 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/586.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.