Bug 96296

Summary: [clover r600g juniper] clpeak causes a GPU hang
Product: Mesa Reporter: Grazvydas Ignotas <notasas>
Component: Drivers/Gallium/r600Assignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: ricardo.ribalda
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 99553    
Attachments: logs
global_bandwidth_v16_local_offset asm dump
AMD PALM (DRM 2.49.0 / 4.10.0-qtec-standard, LLVM 4.0.1 + MESA 17.0.3

Description Grazvydas Ignotas 2016-05-31 20:47:42 UTC
AMD JUNIPER (DRM 2.43.0 / 4.6.0)
Mesa 12.1.0-devel (git-3581812)
llvm-3.8 1:3.8-2ubuntu3

clpeak - https://github.com/krrishnarraj/clpeak.git

As soon as it starts it's float8 test (earlier ones run fine), the machine locks up and does not recover. Perhaps it attempts to execute some fp64 instructions that are missing on Juniper?
Comment 1 Jan Vesely 2016-05-31 22:20:51 UTC
(In reply to Grazvydas Ignotas from comment #0)
> AMD JUNIPER (DRM 2.43.0 / 4.6.0)
> Mesa 12.1.0-devel (git-3581812)
> llvm-3.8 1:3.8-2ubuntu3
> 
> clpeak - https://github.com/krrishnarraj/clpeak.git
> 
> As soon as it starts it's float8 test (earlier ones run fine), the machine
> locks up and does not recover. Perhaps it attempts to execute some fp64
> instructions that are missing on Juniper?

any attempt to use doubles should fail to build the kernel (even with llvm 3.8).

Running with CLOVER_DEBUG=llvm,asm CLOVER_OUTPUT=out_file should give you an idea about what the compiled program looks like, though I'd recommend using llvm 3.9.
Comment 2 Grazvydas Ignotas 2016-06-01 00:21:42 UTC
Created attachment 124221 [details]
logs
Comment 3 Grazvydas Ignotas 2016-06-01 00:22:27 UTC
OK so it's the memory bandwidth test that causes the GPU hang, --compute-dp fails with "No double precision support! Skipped", as expected.

llvm 3.9 doesn't seemed to be released so I've build the trunk, but the hang is still there. I've been able to capture the logs before the system dies, attached.

BTW CLOVER_OUTPUT doesn't seem to be handled, did you mean CLOVER_DEBUG_FILE?
Comment 4 Jan Vesely 2016-06-06 17:59:35 UTC
Created attachment 124375 [details]
global_bandwidth_v16_local_offset asm dump
Comment 5 Jan Vesely 2016-06-07 15:01:31 UTC
One problem is that starting from R700 ADD_INT is VecALU only instruction (should not be in Trans slot), but it was not enough to fix the hang on my Turks.
Comment 6 Ricardo Ribalda 2017-04-19 10:46:12 UTC
Using llvm 4.0.1 and the latest git commit from libclc ( 17648cd846390e294feafef21c32c7106eac1e24 ):

I am getting a cpu endless loop with clpeak, fixable with ctrl+c.

Other samples, such as Matrix Multiply work fine.

CLOVER_DEBUG=llvm,asm,clc CLOVER_OUTPUT=clover.out clpeak >dump 2>dump.err
Comment 7 Ricardo Ribalda 2017-04-19 10:48:22 UTC
Created attachment 130914 [details]
AMD PALM (DRM 2.49.0 / 4.10.0-qtec-standard, LLVM 4.0.1 + MESA 17.0.3
Comment 8 Jan Vesely 2017-07-28 21:08:04 UTC
got this today. No hang.

Platform: Clover
  Device: AMD TURKS (DRM 2.49.0 / 4.11.11-300.fc26.x86_64, LLVM 6.0.0)
    Driver version  : 17.3.0-devel (Linux x64)
    Compute units   : 6
    Clock frequency : 650 MHz

    Global memory bandwidth (GBPS)
      float   : 40.47
      float2  : 41.01
      float4  : 38.05
      float8  : 25.09
      float16 : 13.33

    Single-precision compute (GFLOPS)
      float   : 124.18
      float2  : 243.14
      float4  : 249.80
      float8  : 285.99
      float16 : 350.36

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 62.25
      int2  : 122.03
      int4  : 123.01
      int8  : 122.29
      int16 : 122.11

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 18.15
      enqueueReadBuffer          : 3.06
      enqueueMapBuffer(for read) : 6.53
        memcpy from mapped ptr   : 5.65
      enqueueUnmap(after write)  : 2108.68
        memcpy to mapped ptr     : 7.49

    Kernel launch latency : 67.10 us
Comment 9 Grazvydas Ignotas 2017-07-30 11:47:33 UTC
I've changed hardware and can no longer test, so I'll just trust Jan and close this.
Comment 10 Jan Vesely 2017-08-02 22:18:11 UTC
turns out I spoke too fast. The GPU still hangs, but Linux is better at recovering.
There are still GPU hang(ring 0 stalled for more than) messages in dmesg.
Comment 11 GitLab Migration User 2019-09-18 19:21:40 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/586.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.