Bug 94265

Summary: Add support for running GROMACS OpenCL on Ivy Bridge / Haswell with Beignet
Product: Beignet Reporter: Vedran Miletić <vedran>
Component: BeignetAssignee: Xiuli Pan <xiuli.pan>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium CC: dominik, sin.pecado
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Vedran Miletić 2016-02-23 17:00:47 UTC
Running GROMACS [1] with the following patch

diff --git a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
index 2084d8c..8928582 100644
--- a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
+++ b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
@@ -131,6 +131,8 @@ static int is_gmx_supported_gpu_id(struct gmx_device_info_t *ocl_gpu_device)
             return egpuCompatible;
         case OCL_VENDOR_AMD:
             return runningOnCompatibleOSForAmd() ? egpuCompatible : egpuIncompatible;
+        case OCL_VENDOR_INTEL:
+            return egpuCompatible;
         default:
             return egpuIncompatible;
     }

results in:

$ gmx mdrun -deffnm em
      :-) GROMACS - gmx mdrun, 2016-dev-20160222-29943fe-dirty-unknown (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov  Herman J.C. Berendsen    Par Bjelkmar   
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra    Gerrit Groenhof  
 Christoph Junghans   Anca Hamuraru    Vincent Hindriksen Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson    
  Justin A. Lemkul   Magnus Lundborg   Pieter Meulenhoff    Erik Marklund   
   Teemu Murtola       Szilard Pall       Sander Pronk      Roland Schulz   
  Alexey Shvetsov     Michael Shirts     Alfons Sijbers     Peter Tieleman  
  Teemu Virolainen  Christian Wennberg    Maarten Wolf   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2015, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2016-dev-20160222-29943fe-dirty-unknown
Executable:   /usr/local/gromacs/bin/gmx
Data prefix:  /usr/local/gromacs
Command line:
  gmx mdrun -deffnm em


Back Off! I just backed up em.log to ./#em.log.39#
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0

Running on 1 node with total 4 cores, 4 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
    SIMD instructions most likely to fit this hardware: AVX_256
    SIMD instructions selected at GROMACS compile time: AVX_256
  GPU info:
    Number of GPUs detected: 2
    #0: name: AMD TONGA (DRM 3.1.0, LLVM 3.9.0), vendor: AMD, device version: OpenCL 1.1 MESA 11.2.0-devel, stat: compatible
    #1: name: Intel(R) HD Graphics IvyBridge GT1, vendor: Intel, device version: OpenCL 1.2 beignet 1.1.1 (git-9043d32), stat: compatible

Reading file em.tpr, VERSION 5.1-dev-20150219-7c30fcf-unknown (single precision)
Note: file tpx version 100, software tpx version 109
Using 2 MPI threads
Using 2 OpenMP threads per tMPI thread

2 compatible GPUs are present, with IDs 0,1
2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 PP ranks in this node: 0,1

X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
X server found. dri2 connection failed! 
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [2]
param: 4, val: 0
Selecting generic kernel
ASSERTION FAILED: Not supported
  at file /builddir/build/BUILD/Beignet-1.1.1-Source/backend/src/./ir/context.hpp, function gbe::ir::ImmediateIndex gbe::ir::Context::newIntegerImmediate(int64_t, gbe::ir::Type), line 95
Trace/breakpoint trap (core dumped)

Regardless of "DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument", this assert failure should not happen, right?

[1] http://www.gromacs.org/
Comment 1 Vedran Miletić 2016-02-23 19:10:44 UTC
Another machine, without ioctl failures, same issue:

$ gmx mdrun -deffnm em -v                                                                                                                                                                  
      :-) GROMACS - gmx mdrun, 2016-dev-20160222-29943fe-dirty-unknown (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov  Herman J.C. Berendsen    Par Bjelkmar   
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra    Gerrit Groenhof  
 Christoph Junghans   Anca Hamuraru    Vincent Hindriksen Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson    
  Justin A. Lemkul   Magnus Lundborg   Pieter Meulenhoff    Erik Marklund   
   Teemu Murtola       Szilard Pall       Sander Pronk      Roland Schulz   
  Alexey Shvetsov     Michael Shirts     Alfons Sijbers     Peter Tieleman  
  Teemu Virolainen  Christian Wennberg    Maarten Wolf   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2015, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2016-dev-20160222-29943fe-dirty-unknown
Executable:   /home/vedranm/software/bin/gmx
Data prefix:  /home/vedranm/software
Command line:
  gmx mdrun -deffnm em -v


Back Off! I just backed up em.log to ./#em.log.20#

Running on 1 node with total 4 cores, 4 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 1
    #0: name: Intel(R) HD Graphics Haswell GT2 Mobile, vendor: Intel, device version: OpenCL 1.2 beignet 1.1.1 (git-2eea2c9), stat: compatible

Reading file em.tpr, VERSION 5.1-dev-20150219-7c30fcf-unknown (single precision)
Note: file tpx version 100, software tpx version 109
Using 1 MPI thread
Using 4 OpenMP threads 

1 compatible GPU is present, with ID 0
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0

Selecting generic kernel
ASSERTION FAILED: Not supported
  at file /builddir/build/BUILD/Beignet-1.1.1-Source/backend/src/./ir/context.hpp, function gbe::ir::ImmediateIndex gbe::ir::Context::newIntegerImmediate(int64_t, gbe::ir::Type), line 95
Zamka za praćenje/prekidnu točku (jezgra izbačena)
Comment 2 Xiuli Pan 2016-04-28 07:36:54 UTC
Hi Vedran,

Are the problem still with the GROMACS?
I have tried the GROMACS with our mater branch and the patch https://gerrit.gromacs.org/#/c/5752/
The GROMACS can run but I did not know if the result is right.

Thanks
Xiuli
Comment 3 Vedran Miletić 2016-05-02 13:02:27 UTC
(In reply to Xiuli Pan from comment #2)
> Hi Vedran,
> 
> Are the problem still with the GROMACS?
> I have tried the GROMACS with our mater branch and the patch
> https://gerrit.gromacs.org/#/c/5752/
> The GROMACS can run but I did not know if the result is right.
> 
> Thanks
> Xiuli

I tried, but got blocked by bug 95239.
Comment 4 Vedran Miletić 2016-05-02 20:33:55 UTC
Managed to get a Fedora 23 x86_64 machine with GCC 5.3.1 to test. This machine has no Ivy Bridge, but Haswell:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz
stepping        : 3
microcode       : 0x1e
cpu MHz         : 2560.156
cache size      : 3072 KB

00:02.0 VGA compatible controller [0300]: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller [8086:0416] (rev 06)

With custom compiled LLVM/CLang 3.8, latest Beignet from git, and GROMACS with the patch you linked and the patch from comment 1, I get

Running on 1 node with total 4 cores, 4 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256

  Hardware topology: Basic
  GPU info:
    Number of GPUs detected: 1
    #0: name: Intel(R) HD Graphics Haswell GT2 Mobile, vendor: Intel, device version: OpenCL 1.2 beignet 1.2 (git-8dfec54), stat: compatible

Reading file em.tpr, VERSION 5.1-dev-20150219-7c30fcf-unknown (single precision)
Note: file tpx version 100, software tpx version 110
Using 1 MPI thread
Using 4 OpenMP threads 

1 compatible GPU is present, with ID 0
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0

Selecting generic kernel

Back Off! I just backed up em.trr to ./#em.trr.14#

Back Off! I just backed up em.edr to ./#em.edr.14#

Steepest Descents:
   Tolerance (Fmax)   =  1.00000e+03
   Number of steps    =        50000
ClERROR! -49
Step=    0, Dmax= 1.0e-02 nm, Epot=  3.54278e+04 Fmax= 4.65376e+03, atom= 1309
ClERROR! -49
Step=    1, Dmax= 1.0e-02 nm, Epot=  2.97022e+04 Fmax= 7.33237e+03, atom= 1309
ClERROR! -49

According to [1], error code -49 is CL_INVALID_ARG_INDEX. So, at least on Haswell, the original issue is not there, but it still does not work. I have updated summary accordingly.

[1] https://streamcomputing.eu/blog/2013-04-28/opencl-error-codes/
Comment 5 Xiuli Pan 2016-05-03 06:01:09 UTC
I had a build of gromacs of the master branch on HASWELL, but I have the some thing run with following result and it just stuck then.
What is this gmx mdrun -deffnm em used for? what is this em stand for? What file I need to run this test?

./gmx mdrun -deffnm em
          :-) GROMACS - gmx mdrun, 2016-dev-20160405-2c62aed-dirty (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov  Herman J.C. Berendsen    Par Bjelkmar
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra    Gerrit Groenhof
 Christoph Junghans   Anca Hamuraru    Vincent Hindriksen Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson
  Justin A. Lemkul   Magnus Lundborg   Pieter Meulenhoff    Erik Marklund
   Teemu Murtola       Szilard Pall       Sander Pronk      Roland Schulz
  Alexey Shvetsov     Michael Shirts     Alfons Sijbers     Peter Tieleman
  Teemu Virolainen  Christian Wennberg    Maarten Wolf
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2015, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2016-dev-20160405-2c62aed-dirty
Executable:   /home/pxl/gromacs/build/bin/./gmx
Data prefix:  /home/pxl/gromacs (source tree)
Command line:
  gmx mdrun -deffnm em
Comment 6 Vedran Miletić 2016-05-03 06:44:54 UTC
(In reply to Xiuli Pan from comment #5)
> I had a build of gromacs of the master branch on HASWELL, but I have the
> some thing run with following result and it just stuck then.
> What is this gmx mdrun -deffnm em used for? what is this em stand for? What
> file I need to run this test?

It's a molecular dynamics run, and EM stands for energy minimization. The reason you have to produce these files is that the only part of GROMACS that actually calls OpenCL is mdrun. These files are very easy to produce, for example you can use the instructions at [1].

[1] http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin/gmx-tutorials/lysozyme/index.html
Comment 7 Xiuli Pan 2016-05-03 07:00:42 UTC
I followed the guide to the em part and here is the log:

---------------------------------------------------------------------------

./gmx mdrun -v -deffnm em
          :-) GROMACS - gmx mdrun, 2016-dev-20160405-2c62aed-dirty (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov  Herman J.C. Berendsen    Par Bjelkmar
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra    Gerrit Groenhof
 Christoph Junghans   Anca Hamuraru    Vincent Hindriksen Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson
  Justin A. Lemkul   Magnus Lundborg   Pieter Meulenhoff    Erik Marklund
   Teemu Murtola       Szilard Pall       Sander Pronk      Roland Schulz
  Alexey Shvetsov     Michael Shirts     Alfons Sijbers     Peter Tieleman
  Teemu Virolainen  Christian Wennberg    Maarten Wolf
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2015, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2016-dev-20160405-2c62aed-dirty
Executable:   /home/pxl/gromacs/build/bin/./gmx
Data prefix:  /home/pxl/gromacs (source tree)
Command line:
  gmx mdrun -v -deffnm em


Back Off! I just backed up em.log to ./#em.log.1#

Running on 1 node with total 8 cores, 8 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E3-1286 v3 @ 3.70GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256

  Hardware topology: Full, with devices
  GPU info:
    Number of GPUs detected: 1
    #0: name: Intel(R) HD Graphics Haswell GT2 Server, vendor: Intel, device version: OpenCL 1.2 beignet 1.2 (git-b39d875), stat: compatible

Reading file em.tpr, VERSION 2016-dev-20160405-2c62aed-dirty (single precision)
Using 1 MPI thread
Using 8 OpenMP threads

1 compatible GPU is present, with ID 0
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0

Setting up defines for kernel types for FastGen -DGMX_OCL_FASTGEN_ADD_TWINCUT -DEL_EWALD_ANA -DEELNAME=_ElecEw -DLJ_COMB_GEOM -DVDWNAME=_VdwLJCombGeom
Selecting kernel source automatically
Selecting generic kernel
Setting up kernel vendor spec definitions:  -D_WARPLESS_SOURCE_
The OpenCL compilation log has been saved in "nbnxn_ocl_kernels.cl.SUCCEEDED"

Back Off! I just backed up em.trr to ./#em.trr.1#

Back Off! I just backed up em.edr to ./#em.edr.1#

Steepest Descents:
   Tolerance (Fmax)   =  1.00000e+03
   Number of steps    =        50000
Step=    0, Dmax= 1.0e-02 nm, Epot= -3.31075e+06 Fmax= 6.48321e+04, atom= 2732
Step=    1, Dmax= 1.0e-02 nm, Epot= -3.31357e+06 Fmax= 6.36930e+04, atom= 2732
Step=    4, Dmax= 3.0e-03 nm, Epot= -3.31544e+06 Fmax= 6.18313e+04, atom= 2732
Step=   16, Dmax= 1.8e-06 nm, Epot= -3.30329e+06 Fmax= 6.18301e+04, atom= 2732
Energy minimization has stopped, but the forces have not converged to the
requested precision Fmax < 1000 (which may not be possible for your system).
It stopped because the algorithm tried to make a new step whose size was too
small, or there was no change in the energy since last step. Either way, we
regard the minimization as converged to within the available machine
precision, given your starting configuration and EM parameters.

Double precision normally gives you higher accuracy, but this is often not
needed for preparing to run molecular dynamics.
You might need to increase your constraint accuracy, or turn
off constraints altogether (set constraints = none in mdp file)

writing lowest energy coordinates.

Back Off! I just backed up em.gro to ./#em.gro.1#

Steepest Descents converged to machine precision in 17 steps,
but did not reach the requested Fmax < 1000.
Potential Energy  = -3.3154425e+06
Maximum force     =  6.1831297e+04 on atom 2732
Norm of force     =  7.5550275e+02


NOTE: The GPU has >20% more load than the CPU. This imbalance causes
      performance loss, consider using a shorter cut-off and a finer PME grid.

NOTE: 16 % of the run time was spent in pair search,
      you might want to increase nstlist (this has no effect on accuracy)

---------------------------------------------------------------------------
I actually don't know if this is right, could you have a look here?
Thanks
Xiuli
Comment 8 Vedran Miletić 2016-05-03 10:45:14 UTC
(In reply to Xiuli Pan from comment #7)
> I actually don't know if this is right, could you have a look here?
> Thanks
> Xiuli

I have seen the "Energy minimization has stopped, but the forces have not converged to the requested precision Fmax < 1000 (which may not be possible for your system)." happen on Radeon previously, and I am not sure if it is correct.

Can you try latest git of both GROMACS and Beignet? Then we would have a comparable system software and hardware, and hopefully you will also hit ERROR -49.
Comment 9 Xiuli Pan 2016-05-04 02:46:58 UTC
Hi Vedran,

I am using the master branch of beignet and some commit on 2016-dev-20160405 of GROMACS, you can pick a commit with that error that I can try to reproduce the problem.

Thanks
Xiuli
Comment 10 Vedran Miletić 2016-05-04 15:19:38 UTC
I tried the same version, on Fedora 24 x86_64 and I believe I am getting incorrect result. It does not change whether I use the version of GROMACS you use or the latest git.

I get

Steepest Descents:
   Tolerance (Fmax)   =  1.00000e+03
   Number of steps    =        50000

Energy minimization has stopped, but the forces have not converged to the
requested precision Fmax < 1000 (which may not be possible for your system).
It stopped because the algorithm tried to make a new step whose size was too
small, or there was no change in the energy since last step. Either way, we
regard the minimization as converged to within the available machine
precision, given your starting configuration and EM parameters.

Double precision normally gives you higher accuracy, but this is often not
needed for preparing to run molecular dynamics.
You might need to increase your constraint accuracy, or turn
off constraints altogether (set constraints = none in mdp file)

writing lowest energy coordinates.

Back Off! I just backed up em.gro to ./#em.gro.4#

Steepest Descents converged to machine precision in 32 steps,
but did not reach the requested Fmax < 1000.
Potential Energy  = -5.6638625e+05
Maximum force     =  1.0924077e+04 on atom 1515
Norm of force     =  3.5640625e+02

Another simulation, which works on a machine with Radeon/NVIDIA GPUs, gives me

starting mdrun 'UBIQUITIN in water'
500000 steps,   1000.0 ps.

step 14: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Wrote pdb files with previous and current coordinates
Segmentation fault (core dumped)

Is there any info I can provide that would help debugging?
Comment 11 Xiuli Pan 2016-05-05 06:32:44 UTC
Hi Vedran, Could you provide us with what exact commit gromacs you are using?
I know how to run with guide you give but I could not know if there is any bugs from the result. If there is some unit test for gromacs that can check the result or something like that.
Sometimes the result may got very wrong.

And hi Szilárd, here is also some bug with gromacs, what do you think it is about? Should we use the patch you sent?

Thanks
Xiuli
Comment 12 Vedran Miletić 2016-08-03 13:07:34 UTC
(In reply to Xiuli Pan from comment #11)
> Hi Vedran, Could you provide us with what exact commit gromacs you are using?
> I know how to run with guide you give but I could not know if there is any
> bugs from the result. If there is some unit test for gromacs that can check
> the result or something like that.
> Sometimes the result may got very wrong.
> 
> And hi Szilárd, here is also some bug with gromacs, what do you think it is
> about? Should we use the patch you sent?
> 
> Thanks
> Xiuli

I'm using Beignet git 4585453a41bd1f88e0225785201927b69591d570 with LLVM/Clang 3.8 on Fedora 24 on 00:02.0 VGA compatible controller [0300]: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller [8086:0416] (rev 06).

As for GROMACS, I'm using release-2016 branch, latest commit which right now is 0722e465c5bc31e493c41359917821545a4f9423. The following patch is necessary to allow using OpenCL on Intel iGPUs:

diff --git a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
index 2084d8c..8928582 100644
--- a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
+++ b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp
@@ -131,6 +131,8 @@ static int is_gmx_supported_gpu_id(struct gmx_device_info_t *ocl_gpu_device)
             return egpuCompatible;
         case OCL_VENDOR_AMD:
             return runningOnCompatibleOSForAmd() ? egpuCompatible : egpuIncompatible;
+        case OCL_VENDOR_INTEL:
+            return egpuCompatible;
         default:
             return egpuIncompatible;
     }

Tried both without and with this patch https://gerrit.gromacs.org/#/c/5752/ -- no change.

For testing, I use ftp://ftp.gromacs.org/pub/benchmarks/water_GMX50_bare.tar.gz by going to one of the directories (say 0003) and issuing

$ gmx grompp -f pme.mdp
$ gmx mdrun -s topol.tpr -notunepme -v

Simulation segfaults or warns about water being unable to settle.
Comment 13 Szilárd Páll 2016-08-03 14:37:03 UTC
Hi all,

Sorry for the delay. Let me add some clarifications.

I'd like to note that it hasn't been verified yet whether the GROMACS kernels in question do execute correctly on Intel iGPU hardware, I have planned to try the proprietary OpenCL stack, but haven't had the time. Hence, the incorrect results observed (recently reproduced on Haswell and Skylake) could also be due to a bug on the GROMACS side. If somebody has a setup with the Intel proprietary OpenCL stack and could help out with running a test or two, it would be greatly appreciated.


Secondly, the repro steps that Vedran provided are sufficient to show the issue, but not necessary (or convenient) to verify correctness. To do that it's instead more convenient to run the unit and regression tests (configure with -DGMX_BUILD_UNITTESTS=ON for the former, -DREGRESSIONTEST_DOWNLOAD=ON for the latter) and run "make check".

Alternatively, a more dev-oriented, but somewhat a quick and dirty (and limited) check is to use run a single test case (e.g. the one Vedran provided) 

gmx mdrun -nb cpu -nsteps 0 -g ref-cpu.log;
gmx mdrun -nb gpu -nsteps 0 -g gpu.log

and compare the first step energy values, i.e. the outputs of:

grep -m 1 -A 4 ' Energies (kJ/mol)' ref-cpu.log;
grep -m 1 -A 4 ' Energies (kJ/mol)' gpu.log

Cheers,
Sz.
Comment 14 Szilárd Páll 2016-08-03 21:18:00 UTC
A few more things to add after a bit of testing:
- I suspect the reduction might be the culprit, need to think a bit about it. Is there any way to change or at least know the SIMD execution width? Setting it to 32 would provide helpful feedback.

- The previously noted issue with quoted include seems to still not be fixed.
Comment 15 Xiuli Pan 2016-08-04 02:27:36 UTC
Hi Szilárd,

Thanks for the method to check the result, I will try to check the result and find if anything wrong with the kernel or resutl.

Thanks
Xiuli
Comment 16 Xiuli Pan 2016-08-04 02:59:57 UTC
(In reply to Szilárd Páll from comment #14)
> A few more things to add after a bit of testing:
> - I suspect the reduction might be the culprit, need to think a bit about
> it. Is there any way to change or at least know the SIMD execution width?
> Setting it to 32 would provide helpful feedback.
> 
> - The previously noted issue with quoted include seems to still not be fixed.

If you want to get simd width, you can use code like this:

size_t workgroupSize_used;
err = clGetKernelWorkGroupInfo(kernel, device,
                               CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
                               sizeof(size_t), &workgroupSize_used, NULL);

We support 16 or 8 for beignet.
Comment 17 Vedran Miletić 2016-08-04 10:52:22 UTC
(In reply to Szilárd Páll from comment #14)
> - The previously noted issue with quoted include seems to still not be fixed.

Related patch for Gallium is here: https://lists.freedesktop.org/archives/mesa-dev/2016-July/122864.html
Comment 18 Szilárd Páll 2016-08-04 13:25:07 UTC
(In reply to Xiuli Pan from comment #15)
> Thanks for the method to check the result, I will try to check the result
> and find if anything wrong with the kernel or resutl.

Thanks. Do you have the possibility to compare with the proprietary Intel OpenCL stack?
Comment 19 Xiuli Pan 2016-08-08 07:15:21 UTC
(In reply to Szilárd Páll from comment #14)
> A few more things to add after a bit of testing:
> - I suspect the reduction might be the culprit, need to think a bit about
> it. Is there any way to change or at least know the SIMD execution width?
> Setting it to 32 would provide helpful feedback.
> 
> - The previously noted issue with quoted include seems to still not be fixed.

Hi Szilárd,

I have tried the patch:
https://cgit.freedesktop.org/beignet/commit/?h=Release_v1.1&id=8e9ef20
But this patch somehow was not in master branch but only in release branch, I will ask maintainer to merge this patch in master as well.

Thanks
Xiuli
Comment 20 Xiuli Pan 2016-08-08 07:38:09 UTC
Hi,

I have tried the dev dirty way to get some data, it seems beignet and the proprietary Intel OpenCL driver got both wrong result.
Env: LLVM 3.6.2/3.8; beignet master(dff184) with quote patch; Gromacs release-2016(ebf2a5) with workaround patch.

It seems something also wrong with LLVM version, but the GPU results are all wrong. Is the kernels logic related to SIMD size?

ref-cpu:
   Energies (kJ/mol)
        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
    1.47860e+05   -8.96047e+05    3.62010e+03   -7.44566e+05    0.00000e+00
   Total Energy  Conserved En.    Temperature Pressure (bar)
   -7.44566e+05   -7.44566e+05    0.00000e+00    2.65552e+04
gpu-beg 3.6.2:
   Energies (kJ/mol)
        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
    1.48497e+05   -1.55098e+06    3.62010e+03   -1.39887e+06    0.00000e+00
   Total Energy  Conserved En.    Temperature Pressure (bar)
   -1.39887e+06   -1.39887e+06    0.00000e+00    4.48803e+04
gpu-beg 3.8:
   Energies (kJ/mol)
        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
    1.49634e+05   -1.60296e+06    3.62010e+03   -1.44971e+06    0.00000e+00
   Total Energy  Conserved En.    Temperature Pressure (bar)
   -1.44971e+06   -1.44971e+06    0.00000e+00    4.09773e+04
gpu-intel:
   Energies (kJ/mol)
        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
    1.52678e+05   -1.94930e+06    3.62010e+03   -1.79300e+06    0.00000e+00
   Total Energy  Conserved En.    Temperature Pressure (bar)
   -1.79300e+06   -1.79300e+06    0.00000e+00    3.06926e+04

Thanks
Xiuli
Comment 21 Szilárd Páll 2016-08-11 21:27:38 UTC
Xiuli, thanks for the feedback and the testing!


> It seems something also wrong with LLVM version,  but the GPU results are all wrong.

Do you mean that additionally to the result being off something else is also wrong in the LLVM version?

> Is the kernels logic related to SIMD size?

It is.

Although the algorithms is designed to be quite general (allowing tuning parameters like arithmetic intensity, data reuse, register use etc), the kernel was originally implemented and tuned for the NVIDIA (in CUDA) and later ported to OpenCL for AMD. The current port was my "simple" attempt to remove the execution width >=32 assumptions. I thought the current version has already accomplished that, but that's seem to not be the case given the incorrect results with both OpenCL stacks.

I'll give it a thought and will try to find the source of the issue; I think I'll start with eliminating some conditionals to be able to use mem fencing...

By the way, is it possible to do cross-lane operations in Intel with beignet (e.g. using intrinsics)?
Comment 22 Vedran Miletić 2016-08-29 15:27:55 UTC
(In reply to Szilárd Páll from comment #21)
> I'll give it a thought and will try to find the source of the issue; I think
> I'll start with eliminating some conditionals to be able to use mem
> fencing...
> 

Szilard, any news here?
Comment 23 Vedran Miletić 2016-12-21 12:55:23 UTC
Xiuli, any news from Intel? Any interesting changes since August? Would it be worth testing again?
Comment 24 Xiuli Pan 2016-12-22 01:50:46 UTC
(In reply to Vedran Miletić from comment #23)
> Xiuli, any news from Intel? Any interesting changes since August? Would it
> be worth testing again?

The program can run here when I test with it. But the result seems not correct. I do not know if the kernel has some problem or there are something wrong with the driver.

If there is anything can help me figure out which part is broke, I may have some solution or patches.

Thanks
Xiuli
Comment 25 GitLab Migration User 2018-10-12 21:26:48 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/beignet/beignet/issues/66.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.