r11 - 04 Jul 2019 - 10:11:58 - AlvaroFernandezYou are here: TWiki >  Artemisa Web  > UsageGuide

Artemisa Usage Guide

Introduction

Artemisa is a high performance computing infrastructure based on hardware GPGPUs accelerators, and with the supporting networking and storage infrastructure for running complex scientific batch jobs.

In these pages we introduce the hardware, working environment and usage recipes for the final users of this infrastructure.

Infrastructure

Basically the nodes of the infrastructure are divided in three classes:

  • User Interfaces (UI): the entry point for users, provide a working environment where to compile and test their programs. These machines have a GPGPU and access to the storage services. When the jobs are ready, users can submit their production batch jobs with Job Management System to the Worker Nodes.
  • Worker Nodes (WN): where the production user jobs are executed. Contain high-end CPUs, a large memory configuration, and up to 4 high-end GPGPU to execute the jobs.
  • Storage Nodes: disk servers to store user and project data, that are accessible from both User Interfaces and Worker Nodes.

The current detailed hardware configuration can be found in this page.

Working environment

  • Users access the User Interface (UI) nodes where develop and submit production jobs. These nodes contain a complete development environment where to test and validate their programs using a entry-level local GPU. Access to this local GPU is understood as for development and validation jobs, and therefore with exclusive access during a limited time ( 5 minute slots).

  • After validation, users can submit their codes as production batch jobs using the Job Management System HTCondor. This provides access to the computing cluster, with high-end GPUs (nodes with up to 4 GPUs with high speed NVlink interconnection) and computing nodes with longer execution times are allowed.

User Interface (UI)

User Interfaces are development nodes, where users can test and validate their programs using the entry-level local GPU. After validation, can submit their codes as production batch jobs using the Job Management System HTCondor.

Authorized users can log in with ssh protocol the the User Interface machines ( mlui01.ific.uv.es, mlui02.ific.uv.es)

The current list of UI machines and detailed hardware configuration can be found in this page.

The user HOME directory resides in Lustre filesystem ( See section Storage )

/lhome/ific/<initial_letter>/<username>   <= for IFIC users
/lhome/<groupid>/<username>  <= for external users

In addition of the user HOME directory there is space for projects available upon request, and accessible also from the UI:

/lustre/ific.uv.es/ml/<groupid>

For IFIC users with account in the AFS filesystem, their desktop home directory is accessible in the following path:

/afs/ific.uv.es/user/<initial_letter>/<username>

For IFIC users with account in the AFS filesystem, their desktop home directory is accessible in the following path:

/afs/ific.uv.es/user/<initial_letter>/<username>
But you have to obtain a Kerberos ticket with the following commands:
              $ klog
   or
              $ kinit
              $ aklog 

Software is accesible in the User Interfaces as described in this Software section.

Gpurun: request exclusive access to User Interface GPU

As GPUs are a exclusive resource, users have to instantly reserve for interactive access to the local User Interface GPU.

For doing so can be used the 'gpurun' tool, that will grant usage slots valid for 5 minutes.

-bash-4.2$ gpurun ./benchmark.sh 
Connected
Info: OK 0. Tesla P100-PCIE-12GB [00000000:5E:00.0]

Queued clients:1 Estimated waiting time:300 seconds

In case there are other users executing, the command will synchonously wait until it can be executed.

Job Management System: HTCondor

HTCondor is the resource management system that runs in this cluster. It manages the job workflow and allows the users to send jobs to be executed in the worker nodes. Direct access to worker nodes is not allowed.

Each worker node has a partitionable slot that accepts jobs to be processed. HTCondor deals with job sorting and processing. Slots are divided when the job does not require all node resources, so more jobs can be run in the node. CPU and Memory resources are subtracted in chunks from the main slot. 0, 1, 2 or 4 GPU requests are permitted.

HTCondor tries to run jobs form different users in a fair share way. Jobs priorities among users take into account the previous time spent by the user so CPU time is assigned evenly between all users.

The complete HTCondor manual can be found here

The current Artemisa HTCondor configuration can be found here.

Job submission

Job description file
Before sending a job, you have to prepare a Job Description File which specifies what do you want to excecute, where do you what to write your output and what are your job requirements.

A simple job description file might be (test.sub):

universe = vanilla

executable          = test.sh
arguments           = $(Cluster) $(Process)

log                 = test.log
output              = outfile.$(Cluster).$(Process).out
error               = errors.$(Cluster).$(process).err

request_Cpus        = 4
request_Memory      = 4000

queue 2

In this file:

  • universe: specifies how HTCondor is going to deal with the job. Use vanilla here.
  • $(Cluster): HTCondor assigns a consecutive number for each group of jobs sent at the same time.
  • $(Process): Identifies an specific job inside the group. Process number starts from 0.
  • executable: Tells HTCondor what to execute. This can be an executable or a script
  • arguments: The arguments HTCondor with pass to the executable. In this example the cluster and process job identifiers.
  • log: This is the file where HTCondor will log the job processing information.
  • outout: Where the job output is going to be written. It is important that each job has their own output file, since it will be overwritten if this file exists. In this example, the file name contains the Cluster and Process identifiers so file name is unique.
  • error: Same as output but for the standard error
  • request_Cpus: how many cores need the job. It is important to be accurate since this requirement is used by HTCondor to distribute the available resources among all users.
  • request_Memory: same as CPU but for memory. Unit here is MB !!
  • queue: Number of jobs to be send to the system using this job configuration

This is a very simple job, but the Job Description File can be richer and other options can be more appropriate to submit other user workloads.

Actual job submission (condor_submit)

When the Job Description File is ready, we can send the jobs to the system:

$ condor_submit test.sub
Submitting job(s)..
2 job(s) submitted to cluster 110.

110 is the Cluster identification for this bunch of jobs

Get the jobs status (condor_q)

To get the job status use:

$ condor_q

-- Schedd: xxx.ific.uv.es : <xx.xx.xx.xx:9618?... @ 03/12/19 11:44:28
OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
fulano ID: 110      3/12 11:44      _      _      2      2 110.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for fulano: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

Here, you can see the jobs in IDLE state waiting for HTCondor to be considered for running. The log file begins to have information about the HTCondor processing.

When the job start running, you can see the output and error files being written since at IFIC, the file system (Lustre) can be seen everywhere in the cluster

Removing a job (condor_rm)

A job can be removed from the system using its cluster identification. If several jobs have been sent, a specific job can be removed using its Process identification.

$ condor_rm 110.1
Job 110.1 marked for removal

Checking cluster status (condor_pool)

You can get the cluster (pool) status with:

$ condor_status
Name                    OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@mlwn01.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000  16384  7+20:20:48
slot2@mlwn01.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 134964  7+20:21:04
slot3@mlwn01.ific.uv.es LINUX      X86_64 Claimed   Busy      1.030 231368  0+00:06:00
slot1@mlwn02.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000  16384  7+20:21:07
slot2@mlwn02.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 134964  7+20:21:18
slot3@mlwn02.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 231368  5+19:46:37
slot1@mlwn03.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 309074  7+20:17:45
slot2@mlwn03.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 463611  7+20:17:57

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

  X86_64/LINUX        9     0       1         8       0          0      0

         Total        9     0       1         8       0          0      0

Storage

Storage is mantained in several disk servers as detailed in this page.

A distributed Lustre filesystem is shared and mounted in the different nodes of the cluster, including User Interfaces (UI) and Worker Nodes.

This means that all data is directly available in all nodes, and no explicit file transfer is needed to be accessible from the worked nodes.

This includes user homes and project areas.

Containers

Containers are a software distribution form, very convenient for developers and users.

We support Singularity as it is secure and supports several container types including Docker and access to the DockerHUB?.

The current distribution documentation for users can be found here.

Example: download the latest tensorflow nightly gpu container from the docker hub, and convert it into a singularity image for later use:

$ mkdir ~/s.images
$ cd ~/s.images
$ singularity build tensorflow-nightly-gpu docker://tensorflow/tensorflow:nightly-gpu

HEP Scientific Software

CVMFS: HEP Software distribution

We adopt CVMFS as the main HEP software distribution method. The software packages are distributed in differente repositores maintained by the different contributors, and accessible as local mounted /cvmfs points in User Interfaces (UI) and Worker Nodes.

The current repositories that can be found are the following:

CERN/SFT Repositories
External software packages are taken from external sources to PH/SFT. They are recompiled, if possible and necessary, on all SFT provided platforms. External software packages are provided for many different areas such as
  • General tools (debugging, testing)
  • Graphics
  • Mathematical Libraries
  • Databases
  • Scripting Languages and modules
  • Grid middleware
  • Compilers

An exhaustive list of all provided packages and the supported platforms is available at http://lcginfo.cern.ch.

The 'lcgenv' configuration tool can be used to set the environment to the desired tool https://gitlab.cern.ch/GENSER/lcgenv

IE: The following example sets the environment variables needed to use ROOT, GSL and BOOST libraries:
export LCGENV_PATH=/cvmfs/sft.cern.ch/lcg/releases/
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt ROOT `"
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt GSL `"
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt Boost`"

Other Repositories

The list of other CERN cvmfs repositories maintained by their respective owners and available in the following mount points, as detailed in CVMFS repositories list:

/cvmfs/atlas.cern.ch
/cvmfs/lhcb.cern.ch

Local Installed software

  • NVIDIA Drivers :
    • Installed releases: nvidia-driver-418.67
  • CUDA Toolkit : The NVIDIAŽ CUDAŽ Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler and a runtime library to deploy your application.
    • Installed releases: 9.1, 10.0, 10.1
  • GPU-Accelerated libraries: NVIDIA GPU-accelerated libraries provide highly-optimized functions that perform 2x-10x faster than CPU-only alternatives.
    • Installed releases: 9.1, 10.0, 10.1

  • Compilers: python2.7, python3.6, gcc4.8.5

  • [[https://www.tensorflow.org/][Tensorflow]: end-to-end open source platform for machine learning.
    • Installed releases: r1.13 (python3)

  • Other Scientific Libraries: scipy, numpy, atlas, blas, lapack

Usage recipes

Development on User Interface

We use a basic Tensorflow example for development and using gpus as described here.

  1. Create a python file named 'tf_helloworld.py' :
    import tensorflow as tf
    # Creates a graph.
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
    # Creates a session with log_device_placement set to True.
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(c))
    
    This program performs a basic matrix multiplication and will help us to find out which devices your operations and tensors are assigned to.
  2. Try to execute directly, and it will fail:
    $ python3 tf_helloworld.py
    2019-06-19 12:15:31.077835: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    2019-06-19 12:15:31.134991: W tensorflow/compiler/xla/service/platform_util.cc:240] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
    2019-06-19 12:15:31.135283: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
    Abortado
    
    The execution on local GPUs is not directly possible, and we will get errors like: CUDA_ERROR_INVALID_DEVICE
  3. We must use 'gpurun' tool to execute our programs that use the gpu in the UserInterface?. Simply add 'gpurun' to the previous command:
    $ gpurun python3 tf_helloworld.py
    Connected
    Info: OK 0. Tesla P100-PCIE-12GB [00000000:5E:00.0]
    
    Queued clients:0 Estimated waiting time:0 seconds
    GPU reserved:300 seconds granted
    Details: - Device 0. Tesla P100-PCIE-12GB [00000000:5E:00.0] set to compute mode:Exclusive Process
    Info: Executing program: python3
    2019-06-19 12:18:32.724854: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    2019-06-19 12:18:32.904065: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x504aa40 executing computations on platform CUDA. Devices:
    2019-06-19 12:18:32.904157: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
    2019-06-19 12:18:32.911125: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
    2019-06-19 12:18:32.921467: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x51c70e0 executing computations on platform Host. Devices:
    2019-06-19 12:18:32.921533: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
    2019-06-19 12:18:32.921899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
    name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
    pciBusID: 0000:5e:00.0
    totalMemory: 11.91GiB freeMemory: 11.66GiB
    2019-06-19 12:18:32.921966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
    2019-06-19 12:18:32.923730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
    2019-06-19 12:18:32.923766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
    2019-06-19 12:18:32.923791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
    2019-06-19 12:18:32.923995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11338 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:5e:00.0, compute capability: 6.0)
    Device mapping:
    /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
    /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
    /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:5e:00.0, compute capability: 6.0
    2019-06-19 12:18:32.935916: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
    /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
    /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
    /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:5e:00.0, compute capability: 6.0
    
    MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
    2019-06-19 12:18:32.938232: I tensorflow/core/common_runtime/placer.cc:1059] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
    a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
    2019-06-19 12:18:32.938286: I tensorflow/core/common_runtime/placer.cc:1059] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
    b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
    2019-06-19 12:18:32.938324: I tensorflow/core/common_runtime/placer.cc:1059] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
    [[22. 28.]
     [49. 64.]]
    
If there are not other clients waiting for the local GPU the program will be executed immediately, otherwise will wait and print the expected waiting time:
Queued clients:0 Estimated waiting time:0 seconds
We have to bear in mind that User Interface GPUs are exclusive access (only one program can execute at a time) and so user programs will be terminated after a short limited time ( 5 minutes )

After execution we will se the 2x2 matrix result from the multiplication, and the TensorFlow? logs that give us clues where the execution has resulted: all operands a,b, and MatMult? operation have been performed in the GPU 0, the User Interface installed Tesla P100-PCIE-12GB.

Submit CPU only job: Data Preparation use case

Data preparation and augmentation jobs can be done in parallel using CPU only resources, in order to speed up the process. There are some basic tasks that can be done in a fists step before all gpu jobs, or more complex techiques like Data Pipelining.

In this example we propose a basic example to execute data preparation tasks requesting CPU only resources. We use the the CIFAR10 Images dataset.

  1. First we prepare the environment to execute with the latest tensorflow and keras.
    virtualenv -p `which python3` venv
    source venv/bin/activate
    pip install keras tensorflow
    
  2. Download the executable 01_aug_cifar10.py that performs the real work. It uses the keras libraries and the cifar 10 dataset. Note that you need to remove the .txt file suffix of the filename.
  3. Create the script that will setup the environment and run the executable, named '01_aug_cifar10.sh':
    #!/bin/bash
    source venv/bin/activate
    python 01_aug_cifar10.py
    
  4. Prepare the condor execution script to execute in the CPU test resoruces. With this we can have high priority low resources test
    universe = vanilla
    executable              = 01_aug_cifar10.sh
    arguments               = 
    log                     = test.log
    output                  = outfile.$(Cluster).$(Process).out
    error                   = errors.$(Cluster).$(Process).err
    
    +testJob = True
    queue
    
    Note the '+testJob=true' parameter to select resources

  1. Submit condor job
    $ condor_submit 01_aug_cifar10.sub
    Submitting job(s).
    1 job(s) submitted to cluster 665.
    
  2. Check the output is correct. When validated more CPU resources can be requested substituting 'testJob' with more demanding CPU requirements in the condor file:
    request_Cpus        = 56
    request_Memory      = 134964
    

Submit Basic GPU Production Job: Worker Nodes GPUs and Environment info

After development we can submit production jobs that use 1 or more GPU production resources with higher time limits, and not only the User Interfaces.

In this example we will submit a job requesting a single GPU and retrieving basic environment information, that can help debug multiple issues.

Check HTCondor submission file:

$ cat 01_test.sub

universe = vanilla

executable              = 01_test.sh
arguments               = $(Process)

log                     = test.log
output                  = outfile.$(Cluster).$(Process).out
error                   = errors.$(Cluster).$(Process).err

request_gpus = 1

queue
Check executable script:
$ cat 01_test.sh 
#!/bin/sh
echo ">>>> ENVIRONMENT"
printenv
echo ">>>> HOST"
hostname
echo ">>>> CURRENT DIR"
pwd
echo ">>>> USER"
whoami 
echo ">>>> SPACE LEFT"
df -h
echo ">>>> NVIDIA INFO"
set -x #echo on
nvidia-smi

Submit the job:

$ condor_submit 01_test.sub
Submitting job(s).
1 job(s) submitted to cluster 548.

After submitting the job the output of the script will be recorded in the 'output' and/or 'error' files defined in the parameters. The 'log' file contains information of the HTCondor process to execute the jobs.

>>>> ENVIRONMENT
_CONDOR_JOB_PIDS=
_CONDOR_ANCESTOR_186061=51504:1560788291:637816246
TMPDIR=/var/lib/condor/execute/dir_51504
_CONDOR_ANCESTOR_186024=186061:1560104447:3266347792
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_51504
_CHIRP_DELAYED_UPDATE_PREFIX=Chirp*
TEMP=/var/lib/condor/execute/dir_51504
BATCH_SYSTEM=HTCondor
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_51504/.chirp.config
_CONDOR_ANCESTOR_51504=51505:1560788292:823522762
PWD=/lhome/ific/a/alferca/tutorial_ific/01_BasicCondorGPU
_CONDOR_AssignedGpus=CUDA0
CUDA_VISIBLE_DEVICES=0
_CONDOR_SLOT=slot3
SHLVL=1
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_51504/.machine.ad
TMP=/var/lib/condor/execute/dir_51504
GPU_DEVICE_ORDINAL=0
OMP_NUM_THREADS=32
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_51504/.job.ad
_CONDOR_JOB_IWD=/lhome/ific/a/alferca/tutorial_ific/01_BasicCondorGPU
_=/usr/bin/printenv
>>>> HOST
mlwn01.ific.uv.es
>>>> CURRENT DIR
/lhome/ific/a/alferca/tutorial_ific/01_BasicCondorGPU
>>>> USER
alferca
>>>> SPACE LEFT
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/centos-root       100G   20G   81G  20% /
devtmpfs                      189G     0  189G   0% /dev
tmpfs                         189G     0  189G   0% /dev/shm
tmpfs                         189G     0  189G   0% /sys/fs/cgroup
tmpfs                         189G   19M  189G   1% /run
tmpfs                          38G     0   38G   0% /run/user/0
/dev/md0                      1.8T  122G  1.7T   7% /tdata
/dev/md126p1                 1014M  267M  748M  27% /boot
147.156.116.235@tcp:/ific2fs   26T  217M   25T   1% /lustre/ific.uv.es
147.156.116.235@tcp:/atl3fs   217T   52T  156T  25% /lustre/ific.uv.es/grid/atlas/t3
147.156.116.235@tcp:/prjfs    127T   53T   73T  42% /lustre/ific.uv.es/prj
147.156.116.235@tcp:/mlfs     181T  364G  179T   1% /lustre/ific.uv.es/ml
147.156.116.235@tcp:/homefs    55T  105G   54T   1% /lhome
>>>> NVIDIA INFO
Mon Jun 17 18:18:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   30C    P0    28W / 250W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In the previous output of the job we can find some interesting information about our finished job:

Environment
including variables set by HTCondor, and the CUDA visible devices to our process ( only 1 GPU as requested)
Execution Machine
HOST and slot ( our job was run in slot3 of machine mlwn01.ific.uv.es)
Current dir
is the directory from where the job is started.
Storage
Mounted filesystems and free space ( mounted home and project filesystems, AFS is not mounted in Worker Nodes)
Nvidia GPU Info
(Tesla V100-PCIE, 32GB, not running real workload in this example)

Development and Multi GPU TensorFlow? example

For this test we are going to use a benchmark available here This software uses by default installed python3 and Tensorflow installation, performing several benchmark tests in the requested gpus.

Download benchkmark source code:

git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive
This will download the code in directory 'lambda-tensorflow-benchmark'

First we are going to execute locally in the User Interface (UI) the benchmark


However we will get an error because the GPU is not available (CUDA_ERROR_INVALID_DEVICE):

2019-06-18 15:48:58.336293: W tensorflow/compiler/xla/service/platform_util.cc:240] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2019-06-18 15:48:58.336474: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
Fatal Python error: Aborted

This is because we must user the 'gpurun' tool to reserve the exclusive use of the GPU in the User Interface for our developments. This exclusive use is limited in time, in order to allow users to share the interactive GPU usage.

gpurun ./benchmark.sh

Now using 'gpurun' we have exclusive access to the User Interface GPU during 5 minutes. We can produce test results, but after this time our program will be terminated:

2019-06-18 15:34:27.026687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11338 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:5e:00.0, compute capability: 6.0)
I0618 15:34:28.183031 140514472683328 session_manager.py:491] Running local_init_op.
I0618 15:34:28.235220 140514472683328 session_manager.py:493] Done running local_init_op.
2019-06-18 15:34:29.326831: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
TensorFlow:  1.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  64 global
             64.0 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   replicated
AllReduce:   None
==========
Generating model
Running warm up
Done warm up
Step   Img/sec   total_loss
1   images/sec: 214.8 +/- 0.0 (jitter = 0.0)   8.348
10   images/sec: 214.8 +/- 0.1 (jitter = 0.1)   8.143
20   images/sec: 214.8 +/- 0.1 (jitter = 0.2)   8.436
30   images/sec: 214.9 +/- 0.1 (jitter = 0.2)   8.151
40   images/sec: 214.7 +/- 0.1 (jitter = 0.2)   8.481
50   images/sec: 214.7 +/- 0.1 (jitter = 0.2)   8.334
60   images/sec: 214.7 +/- 0.1 (jitter = 0.2)   8.289
70   images/sec: 214.7 +/- 0.1 (jitter = 0.2)   7.996
80   images/sec: 214.7 +/- 0.1 (jitter = 0.2)   8.246
90   images/sec: 214.7 +/- 0.0 (jitter = 0.2)   8.303
100   images/sec: 214.8 +/- 0.0 (jitter = 0.2)   8.109
----------------------------------------------------------------
total images/sec: 214.70
----------------------------------------------------------------
--optimizer=sgd --model=resnet50 --num_gpus=1 --batch_size=64 --variable_update=replicated --distortions=false --num_batches=100 --data_name=imagenet
Terminado

If we run more that this test time for production jobs, we will make use of the Job Management System HTCondor to submit our production batch jobs In this case, we build a submission file:

universe = vanilla

initialdir      = lambda-tensorflow-benchmark
executable              = 01_test_benchmark.sh
arguments               = 
log                     = test.log
output                  = outfile.$(Cluster).$(Process).out
error                   = errors.$(Cluster).$(Process).err

request_gpus = 2
queue

And executable script:

echo ">>>>> EXECUTE PAYLOAD"
/bin/bash benchmark.sh ${CUDA_VISIBLE_DEVICES}

initialdir
where to start executing, which will be the downloaded directory
executable and arguments
Arguments are specific to this benchmark.sh executable, and represent the gpu ids to use (We are requesting 2 gpus). We are using the environment variable CUDA_VISIBLE_DEVICES that will contain the id list of reserved GPUS for this job. We can't access this variable within the condor submit file (CUDA not initialized yet), so we create an script '01_test_benchmark.sh' that will call the required arguments.
request_gpus
2 requested gpus in this case.

The job will run for several minutes, using installed python3 and installed Tensorflor installation, performing several benchmark tests in the requested gpus. The output will be produced in a results directory named after the CPU-GPU where it was executed, for example '8180-Tesla_V100-SXM2-32GB.logs'

Tensorflow and Keras within a Container

There are situations where installed software packages do not contain the latest developments or needed features for our programs. In this case we can donwload (or build) a Docker/Singularity container that will distribute the needed software to the Worker Nodes. This scenario is supported by HTCondor, following the next steps:

  1. Follow the recipe to download the latest tensorflow container with gpu support as depicted here.
  2. Check the htcondor submit files and submit.

HTCondor submit file:

universe = vanilla

executable              = /usr/bin/python
arguments               = "$ENV(HOME)/tutorial_ific/02_BasicContainer/02_tf_tuto1.py"

log                     = test.log
output                  = outfile.$(Cluster).$(Process).out
error                   = errors.$(Cluster).$(Process).err

+SingularityImage = "$ENV(HOME)/s.images/tensorflow-nightly-gpu"
+SingularityBind = "/lustre:/lustre"

request_GPUs = 1

queue
  • With the '+SingularityImage' variable we select the image that was previously dowloaded. In this case the latest nightly build of tensorflow with gpu support.
  • 'SingularityBind' let us mount other paths, in this case /lustre which contains the project disk space. Home disk space /lhome is mounted by default.
  • The 'executable' and 'arguments' contains the full path, as singularity image may start in different paths (probably $HOME) than the current submission directory.

The executable is python and the program executed is a tensorflow example (02_tf_tuto1.py) from the Tensorflow Tutorial.

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

The program will execute during less than 1 minute and train a example network with MNIST dataset data. See in the output (err) that the TensorFlow? device was created correctly and ir running in one of the physical GPUs):

2019-06-18 07:25:52.395846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1331] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:5e:00.0, compute capability: 7.0)
2019-06-18 07:25:55.156858: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

FAQ: Frequently Asked Questions

Job Submision error "Hold reason: Error ... (errno=13: 'Permission denied')"

This error happens when submitting with HTCondor and the 'executable' defined in the .sub execution does not have have execution permissions (+x).

In this case the job enters in state "Hold" (as can be seen with condor_q), and stays in that state. You can remove the job with 'condor_rm'.

To solve it, simply add correct file permissions to your executable before submitting jobs. For example for a script called 'executable.sh' :

chmod +x executable.sh

-- AlvaroFernandez - 17 Jun 2019

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
txttxt 01_aug_cifar10.py.txt manage 3.3 K 19 Jun 2019 - 15:54 AlvaroFernandez Data augmentation for the Cifar10 dataset.
Edit | WYSIWYG | Attach | PDF | Raw View | Backlinks: Web, All Webs | History: r11 < r10 < r9 < r8 < r7 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback