r16 - 25 May 2023 - 09:23:23 - AlvaroFernandezYou are here: TWiki >  Artemisa Web  > UsageGuide

Artemisa Usage Guide

Introduction

Artemisa is a high performance computing infrastructure based on hardware GPGPUs accelerators, and with the supporting networking and storage infrastructure for running complex scientific batch jobs.

In these pages we introduce the hardware, working environment and usage recipes for the final users of this infrastructure.

Infrastructure

==========================================================

Welcome to 
    _         _                 _           
   / \   _ __| |_ ___ _ __ ___ (_)___  __ _ 
  / _ \ | '__| __/ _ \ '_ ` _ \| / __|/ _` |
 / ___ \| |  | ||  __/ | | | | | \__ \ (_| |
/_/   \_\_|   \__\___|_| |_| |_|_|___/\__,_|
                                            
The IFIC's AI and ML infraestructure

This is mlui02.ific.uv.es running CentOS 7.9.2009
Please,   read the "Politica de Seguridad   de la Informacion"

https://artemisa.ific.uv.es/static/Politica_Seguridad.pdf

==========================================================
**ATTENTION**
CUDA Version: 12.0  Driver Version: 525.85.12
Execute GPUstatus to get a summary on GPU utilization
==========================================================

All the infrastructure is running operating system Linux Centos 7, which is a completely free linux distribution based on Red Hat Enterprise Linux (RHEL) 7.

For more information about the operating system and it is use refer to the official sources:

The hardware nodes of the infrastructure are divided in three classes:

  • User Interfaces (UI): the entry point for users, provide a working environment where to compile and test their programs. These machines have a GPGPU and access to the storage services. When the jobs are ready, users can submit their production batch jobs with Job Management System to the Worker Nodes.
  • Worker Nodes (WN): where the production user jobs are executed. Contain high-end CPUs, a large memory configuration, and up to 8 high-end GPGPU to execute the jobs.
  • Storage Nodes: disk servers to store user and project data, that are accessible from both User Interfaces and Worker Nodes.

The current detailed hardware configuration can be found in this page.

Working environment

  • Users access the User Interface (UI) nodes where develop and submit production jobs. These nodes contain a complete development environment where to test and validate their programs using a entry-level local GPU. Access to this local GPU is understood as for development and validation jobs, and therefore with exclusive access during a limited time ( 5 minute slots).

  • After validation, users can submit their codes as production batch jobs using the Job Management System HTCondor. This provides access to the computing cluster, with high-end GPUs (nodes with up to 8 GPUs with high speed NVlink interconnection) and computing nodes with longer execution times are allowed.

User Interface (UI)

User Interfaces are development nodes, where users can test and validate their programs using the entry-level local GPU. After validation, can submit their codes as production batch jobs using the Job Management System HTCondor.

Authorized users can log in with ssh protocol the the User Interface machines ( mlui01.ific.uv.es, mlui02.ific.uv.es)

The current list of UI machines and detailed hardware configuration can be found in this page.

The user HOME directory resides in Lustre filesystem ( See section Storage )

/lhome/ific/<initial_letter>/<username>   <= for IFIC users
/lhome/ext/<groupid>/<username>  <= for external users

In addition of the user HOME directory there is space for PROJECTS available upon request, and accessible also from the UI:

/lustre/ific.uv.es/ml/<groupid>

Software is accesible in the User Interfaces as described in this Software section.

Gpurun: request exclusive access to User Interface GPU

As GPUs are a exclusive resource, users have to instantly reserve for interactive access to the local User Interface GPU.

For doing so can be used the 'gpurun' tool, that will grant usage slots valid for 5 minutes.

$ gpurun -i
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]

Total clients:0 Running:0 Estimated waiting time:0 seconds

In case there are other users executing, the command will synchonously wait until it can be executed.

Job Management System: HTCondor

HTCondor is the resource management system that runs in this cluster. It manages the job workflow and allows the users to send jobs to be executed in the worker nodes. Direct access to worker nodes is not allowed.

Each worker node has a partitionable slot that accepts jobs to be processed. HTCondor deals with job sorting and processing. Slots are divided when the job does not require all node resources, so more jobs can be run in the node. CPU and Memory resources are subtracted in chunks from the main slot. 0, 1, 2, 4 or 8 GPU requests are permitted.

HTCondor tries to run jobs form different users in a fair share way. Jobs priorities among users take into account the previous time spent by the user so CPU time is assigned evenly between all users.

The complete HTCondor manual can be found here

The current Artemisa HTCondor configuration can be found here.

Job submission

Job description file
Before sending a job, you have to prepare a Job Description File which specifies what do you want to execute, where do you what to write your output and what are your job requirements.

A simple job description file might be (test.sub):

universe = vanilla

executable          = test.sh
arguments           = $(Cluster) $(Process)

log                 = test.log
output              = outfile.$(Cluster).$(Process).out
error               = errors.$(Cluster).$(process).err

request_Cpus        = 4
request_Memory      = 4000

queue 2

In this file:

  • universe: specifies how HTCondor is going to deal with the job. Use vanilla here.
  • $(Cluster): HTCondor assigns a consecutive number for each group of jobs sent at the same time.
  • $(Process): Identifies an specific job inside the group. Process number starts from 0.
  • executable: Tells HTCondor what to execute. This can be an executable or a script
  • arguments: The arguments HTCondor with pass to the executable. In this example the cluster and process job identifiers.
  • log: This is the file where HTCondor will log the job processing information.
  • outout: Where the job output is going to be written. It is important that each job has their own output file, since it will be overwritten if this file exists. In this example, the file name contains the Cluster and Process identifiers so file name is unique.
  • error: Same as output but for the standard error
  • request_Cpus: how many cores need the job. It is important to be accurate since this requirement is used by HTCondor to distribute the available resources among all users.
  • request_Memory: same as CPU but for memory. Unit here is MB !!
  • queue: Number of jobs to be send to the system using this job configuration

This is a very simple job, but the Job Description File can be richer and other options can be more appropriate to submit other user workloads.

Actual job submission (condor_submit)

When the Job Description File is ready, we can send the jobs to the system:

$ condor_submit test.sub
Submitting job(s)..
2 job(s) submitted to cluster 110.

110 is the Cluster identification for this bunch of jobs

Get the jobs status (condor_q)

To get the job status use:

$ condor_q

-- Schedd: xxx.ific.uv.es : <xx.xx.xx.xx:9618?... @ 03/12/19 11:44:28
OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
fulano ID: 110      3/12 11:44      _      _      2      2 110.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for fulano: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

Here, you can see the jobs in IDLE state waiting for HTCondor to be considered for running. The log file begins to have information about the HTCondor processing.

When the job start running, you can see the output and error files being written since at IFIC, the file system (Lustre) can be seen everywhere in the cluster

Removing a job (condor_rm)

A job can be removed from the system using its cluster identification. If several jobs have been sent, a specific job can be removed using its Process identification.

$ condor_rm 110.1
Job 110.1 marked for removal

Checking cluster status (condor_pool)

You can get the cluster (pool) status with:

$ condor_status
Name                    OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@mlwn01.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000  16384  7+20:20:48
slot2@mlwn01.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 134964  7+20:21:04
slot3@mlwn01.ific.uv.es LINUX      X86_64 Claimed   Busy      1.030 231368  0+00:06:00
slot1@mlwn02.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000  16384  7+20:21:07
slot2@mlwn02.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 134964  7+20:21:18
slot3@mlwn02.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 231368  5+19:46:37
slot1@mlwn03.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 309074  7+20:17:45
slot2@mlwn03.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 463611  7+20:17:57

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

  X86_64/LINUX        9     0       1         8       0          0      0

         Total        9     0       1         8       0          0      0

Storage

Storage is mantained in several disk servers as detailed in this page.

A distributed Lustre filesystem is shared and mounted in the different nodes of the cluster, including User Interfaces (UI) and Worker Nodes.

This means that all data is directly available in all nodes, and no explicit file transfer is needed to be accessible from the worked nodes.

This includes user homes and project areas.

Containers

Containers are a software distribution form, very convenient for developers and users.

We support Singularity as it is secure and supports several container types including Docker and access to the DockerHUB?.

The current distribution documentation for users can be found here.

Example: download the latest tensorflow nightly gpu container from the docker hub, and convert it into a singularity image for later use:

$ mkdir ~/s.images
$ cd ~/s.images
$ singularity build tensorflow-nightly-gpu docker://tensorflow/tensorflow:nightly-gpu

HEP Scientific Software

CVMFS: HEP Software distribution

We adopt CVMFS as the main HEP software distribution method. The software packages are distributed in differente repositores maintained by the different contributors, and accessible as local mounted /cvmfs points in User Interfaces (UI) and Worker Nodes.

The current repositories that can be found are the following:

CERN/SFT Repositories
External software packages are taken from external sources to PH/SFT. They are recompiled, if possible and necessary, on all SFT provided platforms. External software packages are provided for many different areas such as
  • General tools (debugging, testing)
  • Graphics
  • Mathematical Libraries
  • Databases
  • Scripting Languages and modules
  • Grid middleware
  • Compilers

An exhaustive list of all provided packages and the supported platforms is available at http://lcginfo.cern.ch.

The 'lcgenv' configuration tool can be used to set the environment to the desired tool https://gitlab.cern.ch/GENSER/lcgenv

IE: The following example sets the environment variables needed to use ROOT, GSL and BOOST libraries:
export LCGENV_PATH=/cvmfs/sft.cern.ch/lcg/releases/
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt ROOT `"
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt GSL `"
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt Boost`"

Other Repositories

The list of other CERN cvmfs repositories maintained by their respective owners and available in the following mount points, as detailed in CVMFS repositories list:

/cvmfs/atlas.cern.ch
/cvmfs/lhcb.cern.ch

Local Installed software

  • NVIDIA Drivers :
    • Installed releases: 495.29.05
  • CUDA Toolkit : The NVIDIAŽ CUDAŽ Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler and a runtime library to deploy your application.
    • Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5
  • GPU-Accelerated libraries: NVIDIA GPU-accelerated libraries provide highly-optimized functions that perform 2x-10x faster than CPU-only alternatives.
    • Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5

  • Compilers: python2.7.5, python3.6.8, gcc4.8.5

  • [[https://www.tensorflow.org/][Tensorflow]: end-to-end open source platform for machine learning.
    • Installed releases: r1.13 (python3)

  • Other Scientific Libraries: scipy, numpy, atlas, blas, lapack

ARTEMISA TUTORIAL : Usage recipes

A basic ARTEMISA tutorial is available at the following git repository: Artemisa Tutorial

You can checkout with the following git command:

git clone https://igit.ific.uv.es/alferca/artemisa-tutorial

Here in this Twiki we will copy the summary of the usage recipes in that repository

Development on User Interface

Abstract

  • Hands on development on a User Interface (UI)
  • Write small test program using TensorFlow?
  • Custom local environments ( venv / conda / pip )
  • Interactive execution on the UI
  • GPURUN tool to gain exclusive access to the UI GPU

Introduction

We use a basic Tensorflow example for development and using gpus as described here: https://www.tensorflow.org/guide/gpu

- Create a python file named 'tf2_helloworld.py' :

 
from __future__ import print_function
import tensorflow as tf

#Tensor allocations or operations to be printed
tf.debugging.set_log_device_placement(True)

# Tensorflow version
print("This is Tensorflow: ", tf.__version__)

# Check GPUs
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Display
tf.print(c)

- Try to execute the program

Run the program with python3, and will use the system installation of python 3 and Tensorflow.

$ python3 tf2_helloworld.py 
This is Tensorflow:  1.14.0
Traceback (most recent call last):
  File "tf2_helloworld.py", line 11, in <module>
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
  File "/lhome/ific/a/alferca/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation_wrapper.py", line 106, in __getattr__
    attr = getattr(self._dw_wrapped_module, name)
AttributeError: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'

If you require other versions than provided by the system, you can use a custom local environment with you own software installation. Check next step.

Custom Local Environment

- We can use different tools for creating a custom local environment

Conda (https://conda.io) : Conda easily creates, saves, loads and switches between environments on your local computer. Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. In its default configuration, conda can install and manage the thousand packages at repo.anaconda.com that are built, reviewed and maintained by Anaconda

Venv (https://docs.python.org/3/library/venv.html) : The venv module provides support for creating lightweight “virtual environments” with their own site directories, optionally isolated from system site directories. Each virtual environment has its own Python binary (which matches the version of the binary that was used to create this environment) and can have its own independent set of installed Python packages in its site directories.

PIP (https://pypi.org/project/pip/) : pip is the package installer for Python. You can use pip to install packages from the Python Package Index and other indexes.

- In this case we want to install another python and tensorflow version. At the time of writing this tutorial this is Tensorflow v2.6.0

Follow the instructions of Conda to install its distribution locally: https://docs.conda.io/en/latest/miniconda.html#linux-installers

Use conda to create a virtual environment. We install required packages (cudnn) but not yet TensorFlow? as the time of writing this tutorial the relase included in Conda do not support GPU

$ conda create -n artemisa-tutorial cudnn

Activate the environment

$ conda activate artemisa-tutorial

Install PIP within conda then Install TensorFlow? (with PIP). PIP is the recommended installation to get latest release from official sources

(artemisa-tutorial) $ conda install pip
(artemisa-tutorial) $ pip install tensorflow

Execute previous program:

(artemisa-tutorial) [alferca@mlui02 01_User_Interface_Development]$  python3 tf2_helloworld.py
This is Tensorflow:  2.6.0
Num GPUs Available:  2
2021-10-27 12:55:05.008162: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-27 12:55:05.023252: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
Aborted

As you can see now the version used is Tensorflow: 2.6.0

The execution on local GPUs is not directly possible, and we will get errors like: CUDA_ERROR_INVALID_DEVICE

In the next section we see how to overcome this and gain interactive access to the UI GPU

Interactive GPU exclusive access: GPURUN

Access to the UI GPU is not directly accessible, and the GPURUN tool has to be used.

Basically preprend you command that access the GPU with the tool 'gpurun'

(artemisa-tutorial) $ gpurun python3 tf2_helloworld.py
First nonoption argument is "python3" at argv[1]
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]

Total clients:0 Running:0 Estimated waiting time:0 seconds
GPU reserved:300 seconds granted
GPUID reserved:0 Details: - Device 0. Tesla V100-PCIE-16GB [00000000:5E:00.0] set to compute mode:Exclusive Process
Info: Executing program: python3
This is Tensorflow:  2.6.0
Num GPUs Available:  1
2021-10-27 13:05:23.086416: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-27 13:05:23.795174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14647 MB memory:  -> device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:5e:00.0, compute capability: 7.0
2021-10-27 13:05:23.948618: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.949655: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.949874: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Reshape in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950036: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950149: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950203: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Reshape in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950376: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:24.427560: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op StringFormat in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-27 13:05:24.427990: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op PrintV2 in device /job:localhost/replica:0/task:0/device:CPU:0
[[22 28]
 [49 64]]

If there are not other clients waiting for the local GPU the program will be executed immediately, otherwise will wait and print the expected waiting time:

Queued clients:0 Estimated waiting time:0 seconds

We have to bear in mind that User Interface GPUs are exclusive access (only one program can execute at a time) and so user programs will be terminated after a short limited time ( 5 minutes )

After execution we will se the 2x2 matrix result from the multiplication.

[[22 28]
 [49 64]]

the TensorFlow? logs that give us clues where the execution has resulted: all operands a,b, and MatMult? operation have been performed in the GPU 0, the User Interface installed Tesla V100-PCIE-16GB.

Notes

  • GPURUN tool is needed to access the GPU in on a User Interface (UI)
  • Custom software/environments installation can be done in the user or project directories

CPU only job: Data Preparation use case

Abstract

  • Execute CPU only Jobs on the UI (without GPURUN)
  • Review Keras example for Data Augmentation
  • Submit a batch job with HTCondor (Job Management System)
  • Execute priority job on remote Worker Node with CPU only

Introduction

The ARTEMISA cluster has some resorces reserved for CPU only jobs.

Althought modern frameworks like TensorFlow? or Keras allow the creation of Data Pipelines to benefit in parallel from the usage of CPU (data aumentation) and GPU (training),

some tasks like Data preparation and augmentation jobs can be done in parallel using CPU only resources, in order to speed up the process.

In this example we propose a basic example to execute data preparation tasks requesting CPU only resources. We are going to build an image classifier. Because the MNIST dataset is a bit overused and too easy, we use the more challenging CIFAR-10 dataset. It consists of 32x32 pixel images with 10 classes. The data is split into 50k training and 10k test images.

First we will aument the data and store it locally. Use the assigned project space to test it.

Preparation

Activate previously created enviroment

$ conda activate artemisa-tutorial

Install needed packages

(artemisa-tutorial) $ pip install matplotlib
(artemisa-tutorial) $ pip install tensorflow-datasets
(artemisa-tutorial) $ pip install scipy

Check provided program:

(artemisa-tutorial) $ python3 tf2_keras_generate_data_CPUonly.py

This program forces the usage on the CPU, so it can be run in the UI without the gpurun tool. It will create new augmented data (images) in a subdirectory

Execute the program in a remote WORKER NODE (with HTCondor)

- Check HTCondor submission file:

(artemisa-tutorial) $ cat tf2_keras_generate_data_CPUonly.sub
universe = vanilla
executable              = tf2_keras_generate_data_CPUonly.sh
arguments               = 
log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

# Needed to read .bashrc and conda environment
getenv = True

# TestJob CPU
+testJob = True
queue

Note the '+testJob = True' to use exclusive CPU resources

- Check 'executable' that will be run on the Worker Node:

(artemisa-tutorial) $ cat tf2_keras_generate_data_CPUonly.sh
#!/bin/bash
EXERCISE_ENVIRONMENT="artemisa-tutorial"
eval "$(conda shell.bash hook)"
conda activate $EXERCISE_ENVIRONMENT
python3 tf2_keras_generate_data_CPUonly.py

Since the job will run on a remote Worker Node, we need to set up the same environment (conda). With this script this is done, prior to invoke our real python executable.

Submit the job (with HTCondor)

- Create the output directories (condor_logs) and submit the HTCondor job:

$ mkdir condor_logs

(artemisa-tutorial) [alferca@mlui02 02.2_Store_Data]$ condor_submit tf2_keras_generate_data_CPUonly.sub 
Submitting job(s).
1 job(s) submitted to cluster 1602

Check results

$ ls -ltr augmented_data/

Notes

  • ARTEMISA has CPU only high priority resources
  • Use '+testJob = True' to select the corresponding CPU only slots.
  • Use 'getenv = true' to maintain the user environment when submitting with HTCondor
  • When using custom environments, init the required environment in your 'executable' (ie: conda) before running the real payload.

Submit Basic GPU Production Job: Worker Nodes GPUs and Environment info

Abstract

  • Submit a batch job with HTCondor (Job Management System)
  • Execute on remote Worker Node with a GPU
  • Obtain basic Worker Node enviroment information

Introduction

After development we can submit production jobs that use 1 or more GPU production resources with higher time limits, and not only the User Interfaces.

In this example we will submit a job requesting a single GPU and retrieving basic environment information, that can help debug multiple issues.

Preparation

- Check HTCondor submission file:

$ cat 01_test.sub

universe = vanilla

executable              = 01_test.sh
arguments               = $(Process)

log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

request_gpus = 1

queue

With this file we will be requesting a node with 1 GPU, and to execute our script defined in 'executable' parameter.

- Check 'executable' that will be run on the Worker Node:

$ cat 01_test.sh

#!/bin/sh
echo ">>>> ENVIRONMENT"
printenv
echo ">>>> HOST"
hostname
echo ">>>> CURRENT DIR"
pwd
echo ">>>> USER"
whoami 
echo ">>>> SPACE LEFT"
df -h
echo ">>>> NVIDIA INFO"
set -x #echo on
nvidia-smi

This shell script will retrieve the Environment variables and some information about the Worker Node where it executes. With nvidia-smi command we can obtain information about the GPUS installed on tha node.

Submit the job

- Create the output directories (condor_logs) and submit the HTCondor job:

$ mkdir condor_logs

$ condor_submit 01_test.sub
Submitting job(s).
1 job(s) submitted to cluster 548.

Check results

After submitting the job the output of the script will be recorded in the 'output' and/or 'error' files defined in the parameters. The 'log' file contains information of the HTCondor process to execute the jobs.

- 'output' file contains the standard output of your executable:

$cat condor_logs/outfile.131398.0.out

>>>> ENVIRONMENT
_CONDOR_JOB_PIDS=
TMPDIR=/var/lib/condor/execute/dir_112754
_CONDOR_ANCESTOR_4719=5207:1618416485:3499018107
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_112754
_CHIRP_DELAYED_UPDATE_PREFIX=Chirp*
_CONDOR_ANCESTOR_5207=112754:1634894211:4224147812
TEMP=/var/lib/condor/execute/dir_112754
BATCH_SYSTEM=HTCondor
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_112754/.chirp.config
PWD=/lhome/ific/a/alferca/prep_tutorial_ific_cuda11.3/03_Submit_GPU_job
_CONDOR_AssignedGpus=CUDA0
CUDA_VISIBLE_DEVICES=0
_CONDOR_SLOT=slot2
_CONDOR_ANCESTOR_112754=112755:1634894211:3354999804
SHLVL=1
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_112754/.machine.ad
TMP=/var/lib/condor/execute/dir_112754
GPU_DEVICE_ORDINAL=0
OMP_NUM_THREADS=32
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_112754/.job.ad
_CONDOR_JOB_IWD=/lhome/ific/a/alferca/prep_tutorial_ific_cuda11.3/03_Submit_GPU_job
_=/usr/bin/printenv
>>>> HOST
mlwn23.ific.uv.es
>>>> CURRENT DIR
/lhome/ific/a/alferca/prep_tutorial_ific_cuda11.3/03_Submit_GPU_job
>>>> USER
alferca
>>>> SPACE LEFT
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/centos-root        49G   35G   15G  71% /
devtmpfs                      189G     0  189G   0% /dev
tmpfs                         189G  305M  188G   1% /dev/shm
tmpfs                         189G     0  189G   0% /sys/fs/cgroup
tmpfs                         189G  187M  189G   1% /run
/dev/md126p1                 1014M  207M  808M  21% /boot
/dev/mapper/centos-condor     352G   33M  352G   1% /tmp
147.156.116.235@tcp:/homefs    55T  5.0T   49T  10% /lhome
147.156.116.235@tcp:/ific2fs   26T  218M   25T   1% /lustre/ific.uv.es
147.156.116.235@tcp:/mlfs     181T   41T  139T  23% /lustre/ific.uv.es/ml
147.156.116.235@tcp:/prjfs    127T   93T   33T  74% /lustre/ific.uv.es/prj
147.156.116.235@tcp:/atl2fs   2.9P  2.4P  477T  84% /lustre/ific.uv.es/grid/atlas
147.156.116.235@tcp:/atl3fs   280T  169T  103T  63% /lustre/ific.uv.es/grid/atlas/t3
>>>> NVIDIA INFO
Fri Oct 22 11:16:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In the previous output of the job we can find some interesting information about our finished job:

Environment including variables set by HTCondor, and the CUDA visible devices to our process ( only 1 GPU as requested) Execution Machine HOST and slot ( our job was run in slot2 of machine mlwn23.ific.uv.es) Current dir is the directory from where the job is started. Storage Mounted filesystems and free space ( mounted home and project filesystems, AFS is not mounted in Worker Nodes) Nvidia GPU Info (Tesla V100-PCIE, 32GB, not running real workload in this example)

Selection of particular GPU resources: NVIDIA A100 GPUs

You can select machines for execution with a particular GPU resources, for example newest NVIDIA A100 GPUs can be selected only if you specify it in your HTCondor submission file with this clause:

+UseNvidiaA100 = True

Try to run the same test job to check environment using this clause, for example with this submission file:

$ cat 01_test_A100.sh

universe = vanilla

executable              = 01_test.sh
arguments               = $(Process)

log                     = condor_logs/test.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

request_gpus = 1
+UseNvidiaA100 = True

queue

Exercise

Try to submit a previous example from this tutorial (ie: python tensorflow) to a Worker Node, and check the results.

Notes

  • HTCONDOR JOB can be submitted to a Worker Node with exclusive a access to a GPU
  • 'output' file records the standard output of the 'executable' defined in the submission script.
  • 'error' files records the starndard error. 'log' records information about HTCondor submission status.
  • clause '+UseNvidiaA100 = True' must be used to select NVIDIA A100 GPUs

Multi GPU TensorFlow? example

Abstract

  • Lambda Tensorflow benchmark
  • Native / Virtual Environment with Tensorflow 2.3.1
  • Usage of multiple GPUs (intra-node)
  • Submission of HTCondor jobs requesting a node with the needed GPUs

Introduction

For this test we are going to use a benchmark available: https://github.com/lambdal/lambda-tensorflow-benchmark

This software uses by default installed python3 and Tensorflow installation, performing several benchmark tests in the requested gpus.

We can use the native TensorFlow? installation, or create a virtual Environment if the required exact releases and dependencies are not installed (Tensorflow 2.3.1)

Preparation

- Use new environment

virtualenv -p /usr/bin/python3.6 venv
. venv/bin/activate

pip install matplotlib

pip install tensorflow-gpu==2.3.1

- Clone benchmark repo

git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive

cd lambda-tensorflow-benchmark

- Prepare test config file:

ln -s config/config_all.sh .

Test on USER-INTERFACE

- Run a quick resnet50 test in FP32 (Note usage of GPURUN tool)

gpurun ./batch_benchmark.sh 1 1 1 100 2 config/config_resnet50_replicated_fp32_train_syn

...
2021-10-25 14:03:24.513360: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step   Img/sec   total_loss
1   images/sec: 700.7 +/- 0.0 (jitter = 0.0)   7.363 1635163410
10   images/sec: 701.6 +/- 0.9 (jitter = 3.6)   7.270 1635163412
20   images/sec: 700.5 +/- 1.5 (jitter = 2.5)   7.365 1635163415
30   images/sec: 700.8 +/- 1.1 (jitter = 2.3)   7.309 1635163418
40   images/sec: 699.9 +/- 0.9 (jitter = 2.6)   7.353 1635163421
50   images/sec: 699.6 +/- 0.8 (jitter = 2.3)   7.335 1635163423
60   images/sec: 699.3 +/- 0.7 (jitter = 2.0)   7.292 1635163426
70   images/sec: 699.5 +/- 0.6 (jitter = 1.9)   7.279 1635163429
80   images/sec: 699.6 +/- 0.6 (jitter = 1.8)   7.277 1635163432
90   images/sec: 699.6 +/- 0.5 (jitter = 1.7)   7.353 1635163434
100   images/sec: 699.8 +/- 0.5 (jitter = 1.7)   7.321 1635163437
----------------------------------------------------------------
total images/sec: 699.26
----------------------------------------------------------------

Submit MULTI-GPU JOB

- Condor submission file requesting 2 GPUs:

$ cat test-lambda-tensorflow-benchmark_venv.sub
universe = vanilla

executable              = 01_test_benchmark_venv.sh
arguments               =
log                     = condor_logs/test_venv.log
output                  = condor_logs/outfile_venv.$(Cluster).$(Process).out
error                   = condor_logs/errors_venv.$(Cluster).$(Process).err

request_gpus = 2
queue

- Executable to activate Virtual Enrironment and run benchmark:

$ cat 01_test_benchmark_venv.sh
#!/bin/bash 
echo ">>>>> ACTIVATE VIRTUALENV TensorFlow 2.3.1"
. venv/bin/activate
echo ">>>>> EXECUTE PAYLOAD"
cd lambda-tensorflow-benchmark
/bin/bash benchmark.sh ${CUDA_VISIBLE_DEVICES} 1

- Create needed output directories and submit HTCondor job

$ mkdir condor_logs
$ condor_submit test-lambda-tensorflow-benchmark_venv.sub

- The benchmark will run for some time on the assigned Worker Node. Information of the running program can be checked on the 'output' and 'error' files as defined inthe condor submission file (condor_logs directory)

Output

The job will run for several minutes, using the configured python3 and installed Tensorflor installation, performing several benchmark tests in the requested gpus. The output will be produced in a results directory named after the CPU-GPU where it was executed, for example 'logs/Platinum-Tesla_V100-SXM2-32GB.logs'.

They can be analyzed as explained in the original benchmark suite page: https://github.com/lambdal/lambda-tensorflow-benchmark#step-three-report-results

Notes

  • A python VIRTUAL ENTIRONMENT can be created with the exact dependencies needed.
  • REQUEST_GPUS is used on the HTCondor submission script to indicate the needed number of GPUs.
  • CUDA_VISIBLE_DEVICES environment variable is set by HTCondor on the Worker Nodes to indicate the assigned GPUs, and can be used by our scripts and programs.

Containers

Abstract

  • Use containers (Singularity, can use Docker images) with a software distribution
  • Can be run in the UI or in the Worker Nodes (HTCondor submission)

Introduction

There are situations where are provided a container with already installed provided packages and enviroment. We can also build one container to maintain one particular environment.

In this case we can download (or build) a Docker/Singularity container that will distribute the needed software to the Worker Nodes. This scenario is supported by HTCondor, following the next steps:

Preparation

Follow the recipe to download the latest tensorflow container with gpu support as depicted here:

$ mkdir ~/s.images
$ cd ~/s.images
$ singularity build tensorflow-latest-gpu  docker://tensorflow/tensorflow:latest-gpu

Run container in the UI

We can run the singularity container in the UI to check it runs correctly with a command like this.

$ gpurun singularity run --nv -c -H $PWD:/home  ~/s.images/tensorflow-latest-gpu python3 ./02_tf_tuto1.py
First nonoption argument is "singularity" at argv[1]
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]

Total clients:0 Running:0 Estimated waiting time:0 seconds
GPU reserved:300 seconds granted
GPUID reserved:0 Details: - Device 0. Tesla V100-PCIE-16GB [00000000:5E:00.0] set to compute mode:Exclusive Process
Info: Executing program: singularity
2021-10-22 09:37:48.656353: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-22 09:37:51.433740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14647 MB memory:  -> device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:5e:00.0, compute capability: 7.0
2021-10-22 09:37:52.058855: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
1875/1875 [==============================] - 5s 2ms/step - loss: 0.2234 - accuracy: 0.9346 
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0968 - accuracy: 0.9701
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0703 - accuracy: 0.9782
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0543 - accuracy: 0.9823
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0427 - accuracy: 0.9861
313/313 [==============================] - 1s 1ms/step - loss: 0.0742 - accuracy: 0.9763

Notes

  • 'gpurun' is prefixed to access the UI GPU.
  • 'singularity' command accepts several commands like 'run' to execute within the container ('~/s.images/tensorflow-latest-gpu') and then the command ('python3 ./02_tf_tuto1.py')
  • the '--nv' parameter is needed to be able to access the GPU within the container
  • By DEFAULT Singularity run will mount $HOME automatically. This can be a problem if local python / tensorflow is installed and the local environment is read in the container. the '-c' parameter clears the environment
  • In this case we want to mount the current directory ($PWD) to be able to access the command (python executable), and we can do it with the parameter: '-H $PWD:/home'

Submit container to the WN (with HTCondor)

- Check the htcondor submit files and submit.

$ cat 02_tf_tuto1.sub

universe = vanilla

arguments               = "$ENV(HOME)/prep_tutorial_ific_cuda11.3/05_Containers/02_tf_tuto1.py"
arguments               = $(Process)

log                     = condor_logs/test_singularity.log
output                  = condor_logs/test_singularity.outfile.$(Cluster).$(Process).out
error                   = condor_logs/test_singularity.errors.$(Cluster).$(Process).err

+SingularityImage = "$ENV(HOME)/s.images/tensorflow-latest-gpu"
+SingularityBind = "/lustre/ific.uv.es/ml"

request_gpus = 1

queue

Notes: - With the '+SingularityImage' variable we select the image that was previously dowloaded. In this case the latest build of tensorflow with gpu support. - 'SingularityBind' let us mount other paths, in this case /lustre/ific.uv.es/ml which contains the project disk space. Home disk space /lhome is mounted by default. - The 'executable' and 'arguments' contains the full path, as singularity image may start in different paths (probably $HOME) than the current submission directory.

The executable is python and the program executed is a tensorflow example (02_tf_tuto1.py) from the Tensorflow Tutorial.

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Submit the job with HTCondor

Submit the .sub file with condor as in previous exercises, and check the output that has to be the same as running in the User Interface.

Notes

  • Use containers (Singularity, can use Docker images) with a software distribution
  • Can be run in the UI or in the Worker Nodes (HTCondor submission)
  • When running in the UI gpurun should be used and some singularity options (--nv) used to access the local GPU
  • When submitting with HTCondor, only some options like '+SingularityImage' is needed to user the container image.
  • By DEFAULT Singularity will mount $HOME automatically.
  • 'SingularityBind' let us mount other paths, in this case /lustre/ific.uv.es/ml which contains the project disk space. Home disk space /lhome is mounted by default.
  • The 'executable' and 'arguments' contains the full path, as singularity image may start in different paths (probably $HOME) than the current submission directory.

FAQ: Frequently Asked Questions

Job Submision error "Hold reason: Error ... (errno=13: 'Permission denied')"

This error happens when submitting with HTCondor and the 'executable' defined in the .sub execution does not have have execution permissions (+x).

In this case the job enters in state "Hold" (as can be seen with condor_q), and stays in that state. You can remove the job with 'condor_rm'.

To solve it, simply add correct file permissions to your executable before submitting jobs. For example for a script called 'executable.sh' :

chmod +x executable.sh

-- AlvaroFernandez - 28 Oct 2021

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
txttxt 01_aug_cifar10.py.txt manage 3.3 K 19 Jun 2019 - 15:54 AlvaroFernandez Data augmentation for the Cifar10 dataset.
Edit | WYSIWYG | Attach | PDF | Raw View | Backlinks: Web, All Webs | History: r16 < r15 < r14 < r13 < r12 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback