Artemisa Usage Guide
Introduction
Artemisa is a high performance computing infrastructure based on hardware GPGPUs accelerators, and with the supporting networking and storage infrastructure for running complex scientific batch jobs.
In these pages we introduce the hardware, working environment and usage recipes for the final users of this infrastructure.
Infrastructure
==========================================================
Welcome to
_ _ _
/ \ _ __| |_ ___ _ __ ___ (_)___ __ _
/ _ \ | '__| __/ _ \ '_ ` _ \| / __|/ _` |
/ ___ \| | | || __/ | | | | | \__ \ (_| |
/_/ \_\_| \__\___|_| |_| |_|_|___/\__,_|
The IFIC's AI and ML infraestructure
This is mlui02.ific.uv.es running CentOS 7.9.2009
Please, read the "Politica de Seguridad de la Informacion"
https://artemisa.ific.uv.es/static/Politica_Seguridad.pdf
==========================================================
**ATTENTION**
CUDA Version: 12.0 Driver Version: 525.85.12
Execute GPUstatus to get a summary on GPU utilization
==========================================================
All the infrastructure is running operating system Linux Centos 7, which is a completely free linux distribution based on Red Hat Enterprise Linux (RHEL) 7.
For more information about the operating system and it is use refer to the official sources:
The hardware nodes of the infrastructure are divided in three classes:
- User Interfaces (UI): the entry point for users, provide a working environment where to compile and test their programs. These machines have a GPGPU and access to the storage services. When the jobs are ready, users can submit their production batch jobs with Job Management System to the Worker Nodes.
- Worker Nodes (WN): where the production user jobs are executed. Contain high-end CPUs, a large memory configuration, and up to 8 high-end GPGPU to execute the jobs.
- Storage Nodes: disk servers to store user and project data, that are accessible from both User Interfaces and Worker Nodes.
The current detailed hardware configuration can be found in
this page.
Working environment
- Users access the User Interface (UI) nodes where develop and submit production jobs. These nodes contain a complete development environment where to test and validate their programs using a entry-level local GPU. Access to this local GPU is understood as for development and validation jobs, and therefore with exclusive access during a limited time ( 5 minute slots).
- After validation, users can submit their codes as production batch jobs using the Job Management System HTCondor. This provides access to the computing cluster, with high-end GPUs (nodes with up to 8 GPUs with high speed NVlink interconnection) and computing nodes with longer execution times are allowed.
User Interface (UI)
User Interfaces are development nodes, where users can test and validate their programs using the entry-level local GPU. After validation, can submit their codes as production batch jobs using the Job Management System HTCondor.
Authorized users can log in with ssh protocol the the User Interface machines ( mlui01.ific.uv.es, mlui02.ific.uv.es)
The current list of UI machines and detailed hardware configuration can be found in
this page.
The user HOME directory resides in Lustre filesystem ( See section
Storage )
/lhome/ific/<initial_letter>/<username> <= for IFIC users
/lhome/ext/<groupid>/<username> <= for external users
In addition of the user HOME directory there is space for PROJECTS available upon request, and accessible also from the UI:
/lustre/ific.uv.es/ml/<groupid>
Software is accesible in the User Interfaces as described in this
Software section.
Gpurun: request exclusive access to User Interface GPU
As GPUs are a exclusive resource, users have to instantly reserve for interactive access to the local User Interface GPU.
For doing so can be used the 'gpurun' tool, that will grant usage slots valid for 5 minutes.
$ gpurun -i
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]
Total clients:0 Running:0 Estimated waiting time:0 seconds
In case there are other users executing, the command will synchonously wait until it can be executed.
Job Management System: HTCondor
HTCondor is the resource management system that runs in this cluster. It manages the
job workflow and allows the users to send jobs to be executed in the worker nodes.
Direct access to worker nodes is not allowed.
Each worker node has a partitionable slot that accepts jobs to be processed. HTCondor deals with job
sorting and processing. Slots are divided when the job does not require all node resources, so more jobs
can be run in the node. CPU and Memory resources are subtracted in chunks from the main slot.
0, 1, 2, 4 or 8 GPU requests are permitted.
HTCondor tries to run jobs form different users in a fair share way. Jobs priorities among users
take into account the previous time spent by the user so CPU time is assigned evenly between all
users.
The complete HTCondor manual can be found
here
The current Artemisa HTCondor configuration can be found
here.
Job submission
Job description file
Before sending a job, you have to prepare a Job Description File which specifies what do you want to
execute, where do you what to write your output and what are your job requirements.
A simple job description file might be (test.sub):
universe = vanilla
executable = test.sh
arguments = $(Cluster) $(Process)
log = test.log
output = outfile.$(Cluster).$(Process).out
error = errors.$(Cluster).$(process).err
request_Cpus = 4
request_Memory = 4000
queue 2
In this file:
- universe: specifies how HTCondor is going to deal with the job. Use vanilla here.
- $(Cluster): HTCondor assigns a consecutive number for each group of jobs sent at the same time.
- $(Process): Identifies an specific job inside the group. Process number starts from 0.
- executable: Tells HTCondor what to execute. This can be an executable or a script
- arguments: The arguments HTCondor with pass to the executable. In this example the cluster and process job identifiers.
- log: This is the file where HTCondor will log the job processing information.
- outout: Where the job output is going to be written. It is important that each job has their own output file, since it will be overwritten if this file exists. In this example, the file name contains the Cluster and Process identifiers so file name is unique.
- error: Same as output but for the standard error
- request_Cpus: how many cores need the job. It is important to be accurate since this requirement is used by HTCondor to distribute the available resources among all users.
- request_Memory: same as CPU but for memory. Unit here is MB !!
- queue: Number of jobs to be send to the system using this job configuration
This is a very simple job, but the Job Description File can be richer and other options can be more appropriate to submit other user workloads.
Actual job submission (condor_submit)
When the Job Description File is ready, we can send the jobs to the system:
$ condor_submit test.sub
Submitting job(s)..
2 job(s) submitted to cluster 110.
110 is the Cluster identification for this bunch of jobs
Get the jobs status (condor_q)
To get the job status use:
$ condor_q
-- Schedd: xxx.ific.uv.es : <xx.xx.xx.xx:9618?... @ 03/12/19 11:44:28
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
fulano ID: 110 3/12 11:44 _ _ 2 2 110.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for fulano: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Here, you can see the jobs in IDLE state waiting for HTCondor to be considered for running.
The log file begins to have information about the HTCondor processing.
When the job start running, you can see the output and error files being written since at IFIC,
the file system (Lustre) can be seen everywhere in the cluster
Removing a job (condor_rm)
A job can be removed from the system using its cluster identification. If several jobs have been sent, a specific job can be removed using its Process identification.
$ condor_rm 110.1
Job 110.1 marked for removal
Checking cluster status (condor_pool)
You can get the cluster (pool) status with:
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@mlwn01.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 16384 7+20:20:48
slot2@mlwn01.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 134964 7+20:21:04
slot3@mlwn01.ific.uv.es LINUX X86_64 Claimed Busy 1.030 231368 0+00:06:00
slot1@mlwn02.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 16384 7+20:21:07
slot2@mlwn02.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 134964 7+20:21:18
slot3@mlwn02.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 231368 5+19:46:37
slot1@mlwn03.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 309074 7+20:17:45
slot2@mlwn03.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 463611 7+20:17:57
Machines Owner Claimed Unclaimed Matched Preempting Drain
X86_64/LINUX 9 0 1 8 0 0 0
Total 9 0 1 8 0 0 0
Storage
Storage is mantained in several disk servers as detailed in
this page.
A distributed
Lustre filesystem is shared and mounted in the different nodes of the cluster, including User Interfaces (UI) and Worker Nodes.
This means that all data is directly available in all nodes, and no explicit file transfer is needed to be accessible from the worked nodes.
This includes user homes and project areas.
Containers
Containers are a software distribution form, very convenient for developers and users.
We support
Singularity as it is secure and supports several container types including Docker and access to the
DockerHUB?.
The current distribution documentation for users can be found
here.
Example: download the latest tensorflow nightly gpu container from the docker hub, and convert it into a singularity image for later use:
$ mkdir ~/s.images
$ cd ~/s.images
$ singularity build tensorflow-nightly-gpu docker://tensorflow/tensorflow:nightly-gpu
HEP Scientific Software
CVMFS: HEP Software distribution
We adopt
CVMFS as the main HEP software distribution method. The software packages are distributed in differente repositores maintained by the different contributors, and accessible as local mounted /cvmfs points in User Interfaces (UI) and Worker Nodes.
The current repositories that can be found are the following:
CERN/SFT Repositories
External software packages are taken from external sources to PH/SFT.
They are recompiled, if possible and necessary, on all SFT provided
platforms. External software packages are provided for many different
areas such as
- General tools (debugging, testing)
- Graphics
- Mathematical Libraries
- Databases
- Scripting Languages and modules
- Grid middleware
- Compilers
An exhaustive list of all provided packages and the supported platforms is available at
http://lcginfo.cern.ch.
The 'lcgenv' configuration tool can be used to set the environment to the desired tool
https://gitlab.cern.ch/GENSER/lcgenv
IE: The following example sets the environment variables needed to use ROOT, GSL and BOOST libraries:
export LCGENV_PATH=/cvmfs/sft.cern.ch/lcg/releases/
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt ROOT `"
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt GSL `"
eval "` $LCGENV_PATH/lcgenv/latest/lcgenv -p LCG_93 x86_64-slc6-gcc62-opt Boost`"
Other Repositories
The list of other CERN cvmfs repositories maintained by their respective owners and available in the following mount points, as detailed in
CVMFS repositories list:
/cvmfs/atlas.cern.ch
/cvmfs/lhcb.cern.ch
Local Installed software
- NVIDIA Drivers :
- Installed releases: 495.29.05
- CUDA Toolkit : The NVIDIAŽ CUDAŽ Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler and a runtime library to deploy your application.
- Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5
- GPU-Accelerated libraries: NVIDIA GPU-accelerated libraries provide highly-optimized functions that perform 2x-10x faster than CPU-only alternatives.
- Installed releases: 9.1, 10.0, 10.1, 10.2, 11.3, 11.4, 11.5
- Compilers: python2.7.5, python3.6.8, gcc4.8.5
- [[https://www.tensorflow.org/][Tensorflow]: end-to-end open source platform for machine learning.
- Installed releases: r1.13 (python3)
- Other Scientific Libraries: scipy, numpy, atlas, blas, lapack
ARTEMISA TUTORIAL : Usage recipes
A basic ARTEMISA tutorial is available at the following git repository:
Artemisa Tutorial
You can checkout with the following git command:
git clone https://igit.ific.uv.es/alferca/artemisa-tutorial
Here in this Twiki we will copy the summary of the usage recipes in that repository
Development on User Interface
Abstract
- Hands on development on a User Interface (UI)
- Write small test program using TensorFlow?
- Custom local environments ( venv / conda / pip )
- Interactive execution on the UI
- GPURUN tool to gain exclusive access to the UI GPU
Introduction
We use a basic Tensorflow example for development and using gpus as described here:
https://www.tensorflow.org/guide/gpu
- Create a python file named 'tf2_helloworld.py' :
from __future__ import print_function
import tensorflow as tf
#Tensor allocations or operations to be printed
tf.debugging.set_log_device_placement(True)
# Tensorflow version
print("This is Tensorflow: ", tf.__version__)
# Check GPUs
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Display
tf.print(c)
- Try to execute the program
Run the program with python3, and will use the system installation of python 3 and Tensorflow.
$ python3 tf2_helloworld.py
This is Tensorflow: 1.14.0
Traceback (most recent call last):
File "tf2_helloworld.py", line 11, in <module>
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
File "/lhome/ific/a/alferca/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation_wrapper.py", line 106, in __getattr__
attr = getattr(self._dw_wrapped_module, name)
AttributeError: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'
If you require other versions than provided by the system, you can use a custom local environment with you own software installation. Check next step.
Custom Local Environment
- We can use different tools for creating a custom local environment
Conda (
https://conda.io) : Conda easily creates, saves, loads and switches between environments on your local computer. Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. In its default configuration, conda can install and manage the thousand packages at repo.anaconda.com that are built, reviewed and maintained by Anaconda
Venv (
https://docs.python.org/3/library/venv.html) : The venv module provides support for creating lightweight “virtual environments” with their own site directories, optionally isolated from system site directories. Each virtual environment has its own Python binary (which matches the version of the binary that was used to create this environment) and can have its own independent set of installed Python packages in its site directories.
PIP (
https://pypi.org/project/pip/) : pip is the package installer for Python. You can use pip to install packages from the Python Package Index and other indexes.
- In this case we want to install another python and tensorflow version. At the time of writing this tutorial this is Tensorflow v2.6.0
Follow the instructions of Conda to install its distribution locally:
https://docs.conda.io/en/latest/miniconda.html#linux-installers
Use conda to create a virtual environment. We install required packages (cudnn) but not yet
TensorFlow? as the time of writing this tutorial the relase included in Conda do not support GPU
$ conda create -n artemisa-tutorial cudnn
Activate the environment
$ conda activate artemisa-tutorial
Install PIP within conda then Install
TensorFlow? (with PIP). PIP is the recommended installation to get latest release from official sources
(artemisa-tutorial) $ conda install pip
(artemisa-tutorial) $ pip install tensorflow
Execute previous program:
(artemisa-tutorial) [alferca@mlui02 01_User_Interface_Development]$ python3 tf2_helloworld.py
This is Tensorflow: 2.6.0
Num GPUs Available: 2
2021-10-27 12:55:05.008162: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-27 12:55:05.023252: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
Aborted
As you can see now the version used is Tensorflow: 2.6.0
The execution on local GPUs is not directly possible, and we will get errors like: CUDA_ERROR_INVALID_DEVICE
In the next section we see how to overcome this and gain interactive access to the UI GPU
Interactive GPU exclusive access: GPURUN
Access to the UI GPU is not directly accessible, and the GPURUN tool has to be used.
Basically preprend you command that access the GPU with the tool 'gpurun'
(artemisa-tutorial) $ gpurun python3 tf2_helloworld.py
First nonoption argument is "python3" at argv[1]
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]
Total clients:0 Running:0 Estimated waiting time:0 seconds
GPU reserved:300 seconds granted
GPUID reserved:0 Details: - Device 0. Tesla V100-PCIE-16GB [00000000:5E:00.0] set to compute mode:Exclusive Process
Info: Executing program: python3
This is Tensorflow: 2.6.0
Num GPUs Available: 1
2021-10-27 13:05:23.086416: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-27 13:05:23.795174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14647 MB memory: -> device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:5e:00.0, compute capability: 7.0
2021-10-27 13:05:23.948618: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.949655: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.949874: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Reshape in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950036: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950149: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950203: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Reshape in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:23.950376: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-27 13:05:24.427560: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op StringFormat in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-27 13:05:24.427990: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op PrintV2 in device /job:localhost/replica:0/task:0/device:CPU:0
[[22 28]
[49 64]]
If there are not other clients waiting for the local GPU the program will be executed immediately, otherwise will wait and print the expected waiting time:
Queued clients:0 Estimated waiting time:0 seconds
We have to bear in mind that User Interface GPUs are exclusive access (only one program can execute at a time) and so user programs will be terminated after a short limited time ( 5 minutes )
After execution we will se the 2x2 matrix result from the multiplication.
[[22 28]
[49 64]]
the
TensorFlow? logs that give us clues where the execution has resulted: all operands a,b, and
MatMult? operation have been performed in the GPU 0, the User Interface installed Tesla V100-PCIE-16GB.
Notes
- GPURUN tool is needed to access the GPU in on a User Interface (UI)
- Custom software/environments installation can be done in the user or project directories
CPU only job: Data Preparation use case
Abstract
- Execute CPU only Jobs on the UI (without GPURUN)
- Review Keras example for Data Augmentation
- Submit a batch job with HTCondor (Job Management System)
- Execute priority job on remote Worker Node with CPU only
Introduction
The ARTEMISA cluster has some resorces reserved for CPU only jobs.
Althought modern frameworks like
TensorFlow? or Keras allow the creation of Data Pipelines to benefit in parallel from the usage of CPU (data aumentation) and GPU (training),
some tasks like Data preparation and augmentation jobs can be done in parallel using CPU only resources, in order to speed up the process.
In this example we propose a basic example to execute data preparation tasks requesting CPU only resources. We are going to build an image classifier. Because the MNIST dataset is a bit overused and too easy, we use the more challenging CIFAR-10 dataset. It consists of 32x32 pixel images with 10 classes. The data is split into 50k training and 10k test images.
First we will aument the data and store it locally. Use the assigned project space to test it.
Preparation
Activate previously created enviroment
$ conda activate artemisa-tutorial
Install needed packages
(artemisa-tutorial) $ pip install matplotlib
(artemisa-tutorial) $ pip install tensorflow-datasets
(artemisa-tutorial) $ pip install scipy
Check provided program:
(artemisa-tutorial) $ python3 tf2_keras_generate_data_CPUonly.py
This program forces the usage on the CPU, so it can be run in the UI without the gpurun tool.
It will create new augmented data (images) in a subdirectory
Execute the program in a remote WORKER NODE (with HTCondor)
- Check HTCondor submission file:
(artemisa-tutorial) $ cat tf2_keras_generate_data_CPUonly.sub
universe = vanilla
executable = tf2_keras_generate_data_CPUonly.sh
arguments =
log = condor_logs/test.log
output = condor_logs/outfile.$(Cluster).$(Process).out
error = condor_logs/errors.$(Cluster).$(Process).err
# Needed to read .bashrc and conda environment
getenv = True
# TestJob CPU
+testJob = True
queue
Note the '+testJob = True' to use exclusive CPU resources
- Check 'executable' that will be run on the Worker Node:
(artemisa-tutorial) $ cat tf2_keras_generate_data_CPUonly.sh
#!/bin/bash
EXERCISE_ENVIRONMENT="artemisa-tutorial"
eval "$(conda shell.bash hook)"
conda activate $EXERCISE_ENVIRONMENT
python3 tf2_keras_generate_data_CPUonly.py
Since the job will run on a remote Worker Node, we need to set up the same environment (conda).
With this script this is done, prior to invoke our real python executable.
Submit the job (with HTCondor)
- Create the output directories (condor_logs) and submit the HTCondor job:
$ mkdir condor_logs
(artemisa-tutorial) [alferca@mlui02 02.2_Store_Data]$ condor_submit tf2_keras_generate_data_CPUonly.sub
Submitting job(s).
1 job(s) submitted to cluster 1602
Check results
$ ls -ltr augmented_data/
Notes
- ARTEMISA has CPU only high priority resources
- Use '+testJob = True' to select the corresponding CPU only slots.
- Use 'getenv = true' to maintain the user environment when submitting with HTCondor
- When using custom environments, init the required environment in your 'executable' (ie: conda) before running the real payload.
Submit Basic GPU Production Job: Worker Nodes GPUs and Environment info
Abstract
- Submit a batch job with HTCondor (Job Management System)
- Execute on remote Worker Node with a GPU
- Obtain basic Worker Node enviroment information
Introduction
After development we can submit production jobs that use 1 or more GPU production resources with higher time limits, and not only the User Interfaces.
In this example we will submit a job requesting a single GPU and retrieving basic environment information, that can help debug multiple issues.
Preparation
- Check HTCondor submission file:
$ cat 01_test.sub
universe = vanilla
executable = 01_test.sh
arguments = $(Process)
log = condor_logs/test.log
output = condor_logs/outfile.$(Cluster).$(Process).out
error = condor_logs/errors.$(Cluster).$(Process).err
request_gpus = 1
queue
With this file we will be requesting a node with 1 GPU, and to execute our script defined in 'executable' parameter.
- Check 'executable' that will be run on the Worker Node:
$ cat 01_test.sh
#!/bin/sh
echo ">>>> ENVIRONMENT"
printenv
echo ">>>> HOST"
hostname
echo ">>>> CURRENT DIR"
pwd
echo ">>>> USER"
whoami
echo ">>>> SPACE LEFT"
df -h
echo ">>>> NVIDIA INFO"
set -x #echo on
nvidia-smi
This shell script will retrieve the Environment variables and some information about the Worker Node where it executes.
With nvidia-smi command we can obtain information about the GPUS installed on tha node.
Submit the job
- Create the output directories (condor_logs) and submit the HTCondor job:
$ mkdir condor_logs
$ condor_submit 01_test.sub
Submitting job(s).
1 job(s) submitted to cluster 548.
Check results
After submitting the job the output of the script will be recorded in the 'output' and/or 'error' files defined in the parameters. The 'log' file contains information of the HTCondor process to execute the jobs.
- 'output' file contains the standard output of your executable:
$cat condor_logs/outfile.131398.0.out
>>>> ENVIRONMENT
_CONDOR_JOB_PIDS=
TMPDIR=/var/lib/condor/execute/dir_112754
_CONDOR_ANCESTOR_4719=5207:1618416485:3499018107
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_112754
_CHIRP_DELAYED_UPDATE_PREFIX=Chirp*
_CONDOR_ANCESTOR_5207=112754:1634894211:4224147812
TEMP=/var/lib/condor/execute/dir_112754
BATCH_SYSTEM=HTCondor
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_112754/.chirp.config
PWD=/lhome/ific/a/alferca/prep_tutorial_ific_cuda11.3/03_Submit_GPU_job
_CONDOR_AssignedGpus=CUDA0
CUDA_VISIBLE_DEVICES=0
_CONDOR_SLOT=slot2
_CONDOR_ANCESTOR_112754=112755:1634894211:3354999804
SHLVL=1
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_112754/.machine.ad
TMP=/var/lib/condor/execute/dir_112754
GPU_DEVICE_ORDINAL=0
OMP_NUM_THREADS=32
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_112754/.job.ad
_CONDOR_JOB_IWD=/lhome/ific/a/alferca/prep_tutorial_ific_cuda11.3/03_Submit_GPU_job
_=/usr/bin/printenv
>>>> HOST
mlwn23.ific.uv.es
>>>> CURRENT DIR
/lhome/ific/a/alferca/prep_tutorial_ific_cuda11.3/03_Submit_GPU_job
>>>> USER
alferca
>>>> SPACE LEFT
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 49G 35G 15G 71% /
devtmpfs 189G 0 189G 0% /dev
tmpfs 189G 305M 188G 1% /dev/shm
tmpfs 189G 0 189G 0% /sys/fs/cgroup
tmpfs 189G 187M 189G 1% /run
/dev/md126p1 1014M 207M 808M 21% /boot
/dev/mapper/centos-condor 352G 33M 352G 1% /tmp
147.156.116.235@tcp:/homefs 55T 5.0T 49T 10% /lhome
147.156.116.235@tcp:/ific2fs 26T 218M 25T 1% /lustre/ific.uv.es
147.156.116.235@tcp:/mlfs 181T 41T 139T 23% /lustre/ific.uv.es/ml
147.156.116.235@tcp:/prjfs 127T 93T 33T 74% /lustre/ific.uv.es/prj
147.156.116.235@tcp:/atl2fs 2.9P 2.4P 477T 84% /lustre/ific.uv.es/grid/atlas
147.156.116.235@tcp:/atl3fs 280T 169T 103T 63% /lustre/ific.uv.es/grid/atlas/t3
>>>> NVIDIA INFO
Fri Oct 22 11:16:51 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:18:00.0 Off | 0 |
| N/A 36C P0 26W / 250W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
In the previous output of the job we can find some interesting information about our finished job:
Environment
including variables set by HTCondor, and the CUDA visible devices to our process ( only 1 GPU as requested)
Execution Machine
HOST and slot ( our job was run in slot2 of machine mlwn23.ific.uv.es)
Current dir
is the directory from where the job is started.
Storage
Mounted filesystems and free space ( mounted home and project filesystems, AFS is not mounted in Worker Nodes)
Nvidia GPU Info
(Tesla V100-PCIE, 32GB, not running real workload in this example)
Selection of particular GPU resources: NVIDIA A100 GPUs
You can select machines for execution with a particular GPU resources, for example newest NVIDIA A100 GPUs can be selected
only if you specify it in your HTCondor submission file with this clause:
+UseNvidiaA100 = True
Try to run the same test job to check environment using this clause, for example with this submission file:
$ cat 01_test_A100.sh
universe = vanilla
executable = 01_test.sh
arguments = $(Process)
log = condor_logs/test.log
output = condor_logs/outfile.$(Cluster).$(Process).out
error = condor_logs/errors.$(Cluster).$(Process).err
request_gpus = 1
+UseNvidiaA100 = True
queue
Exercise
Try to submit a previous example from this tutorial (ie: python tensorflow) to a Worker Node, and check the results.
Notes
- HTCONDOR JOB can be submitted to a Worker Node with exclusive a access to a GPU
- 'output' file records the standard output of the 'executable' defined in the submission script.
- 'error' files records the starndard error. 'log' records information about HTCondor submission status.
- clause '+UseNvidiaA100 = True' must be used to select NVIDIA A100 GPUs
Multi GPU TensorFlow? example
Abstract
- Lambda Tensorflow benchmark
- Native / Virtual Environment with Tensorflow 2.3.1
- Usage of multiple GPUs (intra-node)
- Submission of HTCondor jobs requesting a node with the needed GPUs
Introduction
For this test we are going to use a benchmark available:
https://github.com/lambdal/lambda-tensorflow-benchmark
This software uses by default installed python3 and Tensorflow installation, performing several benchmark tests in the requested gpus.
We can use the native
TensorFlow? installation, or create a virtual Environment if the required exact releases and dependencies are not installed (Tensorflow 2.3.1)
Preparation
- Use new environment
virtualenv -p /usr/bin/python3.6 venv
. venv/bin/activate
pip install matplotlib
pip install tensorflow-gpu==2.3.1
- Clone benchmark repo
git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive
cd lambda-tensorflow-benchmark
- Prepare test config file:
ln -s config/config_all.sh .
Test on USER-INTERFACE
- Run a quick resnet50 test in FP32 (Note usage of GPURUN tool)
gpurun ./batch_benchmark.sh 1 1 1 100 2 config/config_resnet50_replicated_fp32_train_syn
...
2021-10-25 14:03:24.513360: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 700.7 +/- 0.0 (jitter = 0.0) 7.363 1635163410
10 images/sec: 701.6 +/- 0.9 (jitter = 3.6) 7.270 1635163412
20 images/sec: 700.5 +/- 1.5 (jitter = 2.5) 7.365 1635163415
30 images/sec: 700.8 +/- 1.1 (jitter = 2.3) 7.309 1635163418
40 images/sec: 699.9 +/- 0.9 (jitter = 2.6) 7.353 1635163421
50 images/sec: 699.6 +/- 0.8 (jitter = 2.3) 7.335 1635163423
60 images/sec: 699.3 +/- 0.7 (jitter = 2.0) 7.292 1635163426
70 images/sec: 699.5 +/- 0.6 (jitter = 1.9) 7.279 1635163429
80 images/sec: 699.6 +/- 0.6 (jitter = 1.8) 7.277 1635163432
90 images/sec: 699.6 +/- 0.5 (jitter = 1.7) 7.353 1635163434
100 images/sec: 699.8 +/- 0.5 (jitter = 1.7) 7.321 1635163437
----------------------------------------------------------------
total images/sec: 699.26
----------------------------------------------------------------
Submit MULTI-GPU JOB
- Condor submission file requesting 2 GPUs:
$ cat test-lambda-tensorflow-benchmark_venv.sub
universe = vanilla
executable = 01_test_benchmark_venv.sh
arguments =
log = condor_logs/test_venv.log
output = condor_logs/outfile_venv.$(Cluster).$(Process).out
error = condor_logs/errors_venv.$(Cluster).$(Process).err
request_gpus = 2
queue
- Executable to activate Virtual Enrironment and run benchmark:
$ cat 01_test_benchmark_venv.sh
#!/bin/bash
echo ">>>>> ACTIVATE VIRTUALENV TensorFlow 2.3.1"
. venv/bin/activate
echo ">>>>> EXECUTE PAYLOAD"
cd lambda-tensorflow-benchmark
/bin/bash benchmark.sh ${CUDA_VISIBLE_DEVICES} 1
- Create needed output directories and submit HTCondor job
$ mkdir condor_logs
$ condor_submit test-lambda-tensorflow-benchmark_venv.sub
- The benchmark will run for some time on the assigned Worker Node. Information of the running program can be checked on the 'output' and 'error' files as defined inthe condor submission file (condor_logs directory)
Output
The job will run for several minutes, using the configured python3 and installed Tensorflor installation, performing several benchmark tests in the requested gpus. The output will be produced in a results directory named after the CPU-GPU where it was executed, for example 'logs/Platinum-Tesla_V100-SXM2-32GB.logs'.
They can be analyzed as explained in the original benchmark suite page:
https://github.com/lambdal/lambda-tensorflow-benchmark#step-three-report-results
Notes
- A python VIRTUAL ENTIRONMENT can be created with the exact dependencies needed.
- REQUEST_GPUS is used on the HTCondor submission script to indicate the needed number of GPUs.
- CUDA_VISIBLE_DEVICES environment variable is set by HTCondor on the Worker Nodes to indicate the assigned GPUs, and can be used by our scripts and programs.
Containers
Abstract
- Use containers (Singularity, can use Docker images) with a software distribution
- Can be run in the UI or in the Worker Nodes (HTCondor submission)
Introduction
There are situations where are provided a container with already installed provided packages and enviroment. We can also build one container to maintain one particular environment.
In this case we can download (or build) a Docker/Singularity container that will distribute the needed software to the Worker Nodes. This scenario is supported by HTCondor, following the next steps:
Preparation
Follow the recipe to download the latest tensorflow container with gpu support as depicted here:
$ mkdir ~/s.images
$ cd ~/s.images
$ singularity build tensorflow-latest-gpu docker://tensorflow/tensorflow:latest-gpu
Run container in the UI
We can run the singularity container in the UI to check it runs correctly with a command like this.
$ gpurun singularity run --nv -c -H $PWD:/home ~/s.images/tensorflow-latest-gpu python3 ./02_tf_tuto1.py
First nonoption argument is "singularity" at argv[1]
Connected
Info: OK 0. Tesla V100-PCIE-16GB [00000000:5E:00.0]
1. Tesla V100-PCIE-16GB [00000000:86:00.0]
Total clients:0 Running:0 Estimated waiting time:0 seconds
GPU reserved:300 seconds granted
GPUID reserved:0 Details: - Device 0. Tesla V100-PCIE-16GB [00000000:5E:00.0] set to compute mode:Exclusive Process
Info: Executing program: singularity
2021-10-22 09:37:48.656353: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-22 09:37:51.433740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14647 MB memory: -> device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:5e:00.0, compute capability: 7.0
2021-10-22 09:37:52.058855: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
1875/1875 [==============================] - 5s 2ms/step - loss: 0.2234 - accuracy: 0.9346
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0968 - accuracy: 0.9701
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0703 - accuracy: 0.9782
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0543 - accuracy: 0.9823
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0427 - accuracy: 0.9861
313/313 [==============================] - 1s 1ms/step - loss: 0.0742 - accuracy: 0.9763
Notes
- 'gpurun' is prefixed to access the UI GPU.
- 'singularity' command accepts several commands like 'run' to execute within the container ('~/s.images/tensorflow-latest-gpu') and then the command ('python3 ./02_tf_tuto1.py')
- the '--nv' parameter is needed to be able to access the GPU within the container
- By DEFAULT Singularity run will mount $HOME automatically. This can be a problem if local python / tensorflow is installed and the local environment is read in the container. the '-c' parameter clears the environment
- In this case we want to mount the current directory ($PWD) to be able to access the command (python executable), and we can do it with the parameter: '-H $PWD:/home'
Submit container to the WN (with HTCondor)
- Check the htcondor submit files and submit.
$ cat 02_tf_tuto1.sub
universe = vanilla
arguments = "$ENV(HOME)/prep_tutorial_ific_cuda11.3/05_Containers/02_tf_tuto1.py"
arguments = $(Process)
log = condor_logs/test_singularity.log
output = condor_logs/test_singularity.outfile.$(Cluster).$(Process).out
error = condor_logs/test_singularity.errors.$(Cluster).$(Process).err
+SingularityImage = "$ENV(HOME)/s.images/tensorflow-latest-gpu"
+SingularityBind = "/lustre/ific.uv.es/ml"
request_gpus = 1
queue
Notes:
- With the '+SingularityImage' variable we select the image that was previously dowloaded. In this case the latest build of tensorflow with gpu support.
- 'SingularityBind' let us mount other paths, in this case /lustre/ific.uv.es/ml which contains the project disk space. Home disk space /lhome is mounted by default.
- The 'executable' and 'arguments' contains the full path, as singularity image may start in different paths (probably $HOME) than the current submission directory.
The executable is python and the program executed is a tensorflow example (02_tf_tuto1.py) from the Tensorflow Tutorial.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)
Submit the job with HTCondor
Submit the .sub file with condor as in previous exercises, and check the output that has to be the same as running in the User Interface.
Notes
- Use containers (Singularity, can use Docker images) with a software distribution
- Can be run in the UI or in the Worker Nodes (HTCondor submission)
- When running in the UI gpurun should be used and some singularity options (--nv) used to access the local GPU
- When submitting with HTCondor, only some options like '+SingularityImage' is needed to user the container image.
- By DEFAULT Singularity will mount $HOME automatically.
- 'SingularityBind' let us mount other paths, in this case /lustre/ific.uv.es/ml which contains the project disk space. Home disk space /lhome is mounted by default.
- The 'executable' and 'arguments' contains the full path, as singularity image may start in different paths (probably $HOME) than the current submission directory.
FAQ: Frequently Asked Questions
Job Submision error "Hold reason: Error ... (errno=13: 'Permission denied')"
This error happens when submitting with HTCondor and the 'executable' defined in the .sub execution does not have have execution permissions (+x).
In this case the job enters in state "Hold" (as can be seen with condor_q), and stays in that state. You can remove the job with 'condor_rm'.
To solve it, simply add correct file permissions to your executable before submitting jobs. For example for a script called 'executable.sh' :
chmod +x executable.sh
--
AlvaroFernandez - 28 Oct 2021