Job Management (practice session)
Preliminaries
User Interface
First at all, one should be connected to a User Interface (UI). The UIs which are available for this course via ssh are:
cg01.ific.uv.es (SL4)
cg02.ific.uv.es (SL5)
cg03.ific.uv.es (SL5)
Authentication and authorization
Then, you should create a proxy with voms extension. Remember that this step is comparable to a login on the Grid:
voms-proxy-init -voms vo.formacion.es-ngi.eu
If everything is ok you should have something similar to:
voms-proxy-init -voms vo.formacion.es-ngi.eu
Enter GRID pass phrase:
Your identity: /C=ES/O=IFCA/CN=tut29
Creating temporary proxy ................................. Done
Contacting voms01.ifca.es:15004 [/DC=es/DC=irisgrid/O=ifca/CN=host/voms01.ifca.es] "vo.formacion.es-ngi.eu" Done
Creating proxy ......................................................... Done
Your proxy is valid until Tue Jun 15 00:35:30 2010
Information System: Getting information on CEs
If one wants to know the list of CEs which are available in the
vo.formacion.es-ngi.eu
VO and the state of the associated CPU's, remember that you should used the command:
lcg-info --list-ce --vo vo.formacion.es-ngi.eu
or:
lcg-infosites --vo vo.formacion.es-ngi.eu ce
#CPU Free Total Jobs Running Waiting ComputingElement
----------------------------------------------------------
1352 722 0 0 0 ce05.pic.es:2119/jobmanager-lcgpbs-ngi
1352 724 0 0 0 ce07.pic.es:2119/jobmanager-lcgpbs-ngi
1352 704 0 0 0 ce06.pic.es:2119/jobmanager-lcgpbs-ngi
8 8 0 0 0 e-ce.iaa.es:2119/jobmanager-lcgpbs-forngi
1 1 0 0 0 ce.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
164 164 0 0 0 ce2.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
164 164 0 0 0 ce3.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
448 446 0 0 0 ce.iaa.csic.es:2119/jobmanager-lcgpbs-forngi
22 22 0 0 0 ce-ieg.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
136 136 0 0 0 ce01-tic.ciemat.es:2119/jobmanager-lcgpbs-training
248 242 0 0 0 ce-iber.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
56 56 0 0 0 ce-sge-ngi.ceta-ciemat.es:2119/jobmanager-lcgsge-ngiform
164 164 0 0 0 test03.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
151 69 1 1 0 ce01.macc.unican.es:2119/jobmanager-lcgpbs-grid
1616 1616 0 0 0 gridce01.ifca.es:2119/jobmanager-sge-ngifor
42 40 2 2 0 ngiesce.i3m.upv.es:2119/jobmanager-pbs-ngies
Moreover, one may need to know the SO which is available in each CE:
lcg-infosites --vo vo.formacion.es-ngi.eu ce -v 2
RAMMemory Operating System System Version Processor Subcluster name
-------------------------------------------------------------------------------------------------------------------------
16000 ScientificSL Boron Xeon ce05.pic.es
16000 ScientificSL Boron Xeon ce07.pic.es
16000 ScientificSL Boron Xeon ce06.pic.es
4096 ScientificSL Beryllium xeon e-ce.iaa.es
1024 ScientificSL Beryllium Xeon ce.egee.cesga.es
1024 ScientificSL Beryllium PIV ce2.egee.cesga.es
1024 ScientificSL Beryllium Xeon ce3.egee.cesga.es
2048 ScientificCERNSLC SL Intel ce.iaa.csic.es
513 ScientificCERNSLC Beryllium PIV ce-ieg.bifi.unizar.es
1024 ScientificSL SL Xeon5160 ce01-tic.ciemat.es
16384 ScientificCERNSLC Boron Xeon ce-iber.bifi.unizar.es
2048 ScientificSL 4.6 4.6 Xeon ce-sge-ngi.ceta-ciemat.es
0 n.a n.a n.a test03.egee.cesga.es
2048 ScientificSL SLC PD ce01.macc.unican.es
0 n.a n.a n.a gridce01.ifca.es
Unfortunately, some of these CEs have some problems, therefore, here you have the list of certificated CE that have been tested for this tutorial:
e-ce.iaa.es:2119/jobmanager-lcgpbs-forngi
ce.iaa.csic.es:2119/jobmanager-lcgpbs-forngi
ce-ieg.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
ce-iber.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
gridce01.ifca.es:2119/jobmanager-sge-ngifor
Job Description Language (JDL)
The basis of the JDL language has been presented in the job management talk. Anyhow, you can find the same information here:
JDL
Job Management
Be aware that all the Grid jobs will run on batch mode.
gLite middleware
gLite is the middleware use nowadays in
Grid Computing. Developed by an international collaboration within the
EGEE project, gLite provides a solid framework to develop applications which benefit from distributing computing and storage.
In that sense, the commands to submit, retrieve and check (and many others) jobs
à la Grid start with
glite-wms-job-*
. In this section, a summary of the most important commands will be presented though you can use more advance commands. All of them are discussed in the
Workload Management section of the
gLite User Guide.
The gLite commands are presented in the following twiki:
gLite commands.
Practices
Some exercises are proposed in order to apply the previous (acquired) knowledge. People would need to write their own JDL files and use the glite-wms commands to follow these exercises. Be prepare to work...
Exercise 1: Simple JDL
Let's create the simplest JDL file and let's use the gLite commands (
take a look).
- Simple JDL and simple commands:
- Create a JDL file to display the hostname of the WN (using the command
/bin/hostname
) where the job will run (help: JDL twiki, (example)).
- Delegate your credentials using your username (have you been authenticated yourself? if not, click here to know how to do it).
- Perform a job list matching. Which is the right command?
- Now, submit your job. During the job submittion, a job identifier (
_JobId_
) will be generated. This indentifier is unique for each job.
- Check the status of your job using the right job identifier (provided after the submission). You can increase the status information using the option
--verbosity NUMBER
, where NUMBER=0,1,2,3 (0 means less information).
- Once the job has finished (i.e. it has changed to the "Done" status), you could retrieve the output of the finished job. But wait... have you included the OutputSandbox attribute in your JDL file? If not, don't panic, you won't be able to retrieve anything but you will know how to use it in the next exercise.
- You can also test how to cancel a job. Send again the same job and cancel it before it ends. Use the right job id. Check the status of the job once you have cancelled it.
- As good scientist, you should experiment with the command options (check them yourself!). For example, use the useful "-o jobidfile" option. Which is its usefulness?
Exercise 2: JDL with input/output sandbox
Let's see how to use the InputSandbox and the OutputSanbox.
First at all, it is necesary to remind (see
JDL attributes) that the InputSandbox transfers all the files included in its list from the UI to the WN (through the WMS) where the job will finally run. Thus, although these files will be available at runtime as input files for your job (which is great), it is not a good idea to include large files (i.e. files larger than 50 MB) since it can overload the WMS (specially its storage disk). The same happens with the files in the list of the OutputSandbox. For large files, the Storage Element (SE) must be used (see Data Management session).
- Create a bash script that displays the hostname of the WN, your username and list the of CE available for vo.formacion.es-ngi.eu.
- A good habit is to test your scripts/programs in the UI before submit it to the Grid. So, test it, please! (remind to set proper execution permissions to your script with
chmod u+x
).
- Create a JDL file similar to the one in exercise 1 but now including your bash script in the InputSandbox and the two log files in the OutputSandbox.
- If you have initializated a new ssh session, delegate your proxy before submitting your job. Then, you can check the status until the job has finished. After that, retrieve the results of your job, i.e. the log files you spedified in the OutputSandbox. Check the log files to see how your job has run.
Exercise 2b: Running you own programs (InputSandbox properties)
Now, let's learn how to run your own program/executable. In order to finish the exercise today, you can download the following program (use this command):
wget --no-check-certificate https://twiki.ific.uv.es/twiki/pub/ECiencia/JobManagement/myprogram
First, test the program (remember that it should have execution permissions:
chmod u+x
. Once you are sure it works, take the previous bash script file and add a line like
./myprogram
. Then, add the program in the
InputSanbox? in order to be sent together with your job.
What have happened? I guess your program has failed, right? why?
It has failed because the executable flag is not preserved for the files included in the InputSandbox and they are transferred to the WN. The execution permissions should be performed by the initial script specified as the Executable in the JDL file. Therefore, you should add the line
chmod u+x myprogram
before the line
./myprogram
.
Send another time the job. Has it worked now?
Exercise 3: JDL with requirements
The Requirements attribute allows to add constraints on the computing resources; for instance, you can select a certain resource, also satisfying certain software requirements. Let's try them using as the starting point the JDL file from exercise 1:
- Modify the JDL to send the job to a specific CE. To do that, you have to know which CEs are available. Use the
glite-wms-job-list-match
command to list the available CEs that match the requirements of your job. Let's select ce-iber.bifi.unizar.es
, for example. Now, you should specify the proper requirements line in the JDL file (take a look to the following like JDL comments). The right GLUE attribute that should be used is GlueCEUniqueID
. Remember that you can use the command lcg-info --vo vo.formacion.es_ngi.eu --list-attrs
to list all the attributes (see the GLUE Schema twiki). Now, repeat the steps on exercise 1 to send your job.
- Now, modify the JDL to send another job that uses a specific queue manager, for example "torque" (could be a good extra exercise to retrieve the list of CEs together with their queue managers. Which is the command you need? clue:
lcg-info
). Additionally, add require that the total number of CPUs is at least 10. Remember that to specify several requirements you can use the operator AND (&&
). Now, repeat the steps on exercise 1 for this new JDL file.
- Use now the Member operator to modify the JDL file to require that the CE have installed, for example, the
OPENMPI
environment. Send the job and check in which CE the job has run. Now, you could use the command lcg-info
with the appropriate options in order know which CEs fulfill this requirement. Moreover, you can use this command to know all the software installed on each CE. Could you find the proper options to display it?
Exercise 4: JDL to deal with problems
If a job fails when submitted to one site, but succeeds on another then one reason can be that the failing site is not configured correctly. One can force to the WMS to ignore that site. Could you add the right
RegExp
function of the
Requirements
attribute to the JDL of the first exercise in order to ignore
ce3.egee.cesga.es
CE (for example)? Remind that to use the negative expression of RegExp, you can use
!RegExp
.
It is also in general good practice to require the WMS to resubmit a job if a failure happens before the job reaches a WN, i.e. because of a Grid problem. This is done by
ShallowRetryCount = 1;
(see
JDL attributes.
Exercise 5: JDL with rank
At it is said on the
JDL twiki, the
Rank
attribute is a Floating-Point expression that states how to rank CEs that have already met the Requirements expression. Essentially, rank expresses your preferences. A higher numeric value equals a better rank. By default, the raking takes into account the number of CPUs that are free (i.e.
GlueCEStateFreeCPUs
).
Try this feature ranking CEs for the JDL file of exercise 1 with the total number of CPUs. Remember that you can use the command
lcg-info --vo vo.formacion.es_ngi.eu --list-attrs
ir order to display all the attributes together with their corresponding GLUE attribute names.
Once you have added the right line, send your job and see the output.
Exercise 6: Submission to a particular CE
The JDL file of exercise 1 can also be sent directly to a particular CE using the options of the
glite-wms-job-submit
command. In practise, though this gives the same result as using what we have learnt in exercise 3 with the Requirements attribute the principle is not the same. In this case, the middleware does not follow the same paths. The fact is that when a job is submitted using the option
-r CE
, the availability of the CE is not checked by the WMS (saving time) and no
BrokerInfo? is created. Therefore, this could be translated in a speed improvement if you are really sure that the CE where you are submitting your jobs is online and working, but if not, your jobs will definitely fail.
Try to submit the original JDL file using the following command:
glite-wms-job-submit -d $USER -o jobId hostname.jdl
and submit it again using the following command:
glite-wms-job-submit -d $USER -r ce.iaa.csic.es:2119/jobmanager-lcgpbs-forngi -o jobId hostname.jdl
Which of those jobs have started to run earlier? Remember that you have to really trust in the CE you are sending your jobs with the second command.
--
CarlosEscobar - 14 Jun 2010