r13 - 06 Jul 2010 - 19:31:33 - CarlosEscobarYou are here: TWiki >  ECiencia Web  >  PracticeSessions > JobManagement

Job Management (practice session)

Preliminaries

User Interface

First at all, one should be connected to a User Interface (UI). The UIs which are available for this course via ssh are:

cg01.ific.uv.es  (SL4)
cg02.ific.uv.es  (SL5)
cg03.ific.uv.es  (SL5)

Authentication and authorization

Then, you should create a proxy with voms extension. Remember that this step is comparable to a login on the Grid:

voms-proxy-init -voms vo.formacion.es-ngi.eu

If everything is ok you should have something similar to:

voms-proxy-init -voms vo.formacion.es-ngi.eu
Enter GRID pass phrase:
Your identity: /C=ES/O=IFCA/CN=tut29
Creating temporary proxy ................................. Done
Contacting  voms01.ifca.es:15004 [/DC=es/DC=irisgrid/O=ifca/CN=host/voms01.ifca.es] "vo.formacion.es-ngi.eu" Done
Creating proxy ......................................................... Done
Your proxy is valid until Tue Jun 15 00:35:30 2010

Information System: Getting information on CEs

If one wants to know the list of CEs which are available in the vo.formacion.es-ngi.eu VO and the state of the associated CPU's, remember that you should used the command:

lcg-info --list-ce --vo vo.formacion.es-ngi.eu

or:

lcg-infosites --vo vo.formacion.es-ngi.eu ce

#CPU	Free	Total Jobs	Running	Waiting	ComputingElement
----------------------------------------------------------
1352	 722	   0	          0	   0	ce05.pic.es:2119/jobmanager-lcgpbs-ngi
1352	 724	   0	          0	   0	ce07.pic.es:2119/jobmanager-lcgpbs-ngi
1352	 704	   0	          0	   0	ce06.pic.es:2119/jobmanager-lcgpbs-ngi
   8	   8	   0	          0	   0	e-ce.iaa.es:2119/jobmanager-lcgpbs-forngi
   1	   1	   0	          0	   0	ce.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
 164	 164	   0	          0	   0	ce2.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
 164	 164	   0	          0	   0	ce3.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
 448	 446	   0	          0	   0	ce.iaa.csic.es:2119/jobmanager-lcgpbs-forngi
  22	  22	   0	          0	   0	ce-ieg.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
 136	 136	   0	          0	   0	ce01-tic.ciemat.es:2119/jobmanager-lcgpbs-training
 248	 242	   0	          0	   0	ce-iber.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
  56	  56	   0	          0	   0	ce-sge-ngi.ceta-ciemat.es:2119/jobmanager-lcgsge-ngiform
 164	 164	   0	          0	   0	test03.egee.cesga.es:2119/jobmanager-lcgsge-GRID_ngifor
 151	  69	   1	          1	   0	ce01.macc.unican.es:2119/jobmanager-lcgpbs-grid
1616	1616	   0	          0	   0	gridce01.ifca.es:2119/jobmanager-sge-ngifor
  42	  40	   2	          2	   0	ngiesce.i3m.upv.es:2119/jobmanager-pbs-ngies

Moreover, one may need to know the SO which is available in each CE:

lcg-infosites --vo vo.formacion.es-ngi.eu ce -v 2

RAMMemory    Operating System	 System Version	           Processor	Subcluster name
-------------------------------------------------------------------------------------------------------------------------
  16000	          ScientificSL	Boron                         Xeon          ce05.pic.es
  16000	          ScientificSL	Boron                         Xeon          ce07.pic.es
  16000	          ScientificSL	Boron                         Xeon          ce06.pic.es
   4096	          ScientificSL	Beryllium                    xeon          e-ce.iaa.es
   1024	          ScientificSL	Beryllium                    Xeon          ce.egee.cesga.es
   1024	          ScientificSL	Beryllium                    PIV             ce2.egee.cesga.es
   1024	          ScientificSL	Beryllium                    Xeon          ce3.egee.cesga.es
   2048	     ScientificCERNSLC	   SL                   Intel            ce.iaa.csic.es
    513	     ScientificCERNSLC	Beryllium           PIV              ce-ieg.bifi.unizar.es
   1024	          ScientificSL	   SL                           Xeon5160   ce01-tic.ciemat.es
  16384	     ScientificCERNSLC	Boron                Xeon           ce-iber.bifi.unizar.es
   2048	      ScientificSL 4.6	  4.6                          Xeon           ce-sge-ngi.ceta-ciemat.es
      0	                   n.a	  n.a                                  n.a               test03.egee.cesga.es
   2048	          ScientificSL	  SLC	                         PD               ce01.macc.unican.es
      0	                   n.a	  n.a                                  n.a	            gridce01.ifca.es

Unfortunately, some of these CEs have some problems, therefore, here you have the list of certificated CE that have been tested for this tutorial:

e-ce.iaa.es:2119/jobmanager-lcgpbs-forngi
ce.iaa.csic.es:2119/jobmanager-lcgpbs-forngi
ce-ieg.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
ce-iber.bifi.unizar.es:2119/jobmanager-lcgpbs-formangi
gridce01.ifca.es:2119/jobmanager-sge-ngifor

Job Description Language (JDL)

The basis of the JDL language has been presented in the job management talk. Anyhow, you can find the same information here: JDL

Job Management

Be aware that all the Grid jobs will run on batch mode.

gLite middleware

gLite is the middleware use nowadays in Grid Computing. Developed by an international collaboration within the EGEE project, gLite provides a solid framework to develop applications which benefit from distributing computing and storage.

In that sense, the commands to submit, retrieve and check (and many others) jobs à la Grid start with glite-wms-job-*. In this section, a summary of the most important commands will be presented though you can use more advance commands. All of them are discussed in the Workload Management section of the gLite User Guide.

The gLite commands are presented in the following twiki: gLite commands.

Practices

Some exercises are proposed in order to apply the previous (acquired) knowledge. People would need to write their own JDL files and use the glite-wms commands to follow these exercises. Be prepare to work...

Exercise 1: Simple JDL

Let's create the simplest JDL file and let's use the gLite commands (take a look).
  1. Simple JDL and simple commands:
    1. Create a JDL file to display the hostname of the WN (using the command /bin/hostname) where the job will run (help: JDL twiki, (example)).
    2. Delegate your credentials using your username (have you been authenticated yourself? if not, click here to know how to do it).
    3. Perform a job list matching. Which is the right command?
    4. Now, submit your job. During the job submittion, a job identifier (_JobId_) will be generated. This indentifier is unique for each job.
    5. Check the status of your job using the right job identifier (provided after the submission). You can increase the status information using the option --verbosity NUMBER, where NUMBER=0,1,2,3 (0 means less information).
    6. Once the job has finished (i.e. it has changed to the "Done" status), you could retrieve the output of the finished job. But wait... have you included the OutputSandbox attribute in your JDL file? If not, don't panic, you won't be able to retrieve anything but you will know how to use it in the next exercise.
  2. You can also test how to cancel a job. Send again the same job and cancel it before it ends. Use the right job id. Check the status of the job once you have cancelled it.
  3. As good scientist, you should experiment with the command options (check them yourself!). For example, use the useful "-o jobidfile" option. Which is its usefulness?

Exercise 2: JDL with input/output sandbox

Let's see how to use the InputSandbox and the OutputSanbox.

First at all, it is necesary to remind (see JDL attributes) that the InputSandbox transfers all the files included in its list from the UI to the WN (through the WMS) where the job will finally run. Thus, although these files will be available at runtime as input files for your job (which is great), it is not a good idea to include large files (i.e. files larger than 50 MB) since it can overload the WMS (specially its storage disk). The same happens with the files in the list of the OutputSandbox. For large files, the Storage Element (SE) must be used (see Data Management session).

  1. Create a bash script that displays the hostname of the WN, your username and list the of CE available for vo.formacion.es-ngi.eu.
  2. A good habit is to test your scripts/programs in the UI before submit it to the Grid. So, test it, please! (remind to set proper execution permissions to your script with chmod u+x).
  3. Create a JDL file similar to the one in exercise 1 but now including your bash script in the InputSandbox and the two log files in the OutputSandbox.
  4. If you have initializated a new ssh session, delegate your proxy before submitting your job. Then, you can check the status until the job has finished. After that, retrieve the results of your job, i.e. the log files you spedified in the OutputSandbox. Check the log files to see how your job has run.

Exercise 2b: Running you own programs (InputSandbox properties)

Now, let's learn how to run your own program/executable. In order to finish the exercise today, you can download the following program (use this command):

wget --no-check-certificate https://twiki.ific.uv.es/twiki/pub/ECiencia/JobManagement/myprogram

First, test the program (remember that it should have execution permissions: chmod u+x. Once you are sure it works, take the previous bash script file and add a line like ./myprogram. Then, add the program in the InputSanbox? in order to be sent together with your job.

What have happened? I guess your program has failed, right? why? It has failed because the executable flag is not preserved for the files included in the InputSandbox and they are transferred to the WN. The execution permissions should be performed by the initial script specified as the Executable in the JDL file. Therefore, you should add the line chmod u+x myprogram before the line ./myprogram.

Send another time the job. Has it worked now?

Exercise 3: JDL with requirements

The Requirements attribute allows to add constraints on the computing resources; for instance, you can select a certain resource, also satisfying certain software requirements. Let's try them using as the starting point the JDL file from exercise 1:

  1. Modify the JDL to send the job to a specific CE. To do that, you have to know which CEs are available. Use the glite-wms-job-list-match command to list the available CEs that match the requirements of your job. Let's select ce-iber.bifi.unizar.es, for example. Now, you should specify the proper requirements line in the JDL file (take a look to the following like JDL comments). The right GLUE attribute that should be used is GlueCEUniqueID. Remember that you can use the command lcg-info --vo vo.formacion.es_ngi.eu --list-attrs to list all the attributes (see the GLUE Schema twiki). Now, repeat the steps on exercise 1 to send your job.
  2. Now, modify the JDL to send another job that uses a specific queue manager, for example "torque" (could be a good extra exercise to retrieve the list of CEs together with their queue managers. Which is the command you need? clue: lcg-info). Additionally, add require that the total number of CPUs is at least 10. Remember that to specify several requirements you can use the operator AND (&&). Now, repeat the steps on exercise 1 for this new JDL file.
  3. Use now the Member operator to modify the JDL file to require that the CE have installed, for example, the OPENMPI environment. Send the job and check in which CE the job has run. Now, you could use the command lcg-info with the appropriate options in order know which CEs fulfill this requirement. Moreover, you can use this command to know all the software installed on each CE. Could you find the proper options to display it?

Exercise 4: JDL to deal with problems

If a job fails when submitted to one site, but succeeds on another then one reason can be that the failing site is not configured correctly. One can force to the WMS to ignore that site. Could you add the right RegExp function of the Requirements attribute to the JDL of the first exercise in order to ignore ce3.egee.cesga.es CE (for example)? Remind that to use the negative expression of RegExp, you can use !RegExp.

It is also in general good practice to require the WMS to resubmit a job if a failure happens before the job reaches a WN, i.e. because of a Grid problem. This is done by ShallowRetryCount = 1; (see JDL attributes.

Exercise 5: JDL with rank

At it is said on the JDL twiki, the Rank attribute is a Floating-Point expression that states how to rank CEs that have already met the Requirements expression. Essentially, rank expresses your preferences. A higher numeric value equals a better rank. By default, the raking takes into account the number of CPUs that are free (i.e. GlueCEStateFreeCPUs).

Try this feature ranking CEs for the JDL file of exercise 1 with the total number of CPUs. Remember that you can use the command lcg-info --vo vo.formacion.es_ngi.eu --list-attrs ir order to display all the attributes together with their corresponding GLUE attribute names.

Once you have added the right line, send your job and see the output.

Exercise 6: Submission to a particular CE

The JDL file of exercise 1 can also be sent directly to a particular CE using the options of the glite-wms-job-submit command. In practise, though this gives the same result as using what we have learnt in exercise 3 with the Requirements attribute the principle is not the same. In this case, the middleware does not follow the same paths. The fact is that when a job is submitted using the option -r CE, the availability of the CE is not checked by the WMS (saving time) and no BrokerInfo? is created. Therefore, this could be translated in a speed improvement if you are really sure that the CE where you are submitting your jobs is online and working, but if not, your jobs will definitely fail.

Try to submit the original JDL file using the following command:

glite-wms-job-submit -d $USER -o jobId hostname.jdl

and submit it again using the following command:

glite-wms-job-submit -d $USER -r ce.iaa.csic.es:2119/jobmanager-lcgpbs-forngi -o jobId hostname.jdl

Which of those jobs have started to run earlier? Remember that you have to really trust in the CE you are sending your jobs with the second command.

-- CarlosEscobar - 14 Jun 2010

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
elseEXT myprogram manage 8.4 K 06 Jul 2010 - 18:04 CarlosEscobar  
Edit | WYSIWYG | Attach | PDF | Raw View | Backlinks: Web, All Webs | History: r13 < r12 < r11 < r10 < r9 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback