r11 - 10 Jul 2016 - 14:55:22 - LucaFioriniYou are here: TWiki >  Atlas Web  > IFIC-Cluster-CERN

IFIC Computing Cluster at CERN

The Valencia Computing Cluster

The TileCal Valencia computing cluster at CERN is located in building 175. It is directly accessible from machines inside CERN General Public Network. Remote access has to be done through the lxplus service.
local > ssh -X login@lxplus.cern.ch 
lxplus > ssh -X login@valticalXX

The ATLAS Valencia group also holds a small cluster at IFIC T3 computing cluster and users' questions can be submitted to user Questioning Area

Cluster topology

  • All computers mount /data6 from valticalui01 as data storage for analysis
  • All computers mount /work from valtical07 for collaborative code development (No data).
  • All analysis computers mount /localdisk* locally.
  • All analysis computers use /localdisk/xrootd as xrootd file system cache.
  • Offline developments are located in /work/offline.
  • Online developments are located in /work/TicalOnline.
  • Users can log in valticalui01 and valtical05 to run their small scale interactive jobs, for large number of jobs please use condor or proof.

Functionality and Hardware Setup of each host

Computer Activity Cores Mem Xrootd Data Disk OS System Disk
Valtical Xrootd redirector , proof master, MYSQL querying server 4 6 GB 0 TB SLC6 300 GB
Valtical00 Xrootd data server, condor worker node, proof worker node 16 24 GB 14 TB SLC6 500 GB
Valtical04 Xrootd data server,condor worker node; proof worker node 16 24 GB 6 TB SLC6 300 GB
Valtical05 User Interface, NX Server, Xrootd data server, proof submit machine,condor master, condor submit machine ,MYSQL querying client 24 48 GB 17 TB SLC5 500 GB
Valtical06 Xrootd data server,condor worker node; proof worker node 16 24 GB 2 TB SLC6 300 GB
Valtical07 condor worker node; ganglia server, NFS server for /work 16 24 GB 0 TB SLC6 2 TB
Valtical08 Xrootd data server,condor worker node; proof worker node 16 24 GB 2TB SLC6 2 TB
Valtical09 Xrootd data server,condor worker node; proof worker node 16 24 GB 8 TB SLC6 2 TB
valticalui01 User Interface, NFS server for /data6, MYSQL querying client 16 24GB 0TB SLC6 300 GB

Xrootd

xrootd is a distributed file system with high performance and has become popular in Grid applied at many sites. In this document the installation, configuration and general problem debugging will be discussed, further details can be found at Xrootd:Home Page.

Overview

Computer Xrootd Role Xrootd Daemons Disks for xrootd Storage Capacity for Xrootd Xrootd Version OS
Valtical redirector xrootd, cmsd 0 0TB 3.3.1 SLC6
Valtical00 data server xrootd, cmsd 7 12.6TB 3.3.1 SLC6
Valtical04 data server xrootd, cmsd 3 5.4TB 3.3.1 SLC6
Valtical05 data server xrootd, cmsd 7 15.3TB 3.3.1 SLC5
Valtical06 data server xrootd, cmsd 1 1.8TB 3.3.1 SLC6
Valtical08 data server xrootd, cmsd 1 1.8TB 3.3.1 SLC6
Valtical09 data server xrootd, cmsd 3 7.2TB 3.3.1 SLC6

Processes run as xrootd user and are:
/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7 -b -s /var/run/xrootd/xrootd-default.pid -n default
/usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7 -b -s /var/run/xrootd/cmsd-default.pid -n default
/usr/bin/XrdCnsd -d -D 2 -i 90 -b root://valtical.cern.ch:2094

Xrootd Installation and Configuration

1. Install EPEL, remember to choose the right version to match your OS version
  rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm 
2. Install the yum-priorities plugin
  yum install yum-priorities 
3. Install the OSG repositories
  rpm -Uvh http://repo.grid.iu.edu/osg/3.1/osg-3.1-el6-release-latest.rpm 
4. Install xrootd-server and dependent rpms
  yum -y install xrootd.x86_64 
5. Modify /etc/xrootd/xrootd-clustered.cfg with the following contents:
 
all.export /localdisk/xrootd                       # export /localdisk/xrootd as storage path for xrootd
oss.space public /localdisk/xrootd/* xa                            
oss.space public /localdisk2/xrootd xa    # /localdisk2/xrootd is used as an extended space for /localdisk/xrootd   
oss.space public /localdisk3/xrootd xa
oss.space public /localdisk4/xrootd xa
oss.space public /localdisk5/xrootd xa
oss.space public /localdisk6/xrootd xa
oss.space public /localdisk7/xrootd xa
all.export /localdisk/proofbox                   # for proof usage     
set xrdr=valtical.cern.ch                          # set valtical.cern.ch as xrootd redirector
all.manager $(xrdr):1213                        # tells each component the DNS name of the manager.
if $(xrdr) && named cns
      all.export /data/inventory                                           
      xrd.port 1095
else if $(xrdr)
      all.role manager
      xrd.port 1094
else
      all.role server
      oss.localroot /
      ofs.notify closew create mkdir mv rm rmdir trunc | /usr/bin/XrdCnsd -d -D 2 -i 90 -b root://$(xrdr):2094
      cms.space min 2g 5g
fi
 
6. On xrootd redirector (valtical.cern.ch), modify /etc/sysconfig/xrootd with following contents:
XROOTD_USER=xrootd
XROOTD_GROUP=xrootd
XROOTD_DEFAULT_OPTIONS="-l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
XROOTD_CNS_OPTIONS="-k 7 -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-clustered.cfg"
CMSD_DEFAULT_OPTIONS="-l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
FRMD_DEFAULT_OPTIONS="-k 7 -l /var/log/xrootd/frmd.log -c /etc/xrootd/xrootd-clustered.cfg"
XROOTD_INSTANCES="default cns"
PURD_DEFAULT_OPTIONS="-l /var/log/xrootd/purged.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
XFRD_DEFAULT_OPTIONS="-l /var/log/xrootd/xfrd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
CMSD_INSTANCES="default"
FRMD_INSTANCES="default"
XROOTD_INSTANCES="default"
CMSD_INSTANCES="default"
PURD_INSTANCES="default"
XFRD_INSTANCES="default"
 

7. Run xrootd setup, which creates an appropriate directory for xrootd, creates user,group "xrootd" if needed and changes permissions appropriately.

 service xrootd setup 
8. Start/Restart/stop the xrootd server using the following commands. You will want to start the services on the redirector node before any services on the data node(s).
 service xrootd start/restart/stop 

Xrootd Troubleshooting

The most efficient debugging method is to check logs under /var/log/xrootd/, which can help solving most problems. However, there are still problems which needs further support from xrootd developers, in that case send a letter to xrootd-l@slac.stanford.edu.

Q. Error : Unable to set attr XrdFrm?.Pfn from /localdisk2/xrootd/public/1B/2E687E525B000000136%; operation not supported"

A. This means /localdisk2 can not be written by xrootd, to solve this problem /localdisk2 needs to be remounted with option 'user_xattr'.

 
umount /localdisk2
mount -o user_xattr /dev/sdc1 /localdisk2
 

Q. Error : Last server error 3005 ('Unable to create /localdisk/xrootd/users/qing/data12_8TeV/SMDILEP_p1328_p1329/user.qing.data12_8TeV.periodC.physics_Muons.PhysCont.NTUP_SMWZ.grp14_v01_p1328_p1329_2LepSkim_v2/user.qing.001695._00762.skimmed.root; not a directory')"

A. This means /localdisk/xrootd/users/qing/data12_8TeV/SMDILEP_p1328_p1329/user.qing.data12_8TeV.periodC.physics_Muons.PhysCont.NTUP_SMWZ.grp14_v01_p1328_p1329_2LepSkim_v2/ on one or more data servers is created as a link instead of a directory, to fix the problem you will need to remove such links.

Q. Error : Unable to write to xrootd cluster. Error message :Last server error 3011 ('No servers are available to write the file.')

A. Check whether there is enough disk space available in the xrootd cluster.

Q. Error : Xrootd runs on the redirector and data servers.But there is no communication between the redirector and data server.

A. Add rules to iptables to accept incoming tcp connections from xrootd.

Xrootd FAQ

How to list files under XROOTD

On valtical05 and valticalui01, you can get such information with '/sbin/xls'
 
[root@valtical05 qing]# xls -h
NAME
    xls - list directory contents in valtical xrootd
SYNOPSIS
    ls [OPTION] [PATH]
DESCRIPTION
    List  information about the files and directories under [PATH]
    -s   show the total size of [PATH]
    -l   show everything directly under [PATH] with their size
    -r   show all files under [PATH] and it's sub-directories
    -a   show the size of  all files under [PATH] and it's sub directories
EXAMPLE
   xls -s root://valtical.cern.ch/localdisk/xrootd/users/

How to delete files under XROOTD

use the following commands to delete a file in xrootd, $filename should be provided in the following format: root://valtical.cern.ch//localdisk/
  /afs/cern.ch/user/l/lfiorini/public/xrdrm.sh $filename 

Condor

HTCondor is a very stable batch system, in this section the installation and configuration at valtical cluster is described. For more information about HTCondor, please refer to the Condor:Home Page.

Overview

Computer Condor Role Condor Daemons Condor Version Number of Cores used by condor OS
Valtical05 Head Node, Submit Node COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD,Procd 7.6.4 1 SLC5
Valtical00 Worker Node MASTER, STARTD, Procd 7.8.8 12 SLC6
Valtical04 Worker Node MASTER, STARTD, Procd 7.8.8 12 SLC6
Valtical06 Worker Node MASTER, STARTD, Procd 7.8.8 12 SLC6
Valtical07 Worker Node MASTER, STARTD, Procd 7.8.8 12 SLC6
Valtical08 Worker Node MASTER, STARTD, Procd 7.8.8 12 SLC6
Valtical09 Worker Node MASTER, STARTD, Procd 7.8.8 12 SLC6

Condor Installation and Configuration

1. Install EPEL, remember to choose the right version to match your OS version
  rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm  
2. Install the yum-priorities plugin
  yum install yum-priorities 
3. Install the OSG repositories
  rpm -Uvh http://repo.grid.iu.edu/osg/3.1/osg-3.1-el6-release-latest.rpm   
4. Install HTCondor from the OSG yum repository
  yum -y install condor.x86_64 
5. Modify /etc/condor/config.d/00personal_condor.config as following:
CONDOR_HOST = valtical05  # set valtical05 as condor central manager
MAX_NUM_CPUS = 12           # set the maximum of cpu cores can be used by condor 
UID_DOMAIN = cern.ch
COLLECTOR_NAME = Personal Condor at $(FULL_HOSTNAME)
LOCK = /tmp/condor-lock.$(FULL_HOSTNAME)0.00134799613534042
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
#DAEMON_LIST = MASTER,  STARTD                                                                                                # open this line on condor worker node
#DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD                                 # open this on condor central manager
PREEMPTION_REQUIREMENTS = True
RANK = 0
NEGOTIATOR_CONSIDER_PREEMPTION = True  
SEC_DEFAULT_AUTHENTICATION  = NEVER                                                                                    # comment this line on the condor submit machine
LOCAL_CONFIG_FILE = $(LOCAL_CONFIG_FILE)
HOSTALLOW_READ = *.cern.ch
HOSTALLOW_WRITE = *.cern.ch
HOSTALLOW_NEGOTIATOR = valtical05.cern.ch
6. Start/Restart/Stop condor daemons
 service condor start/restart/stop 

Condor Troubleshooting

The most efficient debugging method is to view condor logs saved at /var/log/condor/, if information provided there is not detailed enough, change the corresponding values from default to D_ALL in /etc/condor/condor_config, then recreate the log. Further support can be retrieved from htcondor-users@cs.wisc.edu.
ALL_DEBUG               =
COLLECTOR_DEBUG      =
KBDD_DEBUG      =
NEGOTIATOR_DEBUG   = D_MATCH
SCHEDD_DEBUG      = D_ALL
SHADOW_DEBUG      =
STARTD_DEBUG      = 
MASTER_DEBUG      = 
JOB_ROUTER_DEBUG        =
ROOSTER_DEBUG           =
SHARED_PORT_DEBUG       =
HDFS_DEBUG              =
TRIGGERD_DEBUG           =
HAD_DEBUG      =
REPLICATION_DEBUG   =
TRANSFERER_DEBUG   =
GRIDMANAGER_DEBUG   = 
CREDD_DEBUG         = D_FULLDEBUG
STORK_DEBUG = D_FULLDEBUG
LeaseManager_DEBUG      = D_FULLDEBUG
LeaseManager.DEBUG_ADS      = False
TOOL_DEBUG = D_ALL
SUBMIT_DEBUG = D_ALL

Condor FAQ

condor commands

  • condor_submit : submit a condor job
  • condor_rm user : removes jobs submitted by the user;
  • condor_rm cluster.process : removes the specific job;
  • condor_rm -forcex: to kill a job forcedly
  • condor_q: To get the status of all queued jobs
  • condor_q -analyze cluster.process : provide information of a single condor job
  • condor_q -better-analyze cluster.process : provide more information then -analyze
  • condor_q -submitter: get condor jobs corresponding to a user.
  • condor_status: To monitor and query the condor pool for resource information, submitter information, checkpoint server information, and daemon master information
  • condor_config_val: can be used to obtain configured values. Use 'condor_config_val -v variable ' to get the paths of the important directories
  • condor_prio : To change priority of a user's job.The priority can be changed only by job owner or root.
  • condor_userprio : To change a user's priority.The priority can be changed only by root.

Proof

Overview

Computer proof Role Number of cores for proof OS
Valtical master 0 SLC5
Valtical00 Worker Node 12 SLC6
Valtical04 Worker Node 12 SLC6
Valtical06 Worker Node 12 SLC6
Valtical07 Worker Node 12 SLC6
Valtical08 Worker Node 12 SLC6
Valtical09 Worker Node 12 SLC6

Proof Installation and Configuration

1. Install ROOT under /opt
cd /opt
wget ftp://root.cern.ch/root/root_v5.28.00g.source.tar.gz
tar -xvzf root_v5.28.00c.source.tar.gz
cd root
./configure
gmake
2. Modify /etc/init.d/proofd with following contents:
XRDUSER="xrootd"
XRDLOG="/opt/root/var/logs/xproofd.log"
XRDCF="/opt/root/etc/xproofd.cfg"
XRDDEBUG=""
XRDUSERCONFIG=""
XPROOFD=/opt/root/bin/xproofd
XRDLIBS=/opt/root/lib
export ROOTSYS=/opt/root
. /etc/init.d/functions
. /etc/sysconfig/network
[ -f /etc/sysconfig/xproofd ] && . /etc/sysconfig/xproofd
[ ! -z "$XRDUSERCONFIG" ] && [ -f "$XRDUSERCONFIG" ] && . $XRDUSERCONFIG
if [ ${NETWORKING} = "no" ]
then
        exit 0
fi
[ -x $XPROOFD ] || exit 0
RETVAL=0
prog="xproofd"
start() {
        echo -n $"Starting $prog: "
        export LD_LIBRARY_PATH=$XRDLIBS:$LD_LIBRARY_PATH
        daemon $XPROOFD -b -l $XRDLOG -R $XRDUSER -c $XRDCF $XRDDEBUG -k 5
        RETVAL=$?
        echo
        [ $RETVAL -eq 0 ] && touch /var/lock/subsys/xproofd
        return $RETVAL
}
stop() {
        [ ! -f /var/lock/subsys/xproofd ] && return 0 || true
        echo -n $"Stopping $prog: "
        killproc xproofd
        RETVAL=$?
        echo
        [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/xproofd
        return $RETVAL
}
case "$1" in
  start)
        start
        ;;
  stop)
        stop
        ;;
  status)
        status xproofd
        RETVAL=$?
        ;;
  restart|reload)
        stop
        start
        ;;
  condrestart)
        if [ -f /var/lock/subsys/xproofd ]; then
            stop
            start
        fi
        ;;
  *)
        echo  $"Usage: $0 {start|stop|status|restart|reload|condrestart}"
        exit 1
esac
exit $RETVAL
3. Modify /opt/root/etc/xproofd.cfg with following contents:
set rootlocation = /opt/root
xpd.rootsys ${rootlocation}
xpd.workdir /localdisk/proofbox
xpd.resource static ${rootlocation}/etc/proof/proof.conf
xpd.role worker if valtical*.cern.ch
xpd.role master if valtical.cern.ch
xpd.allow valtical.cern.ch
xpd.maxoldlogs 2
xpd.poolurl root://valtical.cern.ch 

4. Modify /opt/root/etc/proof/proof.conf with following contents:

master valtical.cern.ch workdir=/localdisk/proofbox


worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox

worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox


worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox

worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox

worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox


worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox

Proof Troublehooting

Check logs in /localdisk/proofbox/ could be helpful, for weird problems please ask the question at Proof: Forum.

Proof FAQ

Q: How to clean proof from a killed proof job?

A: Rest valtical.cern.ch in TProof

TProof::Reset("valtical.cern.ch", kTRUE);

Q: Where is the default location for the proof sandboxes?

A: usually it's $HOME, if you wish to change it , set it in $HOME/.rootrc like:

ProofLite.Sandbox /data5/qing/SAM_proof/prooflite         

Q: Can proof detect that one cpu core is overloaded and stop sending jobs to it?

A: No, this is not currently possible with PROOF.

Q: Is it possible to send jobs to a dedicated WN?

A: The concept of 'assigning a job' to a subset of workers is not really a PROOF concept.

Ganglia

Ganglia Installation and Configuration

1. On ganglia server install the following packages:
  
yum -y install ganglia
yum -y install ganglia-gmetad
yum -y install ganglia-gmond
yum -y install ganglia-web
yum -y install  httpd

2. On ganglia server, modify the following lines in /etc/ganglia/gmond.conf:

   
/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
  name = "VALTICAL"
  owner = "VALTICAL"
  latlong = "VALTICAL"
  url = "VALTICAL"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "VALTICAL"
}

3. Start Ganglia daemons on the server:

service gmetad start
service gmond start

4. On ganglia worker node intall the following packages:

   
yum -y install ganglia
yum -y install ganglia-gmond

5. Modify /etc/ganglia/gmond.conf with the following contents:

   
/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
  name = "VALTICAL"
  owner = "VALTICAL"
  latlong = "VALTICAL"
  url = "VALTICAL"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "VALTICAL"
}


6. Start Ganglia daemons on the workern node:

service gmond start

Ganglia Monitoring

Currently the ganglia monitoring of valtical cluter can be viewed at Gangglia Monitoring Webpage. Due to CERN restriction, this page can only be opened in CERN internal network. If you are not able to connect to it, try to modify /etc/httpd/conf/httpd.conf with following contents and then restart the daemons:
# Controls who can get stuff from this server.
#
    Order allow,deny
    Allow from all
</Directory>

MYSQL

MYSQL Installation and Configuration

1. Install the following rpm packages on the MYSQL server and query machines (valtical,valtical05,valticalui01)
  
yum -y install mysql-server mysql php-mysql  MySQL-python
/sbin/service mysqld start 
/sbin/chkconfig mysqld on 
 

2. On valtical create database xrootd and table T_xrootd

  
[root@valtical ~]# mysql
mysql> create database xrootd;
mysql> use xrootd;
mysql> create table T_xrootd;;
mysql> grant select,insert on xrootd.* to 'xrootd'@'localhost';
mysql> grant all on *.* to xrootd@'137.138.40.184';
mysql> grant all on *.* to xrootd@'137.138.40.143';
mysql>  grant all on *.* to xrootd@'137.138.40.190';
mysql> grant all on *.* to xrootd@'137.138.40.186';
mysql> grant all on *.* to xrootd@'137.138.40.165';
mysql> grant all on *.* to xrootd@'137.138.40.181';
mysql> grant all on *.* to xrootd@'137.138.40.166';
mysql> grant all on *.* to xrootd@'137.138.40.140';
mysql> grant all on *.* to xrootd@'137.138.40.173';
mysql> exit
 

3. Add the following line into crontab on valtical00. The scan.sh will call /work/users/qing/data5/qing/scan_cluster/scan.py which will create /work/users/qing/data5/qing/scan_cluster/all_xrootd_files.txt, this file is used to record all files and directories in the xrootd file system.

  
05 * * * * source  /work/users/qing/data5/qing/scan_cluster/scan.sh

4. Add the following line into crontab on valtical:

  
25 * * * *  source /work/users/qing/data5/qing/mysql/import.sh
 

NX

NX Installation and Configuration

1. Execute the following commands on valtical05:
  
yum -y install libjpeg
yum -y install openssl-devel
yum -y install netcat
yum -y install expect
cd /root
scp root@valtical25.cern.ch:/root/NX.tar.gz ./
tar -xvzf NX.tar.gz
ROOTPATH=/root/NX
cd $ROOTPATH
find . -name "*tar.gz" -exec tar -zxf {} \;
cd $ROOTPATH/libpng-1.4.3; ./configure; make
cd $ROOTPATH/nxcomp  ; ./configure;make
cd $ROOTPATH/nxcompext  ; ./configure;make
cd $ROOTPATH/nxcompshad  ; ./configure;make
cd $ROOTPATH/nxproxy  ; ./configure;make
cd $ROOTPATH/nx-X11  ; make World

cd $ROOTPATH
cp -a nx-X11/lib/X11/libX11.so* /usr/NX/lib/
cp -a nx-X11/lib/Xext/libXext.so* /usr/NX/lib/
cp -a nx-X11/lib/Xrender/libXrender.so* /usr/NX/lib/
cp -a nxcomp/libXcomp.so* /usr/NX/lib/
cp -a nxcompext/libXcompext.so* /usr/NX/lib/
cp -a nxcompshad/libXcompshad.so* /usr/NX/lib/
cp -a nx-X11/programs/Xserver/nxagent /usr/NX/bin/
cp -a nxproxy/nxproxy /usr/NX/bin/
cd $ROOTPATH/freenx-server-0.7.3
patch < gentoo-nomachine.diff
scp root@valtical25.cern.ch:/root/build/freenx-server-0.7.3/nx_3.3.0.patch ./
patch < nx_3.3.0.patch
make ; make install
cd /usr/NX/bin
./nxsetup --install
./nxserver restart
  

2. Install NX client on the client machine:

    wget http://64.34.173.142/download/3.5.0/Linux/nxclient-3.5.0-7.x86_64.rpm
    rpm -Uvh nxclient-3.5.0-7.x86_64.rpm 
  

NX FAQ

How to connect NX server on valtical05 from a machine outside of CERN network?

1. Create tunnel.sh with following lines:
qing@tical31:~$ cat tunnel.sh 
#!/usr/bin/env python

import os
import time
import pexpect
import sys
import getpass

user = raw_input("User:")
passw = getpass.unix_getpass("Enter your password:")

if (user,passw) != ("",""):
    print "parent thread"
    print "Connecting to lxplus"
    ssh = pexpect.spawn('ssh  -L 10001:valtical05.cern.ch:22 %s@lxplus.cern.ch'%user)
    ssh.expect('password')
    ssh.sendline(passw)
    ssh.expect('lxplus')
    ssh.interact()

  

2. python tunnel.sh, log on lxplus with your CERN NICE account and password.

3. In NX client configuration, set Host as localhost, and port as 10001, and then log in with your NICE account and passwd.

NFS

/work

1. on valtical07 modify /etc/exports with the following lines:
[root@valtical07 ~]# cat /etc/exports 
/work                    valtical00(rw,sync,no_root_squash)
/work                    valtical01(rw,sync,no_root_squash)
/work                    valtical02(rw,sync,no_root_squash)
/work                    valtical03(rw,sync,no_root_squash)
/work                    valtical04(rw,sync,no_root_squash)
/work                    valtical05(rw,sync,no_root_squash)
/work                    valtical06(rw,sync,no_root_squash)
/work                    valticalui01(rw,sync,no_root_squash)
/work                    valtical08(rw,sync,no_root_squash)
/work                    valtical09(rw,sync,no_root_squash)
/work                    valtical17(rw,sync,no_root_squash)
/work                    valtical19(rw,sync,no_root_squash)
/work                    valtical20(rw,sync,no_root_squash)
/work                    valtical24(rw,sync,no_root_squash)
/work                    valticalui01(rw,sync,no_root_squash)
/work                    137.138.77.204(rw,sync,no_root_squash)
/work                    sbctil-rod-01(rw,sync,no_root_squash)
/work                    sbctil-ttc-01(rw,sync,no_root_squash)
/work                    sbctil-las-01(rw,sync,no_root_squash)
/work                    sbctil-ces-01(rw,sync,no_root_squash)
/work                    atcacpm01(rw,sync,no_root_squash)

2. start the rpcbind service and nfs service on the NFS server and client machine:

service rpcbind restart
service nfs restart

3. mount /work on the client machine and add it to /etc/fstab

mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.190 valtical07:/work /work
for d in "/work"
do
  mkdir $d
  echo "valtical07:$d         $d                  nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.190 0 0" >> /etc/fstab
done

/data6

1. on valticalui01 modify /etc/exports with the following lines:
/data6                    valtical00(rw,sync,no_root_squash)
/data6                    valtical01(rw,sync,no_root_squash)
/data6                    valtical02(rw,sync,no_root_squash)
/data6                    valtical03(rw,sync,no_root_squash)
/data6                    valtical04(rw,sync,no_root_squash)
/data6                    valtical05(rw,sync,no_root_squash)
/data6                    valtical06(rw,sync,no_root_squash)
/data6                    valtical07(rw,sync,no_root_squash)
/data6                    valtical08(rw,sync,no_root_squash)
/data6                    valtical09(rw,sync,no_root_squash)

2. start the rpcbind service and nfs service on the NFS server and client machine:

service rpcbind restart
service nfs restart

3. mount /data6 on the client machines and add it to /etc/fstab

mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 valticalui01:/data6 /data6
do
  mkdir $d
  echo "valticalui01:$d         $d                  nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 0 0" >> /etc/fstab
done

EOS

EOS Setup at CERN

source /afs/cern.ch/project/eos/installation/atlas/etc/setup.sh

You probably need to move files from/to EOS to your local computer/AFS or from CASTOR to EOS. Here are few examples

# copy a single file
eos cp /eos/atlas/user/t/test/histo.root /tmp/                   
# copy all files within a directory - no subdirectories
eos cp /eos/atlas/user/t/test/histodirectory/ /afs/cern.ch/user/t/test/histodirectory  
# copy recursive the complete hierarchy in a directory
eos cp -r /eos/atlas/user/t/test/histodirectory/ /afs/cern.ch/user/t/test/histodirectory
# copy recursive the complete hierarchiy into the directory 'histordirectory' in the current local working directory
eos cp -r /eos/atlas/user/t/test/histodirectory/ histodirectory
# copy recursive the complete hierarchy of a CASTOR directory to an EOS directory (make sure you have the proper CASTOR settings)
eos cp -r root://castorpublic//castor/cern.ch/user/t/test/histordirectory/ /eos/atlas/user/t/test/histodirectory/ 

SAM Monitoring

Service Availability Monitoring of condor jobs

1. The condor job setup scripts are saved in /work/users/qing/data5/qing/condor_test, there you can run condor_submit valtical07.job to send a condor job dedicatedly to valtical07.cern.ch
# valtical07.job
Universe        = vanilla
Notification    = Error
Executable      = script.bash
Arguments       = HWWrel16 valtical07.txt valtical07 valtical07 0 1 1 1
GetEnv          = True
Initialdir      = /work/users/qing/data5/qing/condor_test/Results/data11_177986_Egamma
Output          = logs/valtical07.out
Error           = logs/valtical07.err
Log             = logs/valtical07.log
requirements    = ((Arch == "INTEL" || ARCH == "X86_64") && regexp("valtical00",Machine)!=TRUE)  && ((machine == "valtical07.cern.ch"))
+IsFastJob      = True
+IsAnaJob       = TRUE
stream_output   = False
stream_error    = False
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files    = HWWrel16, list/valtical07.txt, data11_7TeV.periodAllYear_DetStatus-v28-pro08-07_CoolRunQuery-00-04-00_WZjets_allchannels.xml, mu_mc10b.root,
Queue

2. /work/users/qing/data5/qing/SAM_condor/sam.py is developed to send a condor job to each condor worker node and retrieve the output status and save them in /work/users/qing/data5/qing/SAM_condor/result.txt.

#  sam.py -- developed by Gang Qin on Jan 9 2012;
import commands
import os

def sendmail(wn_name):
        title =  "condor test job failed on " + wn_name 
        content = title + " at " + commands.getoutput("date")
        cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch"
        #cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch;Luca.Fiorini@cern.ch"
        commands.getstatusoutput(cmd)

def kill_oldjobs():
   oldjobs = commands.getoutput("condor_q | grep qing | wc -l")
   list_oldjob_id = []
   if oldjobs != "0":
      tmp = commands.getoutput("condor_q | grep qing").split()
      for i in range(len(tmp)):
         if tmp[i] == 'qing':
            list_oldjob_id.append(tmp[i-1]) 
            os.system("condor_rm  "+ tmp[i-1])
            #os.system("condor_rm -forcex "+ tmp[i-1])
   else:
      print "No old jobs"
      return 0
   print "Old jobs ", list_oldjob_id, "are removed from the queue."
   return 0

def submit_newjobs(list_wn):
   list_submit_time = []
   list_jobid = []
   for i in range(len(list_wn)):
      #os.system("sleep 2")
      start_time = commands.getoutput("date +%s")
      job_file = list_wn[i] + '.job'
      tmp = commands.getoutput("condor_submit " + job_file)
      list_jobid.append(tmp.split()[-1].split(".")[0]) 
      list_submit_time.append(start_time)
   return list_submit_time, list_jobid

def check_jobs(list_wn,list_jobid):
   list = []
   for i in range(len(list_wn)):
      filename = "/work/users/qing/data5/qing/condor_test/Results/data11_177986_Egamma/Data11_0jets_" + list_wn[i]  + ".root"
      tmp = commands.getoutput("ls " + filename) 

      if "No such file or directory" in tmp:      # output root file not created
         tag_outputfile = -1
      else:
         tag_outputfile = 1   # output root file created

      tmp = commands.getoutput("condor_q | grep qing ")
      if list_jobid[i] in tmp:
         tag_job_queued = 1 # job still in the queue
      else:
         tag_job_queued = -1  # job not in the queue

      if (tag_outputfile == 1) and (tag_job_queued == -1):
         list.append(1)   # 1 means good 
      elif (tag_outputfile == 1) and (tag_job_queued == 1):
         list.append(-2)  # -2 means outfile created but job not started 
      elif (tag_outputfile == -1) and (tag_job_queued == 1):
         list.append(0)   # 0 job is queued   
      else:
         list.append(-1) # -1 means job finished but output root file not created
         sendmail(list_wn[i])

   time = commands.getoutput("date +%s")
   line = time + '\t' 
   for i in range(len(list)):
      line = line + str(list[i]) + '\t'
   line = line + '\n'
   return line

def record(line):
   print line
   history_file = open("/work/users/qing/data5/qing/SAM_condor/result.txt","a")
   history_file.writelines(line)
   history_file.close()

def main():
   list_wn = ["valtical04","valtical06","valtical07","valtical08","valtical09",'valtical00']
   #list_wn = ["valtical00","valtical04","valtical05","valtical06","valtical07","valtical08","valtical09"]
   work_dir = "/work/users/qing/data5/qing/condor_test"
   os.chdir(work_dir)
   os.system("rm -rf Results/data11_177986_Egamma/Data11_0jets_valtical0*.root") 
   os.system("rm -rf Results/data11_177986_Egamma/Data11_0jets_valtical0*.txt") 
   kill_oldjobs()
   list_submit_time, list_jobid = submit_newjobs(list_wn)
   print list_submit_time, list_jobid, list_wn
   os.system("sleep 120")
   line = check_jobs(list_wn,list_jobid)
   record(line)

if __name__ == "__main__":
   main()

3. acrontab was set for user qing to run /work/users/qing/data5/qing/SAM_condor/sam.sh on valtical05 every 30 minutes.

30 * * * * valtical05.cern.ch source  /work/users/qing/data5/qing/SAM_proof/sam_proof.sh >/dev/null 2>&1

# sam.sh
#!/bin/bash  
source /etc/profile
source /afs/cern.ch/sw/lcg/contrib/gcc/4.7/x86_64-slc5-gcc47-opt/setup.sh
export ROOTSYS=/afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.02/x86_64-slc5-gcc43-opt/root
export PATH=/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export rootpath=/work/users/qing/data5/qing/SAM_condor
cd $rootpath
export cmd=`ps eux | grep 'python sam.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd

if [ "$cmd" == "" ] ; then
        echo 'Running sam.py now'
        python sam.py
else
        echo 'python sam.py is already running'
fi

#echo "" >> $rootpath/log

4. on valtical.cern.ch, /home/qing/SAM_condor/week.py is developed to convert the previous result.txt into plots.

 
# week.py 
from ROOT import *
import commands
import re, array
from array import array

def gr(n,x,y):
        gr = TGraph(n,x,y)
        gr.SetMaximum(9)
        gr.SetMinimum(3)
        gr.SetLineWidth(2);
        gr.SetMarkerSize(1.5)
        return gr

def tex():
   t = TLatex()
   t.SetNDC()
   t.SetTextFont( 62 )
   t.SetTextSize( 0.04 )
   t.SetTextAlign( 12 )
   t.SetTextColor( 1 )
   t.DrawLatex( 0.12, 0.83, 'Valtical00'  )
   t.DrawLatex( 0.12, 0.70, 'Valtical09'  )
   t.DrawLatex( 0.12, 0.56, 'Valtical08'  )
   t.DrawLatex( 0.12, 0.43, 'Valtical07'  )
   t.DrawLatex( 0.12, 0.30, 'Valtical06'  )
   #t.DrawLatex( 0.12, 0.4, 'Valtical05'  )
   t.DrawLatex( 0.12, 0.17, 'Valtical04'  )
   #t.DrawLatex( 0.12, 0.2, 'Valtical00'  )

def tex_per(var04,var06,var07,var08,var09,var00):
#def tex_per(var00,var04,var05,var06,var07,var08,var09):
   #par00 = str((int(var00*10000)/100.))+"%"
   par04 = str((int(var04*10000)/100.))+"%"
   #par05 = str((int(var05*10000)/100.))+"%"
   par06 = str((int(var06*10000)/100.))+"%"
   par07 = str((int(var07*10000)/100.))+"%"
   par08 = str((int(var08*10000)/100.))+"%"
   par09 = str((int(var09*10000)/100.))+"%"
   par00 = str((int(var00*10000)/100.))+"%"
   t = TLatex()
   t.SetNDC()
   t.SetTextFont( 62 )
   t.SetTextSize( 0.04 )
   t.SetTextAlign( 12 )
   t.SetTextColor( 1 )
   #t.DrawLatex( 0.79, 0.2, par00  )
   t.DrawLatex( 0.79, 0.17, par04  )
   #t.DrawLatex( 0.79, 0.37, par05  ) 
   t.DrawLatex( 0.79, 0.30, par06  )  
   t.DrawLatex( 0.79, 0.43, par07  ) 
   t.DrawLatex( 0.79, 0.56, par08  ) 
   t.DrawLatex( 0.79, 0.70, par09  ) 
   t.DrawLatex( 0.79, 0.83, par00  ) 

def convert(list):
        c1 = TCanvas("c1", "c1",100,0,1024,768)
        gStyle.SetOptStat(0)
        gStyle.SetPalette(1)
        gStyle.SetPaperSize(1024,768)
        #c1.SetGrid()
        c1.SetFillColor(0)
        c1.SetFrameFillColor(0)
        c1.SetFrameLineWidth(2)
        c1.SetFrameBorderMode(0)
        c1.SetFrameBorderSize(2)
        c1.Update()
   #myps.NewPage()

   y1,y2,y3,y4,yy = array('d'), array('d'),array('d'), array('d'),array('d')
   x1,x2,x3,x4,xx = array('d'), array('d'),array('d'), array('d'),array('d')
   #current_time = (int(commands.getoutput("date +%s"))-int(list[-25].split()[0]))/60
   now = int(commands.getoutput("date +%s"))
   week = 12*24*7

   #list_00 = []
   list_04 = []
   #list_05 = []
   list_06 = []
   list_07 = []
   list_08 = []
   list_09 = []
   list_00 = []

   for i in range(week):
      j = len(list)-week+i
      tmp = list[j].split()
      time1 = ((int(tmp[0])-now))
      time = float(time1/(60*60*24.))
      if time1 < -1*3600*24*7:
         continue
      #list_00.append(int(tmp[1]))
      list_04.append(int(tmp[1]))
      #list_05.append(int(tmp[3]))
      list_06.append(int(tmp[2]))
      list_07.append(int(tmp[3]))
      list_08.append(int(tmp[4]))
      list_09.append(int(tmp[5]))
      if len(tmp) > 6: 
         list_00.append(int(tmp[6]))
      else:
      
         list_00.append(0)

      #for k in range(7):   
      for k in range(len(tmp)):   
         value = int(tmp[k])
         if value == 1:
            x1.append(time)   
            y1.append(k+2.5)
         elif value == 0:
            x2.append(time)
            y2.append(k+2.5)
            x1.append(time)   
            y1.append(100)
         elif value == -1:
            x3.append(time)
            y3.append(k+2.5)
            x1.append(time)   
            y1.append(100)
         elif value == -2:
            x4.append(time)
            y4.append(k+2.5)
            x1.append(time)   
            y1.append(100)

   x1.append(0)   
   y1.append(100)
   x1.append(-9)   
   y1.append(100)
   x1.append(1)   
   y1.append(100)

   xx.append(0)
   yy.append(11.5)

   #performance_00 = (list_00.count(1)*1.0)/len(list_00)
   performance_04 = (list_04.count(1)*1.0)/len(list_04)
   #performance_05 = (list_05.count(1)*1.0)/len(list_05)
   performance_06 = (list_06.count(1)*1.0)/len(list_06)
   performance_07 = (list_07.count(1)*1.0)/len(list_07)
   performance_08 = (list_08.count(1)*1.0)/len(list_08)
   performance_09 = (list_09.count(1)*1.0)/len(list_09)
   performance_00 = (list_00.count(1)*1.0)/len(list_00)

   n = len(x1) 
   if n!=0:
      gr1 = gr(n,x1,y1)
      gr1.SetMarkerColor(3)
           gr1.SetMarkerStyle(4)
      gr1.SetTitle("Service Availability Monitoring of the valtical cluster in the last 7 days")
      gr1.GetXaxis().SetTitle("Days to " + commands.getoutput("date"))
      gr1.GetYaxis().SetTitle("Condor Work Node List")
      #gr1.GetYaxis().SetTitleOffset(0.3)
      gr1.GetXaxis().SetTickLength(0.03)
      gr1.GetYaxis().SetTickLength(0)
      gr1.GetYaxis().SetLabelColor(0)
      #gr1.GetYaxis().SetLabelOffset(0)
      #gr1.GetXaxis().SetTimeFormat("%d/%m/%Y")
      gr1.Draw("AP")

   n = len(x2) 
   if n!=0:
      gr2 = gr(n,x2,y2)
      gr2.SetMarkerColor(4)
           gr2.SetMarkerStyle(30)
      gr2.Draw("P")

   n = len(x3) 
   if n!=0:   
      gr3 = gr(n,x3,y3)
      gr3.SetMarkerColor(2)
           #gr3.SetMarkerStyle(4)
           gr3.SetMarkerStyle(20)
      gr3.Draw("P")

   n = len(x4) 
   if n!=0:   
      gr4 = gr(n,x4,y4)
      gr4.SetMarkerColor(4)
           gr4.SetMarkerStyle(30)
      gr4.Draw("P")


   n = len(xx) 
   if n!=0:   
      gr5 = gr(n,xx,yy)
      gr5.SetMarkerColor(6)
           gr5.SetMarkerStyle(23)
      gr5.Draw("P")
   tex()
   tex_per(performance_04,performance_06,performance_07,performance_08,performance_09,performance_00)
   #tex_per(performance_00,performance_04,performance_05,performance_06,performance_07,performance_08,performance_09)

   c1.Update()
   filename="week.jpg"
   c1.SaveAs(filename);

def main(): 
   infile = file("result.txt",'r')
   list = infile.readlines()
   infile.close()
   list_wn = ["valtical04","valtical06","valtical07","valtical08","valtical09","valtical00"]
   #list_wn = ["valtical00","valtical04","valtical05","valtical06","valtical07","valtical08","valtical09"]
   print list_wn
   convert(list)

if __name__ == '__main__':
   main()

5. on valtical.cern.ch, /home/qing/SAM_condor/week.sh calls week.py and send the plot to ticalui02.uv.es.

# week.sh
#!/bin/bash
source /etc/profile
source /afs/cern.ch/sw/lcg/contrib/gcc/4.3/x86_64-slc5-gcc43-opt/setup.sh
export ROOTSYS=/afs/cern.ch/sw/lcg/app/releases/ROOT/5.26.00/x86_64-slc5-gcc43-opt/root
export PATH=/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export rootpath=/home/qing/SAM_condor
cd $rootpath
scp root@valtical00.cern.ch:/work/users/qing/data5/qing/SAM_condor/result.txt ./
export cmd=`ps eux | grep 'python week.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd
if [ "$cmd" == "" ] ; then
        echo 'Running week.py now'
   python week.py
   scp ./week.jpg root@ticalui02.ific.uv.es:/var/www/html/cluster/plots/ 
else
        echo 'python week.py is already running'
fi

6. on valtical.cern.ch, a crond is setup to to run /home/qing/SAM_condor/week.sh every hour.

0 * * * * source /home/qing/SAM_condor/week.sh

7. Check the SAM condor performance at SAM:CONDOR

Service Availability Monitoring of Proof jobs

1. /work/users/qing/data5/qing/SAM_proof/sam_proof.py was designed to send proof jobs to every proof worker node:
# sam_proof.py, developed by Gang Qin on Jan 9 2012;
from ROOT import *
import commands
import os

def sendmail():
        title =  "proof test job failed  "  
        content = title + " at " + commands.getoutput("date")
        cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch"
        #cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch;Luca.Fiorini@cern.ch"
        os.system(cmd)

def check_job():
   filename = "/work/users/qing/data5/qing/SAM_proof/histo.root"
   tmp = commands.getoutput("ls " + filename) 

   if "No such file or directory" in tmp:      # output root file not created
      tag_outputfile = -1
      sendmail()
   else:
      tag_outputfile = 1   # output root file created

   time = commands.getoutput("date +%s")
   line = time + '\t' + str(tag_outputfile) + '\n' 
   return line

def record(line):
   history_file = open("/work/users/qing/data5/qing/SAM_proof/result.txt","a")
   history_file.writelines(line)
   history_file.close()

def submit_newjob():
        gROOT.SetBatch(1);
        gEnv.SetValue("Proof.StatsHist",0);
        gEnv.SetValue("Proof.StatsTrace",0);
        gEnv.SetValue("Proof.SlaveStatsTrace",0);
   gEnv.Print()
        worker = 72 
        #worker = 16*6
        p = TProof.Open("valtical.cern.ch","workers="+str(worker))
        p.SetParameter("PROOF_RateEstimation","average")
        p.SetParallel(worker);

        ch = TChain("wwd3pd")
        for i in range(38):
                filename = "root://valtical.cern.ch//localdisk/xrootd/sam/" + 'sam_proof_' + str(i) + '.root'
                ch.Add(filename)

        ch.SetProof()
        option=commands.getoutput("pwd")
        ch.Process("HistsSel.C++",option,-1)

def main():
   work_dir = "/work/users/qing/data5/qing/SAM_proof"
   os.chdir(work_dir)
   os.system("rm -rf  /work/users/qing/data5/qing/SAM_proof/histo.root") 
   submit_newjob()
   #os.system("sleep 240")
   line = check_job()
   record(line)

if __name__ == "__main__":
   main()

2. /work/users/qing/data5/qing/SAM_proof/sam_proof.sh is developed to call week.py and retrieve the output status in result.txt.

#!/bin/bash
export HOME=/work/users/qing/data5/qing/SAM_proof
export rootpath=/work/users/qing/data5/qing/SAM_proof
cd $rootpath
echo `date` >>  /work/users/qing/data5/qing/SAM_proof/log 
echo "   " >>  /work/users/qing/data5/qing/SAM_proof/log
source /etc/profile
export ROOTSYS=/work/users/qing/software/root
export PATH=/work/users/qing/software/python2.4/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/work/users/qing/software/python2.4/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export HOME=/work/users/qing/data5/qing/SAM_proof
export cmd=`ps eux | grep 'python sam_proof.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd

if [ "$cmd" == "" ] ; then
        echo 'Running sam_proof.py now'
        python sam_proof.py >>  /work/users/qing/data5/qing/SAM_proof/log 2>&1
   export file1=`ls -l histo.root`
   echo "  " >>  /work/users/qing/data5/qing/SAM_proof/log
   echo $file1 >>  /work/users/qing/data5/qing/SAM_proof/log
   echo "   " >>  /work/users/qing/data5/qing/SAM_proof/log
   echo "   " >>  /work/users/qing/data5/qing/SAM_proof/log
   echo "   " >>  /work/users/qing/data5/qing/SAM_proof/log
else
        echo 'python sam_proof.py is already running'
fi

3. An acrond is setup by user qing to call this sam_proof.sh every 30 minutes:

30 * * * * valtical05.cern.ch source  /work/users/qing/data5/qing/SAM_proof/sam_proof.sh >/dev/null 2>&1

4. On valtical.cern.ch, /home/qing/SAM_proof/week_proof.py converts the result.txt into plots.

[root@valtical SAM_proof]# cat week_proof.py
from ROOT import *
import commands
import re, array
from array import array

def gr(n,x,y):
        gr = TGraph(n,x,y)
        gr.SetMaximum(1.5)
        gr.SetMinimum(-1.5)
        gr.SetLineWidth(2);
        gr.SetMarkerSize(1.5)
        return gr

def tex():
   t = TLatex()
   t.SetNDC()
   t.SetTextFont( 62 )
   t.SetTextSize( 0.04 )
   t.SetTextAlign( 12 )
   t.SetTextColor( 1 )
   t.DrawLatex( 0.15, 0.8, 'Valtical09'  )
   t.DrawLatex( 0.15, 0.7, 'Valtical08'  )
   t.DrawLatex( 0.15, 0.6, 'Valtical07'  )
   t.DrawLatex( 0.15, 0.5, 'Valtical06'  )
   t.DrawLatex( 0.15, 0.4, 'Valtical05'  )
   t.DrawLatex( 0.15, 0.3, 'Valtical04'  )
   t.DrawLatex( 0.15, 0.2, 'Valtical00'  )

def convert(list):
        c1 = TCanvas("c1", "c1",0,0,1024,768)
        gStyle.SetOptStat(0)
        gStyle.SetPalette(1)
        gStyle.SetPaperSize(1024,768)
        #c1.SetGrid()
        c1.SetFillColor(0)
        c1.SetFrameFillColor(0)
        c1.SetFrameLineWidth(2)
        c1.SetFrameBorderMode(0)
        c1.SetFrameBorderSize(2)
        c1.Update()
   #myps.NewPage()

   y1,y2,y3,y4,yy = array('d'), array('d'),array('d'), array('d'),array('d')
   x1,x2,x3,x4,xx = array('d'), array('d'),array('d'), array('d'),array('d')
   #current_time = (int(commands.getoutput("date +%s"))-int(list[-25].split()[0]))/60
   now = int(commands.getoutput("date +%s"))

   for i in range(24*7):
      j = len(list)-24*7+i
      tmp = list[j].split()
      time = (int(tmp[0])-now)/3600./24.
      #time = time*5
      if time < -7:
         continue
      value = int(tmp[1])
      if value == 1:
         x1.append(time)   
         y1.append(1)
      elif value == -1:
         x2.append(time)
         y2.append(-1)
         x1.append(time)   
         y1.append(100)

   print x1

        xx.append(-0.01)
        yy.append(1.2)
   x1.append(-7)
   y1.append(100)

   n = len(x1) 
   if n!=0:
      gr1 = gr(n,x1,y1)
      gr1.SetMarkerColor(3)
           gr1.SetMarkerStyle(4)
      gr1.SetTitle("Service Availability Monitoring of the valtical cluster in the last 7 days")
      gr1.GetXaxis().SetTitle("Time(days) to " + commands.getoutput("date"))
      gr1.GetYaxis().SetTitle("Proof cluster")
      #gr1.GetYaxis().SetTitleOffset(0.3)
      gr1.GetXaxis().SetTickLength(0.03)
      gr1.GetYaxis().SetTickLength(0)
      #gr1.GetYaxis().SetLabelColor(0)
      #gr1.GetYaxis().SetLabelOffset(0)
      #gr1.GetXaxis().SetTimeFormat("%d/%m/%Y")

      gr1.Draw("AP")

   n = len(x2) 
   if n!=0: seems the 
      gr2 = gr(n,x2,y2)
      gr2.SetMarkerColor(2)
           gr2.SetMarkerStyle(4)
      gr2.Draw("P")

        n = len(xx)
   print xx,yy
        if n!=0:   
                gr3 = gr(n,xx,yy)
                gr3.SetMarkerColor(6)
                gr3.SetMarkerStyle(23)
                gr3.Draw("P")

   #tex()


   c1.Update()
   filename="week_proof_valtical.jpg"
   c1.SaveAs(filename);

def main(): 
   infile = file("result.txt",'r')
   list = infile.readlines()
   infile.close()
   convert(list)
   
if __name__ == '__main__':
   main()

5. on valtical.cern.ch , /home/qing/SAM_proof/week_proof.sh calls week_proof.py and send the plot to ticalui02.uv.es

#!/bin/bash
source /etc/profile
source /afs/cern.ch/sw/lcg/contrib/gcc/4.3/x86_64-slc5-gcc43-opt/setup.sh
export ROOTSYS=/afs/cern.ch/sw/lcg/app/releases/ROOT/5.26.00/x86_64-slc5-gcc43-opt/root
export PATH=/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export rootpath=/home/qing/SAM_proof
cd $rootpath
scp root@valtical00.cern.ch:/work/users/qing/data5/qing/SAM_proof/result.txt ./
export cmd=`ps eux | grep 'python draw_proof.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd
if [ "$cmd" == "" ] ; then
        echo 'Running draw_proof.py now'
   python draw_proof.py
   rm -f /var/www/html/cluster/plots/SAM_proof_valtical.jpg 
   cp ./SAM_proof_valtical.jpg /var/www/html/cluster/plots/ 
else
        echo 'python draw_proof.py is already running'
fi
cd $rootpath

6. on valtical.cern.ch, /home/qing/SAM_proof/week_proof.sh is set to run every hour

0 * * * * source /home/qing/SAM_proof/week_proof.sh

7. Check the SAM proof performance at SAM:PROOF

High Temperature Alarming

1. lm_sensors installation
    yum -y install lm_sensors
    sensors-detect
    sensors 
2. Add the following line to crontab on valtical.cern.ch. This will check the temperature of each core, if it reaches 100 degree, an alarming letter will be sent to the system administrator immediately.
*/10 * * * * source /work/users/qing/data5/qing/lm_sensors/scanT.sh

Miscellaneous

SLC6 installation post setup on worker nodes:

 
# unify the public key in the cluster
scp root@valtical00.cern.ch:./.ssh/authorized_keys ~/.ssh/
scp root@valtical00.cern.ch:./.ssh/id_rsa ~/.ssh

# cern setup
/usr/sbin/lcm --configure ntpd afsclt
/sbin/chkconfig --add afs
/sbin/chkconfig afs on
/sbin/chkconfig --add yum-autoupdate
/sbin/service yum-autoupdate start
/usr/sbin/lcm --configure srvtab
/usr/sbin/lcm --configure krb5clt sendmail ntpd chkconfig ocsagent ssh

# mount /work and /data6
for d in "/work"
do
  mkdir $d
  echo "valtical07:$d         $d                  nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.190 0 0" >> /etc/fstab
done

for d in  "/data6"
do
  mkdir $d
  echo "valticalui01:$d         $d                  nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 0 0" >> /etc/fstab
done
service rpcbind start
sevice nfs start
mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.140 valtical00:/work /work
mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 valticalui01:/data6 /data6

# add users ,  in /etc/passwd change /sbin/bash to /sbin/nologin.
useraddcern solans
useraddcern jvalero
useraddcern yesenia
useraddcern yzhu
useraddcern bmellado
useraddcern ypan
useraddcern tiledaq
useraddcern tilerod
useraddcern ruanxf
useraddcern akkruse
useraddcern valls
useraddcern nkauer
useraddcern snair
useraddcern xchen
useraddcern gtorralb
useraddcern shaque
useraddcern lfiorini
useraddcern ferrer
useraddcern tilelas
useraddcern yangw
useraddcern hpeng
useraddcern qing
useraddcern fcarrio
useraddcern tibristo
useraddcern osilbert
useraddcern scole
useraddcern smartine
useraddcern lmarch
useraddcern jhuston
useraddcern jresende
useraddcern montoya
useraddcern daalvare
useraddcern lcolasur
useraddcern lcerdaal
useraddcern roreed
useraddcern jlima
useraddcern yahuang
useraddcern ghamity
useraddcern svonbudd
useraddcern gamar
useraddcern lcerdaal

# useful pacakges
yum -y install glibc-devel
yum -y install gparted
yum -y install tkinter
yum -y install xemacs xemacs-packages-*
yum -y install compat-gcc-34-g77
yum -y install compat-libf2c-34
yum -y install libxml2*
yum -y install compat-libtermcap.x86_64
yum -y install openssl098e
yum -y install compat-expat1-1.95.8-8.el6.i686
yum -y install expat-devel.x86_64
yum -y install expat-devel.i686
ln -s /lib64/libexpat.so.1.5.2 /usr/lib64/libexpat.so.0
yum install compat-openldap-2.3.43-2.el6.x86_64 
yum install libaio.so.1, libcrypto.so.6 

# Modify /etc/inittab to set id:3:initdefault:''

pyroot setup

    #install python
        cd /work/users/qing/software
        mkdir python2.4
        wget http://www.python.org/ftp/python/2.4.6/Python-2.4.6.tgz
        tar -xvzf Python-2.4.6.tgz
        cd Python-2.4.6
        ./configure --enable-shared --prefix="/work/users/qing/software/python2.4"
        gmake
        gmake install 

    #install pyroot
        cd /work/users/qing/software
        mkdir root5.28
        wget ftp://root.cern.ch/root/root_v5.28.00b.source.tar.gz
        tar -xvzf root_v5.28.00b.source.tar.gz
        cd root
        ./configure --with-python-incdir=/work/users/qing/software/python2.4/include/python2.4 --with-python-libdir=/work/users/qing/software/python2.4/lib --prefix="/work/users/qing/software/root/root5.28" --etcdir="/work/users/qing/software/root/root5.28/etc"
        gmake
        gmake install 

    #Environment setup before using ROOT:
        export ROOTSYS=/work/users/qing/software/root
        export PATH=/work/users/qing/software/python2.4/bin:$ROOTSYS/bin:$PATH
        export LD_LIBRARY_PATH=$ROOTSYS/lib:/work/users/qing/software/python2.4/lib:$LD_LIBRARY_PATH
        export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib 

Data Management

Data production

To run skimming jobs on DATA, please enter /work/offline/qing/Skimmer/trunk use SendToGRID? to send the jobs. For MC the working directory is /work/offline/qing/Skimmer/trunk.mc.

Data Transfer

Data transfer from Grid sites to xrootd

Either lcg-cp or dq2-get can be used for this download, you can write codes by your self or use the tool below:
cd /afs/cern.ch/user/q/qing/grid2CERN_lcg/data
put the name of datasets in list_data.txt
python lcg_data.py to create all.txt
python split.py and then run cp1.sh, cp2.sh,cp3.sh,cp4.sh,cp5.sh on 5 different valtical machines
python lcg_check.py to create download_missing.sh 
source download_missing.sh on one or more machines.

consistency check of downloaded files in xrootd

cd /data6/qing/broken
source setup.sh  (Juan's working environment to test his ntuples)
cat /work/users/qing/data5/qing/ForUsers/all_xrootd_files.txt | grep 'root://valtical.cern.ch//localdisk/xrootd/users/qing/data12_8TeV/SMDILEP_p1328_p1329/user.qing.data12_8TeV.periodH.physics_Muons.PhysCont.NTUP_SMWZ.grp14_v01_p1328_p1329_2LepSkim_v2' >  MuonH.txt
python create.py MuonH.txt > 1.sh
source 1.sh
python find_bad.py

Data transfer from CERN to IFIC

10TB in AtlasLocalGroupDisk? at IFIC can be used to backup some files at CERN xrootd, to use more please contact sgonzale@ific.uv.es.
    cd /afs/cern.ch/user/q/qing/CERN2IFIC/
    put the xrootd paths into list_data.txt
    python makelist.py to create list_files.txt
    python transfer_list.py to create files_to_transfer.txt
    source dq2_setup.sh to set up the enviroment
    source files_to_transfer.txt to start the transfer
    scp list_files.txt qing@ticalui01.uv.es:./private/data_transfer/list_files.txt
    open a new terminal and log ticalui01, cd ~qing/private/data_transfer
    python checklist.py
    scp qing@ticalui01.uv.es:./private/data_transfer/lustre_file.txt ./
    python transfer_missing.py
    source files_to_transfer.txt

Data transfer from IFIC to CERN

cd /afs/cern.ch/user/q/qing/cern2ific
# save the list of files into 1.txt
# use make.py to create the codes for transfer
# use split.py to split the transfer into several threads. 

Data Deletion

For large scale xrootd file deletion please put the list of directories to be deleted in /data6/qing/file_deletion, then python scan.py; python delete.py; source delete.sh

Maintenance tasks

On request

  • Install new packages in all computers
  • Add new users
  • Change user default settings
  • Remove old users
  • Check nfs, xrootd, condor, proofd status

Daily

  • Check and kill for zombie processes.
  • Check CPU and memory consumption.
  • Free cached memory.
  • Check SAM performance and Ganglia status

Weekly

  • Check for package upgrades
  • Check disk space status
  • Warn users using considerable amount of disk space
  • Help users migrate data from NFS to xrootd
  • Check /var/log/messages for SMART message of disk problem

Monthly

  • Reboot machines in the cluster
  • Check dark files, invalid links and empty directories in xrootd

-- GangQin? - Nov 11 2013

Edit | WYSIWYG | Attach | PDF | Raw View | Backlinks: Web, All Webs | History: r11 < r10 < r9 < r8 < r7 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback