IFIC Computing Cluster at CERN
The Valencia Computing Cluster
The
TileCal Valencia computing cluster at CERN is located in building 175. It is directly accessible from machines inside CERN General Public Network. Remote access has to be done through the lxplus service.
local > ssh -X login@lxplus.cern.ch
lxplus > ssh -X login@valticalXX
The ATLAS Valencia group also holds a small cluster at
IFIC T3 computing cluster and users' questions can be submitted to
user Questioning Area
Cluster topology
- All computers mount /data6 from valticalui01 as data storage for analysis
- All computers mount /work from valtical07 for collaborative code development (No data).
- All analysis computers mount /localdisk* locally.
- All analysis computers use /localdisk/xrootd as xrootd file system cache.
- Offline developments are located in /work/offline.
- Online developments are located in /work/TicalOnline.
- Users can log in valticalui01 and valtical05 to run their small scale interactive jobs, for large number of jobs please use condor or proof.
Functionality and Hardware Setup of each host
Computer | Activity | Cores | Mem | Xrootd Data Disk | OS | System Disk |
Valtical | Xrootd redirector , proof master, MYSQL querying server | 4 | 6 GB | 0 TB | SLC6 | 300 GB |
Valtical00 | Xrootd data server, condor worker node, proof worker node | 16 | 24 GB | 14 TB | SLC6 | 500 GB |
Valtical04 | Xrootd data server,condor worker node; proof worker node | 16 | 24 GB | 6 TB | SLC6 | 300 GB |
Valtical05 | User Interface, NX Server, Xrootd data server, proof submit machine,condor master, condor submit machine ,MYSQL querying client | 24 | 48 GB | 17 TB | SLC5 | 500 GB |
Valtical06 | Xrootd data server,condor worker node; proof worker node | 16 | 24 GB | 2 TB | SLC6 | 300 GB |
Valtical07 | condor worker node; ganglia server, NFS server for /work | 16 | 24 GB | 0 TB | SLC6 | 2 TB |
Valtical08 | Xrootd data server,condor worker node; proof worker node | 16 | 24 GB | 2TB | SLC6 | 2 TB |
Valtical09 | Xrootd data server,condor worker node; proof worker node | 16 | 24 GB | 8 TB | SLC6 | 2 TB |
valticalui01 | User Interface, NFS server for /data6, MYSQL querying client | 16 | 24GB | 0TB | SLC6 | 300 GB |
Xrootd
xrootd is a distributed file system with high performance and has become popular in Grid applied at many sites. In this document the installation, configuration and general problem debugging will be discussed, further details can be found at
Xrootd:Home Page.
Overview
Computer | Xrootd Role | Xrootd Daemons | Disks for xrootd | Storage Capacity for Xrootd | Xrootd Version | OS |
Valtical | redirector | xrootd, cmsd | 0 | 0TB | 3.3.1 | SLC6 |
Valtical00 | data server | xrootd, cmsd | 7 | 12.6TB | 3.3.1 | SLC6 |
Valtical04 | data server | xrootd, cmsd | 3 | 5.4TB | 3.3.1 | SLC6 |
Valtical05 | data server | xrootd, cmsd | 7 | 15.3TB | 3.3.1 | SLC5 |
Valtical06 | data server | xrootd, cmsd | 1 | 1.8TB | 3.3.1 | SLC6 |
Valtical08 | data server | xrootd, cmsd | 1 | 1.8TB | 3.3.1 | SLC6 |
Valtical09 | data server | xrootd, cmsd | 3 | 7.2TB | 3.3.1 | SLC6 |
Processes run as
xrootd
user and are:
/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7 -b -s /var/run/xrootd/xrootd-default.pid -n default
/usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7 -b -s /var/run/xrootd/cmsd-default.pid -n default
/usr/bin/XrdCnsd -d -D 2 -i 90 -b root://valtical.cern.ch:2094
Xrootd Installation and Configuration
1. Install EPEL, remember to choose the right version to match your OS version
rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
2. Install the yum-priorities plugin
yum install yum-priorities
3. Install the OSG repositories
rpm -Uvh http://repo.grid.iu.edu/osg/3.1/osg-3.1-el6-release-latest.rpm
4. Install xrootd-server and dependent rpms
yum -y install xrootd.x86_64
5. Modify /etc/xrootd/xrootd-clustered.cfg with the following contents:
all.export /localdisk/xrootd # export /localdisk/xrootd as storage path for xrootd
oss.space public /localdisk/xrootd/* xa
oss.space public /localdisk2/xrootd xa # /localdisk2/xrootd is used as an extended space for /localdisk/xrootd
oss.space public /localdisk3/xrootd xa
oss.space public /localdisk4/xrootd xa
oss.space public /localdisk5/xrootd xa
oss.space public /localdisk6/xrootd xa
oss.space public /localdisk7/xrootd xa
all.export /localdisk/proofbox # for proof usage
set xrdr=valtical.cern.ch # set valtical.cern.ch as xrootd redirector
all.manager $(xrdr):1213 # tells each component the DNS name of the manager.
if $(xrdr) && named cns
all.export /data/inventory
xrd.port 1095
else if $(xrdr)
all.role manager
xrd.port 1094
else
all.role server
oss.localroot /
ofs.notify closew create mkdir mv rm rmdir trunc | /usr/bin/XrdCnsd -d -D 2 -i 90 -b root://$(xrdr):2094
cms.space min 2g 5g
fi
6. On xrootd redirector (valtical.cern.ch), modify /etc/sysconfig/xrootd with following contents:
XROOTD_USER=xrootd
XROOTD_GROUP=xrootd
XROOTD_DEFAULT_OPTIONS="-l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
XROOTD_CNS_OPTIONS="-k 7 -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-clustered.cfg"
CMSD_DEFAULT_OPTIONS="-l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
FRMD_DEFAULT_OPTIONS="-k 7 -l /var/log/xrootd/frmd.log -c /etc/xrootd/xrootd-clustered.cfg"
XROOTD_INSTANCES="default cns"
PURD_DEFAULT_OPTIONS="-l /var/log/xrootd/purged.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
XFRD_DEFAULT_OPTIONS="-l /var/log/xrootd/xfrd.log -c /etc/xrootd/xrootd-clustered.cfg -k 7"
CMSD_INSTANCES="default"
FRMD_INSTANCES="default"
XROOTD_INSTANCES="default"
CMSD_INSTANCES="default"
PURD_INSTANCES="default"
XFRD_INSTANCES="default"
7. Run xrootd setup, which creates an appropriate directory for xrootd, creates user,group "xrootd" if needed and changes permissions appropriately.
service xrootd setup
8. Start/Restart/stop the xrootd server using the following commands. You will want to start the services on the redirector node before any services on the data node(s).
service xrootd start/restart/stop
Xrootd Troubleshooting
The most efficient debugging method is to check logs under /var/log/xrootd/, which can help solving most problems. However, there are still problems which needs further support from xrootd developers, in that case send a letter to
xrootd-l@slac.stanford.edu.
Q. Error : Unable to set attr
XrdFrm?.Pfn from /localdisk2/xrootd/public/1B/2E687E525B000000136%; operation not supported"
A. This means /localdisk2 can not be written by xrootd, to solve this problem /localdisk2 needs to be remounted with option 'user_xattr'.
umount /localdisk2
mount -o user_xattr /dev/sdc1 /localdisk2
Q. Error : Last server error 3005 ('Unable to create /localdisk/xrootd/users/qing/data12_8TeV/SMDILEP_p1328_p1329/user.qing.data12_8TeV.periodC.physics_Muons.PhysCont.NTUP_SMWZ.grp14_v01_p1328_p1329_2LepSkim_v2/user.qing.001695._00762.skimmed.root; not a directory')"
A. This means /localdisk/xrootd/users/qing/data12_8TeV/SMDILEP_p1328_p1329/user.qing.data12_8TeV.periodC.physics_Muons.PhysCont.NTUP_SMWZ.grp14_v01_p1328_p1329_2LepSkim_v2/ on one or more data servers is created as a link instead of a directory, to fix the problem you will need to remove such links.
Q. Error : Unable to write to xrootd cluster. Error message :Last server error 3011 ('No servers are available to write the file.')
A. Check whether there is enough disk space available in the xrootd cluster.
Q. Error : Xrootd runs on the redirector and data servers.But there is no communication between the redirector and data server.
A. Add rules to iptables to accept incoming tcp connections from xrootd.
Xrootd FAQ
How to list files under XROOTD
On valtical05 and valticalui01, you can get such information with '/sbin/xls'
[root@valtical05 qing]# xls -h
NAME
xls - list directory contents in valtical xrootd
SYNOPSIS
ls [OPTION] [PATH]
DESCRIPTION
List information about the files and directories under [PATH]
-s show the total size of [PATH]
-l show everything directly under [PATH] with their size
-r show all files under [PATH] and it's sub-directories
-a show the size of all files under [PATH] and it's sub directories
EXAMPLE
xls -s root://valtical.cern.ch/localdisk/xrootd/users/
How to delete files under XROOTD
use the following commands to delete a file in xrootd, $filename should be provided in the following format: root://valtical.cern.ch//localdisk/
/afs/cern.ch/user/l/lfiorini/public/xrdrm.sh $filename
Condor
HTCondor is a very stable batch system, in this section the installation and configuration at valtical cluster is described. For more information about HTCondor, please refer to the
Condor:Home Page.
Overview
Computer | Condor Role | Condor Daemons | Condor Version | Number of Cores used by condor | OS |
Valtical05 | Head Node, Submit Node | COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD,Procd | 7.6.4 | 1 | SLC5 |
Valtical00 | Worker Node | MASTER, STARTD, Procd | 7.8.8 | 12 | SLC6 |
Valtical04 | Worker Node | MASTER, STARTD, Procd | 7.8.8 | 12 | SLC6 |
Valtical06 | Worker Node | MASTER, STARTD, Procd | 7.8.8 | 12 | SLC6 |
Valtical07 | Worker Node | MASTER, STARTD, Procd | 7.8.8 | 12 | SLC6 |
Valtical08 | Worker Node | MASTER, STARTD, Procd | 7.8.8 | 12 | SLC6 |
Valtical09 | Worker Node | MASTER, STARTD, Procd | 7.8.8 | 12 | SLC6 |
Condor Installation and Configuration
1. Install EPEL, remember to choose the right version to match your OS version
rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
2. Install the yum-priorities plugin
yum install yum-priorities
3. Install the OSG repositories
rpm -Uvh http://repo.grid.iu.edu/osg/3.1/osg-3.1-el6-release-latest.rpm
4. Install HTCondor from the OSG yum repository
yum -y install condor.x86_64
5. Modify /etc/condor/config.d/00personal_condor.config as following:
CONDOR_HOST = valtical05 # set valtical05 as condor central manager
MAX_NUM_CPUS = 12 # set the maximum of cpu cores can be used by condor
UID_DOMAIN = cern.ch
COLLECTOR_NAME = Personal Condor at $(FULL_HOSTNAME)
LOCK = /tmp/condor-lock.$(FULL_HOSTNAME)0.00134799613534042
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
#DAEMON_LIST = MASTER, STARTD # open this line on condor worker node
#DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD # open this on condor central manager
PREEMPTION_REQUIREMENTS = True
RANK = 0
NEGOTIATOR_CONSIDER_PREEMPTION = True
SEC_DEFAULT_AUTHENTICATION = NEVER # comment this line on the condor submit machine
LOCAL_CONFIG_FILE = $(LOCAL_CONFIG_FILE)
HOSTALLOW_READ = *.cern.ch
HOSTALLOW_WRITE = *.cern.ch
HOSTALLOW_NEGOTIATOR = valtical05.cern.ch
6. Start/Restart/Stop condor daemons
service condor start/restart/stop
Condor Troubleshooting
The most efficient debugging method is to view condor logs saved at /var/log/condor/, if information provided there is not detailed enough, change the corresponding values from default to D_ALL in /etc/condor/condor_config, then recreate the log. Further support can be retrieved from
htcondor-users@cs.wisc.edu.
ALL_DEBUG =
COLLECTOR_DEBUG =
KBDD_DEBUG =
NEGOTIATOR_DEBUG = D_MATCH
SCHEDD_DEBUG = D_ALL
SHADOW_DEBUG =
STARTD_DEBUG =
MASTER_DEBUG =
JOB_ROUTER_DEBUG =
ROOSTER_DEBUG =
SHARED_PORT_DEBUG =
HDFS_DEBUG =
TRIGGERD_DEBUG =
HAD_DEBUG =
REPLICATION_DEBUG =
TRANSFERER_DEBUG =
GRIDMANAGER_DEBUG =
CREDD_DEBUG = D_FULLDEBUG
STORK_DEBUG = D_FULLDEBUG
LeaseManager_DEBUG = D_FULLDEBUG
LeaseManager.DEBUG_ADS = False
TOOL_DEBUG = D_ALL
SUBMIT_DEBUG = D_ALL
Condor FAQ
condor commands
- condor_submit : submit a condor job
- condor_rm user : removes jobs submitted by the user;
- condor_rm cluster.process : removes the specific job;
- condor_rm -forcex: to kill a job forcedly
- condor_q: To get the status of all queued jobs
- condor_q -analyze cluster.process : provide information of a single condor job
- condor_q -better-analyze cluster.process : provide more information then -analyze
- condor_q -submitter: get condor jobs corresponding to a user.
- condor_status: To monitor and query the condor pool for resource information, submitter information, checkpoint server information, and daemon master information
- condor_config_val: can be used to obtain configured values. Use 'condor_config_val -v variable ' to get the paths of the important directories
- condor_prio : To change priority of a user's job.The priority can be changed only by job owner or root.
- condor_userprio : To change a user's priority.The priority can be changed only by root.
Proof
Overview
Computer | proof Role | Number of cores for proof | OS |
Valtical | master | 0 | SLC5 |
Valtical00 | Worker Node | 12 | SLC6 |
Valtical04 | Worker Node | 12 | SLC6 |
Valtical06 | Worker Node | 12 | SLC6 |
Valtical07 | Worker Node | 12 | SLC6 |
Valtical08 | Worker Node | 12 | SLC6 |
Valtical09 | Worker Node | 12 | SLC6 |
Proof Installation and Configuration
1. Install ROOT under /opt
cd /opt
wget ftp://root.cern.ch/root/root_v5.28.00g.source.tar.gz
tar -xvzf root_v5.28.00c.source.tar.gz
cd root
./configure
gmake
2. Modify /etc/init.d/proofd with following contents:
XRDUSER="xrootd"
XRDLOG="/opt/root/var/logs/xproofd.log"
XRDCF="/opt/root/etc/xproofd.cfg"
XRDDEBUG=""
XRDUSERCONFIG=""
XPROOFD=/opt/root/bin/xproofd
XRDLIBS=/opt/root/lib
export ROOTSYS=/opt/root
. /etc/init.d/functions
. /etc/sysconfig/network
[ -f /etc/sysconfig/xproofd ] && . /etc/sysconfig/xproofd
[ ! -z "$XRDUSERCONFIG" ] && [ -f "$XRDUSERCONFIG" ] && . $XRDUSERCONFIG
if [ ${NETWORKING} = "no" ]
then
exit 0
fi
[ -x $XPROOFD ] || exit 0
RETVAL=0
prog="xproofd"
start() {
echo -n $"Starting $prog: "
export LD_LIBRARY_PATH=$XRDLIBS:$LD_LIBRARY_PATH
daemon $XPROOFD -b -l $XRDLOG -R $XRDUSER -c $XRDCF $XRDDEBUG -k 5
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/xproofd
return $RETVAL
}
stop() {
[ ! -f /var/lock/subsys/xproofd ] && return 0 || true
echo -n $"Stopping $prog: "
killproc xproofd
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/xproofd
return $RETVAL
}
case "$1" in
start)
start
;;
stop)
stop
;;
status)
status xproofd
RETVAL=$?
;;
restart|reload)
stop
start
;;
condrestart)
if [ -f /var/lock/subsys/xproofd ]; then
stop
start
fi
;;
*)
echo $"Usage: $0 {start|stop|status|restart|reload|condrestart}"
exit 1
esac
exit $RETVAL
3. Modify /opt/root/etc/xproofd.cfg with following contents:
set rootlocation = /opt/root
xpd.rootsys ${rootlocation}
xpd.workdir /localdisk/proofbox
xpd.resource static ${rootlocation}/etc/proof/proof.conf
xpd.role worker if valtical*.cern.ch
xpd.role master if valtical.cern.ch
xpd.allow valtical.cern.ch
xpd.maxoldlogs 2
xpd.poolurl root://valtical.cern.ch
4. Modify /opt/root/etc/proof/proof.conf with following contents:
master valtical.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical00.cern.ch workdir=/localdisk/proofbox
Proof Troublehooting
Check logs in /localdisk/proofbox/ could be helpful, for weird problems please ask the question at
Proof: Forum.
Proof FAQ
Q: How to clean proof from a killed proof job?
A: Rest valtical.cern.ch in TProof
TProof::Reset("valtical.cern.ch", kTRUE);
Q: Where is the default location for the proof sandboxes?
A: usually it's $HOME, if you wish to change it , set it in $HOME/.rootrc like:
ProofLite.Sandbox /data5/qing/SAM_proof/prooflite
Q: Can proof detect that one cpu core is overloaded and stop sending jobs to it?
A: No, this is not currently possible with PROOF.
Q: Is it possible to send jobs to a dedicated WN?
A: The concept of 'assigning a job' to a subset of workers is not really a PROOF concept.
Ganglia
Ganglia Installation and Configuration
1. On ganglia server install the following packages:
yum -y install ganglia
yum -y install ganglia-gmetad
yum -y install ganglia-gmond
yum -y install ganglia-web
yum -y install httpd
2. On ganglia server, modify the following lines in /etc/ganglia/gmond.conf:
/*
* The cluster attributes specified will be used as part of the <CLUSTER>
* tag that will wrap all hosts collected by this instance.
*/
cluster {
name = "VALTICAL"
owner = "VALTICAL"
latlong = "VALTICAL"
url = "VALTICAL"
}
/* The host section describes attributes of the host, like the location */
host {
location = "VALTICAL"
}
3. Start Ganglia daemons on the server:
service gmetad start
service gmond start
4. On ganglia worker node intall the following packages:
yum -y install ganglia
yum -y install ganglia-gmond
5. Modify /etc/ganglia/gmond.conf with the following contents:
/*
* The cluster attributes specified will be used as part of the <CLUSTER>
* tag that will wrap all hosts collected by this instance.
*/
cluster {
name = "VALTICAL"
owner = "VALTICAL"
latlong = "VALTICAL"
url = "VALTICAL"
}
/* The host section describes attributes of the host, like the location */
host {
location = "VALTICAL"
}
6. Start Ganglia daemons on the workern node:
service gmond start
Ganglia Monitoring
Currently the ganglia monitoring of valtical cluter can be viewed at
Gangglia Monitoring Webpage. Due to CERN restriction, this page can only be opened in CERN internal network. If you are not able to connect to it, try to modify /etc/httpd/conf/httpd.conf with following contents and then restart the daemons:
# Controls who can get stuff from this server.
#
Order allow,deny
Allow from all
</Directory>
MYSQL
MYSQL Installation and Configuration
1. Install the following rpm packages on the MYSQL server and query machines (valtical,valtical05,valticalui01)
yum -y install mysql-server mysql php-mysql MySQL-python
/sbin/service mysqld start
/sbin/chkconfig mysqld on
2. On valtical create database xrootd and table T_xrootd
[root@valtical ~]# mysql
mysql> create database xrootd;
mysql> use xrootd;
mysql> create table T_xrootd;;
mysql> grant select,insert on xrootd.* to 'xrootd'@'localhost';
mysql> grant all on *.* to xrootd@'137.138.40.184';
mysql> grant all on *.* to xrootd@'137.138.40.143';
mysql> grant all on *.* to xrootd@'137.138.40.190';
mysql> grant all on *.* to xrootd@'137.138.40.186';
mysql> grant all on *.* to xrootd@'137.138.40.165';
mysql> grant all on *.* to xrootd@'137.138.40.181';
mysql> grant all on *.* to xrootd@'137.138.40.166';
mysql> grant all on *.* to xrootd@'137.138.40.140';
mysql> grant all on *.* to xrootd@'137.138.40.173';
mysql> exit
3. Add the following line into crontab on valtical00. The scan.sh will call /work/users/qing/data5/qing/scan_cluster/scan.py which will create /work/users/qing/data5/qing/scan_cluster/all_xrootd_files.txt, this file is used to record all files and directories in the xrootd file system.
05 * * * * source /work/users/qing/data5/qing/scan_cluster/scan.sh
4. Add the following line into crontab on valtical:
25 * * * * source /work/users/qing/data5/qing/mysql/import.sh
NX
NX Installation and Configuration
1. Execute the following commands on valtical05:
yum -y install libjpeg
yum -y install openssl-devel
yum -y install netcat
yum -y install expect
cd /root
scp root@valtical25.cern.ch:/root/NX.tar.gz ./
tar -xvzf NX.tar.gz
ROOTPATH=/root/NX
cd $ROOTPATH
find . -name "*tar.gz" -exec tar -zxf {} \;
cd $ROOTPATH/libpng-1.4.3; ./configure; make
cd $ROOTPATH/nxcomp ; ./configure;make
cd $ROOTPATH/nxcompext ; ./configure;make
cd $ROOTPATH/nxcompshad ; ./configure;make
cd $ROOTPATH/nxproxy ; ./configure;make
cd $ROOTPATH/nx-X11 ; make World
cd $ROOTPATH
cp -a nx-X11/lib/X11/libX11.so* /usr/NX/lib/
cp -a nx-X11/lib/Xext/libXext.so* /usr/NX/lib/
cp -a nx-X11/lib/Xrender/libXrender.so* /usr/NX/lib/
cp -a nxcomp/libXcomp.so* /usr/NX/lib/
cp -a nxcompext/libXcompext.so* /usr/NX/lib/
cp -a nxcompshad/libXcompshad.so* /usr/NX/lib/
cp -a nx-X11/programs/Xserver/nxagent /usr/NX/bin/
cp -a nxproxy/nxproxy /usr/NX/bin/
cd $ROOTPATH/freenx-server-0.7.3
patch < gentoo-nomachine.diff
scp root@valtical25.cern.ch:/root/build/freenx-server-0.7.3/nx_3.3.0.patch ./
patch < nx_3.3.0.patch
make ; make install
cd /usr/NX/bin
./nxsetup --install
./nxserver restart
2. Install NX client on the client machine:
wget http://64.34.173.142/download/3.5.0/Linux/nxclient-3.5.0-7.x86_64.rpm
rpm -Uvh nxclient-3.5.0-7.x86_64.rpm
NX FAQ
How to connect NX server on valtical05 from a machine outside of CERN network?
1. Create tunnel.sh with following lines:
qing@tical31:~$ cat tunnel.sh
#!/usr/bin/env python
import os
import time
import pexpect
import sys
import getpass
user = raw_input("User:")
passw = getpass.unix_getpass("Enter your password:")
if (user,passw) != ("",""):
print "parent thread"
print "Connecting to lxplus"
ssh = pexpect.spawn('ssh -L 10001:valtical05.cern.ch:22 %s@lxplus.cern.ch'%user)
ssh.expect('password')
ssh.sendline(passw)
ssh.expect('lxplus')
ssh.interact()
2. python tunnel.sh, log on lxplus with your CERN NICE account and password.
3. In NX client configuration, set Host as localhost, and port as 10001, and then log in with your NICE account and passwd.
NFS
/work
1. on valtical07 modify /etc/exports with the following lines:
[root@valtical07 ~]# cat /etc/exports
/work valtical00(rw,sync,no_root_squash)
/work valtical01(rw,sync,no_root_squash)
/work valtical02(rw,sync,no_root_squash)
/work valtical03(rw,sync,no_root_squash)
/work valtical04(rw,sync,no_root_squash)
/work valtical05(rw,sync,no_root_squash)
/work valtical06(rw,sync,no_root_squash)
/work valticalui01(rw,sync,no_root_squash)
/work valtical08(rw,sync,no_root_squash)
/work valtical09(rw,sync,no_root_squash)
/work valtical17(rw,sync,no_root_squash)
/work valtical19(rw,sync,no_root_squash)
/work valtical20(rw,sync,no_root_squash)
/work valtical24(rw,sync,no_root_squash)
/work valticalui01(rw,sync,no_root_squash)
/work 137.138.77.204(rw,sync,no_root_squash)
/work sbctil-rod-01(rw,sync,no_root_squash)
/work sbctil-ttc-01(rw,sync,no_root_squash)
/work sbctil-las-01(rw,sync,no_root_squash)
/work sbctil-ces-01(rw,sync,no_root_squash)
/work atcacpm01(rw,sync,no_root_squash)
2. start the rpcbind service and nfs service on the NFS server and client machine:
service rpcbind restart
service nfs restart
3. mount /work on the client machine and add it to /etc/fstab
mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.190 valtical07:/work /work
for d in "/work"
do
mkdir $d
echo "valtical07:$d $d nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.190 0 0" >> /etc/fstab
done
/data6
1. on valticalui01 modify /etc/exports with the following lines:
/data6 valtical00(rw,sync,no_root_squash)
/data6 valtical01(rw,sync,no_root_squash)
/data6 valtical02(rw,sync,no_root_squash)
/data6 valtical03(rw,sync,no_root_squash)
/data6 valtical04(rw,sync,no_root_squash)
/data6 valtical05(rw,sync,no_root_squash)
/data6 valtical06(rw,sync,no_root_squash)
/data6 valtical07(rw,sync,no_root_squash)
/data6 valtical08(rw,sync,no_root_squash)
/data6 valtical09(rw,sync,no_root_squash)
2. start the rpcbind service and nfs service on the NFS server and client machine:
service rpcbind restart
service nfs restart
3. mount /data6 on the client machines and add it to /etc/fstab
mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 valticalui01:/data6 /data6
do
mkdir $d
echo "valticalui01:$d $d nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 0 0" >> /etc/fstab
done
EOS
EOS Setup at CERN
source /afs/cern.ch/project/eos/installation/atlas/etc/setup.sh
You probably need to move files from/to EOS to your local computer/AFS or from CASTOR to EOS. Here are few examples
# copy a single file
eos cp /eos/atlas/user/t/test/histo.root /tmp/
# copy all files within a directory - no subdirectories
eos cp /eos/atlas/user/t/test/histodirectory/ /afs/cern.ch/user/t/test/histodirectory
# copy recursive the complete hierarchy in a directory
eos cp -r /eos/atlas/user/t/test/histodirectory/ /afs/cern.ch/user/t/test/histodirectory
# copy recursive the complete hierarchiy into the directory 'histordirectory' in the current local working directory
eos cp -r /eos/atlas/user/t/test/histodirectory/ histodirectory
# copy recursive the complete hierarchy of a CASTOR directory to an EOS directory (make sure you have the proper CASTOR settings)
eos cp -r root://castorpublic//castor/cern.ch/user/t/test/histordirectory/ /eos/atlas/user/t/test/histodirectory/
SAM Monitoring
Service Availability Monitoring of condor jobs
1. The condor job setup scripts are saved in /work/users/qing/data5/qing/condor_test, there you can run condor_submit valtical07.job to send a condor job dedicatedly to valtical07.cern.ch
# valtical07.job
Universe = vanilla
Notification = Error
Executable = script.bash
Arguments = HWWrel16 valtical07.txt valtical07 valtical07 0 1 1 1
GetEnv = True
Initialdir = /work/users/qing/data5/qing/condor_test/Results/data11_177986_Egamma
Output = logs/valtical07.out
Error = logs/valtical07.err
Log = logs/valtical07.log
requirements = ((Arch == "INTEL" || ARCH == "X86_64") && regexp("valtical00",Machine)!=TRUE) && ((machine == "valtical07.cern.ch"))
+IsFastJob = True
+IsAnaJob = TRUE
stream_output = False
stream_error = False
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = HWWrel16, list/valtical07.txt, data11_7TeV.periodAllYear_DetStatus-v28-pro08-07_CoolRunQuery-00-04-00_WZjets_allchannels.xml, mu_mc10b.root,
Queue
2. /work/users/qing/data5/qing/SAM_condor/sam.py is developed to send a condor job to each condor worker node and retrieve the output status and save them in /work/users/qing/data5/qing/SAM_condor/result.txt.
# sam.py -- developed by Gang Qin on Jan 9 2012;
import commands
import os
def sendmail(wn_name):
title = "condor test job failed on " + wn_name
content = title + " at " + commands.getoutput("date")
cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch"
#cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch;Luca.Fiorini@cern.ch"
commands.getstatusoutput(cmd)
def kill_oldjobs():
oldjobs = commands.getoutput("condor_q | grep qing | wc -l")
list_oldjob_id = []
if oldjobs != "0":
tmp = commands.getoutput("condor_q | grep qing").split()
for i in range(len(tmp)):
if tmp[i] == 'qing':
list_oldjob_id.append(tmp[i-1])
os.system("condor_rm "+ tmp[i-1])
#os.system("condor_rm -forcex "+ tmp[i-1])
else:
print "No old jobs"
return 0
print "Old jobs ", list_oldjob_id, "are removed from the queue."
return 0
def submit_newjobs(list_wn):
list_submit_time = []
list_jobid = []
for i in range(len(list_wn)):
#os.system("sleep 2")
start_time = commands.getoutput("date +%s")
job_file = list_wn[i] + '.job'
tmp = commands.getoutput("condor_submit " + job_file)
list_jobid.append(tmp.split()[-1].split(".")[0])
list_submit_time.append(start_time)
return list_submit_time, list_jobid
def check_jobs(list_wn,list_jobid):
list = []
for i in range(len(list_wn)):
filename = "/work/users/qing/data5/qing/condor_test/Results/data11_177986_Egamma/Data11_0jets_" + list_wn[i] + ".root"
tmp = commands.getoutput("ls " + filename)
if "No such file or directory" in tmp: # output root file not created
tag_outputfile = -1
else:
tag_outputfile = 1 # output root file created
tmp = commands.getoutput("condor_q | grep qing ")
if list_jobid[i] in tmp:
tag_job_queued = 1 # job still in the queue
else:
tag_job_queued = -1 # job not in the queue
if (tag_outputfile == 1) and (tag_job_queued == -1):
list.append(1) # 1 means good
elif (tag_outputfile == 1) and (tag_job_queued == 1):
list.append(-2) # -2 means outfile created but job not started
elif (tag_outputfile == -1) and (tag_job_queued == 1):
list.append(0) # 0 job is queued
else:
list.append(-1) # -1 means job finished but output root file not created
sendmail(list_wn[i])
time = commands.getoutput("date +%s")
line = time + '\t'
for i in range(len(list)):
line = line + str(list[i]) + '\t'
line = line + '\n'
return line
def record(line):
print line
history_file = open("/work/users/qing/data5/qing/SAM_condor/result.txt","a")
history_file.writelines(line)
history_file.close()
def main():
list_wn = ["valtical04","valtical06","valtical07","valtical08","valtical09",'valtical00']
#list_wn = ["valtical00","valtical04","valtical05","valtical06","valtical07","valtical08","valtical09"]
work_dir = "/work/users/qing/data5/qing/condor_test"
os.chdir(work_dir)
os.system("rm -rf Results/data11_177986_Egamma/Data11_0jets_valtical0*.root")
os.system("rm -rf Results/data11_177986_Egamma/Data11_0jets_valtical0*.txt")
kill_oldjobs()
list_submit_time, list_jobid = submit_newjobs(list_wn)
print list_submit_time, list_jobid, list_wn
os.system("sleep 120")
line = check_jobs(list_wn,list_jobid)
record(line)
if __name__ == "__main__":
main()
3. acrontab was set for user qing to run /work/users/qing/data5/qing/SAM_condor/sam.sh on valtical05 every 30 minutes.
30 * * * * valtical05.cern.ch source /work/users/qing/data5/qing/SAM_proof/sam_proof.sh >/dev/null 2>&1
# sam.sh
#!/bin/bash
source /etc/profile
source /afs/cern.ch/sw/lcg/contrib/gcc/4.7/x86_64-slc5-gcc47-opt/setup.sh
export ROOTSYS=/afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.02/x86_64-slc5-gcc43-opt/root
export PATH=/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export rootpath=/work/users/qing/data5/qing/SAM_condor
cd $rootpath
export cmd=`ps eux | grep 'python sam.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd
if [ "$cmd" == "" ] ; then
echo 'Running sam.py now'
python sam.py
else
echo 'python sam.py is already running'
fi
#echo "" >> $rootpath/log
4. on valtical.cern.ch, /home/qing/SAM_condor/week.py is developed to convert the previous result.txt into plots.
# week.py
from ROOT import *
import commands
import re, array
from array import array
def gr(n,x,y):
gr = TGraph(n,x,y)
gr.SetMaximum(9)
gr.SetMinimum(3)
gr.SetLineWidth(2);
gr.SetMarkerSize(1.5)
return gr
def tex():
t = TLatex()
t.SetNDC()
t.SetTextFont( 62 )
t.SetTextSize( 0.04 )
t.SetTextAlign( 12 )
t.SetTextColor( 1 )
t.DrawLatex( 0.12, 0.83, 'Valtical00' )
t.DrawLatex( 0.12, 0.70, 'Valtical09' )
t.DrawLatex( 0.12, 0.56, 'Valtical08' )
t.DrawLatex( 0.12, 0.43, 'Valtical07' )
t.DrawLatex( 0.12, 0.30, 'Valtical06' )
#t.DrawLatex( 0.12, 0.4, 'Valtical05' )
t.DrawLatex( 0.12, 0.17, 'Valtical04' )
#t.DrawLatex( 0.12, 0.2, 'Valtical00' )
def tex_per(var04,var06,var07,var08,var09,var00):
#def tex_per(var00,var04,var05,var06,var07,var08,var09):
#par00 = str((int(var00*10000)/100.))+"%"
par04 = str((int(var04*10000)/100.))+"%"
#par05 = str((int(var05*10000)/100.))+"%"
par06 = str((int(var06*10000)/100.))+"%"
par07 = str((int(var07*10000)/100.))+"%"
par08 = str((int(var08*10000)/100.))+"%"
par09 = str((int(var09*10000)/100.))+"%"
par00 = str((int(var00*10000)/100.))+"%"
t = TLatex()
t.SetNDC()
t.SetTextFont( 62 )
t.SetTextSize( 0.04 )
t.SetTextAlign( 12 )
t.SetTextColor( 1 )
#t.DrawLatex( 0.79, 0.2, par00 )
t.DrawLatex( 0.79, 0.17, par04 )
#t.DrawLatex( 0.79, 0.37, par05 )
t.DrawLatex( 0.79, 0.30, par06 )
t.DrawLatex( 0.79, 0.43, par07 )
t.DrawLatex( 0.79, 0.56, par08 )
t.DrawLatex( 0.79, 0.70, par09 )
t.DrawLatex( 0.79, 0.83, par00 )
def convert(list):
c1 = TCanvas("c1", "c1",100,0,1024,768)
gStyle.SetOptStat(0)
gStyle.SetPalette(1)
gStyle.SetPaperSize(1024,768)
#c1.SetGrid()
c1.SetFillColor(0)
c1.SetFrameFillColor(0)
c1.SetFrameLineWidth(2)
c1.SetFrameBorderMode(0)
c1.SetFrameBorderSize(2)
c1.Update()
#myps.NewPage()
y1,y2,y3,y4,yy = array('d'), array('d'),array('d'), array('d'),array('d')
x1,x2,x3,x4,xx = array('d'), array('d'),array('d'), array('d'),array('d')
#current_time = (int(commands.getoutput("date +%s"))-int(list[-25].split()[0]))/60
now = int(commands.getoutput("date +%s"))
week = 12*24*7
#list_00 = []
list_04 = []
#list_05 = []
list_06 = []
list_07 = []
list_08 = []
list_09 = []
list_00 = []
for i in range(week):
j = len(list)-week+i
tmp = list[j].split()
time1 = ((int(tmp[0])-now))
time = float(time1/(60*60*24.))
if time1 < -1*3600*24*7:
continue
#list_00.append(int(tmp[1]))
list_04.append(int(tmp[1]))
#list_05.append(int(tmp[3]))
list_06.append(int(tmp[2]))
list_07.append(int(tmp[3]))
list_08.append(int(tmp[4]))
list_09.append(int(tmp[5]))
if len(tmp) > 6:
list_00.append(int(tmp[6]))
else:
list_00.append(0)
#for k in range(7):
for k in range(len(tmp)):
value = int(tmp[k])
if value == 1:
x1.append(time)
y1.append(k+2.5)
elif value == 0:
x2.append(time)
y2.append(k+2.5)
x1.append(time)
y1.append(100)
elif value == -1:
x3.append(time)
y3.append(k+2.5)
x1.append(time)
y1.append(100)
elif value == -2:
x4.append(time)
y4.append(k+2.5)
x1.append(time)
y1.append(100)
x1.append(0)
y1.append(100)
x1.append(-9)
y1.append(100)
x1.append(1)
y1.append(100)
xx.append(0)
yy.append(11.5)
#performance_00 = (list_00.count(1)*1.0)/len(list_00)
performance_04 = (list_04.count(1)*1.0)/len(list_04)
#performance_05 = (list_05.count(1)*1.0)/len(list_05)
performance_06 = (list_06.count(1)*1.0)/len(list_06)
performance_07 = (list_07.count(1)*1.0)/len(list_07)
performance_08 = (list_08.count(1)*1.0)/len(list_08)
performance_09 = (list_09.count(1)*1.0)/len(list_09)
performance_00 = (list_00.count(1)*1.0)/len(list_00)
n = len(x1)
if n!=0:
gr1 = gr(n,x1,y1)
gr1.SetMarkerColor(3)
gr1.SetMarkerStyle(4)
gr1.SetTitle("Service Availability Monitoring of the valtical cluster in the last 7 days")
gr1.GetXaxis().SetTitle("Days to " + commands.getoutput("date"))
gr1.GetYaxis().SetTitle("Condor Work Node List")
#gr1.GetYaxis().SetTitleOffset(0.3)
gr1.GetXaxis().SetTickLength(0.03)
gr1.GetYaxis().SetTickLength(0)
gr1.GetYaxis().SetLabelColor(0)
#gr1.GetYaxis().SetLabelOffset(0)
#gr1.GetXaxis().SetTimeFormat("%d/%m/%Y")
gr1.Draw("AP")
n = len(x2)
if n!=0:
gr2 = gr(n,x2,y2)
gr2.SetMarkerColor(4)
gr2.SetMarkerStyle(30)
gr2.Draw("P")
n = len(x3)
if n!=0:
gr3 = gr(n,x3,y3)
gr3.SetMarkerColor(2)
#gr3.SetMarkerStyle(4)
gr3.SetMarkerStyle(20)
gr3.Draw("P")
n = len(x4)
if n!=0:
gr4 = gr(n,x4,y4)
gr4.SetMarkerColor(4)
gr4.SetMarkerStyle(30)
gr4.Draw("P")
n = len(xx)
if n!=0:
gr5 = gr(n,xx,yy)
gr5.SetMarkerColor(6)
gr5.SetMarkerStyle(23)
gr5.Draw("P")
tex()
tex_per(performance_04,performance_06,performance_07,performance_08,performance_09,performance_00)
#tex_per(performance_00,performance_04,performance_05,performance_06,performance_07,performance_08,performance_09)
c1.Update()
filename="week.jpg"
c1.SaveAs(filename);
def main():
infile = file("result.txt",'r')
list = infile.readlines()
infile.close()
list_wn = ["valtical04","valtical06","valtical07","valtical08","valtical09","valtical00"]
#list_wn = ["valtical00","valtical04","valtical05","valtical06","valtical07","valtical08","valtical09"]
print list_wn
convert(list)
if __name__ == '__main__':
main()
5. on valtical.cern.ch, /home/qing/SAM_condor/week.sh calls week.py and send the plot to ticalui02.uv.es.
# week.sh
#!/bin/bash
source /etc/profile
source /afs/cern.ch/sw/lcg/contrib/gcc/4.3/x86_64-slc5-gcc43-opt/setup.sh
export ROOTSYS=/afs/cern.ch/sw/lcg/app/releases/ROOT/5.26.00/x86_64-slc5-gcc43-opt/root
export PATH=/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export rootpath=/home/qing/SAM_condor
cd $rootpath
scp root@valtical00.cern.ch:/work/users/qing/data5/qing/SAM_condor/result.txt ./
export cmd=`ps eux | grep 'python week.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd
if [ "$cmd" == "" ] ; then
echo 'Running week.py now'
python week.py
scp ./week.jpg root@ticalui02.ific.uv.es:/var/www/html/cluster/plots/
else
echo 'python week.py is already running'
fi
6. on valtical.cern.ch, a crond is setup to to run /home/qing/SAM_condor/week.sh every hour.
0 * * * * source /home/qing/SAM_condor/week.sh
7. Check the SAM condor performance at
SAM:CONDOR
Service Availability Monitoring of Proof jobs
1. /work/users/qing/data5/qing/SAM_proof/sam_proof.py was designed to send proof jobs to every proof worker node:
# sam_proof.py, developed by Gang Qin on Jan 9 2012;
from ROOT import *
import commands
import os
def sendmail():
title = "proof test job failed "
content = title + " at " + commands.getoutput("date")
cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch"
#cmd = " echo '" + content + "' | mail -v -s '" + title + "' gang.qin@cern.ch;Luca.Fiorini@cern.ch"
os.system(cmd)
def check_job():
filename = "/work/users/qing/data5/qing/SAM_proof/histo.root"
tmp = commands.getoutput("ls " + filename)
if "No such file or directory" in tmp: # output root file not created
tag_outputfile = -1
sendmail()
else:
tag_outputfile = 1 # output root file created
time = commands.getoutput("date +%s")
line = time + '\t' + str(tag_outputfile) + '\n'
return line
def record(line):
history_file = open("/work/users/qing/data5/qing/SAM_proof/result.txt","a")
history_file.writelines(line)
history_file.close()
def submit_newjob():
gROOT.SetBatch(1);
gEnv.SetValue("Proof.StatsHist",0);
gEnv.SetValue("Proof.StatsTrace",0);
gEnv.SetValue("Proof.SlaveStatsTrace",0);
gEnv.Print()
worker = 72
#worker = 16*6
p = TProof.Open("valtical.cern.ch","workers="+str(worker))
p.SetParameter("PROOF_RateEstimation","average")
p.SetParallel(worker);
ch = TChain("wwd3pd")
for i in range(38):
filename = "root://valtical.cern.ch//localdisk/xrootd/sam/" + 'sam_proof_' + str(i) + '.root'
ch.Add(filename)
ch.SetProof()
option=commands.getoutput("pwd")
ch.Process("HistsSel.C++",option,-1)
def main():
work_dir = "/work/users/qing/data5/qing/SAM_proof"
os.chdir(work_dir)
os.system("rm -rf /work/users/qing/data5/qing/SAM_proof/histo.root")
submit_newjob()
#os.system("sleep 240")
line = check_job()
record(line)
if __name__ == "__main__":
main()
2. /work/users/qing/data5/qing/SAM_proof/sam_proof.sh is developed to call week.py and retrieve the output status in result.txt.
#!/bin/bash
export HOME=/work/users/qing/data5/qing/SAM_proof
export rootpath=/work/users/qing/data5/qing/SAM_proof
cd $rootpath
echo `date` >> /work/users/qing/data5/qing/SAM_proof/log
echo " " >> /work/users/qing/data5/qing/SAM_proof/log
source /etc/profile
export ROOTSYS=/work/users/qing/software/root
export PATH=/work/users/qing/software/python2.4/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/work/users/qing/software/python2.4/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export HOME=/work/users/qing/data5/qing/SAM_proof
export cmd=`ps eux | grep 'python sam_proof.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd
if [ "$cmd" == "" ] ; then
echo 'Running sam_proof.py now'
python sam_proof.py >> /work/users/qing/data5/qing/SAM_proof/log 2>&1
export file1=`ls -l histo.root`
echo " " >> /work/users/qing/data5/qing/SAM_proof/log
echo $file1 >> /work/users/qing/data5/qing/SAM_proof/log
echo " " >> /work/users/qing/data5/qing/SAM_proof/log
echo " " >> /work/users/qing/data5/qing/SAM_proof/log
echo " " >> /work/users/qing/data5/qing/SAM_proof/log
else
echo 'python sam_proof.py is already running'
fi
3. An acrond is setup by user qing to call this sam_proof.sh every 30 minutes:
30 * * * * valtical05.cern.ch source /work/users/qing/data5/qing/SAM_proof/sam_proof.sh >/dev/null 2>&1
4. On valtical.cern.ch, /home/qing/SAM_proof/week_proof.py converts the result.txt into plots.
[root@valtical SAM_proof]# cat week_proof.py
from ROOT import *
import commands
import re, array
from array import array
def gr(n,x,y):
gr = TGraph(n,x,y)
gr.SetMaximum(1.5)
gr.SetMinimum(-1.5)
gr.SetLineWidth(2);
gr.SetMarkerSize(1.5)
return gr
def tex():
t = TLatex()
t.SetNDC()
t.SetTextFont( 62 )
t.SetTextSize( 0.04 )
t.SetTextAlign( 12 )
t.SetTextColor( 1 )
t.DrawLatex( 0.15, 0.8, 'Valtical09' )
t.DrawLatex( 0.15, 0.7, 'Valtical08' )
t.DrawLatex( 0.15, 0.6, 'Valtical07' )
t.DrawLatex( 0.15, 0.5, 'Valtical06' )
t.DrawLatex( 0.15, 0.4, 'Valtical05' )
t.DrawLatex( 0.15, 0.3, 'Valtical04' )
t.DrawLatex( 0.15, 0.2, 'Valtical00' )
def convert(list):
c1 = TCanvas("c1", "c1",0,0,1024,768)
gStyle.SetOptStat(0)
gStyle.SetPalette(1)
gStyle.SetPaperSize(1024,768)
#c1.SetGrid()
c1.SetFillColor(0)
c1.SetFrameFillColor(0)
c1.SetFrameLineWidth(2)
c1.SetFrameBorderMode(0)
c1.SetFrameBorderSize(2)
c1.Update()
#myps.NewPage()
y1,y2,y3,y4,yy = array('d'), array('d'),array('d'), array('d'),array('d')
x1,x2,x3,x4,xx = array('d'), array('d'),array('d'), array('d'),array('d')
#current_time = (int(commands.getoutput("date +%s"))-int(list[-25].split()[0]))/60
now = int(commands.getoutput("date +%s"))
for i in range(24*7):
j = len(list)-24*7+i
tmp = list[j].split()
time = (int(tmp[0])-now)/3600./24.
#time = time*5
if time < -7:
continue
value = int(tmp[1])
if value == 1:
x1.append(time)
y1.append(1)
elif value == -1:
x2.append(time)
y2.append(-1)
x1.append(time)
y1.append(100)
print x1
xx.append(-0.01)
yy.append(1.2)
x1.append(-7)
y1.append(100)
n = len(x1)
if n!=0:
gr1 = gr(n,x1,y1)
gr1.SetMarkerColor(3)
gr1.SetMarkerStyle(4)
gr1.SetTitle("Service Availability Monitoring of the valtical cluster in the last 7 days")
gr1.GetXaxis().SetTitle("Time(days) to " + commands.getoutput("date"))
gr1.GetYaxis().SetTitle("Proof cluster")
#gr1.GetYaxis().SetTitleOffset(0.3)
gr1.GetXaxis().SetTickLength(0.03)
gr1.GetYaxis().SetTickLength(0)
#gr1.GetYaxis().SetLabelColor(0)
#gr1.GetYaxis().SetLabelOffset(0)
#gr1.GetXaxis().SetTimeFormat("%d/%m/%Y")
gr1.Draw("AP")
n = len(x2)
if n!=0: seems the
gr2 = gr(n,x2,y2)
gr2.SetMarkerColor(2)
gr2.SetMarkerStyle(4)
gr2.Draw("P")
n = len(xx)
print xx,yy
if n!=0:
gr3 = gr(n,xx,yy)
gr3.SetMarkerColor(6)
gr3.SetMarkerStyle(23)
gr3.Draw("P")
#tex()
c1.Update()
filename="week_proof_valtical.jpg"
c1.SaveAs(filename);
def main():
infile = file("result.txt",'r')
list = infile.readlines()
infile.close()
convert(list)
if __name__ == '__main__':
main()
5. on valtical.cern.ch , /home/qing/SAM_proof/week_proof.sh calls week_proof.py and send the plot to ticalui02.uv.es
#!/bin/bash
source /etc/profile
source /afs/cern.ch/sw/lcg/contrib/gcc/4.3/x86_64-slc5-gcc43-opt/setup.sh
export ROOTSYS=/afs/cern.ch/sw/lcg/app/releases/ROOT/5.26.00/x86_64-slc5-gcc43-opt/root
export PATH=/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
export rootpath=/home/qing/SAM_proof
cd $rootpath
scp root@valtical00.cern.ch:/work/users/qing/data5/qing/SAM_proof/result.txt ./
export cmd=`ps eux | grep 'python draw_proof.py' | grep -v grep | grep -v 'ps eux'`
echo $cmd
if [ "$cmd" == "" ] ; then
echo 'Running draw_proof.py now'
python draw_proof.py
rm -f /var/www/html/cluster/plots/SAM_proof_valtical.jpg
cp ./SAM_proof_valtical.jpg /var/www/html/cluster/plots/
else
echo 'python draw_proof.py is already running'
fi
cd $rootpath
6. on valtical.cern.ch, /home/qing/SAM_proof/week_proof.sh is set to run every hour
0 * * * * source /home/qing/SAM_proof/week_proof.sh
7. Check the SAM proof performance at
SAM:PROOF
High Temperature Alarming
1. lm_sensors installation
yum -y install lm_sensors
sensors-detect
sensors
2. Add the following line to crontab on valtical.cern.ch. This will check the temperature of each core, if it reaches 100 degree, an alarming letter will be sent to the system administrator immediately.
*/10 * * * * source /work/users/qing/data5/qing/lm_sensors/scanT.sh
Miscellaneous
SLC6 installation post setup on worker nodes:
# unify the public key in the cluster
scp root@valtical00.cern.ch:./.ssh/authorized_keys ~/.ssh/
scp root@valtical00.cern.ch:./.ssh/id_rsa ~/.ssh
# cern setup
/usr/sbin/lcm --configure ntpd afsclt
/sbin/chkconfig --add afs
/sbin/chkconfig afs on
/sbin/chkconfig --add yum-autoupdate
/sbin/service yum-autoupdate start
/usr/sbin/lcm --configure srvtab
/usr/sbin/lcm --configure krb5clt sendmail ntpd chkconfig ocsagent ssh
# mount /work and /data6
for d in "/work"
do
mkdir $d
echo "valtical07:$d $d nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.190 0 0" >> /etc/fstab
done
for d in "/data6"
do
mkdir $d
echo "valticalui01:$d $d nfs rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 0 0" >> /etc/fstab
done
service rpcbind start
sevice nfs start
mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.140 valtical00:/work /work
mount -t nfs -o rw,rsize=8192,wsize=8192,hard,intr,addr=137.138.40.173 valticalui01:/data6 /data6
# add users , in /etc/passwd change /sbin/bash to /sbin/nologin.
useraddcern solans
useraddcern jvalero
useraddcern yesenia
useraddcern yzhu
useraddcern bmellado
useraddcern ypan
useraddcern tiledaq
useraddcern tilerod
useraddcern ruanxf
useraddcern akkruse
useraddcern valls
useraddcern nkauer
useraddcern snair
useraddcern xchen
useraddcern gtorralb
useraddcern shaque
useraddcern lfiorini
useraddcern ferrer
useraddcern tilelas
useraddcern yangw
useraddcern hpeng
useraddcern qing
useraddcern fcarrio
useraddcern tibristo
useraddcern osilbert
useraddcern scole
useraddcern smartine
useraddcern lmarch
useraddcern jhuston
useraddcern jresende
useraddcern montoya
useraddcern daalvare
useraddcern lcolasur
useraddcern lcerdaal
useraddcern roreed
useraddcern jlima
useraddcern yahuang
useraddcern ghamity
useraddcern svonbudd
useraddcern gamar
useraddcern lcerdaal
# useful pacakges
yum -y install glibc-devel
yum -y install gparted
yum -y install tkinter
yum -y install xemacs xemacs-packages-*
yum -y install compat-gcc-34-g77
yum -y install compat-libf2c-34
yum -y install libxml2*
yum -y install compat-libtermcap.x86_64
yum -y install openssl098e
yum -y install compat-expat1-1.95.8-8.el6.i686
yum -y install expat-devel.x86_64
yum -y install expat-devel.i686
ln -s /lib64/libexpat.so.1.5.2 /usr/lib64/libexpat.so.0
yum install compat-openldap-2.3.43-2.el6.x86_64
yum install libaio.so.1, libcrypto.so.6
# Modify /etc/inittab to set id:3:initdefault:''
pyroot setup
#install python
cd /work/users/qing/software
mkdir python2.4
wget http://www.python.org/ftp/python/2.4.6/Python-2.4.6.tgz
tar -xvzf Python-2.4.6.tgz
cd Python-2.4.6
./configure --enable-shared --prefix="/work/users/qing/software/python2.4"
gmake
gmake install
#install pyroot
cd /work/users/qing/software
mkdir root5.28
wget ftp://root.cern.ch/root/root_v5.28.00b.source.tar.gz
tar -xvzf root_v5.28.00b.source.tar.gz
cd root
./configure --with-python-incdir=/work/users/qing/software/python2.4/include/python2.4 --with-python-libdir=/work/users/qing/software/python2.4/lib --prefix="/work/users/qing/software/root/root5.28" --etcdir="/work/users/qing/software/root/root5.28/etc"
gmake
gmake install
#Environment setup before using ROOT:
export ROOTSYS=/work/users/qing/software/root
export PATH=/work/users/qing/software/python2.4/bin:$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib:/work/users/qing/software/python2.4/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:$ROOTSYS/lib
Data Management
Data production
To run skimming jobs on DATA, please enter /work/offline/qing/Skimmer/trunk use
SendToGRID? to send the jobs. For MC the working directory is /work/offline/qing/Skimmer/trunk.mc.
Data Transfer
Data transfer from Grid sites to xrootd
Either lcg-cp or dq2-get can be used for this download, you can write codes by your self or use the tool below:
cd /afs/cern.ch/user/q/qing/grid2CERN_lcg/data
put the name of datasets in list_data.txt
python lcg_data.py to create all.txt
python split.py and then run cp1.sh, cp2.sh,cp3.sh,cp4.sh,cp5.sh on 5 different valtical machines
python lcg_check.py to create download_missing.sh
source download_missing.sh on one or more machines.
consistency check of downloaded files in xrootd
cd /data6/qing/broken
source setup.sh (Juan's working environment to test his ntuples)
cat /work/users/qing/data5/qing/ForUsers/all_xrootd_files.txt | grep 'root://valtical.cern.ch//localdisk/xrootd/users/qing/data12_8TeV/SMDILEP_p1328_p1329/user.qing.data12_8TeV.periodH.physics_Muons.PhysCont.NTUP_SMWZ.grp14_v01_p1328_p1329_2LepSkim_v2' > MuonH.txt
python create.py MuonH.txt > 1.sh
source 1.sh
python find_bad.py
Data transfer from CERN to IFIC
10TB in
AtlasLocalGroupDisk? at IFIC can be used to backup some files at CERN xrootd, to use more please contact
sgonzale@ific.uv.es.
cd /afs/cern.ch/user/q/qing/CERN2IFIC/
put the xrootd paths into list_data.txt
python makelist.py to create list_files.txt
python transfer_list.py to create files_to_transfer.txt
source dq2_setup.sh to set up the enviroment
source files_to_transfer.txt to start the transfer
scp list_files.txt qing@ticalui01.uv.es:./private/data_transfer/list_files.txt
open a new terminal and log ticalui01, cd ~qing/private/data_transfer
python checklist.py
scp qing@ticalui01.uv.es:./private/data_transfer/lustre_file.txt ./
python transfer_missing.py
source files_to_transfer.txt
Data transfer from IFIC to CERN
cd /afs/cern.ch/user/q/qing/cern2ific
# save the list of files into 1.txt
# use make.py to create the codes for transfer
# use split.py to split the transfer into several threads.
Data Deletion
For large scale xrootd file deletion please put the list of directories to be deleted in /data6/qing/file_deletion, then python scan.py; python delete.py; source delete.sh
Maintenance tasks
On request
- Install new packages in all computers
- Add new users
- Change user default settings
- Remove old users
- Check nfs, xrootd, condor, proofd status
Daily
- Check and kill for zombie processes.
- Check CPU and memory consumption.
- Free cached memory.
- Check SAM performance and Ganglia status
Weekly
- Check for package upgrades
- Check disk space status
- Warn users using considerable amount of disk space
- Help users migrate data from NFS to xrootd
- Check /var/log/messages for SMART message of disk problem
Monthly
- Reboot machines in the cluster
- Check dark files, invalid links and empty directories in xrootd
--
GangQin? - Nov 11 2013