r27 - 25 Nov 2013 - 10:09:36 - Main.qingYou are here: TWiki >  Atlas Web  >  TileCal > IFICT3Cluster

IFIC Tier3 Computing Cluster

Link to CERN Computing Cluster

https://twiki.ific.uv.es/twiki/bin/view/Atlas/CERNComputerCluster

Link to User Questioning Area

https://twiki.ific.uv.es/twiki/bin/view/Atlas/UserQuestions

The IFIC Tier3 Computing Cluster

The IFIC Tier3 computing cluster at IFIC is located in the IFIC cluster room. IFIC AFS account is required to access the 2 UIs remotely.

local > ssh -X login@ticalui01.uv.es or login@ticalui01.ific.uv.es
local > ssh -X login@ticalui02.uv.es or login@ticalui02.ific.uv.es

Computers

Computer Activity Cores Mem Local Disk OS kernel version NFS server User Quota
ticalui01 UI & Condor Submit Machine 8 48GB 750 GB SLC5.6 2.6.18-128.7.1.el5_lustre.1.8.1.1.ific /work 20GB
ticalui02 UI & Condor Submit Machine 8 48GB 750GB SLC5.6 2.6.18-128.7.1.el5_lustre.1.8.1.1.ific /data2 20GB

User Space

Directory Total size User default Quota NFS server Support User list
/work 540 GB 20GB ticalui01 qing,fiorini,solans,avalero,valls,samarsan,yeherji,leoceral,march
/scratch0 460 GB 40GB ticalui01 fiorini,solans,avalero,valls
/scratch1 460 GB 40GB ticalui01 qing,samarsan,yeherji,leoceral,march
/data2 540 GB 20GB ticalui02 qing,fiorini,solans,avalero,valls,samarsan,yeherji,leoceral,march

Cluster topology

  • /work physically mounted on ticalui01 and /data2 physically mounted on ticalui02, by default each user has 20GB in /work and /data2 separately.

How to use root

  • source /afs/ific.uv.es/user/q/qing/software/bin/thisroot.sh
  • source /data2/software/root/bin/thisroot.sh

How to use python2.6

  • export PATH=/work/software/python2.6/bin:$PATH
  • export LD_LIBRARY_PATH=/work/software/python2.6/lib:$LD_LIBRARY_PATH
  • export PYTHONPATH=/work/software/python2.6

UI Setup for SLC5

Condor setup

  • ticalui01: COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD; this is the machine for users to submit jobs
  • ticalui02: MASTER, STARTD; this is the machine working as WNs

Skimming job Example using condor

  • log on ticalui01.ific.uv.es and cd /work/offline/condor_test; condor_submit ticalui02.job; # The output file is /work/offline/condor_testResults/data11_177986_Egamma/Data11_0jets_ticalui02.root
    • /work/offline/condor_test/ticalui02.job: condor job configuration file
    • /work/offline/condor_test/Results/data11_177986_Egamma/HWWrel16: Executable file compiled from /work/offline/tile/yesenia/Analysis/compileHWW.sh
    • /work/offline/condor_test/Results/data11_177986_Egamma/list/ticalui02.txt: path of lustre file to be analyzed
  • /work/offline/tile/yesenia/Analysis/submitCondor_cern/run_jobs shows you how to create multiple condor jobs.

UI Setup for SLC6

  • lustre setup:
 
yum -y update -x kernel -x kernel-devel -x kernel-headers
#
# lustre kernel
#
yum install -y kernel-2.6.32-358.14.1.el6 kernel-devel-2.6.32-358.14.1.el6 kernel-headers-2.6.32-358.14.1.el6 net-snmp-libs > /dev/console 2>&1
/sbin/new-kernel-pkg --make-default --install 2.6.32-358.14.1.el6.x86_64 > /dev/console 2>&1
/sbin/new-kernel-pkg --mkinitrd --dracut --depmod --update 2.6.32-358.14.1.el6.x86_64 > /dev/console 2>&1

RPM_PROXY=""
/bin/rpm -Uhv $RPM_PROXY \
   http://alpha2.ific.uv.es/linux/ific/03/el6/x86_64/RPMS/lustre-client-1.8.9-wc1_2.6.32_358.14.1.el6.x86_64.x86_64.rpm \
   http://alpha2.ific.uv.es/linux/ific/03/el6/x86_64/RPMS/lustre-client-modules-1.8.9-wc1_2.6.32_358.14.1.el6.x86_64.x86_64.rpm \
   > /dev/console 2>&1

echo "mgs01.ific.uv.es@tcp:/ificfs /lustre/ific.uv.es/ lustre ro,_netdev,user_xattr 0 0" >> /etc/fstab
echo "mgs01.ific.uv.es@tcp:/t3fs   /lustre/ific.uv.es/grid/atlas/t3 lustre ro,_netdev,user_xattr 0 0" >> /etc/fstab
mkdir -p /lustre/ific.uv.es/
  • the other setup are similar as in slc5

Raid recovery with /sda or /sdb on ticalui01 and ticalui02 under SLC6

  • Prepare a new disk, format it to have same partition as the system disk , then do the following cleaning:
    • dd if=/dev/zero of=/dev/sdh seek=1465148000 count=10000
    • dd if=/dev/zero of=/dev/sdh2 count=100
    • dd if=/dev/zero of=/dev/sdh1
    • 2002 dd if=/dev/zero of=/dev/sdh count=100

  • when a disk fails, pull it out, wait > 10seconds, then insert a new disk, then copy the first 512Bytes to the new disk, for example , if /dev/sdb is the new disk:
    • dd if=/dev/sda of=/dev/sdb count=512
  • Add the new disk into raid
    • mdadm --manage /dev/md0 --add /dev/sdb1
    • mdadm --manage /dev/md1 --add /dev/sdb2

Raid recovery with /sda or /sdb on ticalui01 and ticalui02 under SLC5

  • make sure that /dev/sda and /dev/sdb has same results in 'fdisk -l' and 'parted;print', if not, fdisk the disk with option 't;1;fd;t;2;fd;a;1', otherwise reboot could fail with one disk;
  • Install grub on /dev/sda and /dev/sdb so that system could boot from any disk, firstly enter grub environment, then:
    • device (hd0) /dev/sda
    • root (hd0,0)
    • setup (hd0)
    • device (hd0) /dev/sdb
    • root (hd0,0)
    • setup (hd0)
  • remove /sdb1 and /sdb2 from md0 and md1
    • mdadm --manage /dev/md0 --fail /dev/sdb1
    • mdadm --manage /dev/md1 --fail /dev/sdb2
    • mdadm --manage /dev/md0 --remove /dev/sdb1
    • mdadm --manage /dev/md1 --remove /dev/sdb2
  • check the id of /sdb
    • dmesg | grep Attached or cat /var/log/dmesg | grep Attached
  • remove disk /sdb from the system
    • echo "scsi remove-single-device 1 0 0 0" > /proc/scsi/scsi
  • pull out /sdb and then insert a new disk,
  • Add the new disk driver to the the system,
    • echo "scsi add-single-device 2 0 0 0" > /proc/scsi/scsi, if they are mapped to sdc, sdd instead of sdb, try to reboot
    • make a new /sdb1 and /sdb2 with size = the previous two partitions (first one is 251MiB, format is ext3)
  • Add the 2 new partitions to md0 and md1
    • mdadm --manage /dev/md0 --add /dev/sde1
    • mdadm --manage /dev/md1 --add /dev/sde2
  • Rebuild will start and takes ~ 6 hours to finished, check the status recorded at /proc/mdstat

Installing printing machine at IFIC

  • copy the directory /etc/cups from ticalui01 or any machine which has the printers installed
  • service cups start

Permission denied problem in logging on machines via AFS

  • lcm --configure krb5clt afsclt srvtab (How to recreate keytab file at CERN)

Can log in the machine but can't write into AFS:

  • lcm --reconfigure all
  • kdestroy; recreate krb5.keytab.linux krb5.keytab.windows
  • check if /tmp is full

forms for new user registration:

Large file deletion in the ATLASLOCALDISK area:

  • cd /afs/cern.ch/user/q/qing/cern2ific/del_at_ific
  • source dq2_setup.sh
  • fill list.txt with srm path to be deleted
  • python del.py

How to transfer files from CERN cluster to IFIC

  • log in valtical00 and "cd /afs/cern.ch/user/q/qing/CERN2IFIC", fill list_data.txt with the dataset names, then 'python makelist.py' to create list_files.txt
  • scp list_files.txt qing@ticalui02.ific.uv.es:./private/data_transfer/
  • open a new terminal and log on ticalui02, 'cd /afs/ific.uv.es/user/q/qing/private/data_transfer', then 'python checklist.py' to create lustre_file.txt
  • go to valtical00 and "scp qing@ticalui02.ific.uv.es:./private/data_transfer/lustre_file.txt ./"; python transfer_list.py to create files_to_transfer.txt
  • python multi_transfer.py

How to setup the free version of NX server which supports multi-users and multi-logins:

How to format a 3TB disk (http://www.cyberciti.biz/tips/fdisk-unable-to-create-partition-greater-2tb.html):

  • fdisk -l /dev/sdb
  • parted /dev/sdb
  • (parted) mklabel gpt
  • (parted) unit TB
  • mkpart primary 0.00TB 3.00TB
  • print
  • quit
  • mkfs.ext3 /dev/sdb1

-- Main.avalero - 13 Sep 2010

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
elseks ticalui_LVM.ks manage 3.8 K 20 Apr 2012 - 09:38 Main.qing KickStart? configuration file to install slc5.6 on ticalui with LVM
elseks ticalui_nonLVM.ks manage 4.2 K 20 Apr 2012 - 09:49 Main.qing  
shsh UI_setup.sh manage 5.8 K 20 Apr 2012 - 09:59 Main.qing  
shsh NXserver_setup.sh manage 1.1 K 08 Sep 2012 - 15:48 Main.qing  
shsh tunnel.sh manage 0.4 K 08 Sep 2012 - 15:50 Main.qing  
Edit | WYSIWYG | Attach | PDF | Raw View | Backlinks: Web, All Webs | History: r27 < r26 < r25 < r24 < r23 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback