Scaling out Computing Clusters to EC2

The aim of this use case is to create and configure everything needed to deploy a mini-cluster in a private network with NIS and NFS and let external nodes from EC2 to connect via VPN to the server and join this network and then the cluster with SGE.

Architecture

For all of this we have take the following technical considerations:

The private network is only accesible via VPN, so nodes running inside it are completely isolated.
The server must provide NFS, NIS and VPN services.
Internal workernodes must auto start NFS and NIS services from the server.
External workernodes are from EC2.
External workernodes must auto configure them and join the VPN, and start NFS and NIS services from the server.

with these requirements in mind we have to create 3 images:

2 XEN images, server and internal workernode.
1 AMI for EC2 with the same configuration as the internal worker node.

The Server:

private ip	eth0 10.1.1.99
public ip	eth1 147.96.1.100
hostname	oneserver

The XEN workernodes (local)

private ip	eth0 10.1.1.55
hostname	local01

The EC2 workernodes (ip range from 10.1.1.100 to 10.1.1.254)

vpn ip	tap0 10.1.1.100 (asigned by VPN server)
public ip	eth0 automatically assigned by Amazon
hostname	workernode0

Configuration

Images Configuration

The procedure to create and configure the images can be quite long, for this I recommend you that first create a clean installation of the linux distro of your choice (we used Ubuntu). Later this will be used to create all other VM images.

Once you have this image copy it, then mount it and edit the files /etc/hostname and /etc/network/interfaces, this will be the server (we call it oneserver). Copy again the original images and put the proper values for the same files, we called these nodes WorkernodesX, unmount both images and start them with Xen, we recommend you the following HOWTOs for NIS and NFS, since are simple a straightforward. For one server image you will have to follow the “server steps” on the HOWTO's and so on for the clients (workernodes).

NIS HOWTO

NFS HOWTO

Now when configuring the VPN server its important to allow the clients to use duplicate certificates (all the clients with same certs) this is because machines on EC2 are the same and we don't want to create separate AMIs, for this include in the openvpn configuration file at /etc/openvpn/server.conf the line “duplicate-cn”, we use the following HOWTO for the VPN, the VPN was ONLY configured in the server, not the client images, this is because the client installation of the VPN will be on EC2 images only.

OpenVPN HOWTO

To create an AMI use the bundle command, in our experience get the latest version since could happen that you have problems bundling the /etc/fstab or the /etc/hosts files, remember you can use a workernode local to bundle the initial EC2 image. Also install VPN client and SGE and configure them, the easiest way to do this is start a copy a local workernode a configure it, then bundle it with the command ec2-bundle-image, and you are finished. There is a minor tweaking we have to do, since SGE works mainly the hostnames of the machines, and Amazon assigns automatically names to new instances on EC2, we created a script that was executed on the boot of the machine, and depending on private vpn ip address a name was generated for it, since OpenVPN supports definition of ranges for machines joining the vpn, we reserved the range from 10.1.1.100 to 10.1.1.254 to EC2 instances. All of this must be configured in the /etc/hosts of the server (oneserver).

OpenNebula EC2 Driver Configuration

Once all of the configuration is finished, now we need to create ONE templates to launch all the machines. You must create one template for each local machine (oneserver, workernode0,workernode1…) but only one template for any number of machines you want to launch via EC2. Check out the documentation on EC2 configuration and template creation here. For an example, if you have this available images:

lgonzalez@machine:bin$ ec2-describe-images
IMAGE   ami-e4a94d8d    one-w2/image.manifest.xml       587384515363    available       private         i386    machine
IMAGE   ami-cdb054a4    sge-dolphin/image.manifest.xml  587384515363    available       private         i386    machine
IMAGE   ami-d8b753b1    sge-parrot/image.manifest.xml   587384515363    available       private         i386    machine
IMAGE   ami-dcb054b5    sge-squirrel/image.manifest.xml 587384515363    available       private         i386    machine

And we chose the last image ami-dcb054b5, you can configure the ONE-EC2-template as following:

CPU=1

MEMORY=1700

EC2=[ AMI="ami-dcb054b5", KEYPAIR="gsg-keypair", ELASTICIP="75.101.155.97", INSTANCETYPE="m1.small", AUTHORIZED_PORTS="22-25"]

REQUIREMENTS = 'HOSTNAME = "ec2"'

The ELASTICIP, INSTANCETYPE and AUTHORIZED_PORTS are optional.

Deploy and Testing

To start the testing all of this, start OpenNebula and add the ec2 host with:

lgonzalez@machine:one$ one start
oned and scheduler started
lgonzalez@machine:one$ onehost create ec2 im_ec2 vmm_ec2
lgonzalez@machine:one$ onehost list
 HID NAME                      RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
   0 ec2                         0             0    100                   on

submit the created ec2 template to initiate a ec2 instance like this:

lgonzalez@machine:one$ onevm create ec2.template
ID: 0

later the scheduler will deploy the machine on the ec2

lgonzalez@machine:one$ onevm list
  ID     NAME STAT CPU     MEM        HOSTNAME        TIME
   0    one-0 pend   0       0                 00 00:00:05
lgonzalez@machine:one$ onevm list
  ID     NAME STAT CPU     MEM        HOSTNAME        TIME
   0    one-0 boot   0       0             ec2 00 00:00:15

And then you can see more detailed information (like IP address of this machine):

lgonzalez@machine:one$ onevm show 0
VID            : 0
AID            : -1
TID            : -1
UID            : 0
STATE          : ACTIVE
LCM STATE      : RUNNING
DEPLOY ID      : i-1d04d674
MEMORY         : 0
CPU            : 0
PRIORITY       : -2147483648
RESCHEDULE     : 0
LAST RESCHEDULE: 0
LAST POLL      : 1216647834
START TIME     : 07/21 15:42:47
STOP TIME      : 01/01 01:00:00
NET TX         : 0
NET RX         : 0

....: Template :....
    CPU             : 1
    EC2             : AMI=ami-dcb054b5,AUTHORIZED_PORTS=22-25,ELASTICIP=75.101.155.97,INSTANCETYPE=m1.small,KEYPAIR=gsg-keypair
    IP              : ec2-75-101-155-97.compute-1.amazonaws.com
    MEMORY          : 1700
    NAME            : one-0
    REQUIREMENTS    : HOSTNAME = "ec2"

In this case is IP assigned ec2-75-101-155-97.compute-1.amazonaws.com.

Now we check that the machines are running in our cluster, we have one machine running locally from our Xen resources (local01) and one machine runnining in EC2 (workernode0).

oneserver:~# qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@local01                  BIP   0/1       0.05     lx24-x86
----------------------------------------------------------------------------
all.q@workernode0              BIP   0/1       0.04     lx24-x86
----------------------------------------------------------------------------

To test the cluster, submit some jobs to SGE via qsub <script.sh>, before that we need to change to the account of nistest, since thats the user we configured for the NIS and SGE.

oneserver:~# su - nistest

oneserver:~# qsub test_1.sh; qsub test_2.sh;

Now we see how jobs are scheduled and launched into our hybrid cluster.

nistest@oneserver:~$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@local01                  BIP   0/1       0.02     lx24-x86
----------------------------------------------------------------------------
all.q@workernode0              BIP   0/1       0.01     lx24-x86
----------------------------------------------------------------------------
############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
   1180 0.00000 test_1.sh  nistest      qw    07/21/2008 15:26:09     1
   1181 0.00000 test_2.sh  nistest      qw    07/21/2008 15:26:09     1

nistest@oneserver:~$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@local01                  BIP   1/1       0.02     lx24-x86
   1181 0.55500 test_2.sh  nistest      r     07/21/2008 15:26:20     1
----------------------------------------------------------------------------
all.q@workernode0              BIP   1/1       0.07     lx24-x86
   1180 0.55500 test_1.sh  nistest      r     07/21/2008 15:26:20     1
----------------------------------------------------------------------------

The interesting parameter here is the scalability provided by EC2, since you can launch any number of instances on EC2, and add them as working nodes into your SGE Virtual Private Cluster managed by OpenNebula.