MPI Cluster with Python and Amazon EC2 (part 2 of 3)

Today I posted a public AMI which can be used to run a small beowulf cluster on Amazon EC2 and do some parallel computations with C, Fortran, or Python. If you prefer another language (Java, Ruby, etc) just install the appropriate MPI library and rebundle the EC2 image. The following set of Python scripts automate the launch and configuration of an MPI cluster on EC2 (currently limited to 20 nodes while EC2 is in beta):

Update (3-19-08): Code for running a cluster with large or xlarge 64 bit EC2 instances is now hosted on google code. The new images include NFS, ganglia, IPython1, and other useful python packages.

http://code.google.com/p/elasticwulf/

Update (7-24-07): I've made some important bug fixes to the scripts to address issues mentioned in the comments. See the README file for details

The file contains some quick scripts I threw together using the AWS Python example code. This is the approach I'm using to bootstrap an MPI cluster until one of the major linux cluster distros is ported to run on EC2. Details on what is included in the public AMI were covered in Part 1 of the tutorial, Part 3 will cover cluster operation on EC2 in more detail and show how to use Python to carry out some neat parallel computations.

The cluster launch process is pretty simple once you have an Amazon EC2 account and keys, just download the Python scripts and you can be running a compute cluster in a few minutes. In a later post I will look at cluster bandwidth and performance in detail. If you have only an occasional need for running large jobs, $2/hour for a 20 node MPI cluster on EC2 is not a bad deal considering the ~ $20K price for building your own comparable system.

Prerequisites:

  1. Get a valid Amazon EC2 account
  2. Complete the most recent "getting started guide" tutorial on Amazon EC2 and create all needed web service accounts, authorizations, and keypairs
  3. Download and install the Amazon EC2 Python library
  4. Download the Amazon EC2 MPI cluster management scripts

Launching the EC2 nodes

First , unzip the cluster management scripts and modify the configuration parameters in '''EC2config.py''', substituting your own EC2 keys and changing the cluster size if desired:

#replace these with your AWS keys
AWS_ACCESS_KEY_ID = 'YOUR_KEY_ID_HERE'
AWS_SECRET_ACCESS_KEY = 'YOUR_KEY_HERE'
#change this to your keypair location (see the EC2 getting started guide tutorial on using ec2-add-keypair)
KEYNAME = "gsg-keypair"
KEY_LOCATION = "/Users/pskomoroch/id_rsa-gsg-keypair"
# remove these next two lines when you've updated your credentials.
print "update %s with your AWS credentials" % sys.argv[0]
sys.exit()

MASTER_IMAGE_ID = "ami-3e836657"
IMAGE_ID = "ami-3e836657"

DEFAULT_CLUSTER_SIZE = 5

Launch the EC2 cluster by running the '''ec2-start_cluster.py''' script from your local machine:

 
peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-start-cluster.py 

image ami-3e836657
master image ami-3e836657
----- starting master -----
RESERVATION r-275eb84e  027811143419    default
INSTANCE    i-0ed33167  ami-3e836657            pending
----- starting workers -----
RESERVATION r-265eb84f  027811143419    default
INSTANCE    i-01d33168  ami-3e836657            pending
INSTANCE    i-00d33169  ami-3e836657            pending
INSTANCE    i-03d3316a  ami-3e836657            pending
INSTANCE    i-02d3316b  ami-3e836657            pending

Verify the EC2 nodes are running with '''./ec2-check-instances.py''':


peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-check-instances.py 
----- listing instances -----

RESERVATION     r-aec420c7      027811143419    default
INSTANCE        i-ab41a6c2      ami-3e836657    domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com      running
INSTANCE        i-aa41a6c3      ami-3e836657    domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com      running
INSTANCE        i-ad41a6c4      ami-3e836657    domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com      running
INSTANCE        i-ac41a6c5      ami-3e836657    domU-12-31-33-00-04-19.usma1.compute.amazonaws.com      running
INSTANCE        i-af41a6c6      ami-3e836657    domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com      running

Cluster Configuration and Booting MPI

Run '''ec2-mpi-config.py''' to configure MPI on the nodes, this will take a minute or two depending on the number of nodes.


peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-mpi-config.py 

---- MPI Cluster Details ----
Numer of nodes = 5
Instance= i-ab41a6c2 hostname= domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com state= running
Instance= i-aa41a6c3 hostname= domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com state= running
Instance= i-ad41a6c4 hostname= domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com state= running
Instance= i-ac41a6c5 hostname= domU-12-31-33-00-04-19.usma1.compute.amazonaws.com state= running
Instance= i-af41a6c6 hostname= domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com state= running

The master node is ec2-72-44-46-78.z-2.compute-1.amazonaws.com 


... ...

Configuration complete, ssh into the master node as lamuser and boot the cluster:
$ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com 
> mpdboot -n 5 -f mpd.hosts 
> mpdtrace

Login to the master node, boot the MPI cluster, and test the connectivity:



peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com 



Sample Fedora Core 6 + MPICH2 + Numpy/PyMPI compute node image 

http://www.datawrangling.com/on-demand-mpi-cluster-with-python-and-ec2-part-1-of-3

---- Modified From Marcin's Cool Images: Cool Fedora Core 6 Base + Updates Image v1.0 ---

see http://developer.amazonwebservices.com/connect/entry.jspa?externalID=554&categoryID=101


Like Marcin's image, standard disclaimer applies, use as you please...

Amazon EC2 MPI Compute Node Image
Copyright (c) 2006 DataWrangling. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

    * Redistributions of source code must retain the above copyright
       notice, this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above
       copyright notice, this list of conditions and the following
       disclaimer in the documentation and/or other materials provided
       with the distribution.

    * Neither the name of the DataWrangling nor the names of any
       contributors may be used to endorse or promote products derived
       from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
[lamuser@domU-12-31-33-00-02-5A ~]$ 
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdboot -n 5 -f mpd.hosts 
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdtrace
domU-12-31-33-00-02-5A
domU-12-31-33-00-01-E3
domU-12-31-33-00-03-E3
domU-12-31-33-00-03-AA
domU-12-31-33-00-04-19

The results of the mpdtrace command show we have an MPI cluster running on 5 nodes. In the next section, we will verify that we can run some basic MPI tasks. For more detailed information on these mpi commands (and MPI in general), see the MPICH2 documentation.

Testing the MPI Cluster

Next we execute a sample C program bundled with MPICH2 which estimates pi using the cluster:


[lamuser@domU-12-31-33-00-02-5A ~]$  mpiexec -n 5 /usr/local/src/mpich2-1.0.5/examples/cpi
Process 0 of 5 is on domU-12-31-33-00-02-5A
Process 1 of 5 is on domU-12-31-33-00-01-E3
Process 2 of 5 is on domU-12-31-33-00-03-E3
Process 3 of 5 is on domU-12-31-33-00-03-AA
Process 4 of 5 is on domU-12-31-33-00-04-19
pi is approximately 3.1415926544231230, Error is 0.0000000008333298
wall clock time = 0.007539

Test the message travel time for the ring of nodes you just created:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpdringtest 100
time for 100 loops = 0.14577794075 seconds

Verify that the cluster can run a multiprocess job:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpiexec -l -n 5 hostname
3: domU-12-31-33-00-03-AA
0: domU-12-31-33-00-02-5A
1: domU-12-31-33-00-01-E3
4: domU-12-31-33-00-04-19
2: domU-12-31-33-00-03-E3

Testing PyMPI

Lets verify that the PyMPI install is working with our running cluster of 5 nodes. Execute the following on the master node:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpirun -np 5 pyMPI /usr/local/src/pyMPI-2.4b2/examples/fractal.py
Starting computation (groan)

process 1 done with computation!!
process 3 done with computation!!
process 4 done with computation!!
process 2 done with computation!!
process 0 done with computation!!
Header length is  54
BMP size is  (400, 400)
Data length is  480000
[lamuser@domU-12-31-33-00-02-5A ~]$ ls
hosts  id_rsa.pub  mpd.hosts  output.bmp

This produced the following fractal image (output.bmp):

output.bmp

We will show some more examples using PyMPI in the next post.

Changing the Cluster Size

If we want to modify the number of nodes in the cluster we first need to kill the mpi cluster from the master node as follows:


[lamuser@domU-12-31-33-00-02-5A ~]$ mpdallexit
[lamuser@domU-12-31-33-00-02-5A ~]$ mpdcleanup

Once this is done, you can start additional instances of the public AMI from your local machine, then re-run the ec2-mpi-config.py script and reboot the cluster.

Cluster Shutdown

Run '''ec2-stop-cluster.py''' to stop all EC2 MPI nodes. If you just want to stop the slave nodes, run ec2-stop-slaves.py



peter-skomorochs-computer:~/AmazonEC2_MPI_scripts pskomoroch$ ./ec2-stop-cluster.py
This will stop all your EC2 MPI images, are you sure (yes/no)? yes
----- listing instances -----
RESERVATION     r-aec420c7      027811143419    default
INSTANCE        i-ab41a6c2      ami-3e836657    domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com      running
INSTANCE        i-aa41a6c3      ami-3e836657    domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com      running
INSTANCE        i-ad41a6c4      ami-3e836657    domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com      running
INSTANCE        i-ac41a6c5      ami-3e836657    domU-12-31-33-00-04-19.usma1.compute.amazonaws.com      running
INSTANCE        i-af41a6c6      ami-3e836657    domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com      running

---- Stopping instance Id's ----
Stoping Instance Id = i-ab41a6c2 
Stoping Instance Id = i-aa41a6c3 
Stoping Instance Id = i-ad41a6c4 
Stoping Instance Id = i-ac41a6c5 
Stoping Instance Id = i-af41a6c6 

Waiting for shutdown ....
----- listing new state of instances -----
RESERVATION     r-aec420c7      027811143419    default
INSTANCE        i-ab41a6c2      ami-3e836657    domU-12-31-33-00-02-5A.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-aa41a6c3      ami-3e836657    domU-12-31-33-00-01-E3.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-ad41a6c4      ami-3e836657    domU-12-31-33-00-03-AA.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-ac41a6c5      ami-3e836657    domU-12-31-33-00-04-19.usma1.compute.amazonaws.com      shutting-down
INSTANCE        i-af41a6c6      ami-3e836657    domU-12-31-33-00-03-E3.usma1.compute.amazonaws.com      shutting-down

Glad you liked it. Would you like to share?

Sharing this page …

Thanks! Close

Add New Comment

  • Image

Showing 69 comments

  • Aarthi 8288 3 comments collapsed Collapse Expand

    Hi,

    I was trying to launch a cluster using the given scripts. I got the following error :

    # ./ec2-start-cluster.py
    Traceback (most recent call last):
    File "./ec2-start-cluster.py", line 23, in <module>
    import EC2
    ImportError: No module named EC2

    Can anyone help me out ?
    </module>

  • Adeel Jan 1 comment collapsed Collapse Expand

    I think you need to get python library :)

    Adeel Jan.

  • Lukearron 1 comment collapsed Collapse Expand

    Did anyone solve this problem?  I had the same problem, and I downloaded and unpacked the libraries, but perhaps the python library needs to be _placed_ somewhere?

  • Easwar 1 comment collapsed Collapse Expand

    Dear Sir,

      I am trying to use MPI for
    running Data parallel applications. I read this post and on following up, I have few queries. Would be great, if I can get answers to them.

    a) As of now, I am choosing AMI using the browser interface of Amazon. You have uploaded two AMI's one for master (ami-e813f681) and another for slave (ami-eb13f682).
    When I launch an instance, I can only choose one. Which one should I
    choose ? (Or how should I choose both, in case I need to)  ? Do these
    AMI's has OPEMPI implementation ?

    b) Secondly, if I launch multiple instances of a Large CPU (say 3
    instances), I would be get many nodes (say m nodes per one instance, hence i would have m*3 nodes). Can I communicate between nodes
    of different instances just as we normally do in MPI ?

    c) Are you aware any MPI image with ubuntu ?

    d) I am using browser, since I have to connect through proxy. How/Where
    do set proxy connections, if I need to start cluster using your python
    scripts ?

    Easwar

  • Johnliu 1 comment collapsed Collapse Expand

    Hi,

    Does anyone have this running under cygwin in windows?

    If so, can you please post your code for ec2-mpi-config.py? I tried using the current file but get lots of errors.

    Thanks,
    John

  • Supriyamunshaw 2 comments collapsed Collapse Expand

    I have sucessfully connected to 5 nodes but am having trouble with the ec2-mpi-config.py script. When I run it, I repeatedly get the following:

    ---- MPI Cluster Details ----
    Numer of nodes = 5
    Instance= i-8d2949e7 external_name = ec2-184-72-183-113.compute-1.a... hostname= ip-10-212-239-33.ec2.internal state=
    Instance= i-832949e9 external_name = ec2-174-129-138-43.compute-1.a... hostname= domU-12-31-39-10-6C-13.compute-1.internal state=
    Instance= i-812949eb external_name = ec2-184-72-141-241.compute-1.a... hostname= domU-12-31-39-0B-00-F8.compute-1.internal state=
    Instance= i-872949ed external_name = ec2-174-129-131-171.compute-1.... hostname= domU-12-31-39-09-C4-24.compute-1.internal state=
    Instance= i-852949ef external_name = ec2-174-129-61-151.compute-1.a... hostname= domU-12-31-39-0C-D8-57.compute-1.internal state=
    5

    The master node is ec2-184-72-183-113.compute-1.a...

    Writing out mpd.hosts file

    scp -i id_rsa-gsg-keypair -o "StrictHostKeyChecking no" id_rsa-gsg-keypair root@ec2-184-72-183-113.compute-1.amaz...:~/.ssh/id_rsa-gsg-keypair

    ssh: connect to host ec2-184-72-183-113.compute-1.a... port 22: Connection timed out
    lost connection

    ssh -o "StrictHostKeyChecking no" root@ec2-184-72-183-113.compute-1.amaz... "touch .ssh/authorized_keys"

    ssh: connect to host ec2-184-72-183-113.compute-1.a... port 22: Connection timed out

    ssh -o "StrictHostKeyChecking no" root@ec2-184-72-183-113.compute-1.amaz... "cp -r .ssh /home/lamuser/"

    It seems I'm having a connection problem. Does anyone what I can do about this?

  • Supriyamunshaw 1 comment collapsed Collapse Expand

    ok, i figured it out. the connection was being made through my default security group where port 22 was not open.

  • Henrywang41 1 comment collapsed Collapse Expand

    Can someone please help? mc2-mpi-config.py is giving the following error:

    ---- MPI Cluster Details ----
    Numer of nodes = 2
    Instance= i-bf90f5d5 external_name = ec2-184-73-36-216.compute-1.am... hostname= ip-10-242-1
    18-239.ec2.internal state= running
    Instance= i-bd90f5d7 external_name = ec2-174-129-77-216.compute-1.a... hostname= ip-10-242-
    117-139.ec2.internal state= running

    The master node is ec2-184-73-36-216.compute-1.am...

    Writing out mpd.hosts file
    Traceback (most recent call last):
    File "ec2-mpi-config.py", line 210, in <module>
    sys.exit(main())
    File "ec2-mpi-config.py", line 65, in main
    configure()
    File "ec2-mpi-config.py", line 151, in configure
    rsakeys = open(homedir + "/.ssh/id_rsa", 'r').read()
    IOError: [Errno 2] No such file or directory: 'C:\\Documents and Settings\\Yunzhi Ma/.ssh/id_rsa'

    Could this possibly be due to anything about chunk and parsed_response? I printed out parsedresponse:

    [['RESERVATION', 'r-30733b5b', '219669225938', 'default'], ['INSTANCE', 'i-bf90f5d5', 'ami-e813f681'
    , 'ec2-184-73-36-216.compute-1.am...', 'ip-10-242-118-239.ec2.internal', 'running'], ['RESER
    VATION', 'r-36733b5d', '219669225938', 'default'], ['INSTANCE', 'i-bd90f5d7', 'ami-eb13f682', 'ec2-1
    74-129-77-216.compute-1.amazon...', 'ip-10-242-117-139.ec2.internal', 'running']]

    Thanks so much,
    Henry</module>

  • Henrywang41 1 comment collapsed Collapse Expand

    Hi,

    Amazon just recently (last month) released a cloud computing instance (http://developer.amazonwebserv...

    Does the code and what you describe here work for this new released instance for HPC? (It's CentOS HVM AMI, ami-7ea24a17 under U.S. East)

    Thanks,

    Henry

  • Henrywang41 1 comment collapsed Collapse Expand

    Hi,

    Is Elasticwulf and MPI essentially the same thing? I'm trying to run some high performance computing using an Amazon EC2 cluster.

    Also, is boto necessary to set up a cluster on amazon EC2? What's the difference between boto and Elasticwulf?

    Thanks,

    Henry

  • Henrywang41 1 comment collapsed Collapse Expand

    Hi,

    What exactly is the difference between Elasticwulf and MPI? Are they the same thing? I'm trying to launch a cluster for HPC, which one is more suitable?

    Also, is boto necessary too for launching a cluster?

    Thanks,
    Henry

  • bearrito 1 comment collapsed Collapse Expand

    Little late on the thread here but would still like some feedback.

    My issues:

    1. I was also being prompted for my password. I ended up using the solution that Raghave suggested.

    2. I am not seeing /usr/local/src/pyMPI-2.4b2/. That directory doesn't appear to be present. I tried to get around this by copying in fractal.py from my local machine. I end up with the following:

    mpirun -np 2 pyMPI /home/lamuser/fractal.py
    pyMPI: can't open file '/home/lamuser/fractal.py': [Errno 2] No such file or directory

    fractal.py is in that directory.

    Advice?

  • miccloud 1 comment collapsed Collapse Expand

    I have a python script that calls a program installed on the master node and slave nodes.
    Can I run it from the master node and get the results with mpirun?

    Thanks a lot.

  • miccloud 1 comment collapsed Collapse Expand

    Can I execute bash script in a node with a job?

    Thanks.

  • Harry 1 comment collapsed Collapse Expand

    Hello everybody,

    I had error with Fortran 77 library in AMI ami-e813f681 (Fedora core 6 x86- 64 bit) because of two libraries: libf2c or libg2c in the AMI.
    (....
    checking for f_exit in -lf2c... no
    checking for f_exit in -lg2c... no
    checking for dummy main to link with Fortran 77 libraries... unknown
    configure: error: linking to Fortran libraries from C fails
    )

    Could anybody help me solve the problem? Thanks so much!

    best regards,

    Harry

  • Pete 9 comments collapsed Collapse Expand

    Jeff,


    I'm actually working on a new version of Elasticwulf right now. Shoot me an email at pete@datawrangling.com and I'll try to include what you need for Rmpi. If you have some sample Rmpi code you want to test and that you don't mind releasing, we can build that into the AMI to ensure everything you need is installed.


    Here are the MPI installs that were included on that Fedora 64 bit image:


    # mpich2 

    cd /usr/local/src/
    wget http://www.mcs.anl.gov/researc...
    tar -xzvf mpich2-1.0.6p1.tar.gz
    cd mpich2-1.0.6p1
    ./configure --enable-sharedlibs=gcc --prefix=/usr/local/mpich2
    make
    make install

    # openmpi

    cd /usr/local/src/
    wget http://www.open-mpi.org/softwa...
    #wget http://www.open-mpi.de/softwar...
    tar -zxf openmpi-1.2.5.tar.gz
    cd openmpi-1.2.5
    ./configure --prefix=/usr/local/openmpi
    make all
    make install

    #lam
    cd /usr/local/src/
    wget http://www.lam-mpi.org/downloa...
    tar -xzvf lam-7.1.2.tar.gz
    cd lam-7.1.2
    ./configure --enable-shared --prefix=/usr/local/lam
    make
    make install

    # mpich1
    cd /usr/local/src/
    wget http://www-unix.mcs.anl.gov/mp...
    tar -zxf mpich.tar.gz
    cd mpich-1.2.7p1/
    ./configure --enable-sharedlib --prefix=/usr/local/mpich
    make
    make install

  • Andrew Lonie 8 comments collapsed Collapse Expand

    Hi - I'd be very interested in a Elasticwulf cluster that supports R, too. Did anything come of this? I'd be happy to be involved.

    Andrew

  • pskomoroch 7 comments collapsed Collapse Expand

    Yes, I have a Rails REST web service on github now for spawning MPI clusters that support R. Haven't had time yet to finish the docs or write a blog post about it. Works fine in operation...

    http://github.com/datawranglin...

  • Soren Macbeth, Chief Data Scientist @yieldbot. Co-founder of @StockTwits 3 comments collapsed Collapse Expand

    Hey Peter,

    Thanks for the awesome ec2cluster project! I forked it on github so that I could add the ability to install R packages from CRAN across all the nodes in the cluster. Basically I added the following to ubuntu_installs.sh:
    --snip--
    # Custom R packages
    cat <<eof>> /home/ec2cluster/install_custom_packages.R
    install.packages("DEoptim",repos="http://cran.stat.ucla.edu")
    EOF

    R CMD BATCH /home/ec2cluster/install_custom_packages.R
    --snip--

    A bit crude I know, but something people wanting to do things with R will probably find useful.

    To actually run the R code, I was successful with the follow approach:

    1) put your R code in a file and save it as foo.R
    2) add the following line to a shellscript: mpirun -n 1 -hostfile /mnt/ec2cluster/openmpi_hostfile R CMD BATCH foo.R
    3) call the shellscript and grab the produced .Rout file to your S3 bucket!
    </eof>

  • pskomoroch 2 comments collapsed Collapse Expand

    Soren,

    Glad ec2cluster helped, are you guys big R users at StockTwits?

    -Pete

  • Soren Macbeth, Chief Data Scientist @yieldbot. Co-founder of @StockTwits 1 comment collapsed Collapse Expand

    The very first StockTwits prototype used R to generate some statistics as well as generate charts of the stocks being talked about in tweets :)

  • Andrew Lonie 3 comments collapsed Collapse Expand

    Thanks - this is impressive, and a web interface for building clusters would be very nice, but maybe I'm after something slightly closer to your original solution. You might already know that R has an in-language cluster support API built on Rmpi, called SNOW (Simple network of workstations). It allows for various script commands like clusterExport(data) and clusterApply(vector, function) which let you interactively cluster jobs according to the parameter values in a list.
    Would this be compatible with your cluster app? I notice that it's more schedule-focused and the nodes need to talk to the app; is the app acting as the master rather than one of the ec2 nodes? Ideally my architecture would be something like master Rmpi node running on ec2 talking to arbitrary slave #nodes, accessed through maybe something like the Biocep remote R client (http://biocep-distrib.r-forge.... clustering done in-session.

  • pskomoroch 2 comments collapsed Collapse Expand

    Yes, this is compatible with the cluster app. One of the bundled examples runs some calculations with R and SNOW, another uses Rmpi (see the code on Github http://bit.ly/ocKCn ).

    The web interface can be run anywhere, but needs to be https accessible to the EC2 nodes. I usually just run it as a small ec2 instance as shown in the docs. You can start a job from the API or the web console with shutdown_after_complete = false, and the cluster will remain live for interactive work, just ssh into the master node like you would with Elasticwulf. The app is not acting as the MPI master node, but the cluster nodes do talk to the app to handle configuration etc.

  • Andrew Lonie 1 comment collapsed Collapse Expand

    OK thanks I understand. I'll try this out properly; it sounds like exactly what I'm after.

  • Jeff Howbert 1 comment collapsed Collapse Expand

    Hello Pete -


    Thanks for putting together your ElsticWulf scripts and AMIs. They have saved me a huge amount of time and effort compared with building my own from scratch.


    I am interested in parallelizing some machine learning algorithms written in R. My interest in ElasticWulf comes partly from the fact that R is already bundled with its AMIs. I discovered, however, that Rmpi is not one of the installed packages. What were your intentions/plans with R in the ElasticWulf environment? Did you plan for parallel communication using a mechanism other than MPI?


    It wasn't hard to install Rmpi on top of the ElasticWulf AMI, but despite a couple of days' struggle, I haven't found a combination of Rmpi version and paths to the AMI's existing MPI libraries that fully works. The best I've been able to do is spawn an R cluster where all the nodes are running on the master node.


    Could you tell me what version of R and the various MPI implementations (OpenMPI, MPICH, LAM) went into your 64-bit AMI? That might help me sort things out. A couple of observations, for what they're worth:


    1) Once I have an ElasticWulf cluster up and have run mpdboot, I find that mpiexec works, but orterun (the equivalent in OpenMPI) does not.


    2) There has been at least one report of problems between the latest version of Rmpi and OpenMPI:


    https://stat.ethz.ch/pipermail...


    Much thanks.


    Jeff Howbert

  • Ben Racine 1 comment collapsed Collapse Expand

    Hello,


    I get the same problem that Michael Creel was having. I am able to start the instances and get them "running" successfully, by pointing them to my keypair with the KEYNAME variable, but I believe my KEY_LOCATION variable in my EC2config.py file must be causing the prompt for a password.


    This is all per the default block of code in EC2config.py:

    change this to your keypair location (see the EC2 getting started guide tutorial on using ec2-add-keypair)

    KEYNAME = "my_keypair"
    KEY_LOCATION = "/Users/pskomoroch/id_rsa-gsg-keypair"


    I believe this requires me to go back through the "getting started guide", but I just wanted to update my progress in case others are seeing the same thing.


    Many thanks for sharing your progress Peter!


    Ben Racine

  • ej 1 comment collapsed Collapse Expand

    @Pete - Feb08



    Can’t get the ec-mpi-config to work. Says list >index out of range for mpi-externalnames[0] on >line 108



    You are right - the output from ec2-describe-instances has changed. Do the following..


    Change


    machine_state.append(chunk[-1])


    to


    machine_state.append(chunk[5])


    in "ec2-mpi-config.py"


    Or, if the output changes again - just do an "ec2-describe-instances" and match up the required fields to the index on the chunk[] array

  • Pete 1 comment collapsed Collapse Expand

    Joanne,


    Try logging in and running your commands as "lamuser" instead of root. The default configuration assumes lamuser is running all commands.


    $ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.ama...


    See part 1 of the post for details on changing the configuration to run MPI as root.


    -Pete

  • jjiyunlee 1 comment collapsed Collapse Expand

    Hi,


    Thanks for your writeup! It's very helpful. I'm running into an error with mpdtrace and was hoping for some of your insight into it. I am running mpd as root, with one node for simplicity.


    I can successfully start up mpd on the instance and "mpd &":
    root@...:/etc# mpdboot -n 1 -f mpd.hosts
    root@...:/etc# mpd &
    [1] 2280


    but "mpdtrace -l" gives me an error:
    root@ip-10-251-143-0:/etc# mpdtrace -l
    mpdtrace: unexpected msg from mpd=:{'error_msg': 'invalid secretword to root mpd'}:


    I have tried all pairwise combinations of having MPD_SECRETWORD= or secretword= in ~/.mpd.conf and /etc/mpd.conf, all of which were set to read/write for root only.


    I also can't do "mpdallexit":
    I can't mpdallexit:
    root@...:~# mpdallexit
    mpdallexit: mpd_uncaught_except_tb handling:
    : 'cmd'
    /usr/local/bin/mpich2-install/bin/mpdallexit 53 mpdallexit
    elif msg['cmd'] != 'mpdallexit_ack':
    /usr/local/bin/mpich2-install/bin/mpdallexit 59
    mpdallexit()


    I can also run mpdcheck as a server and have it listen for mpdcheck as a client from the same instance (in a different window).


    Suggestions/help? I'd greatly appreciate any advice you have on this problem. Thanks --


    <ul>
    <li>Joanne</li>
    </ul>

  • Tim Salimans 1 comment collapsed Collapse Expand

    Got it working using OpenSSH, guess PuTTy was the problem after all.

  • Tim Salimans 1 comment collapsed Collapse Expand

    Great project and thanks very much for sharing! I do have some trouble getting it all to work though. Everything works fine until it tries to run the create_hosts.py:


    /////// OUTPUT ///////////////


    Creating hosts file on master node and copying hosts file to compute nodes...


    pscp -scp -i D:\grid\keys\keypair.ppk -q create_hosts.py root@ec2-67-202-19-253.
    compute-1.amazonaws.com:/etc/


    plink -ssh -i D:\grid\keys\keypair.ppk root@ec2-67-202-19-253.compute-1.amazonaw
    s.com "python /etc/create_hosts.py"


    exporting 10.252.31.48:/home/beowulf
    exporting 10.252.31.48:/mnt/data
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @ WARNING: UNPROTECTED PRIVATE KEY FILE! @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    Permissions 0644 for '/root/.ssh/id_rsa' are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.
    bad permissions: ignore key: /root/.ssh/id_rsa
    Permission denied, please try again.
    Permission denied, please try again.
    Permission denied (publickey,gssapi-with-mic,password).
    lost connection
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @ WARNING: UNPROTECTED PRIVATE KEY FILE! @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    Permissions 0644 for '/root/.ssh/id_rsa' are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.
    bad permissions: ignore key: /root/.ssh/id_rsa
    Permission denied, please try again.
    Permission denied, please try again.
    Permission denied (publickey,gssapi-with-mic,password).
    lost connection


    etcetera


    //////////////////////////////////


    As you can see I made some small modifications in order to use PuTTy as my SSH client, but that does not seem to be the problem... Does anyone else have this problem, and does anyone know how to fix it?

  • Pete 4 comments collapsed Collapse Expand

    Magg,


    I wouldn't recommend it, the latency would be huge and I'm not sure how MPI would handle that. You would also need to open the mpi ports to the outside world using the EC2 security group authorize commands.


    An alternative is to open an X11 session and connect to the head node or maybe VNC in to the instance. The 64 bit elasticwulf images are set up for X11 sessions and adding a desktop package would allow you to VNC in if you prefer that route.


    -Pete

  • Abhimanyu 3 comments collapsed Collapse Expand

    I am facing the same problem. I am able to set up the ring manually but mpdboot complains:

    mpdboot_domU-blah(handle_mpd_output 414): from mpd on domU-blah-,
    invalid port info:
    no_port

    any word on this?

  • Abhimanyu 1 comment collapsed Collapse Expand

    I found the problem. Apparently I had forgotten to include the "chown -R user:user /home/user" command. It didnt have access to the id_rsa file. As root mpdboot would work. Rather silly error message though.

  • pskomoroch 1 comment collapsed Collapse Expand

    I think there are some SGE grid solutions now that allow you to add EC2 nodes to your existing cluster, but again the performance of MPI from a local network to EC2 would be horrible...

    If you are interested in running MPI on EC2, I have a new project on Github I'll be announcing soon:

    http://github.com/datawranglin...

  • magg 1 comment collapsed Collapse Expand

    Hi Peter,


    Have you tried to connect EC2 instances with your local desktops? I am trying to do that with mpich2 1.0.7 but I am not successful at all. mpdboot complains about invalid port info (no_port) - actually no port when I try to do mpdboot -n 2. Even when I tried to mpd& on EC2 machine and then mpdtrace -l and then unblock the port and then mpd -h ec2-blabla -p ec2-mpdtrace-l-port still I have no luck. Have you faced similar problems?


    Thanks
    - magg

  • Pete 1 comment collapsed Collapse Expand

    @Theo,


    I'm attending MMDS this week at Stanford (http://www.stanford.edu/group/mmds/), and had a chance to ask James Demmel a few questions. He gave a talk titled "Avoiding communication in linear algebra algorithms", which was very relevant. His advice for matrix multiplication in a high latency environment like EC2 was to try dialing up the block size as much as possible in the standard MPI solvers and see how performance was affected.


    -Pete

  • Pete 1 comment collapsed Collapse Expand

    I found the secret to avoiding a lot of MPI errors on EC2, but haven't found time to do an additional post...


    The secret seems to be that just because Amazon says that an instance is "running", doesn't mean that the ssh daemons are available. This caused all kinds of intermittent problems setting up the hosts and my old scripts would fail silently.


    In my current codebase, I do some checks like the following:


        print "Instance is %s" % BOOTING_INSTANCE

    # wait for instance description to return "running" and grab HOSTNAME variable
    print "Polling server status (ec2-describe-instances %s)" % BOOTING_INSTANCE
    while 1:
    print "waiting for instance to boot..."
    HOSTNAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $4}'" % BOOTING_INSTANCE)
    if len(HOSTNAME) > 1:
    print "-------Instance booted, The server is available at %s" % HOSTNAME
    DOM_NAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $5}'" % BOOTING_INSTANCE).split('.')[0]
    break
    time.sleep(1)

    # sometimes it takes a while for the ssh service to start, even when the ec2 api describes an instance as running.
    # A machine in the "running" state may not have finished booting. Try executing a no-op command until a valid response is found
    print "verifying ssh daemon has started..."
    counter=0
    while 1:
    print "Waiting for ssh daemon to start..."
    counter += 1
    REPLY = commands.getoutput('''ssh %s "root@%s" 'echo "hello"' ''' % (SSH_OPTS, HOSTNAME) )
    if REPLY == 'hello':
    print "-------ssh has started, proceeding with AMI build"
    break
    if counter > 24:
    print "Instance not respoding to SSH hails, aborting..."
    ## sshd should not take more than 2 minutes to launch
    terminate_status = commands.getoutput('ec2-terminate-instances %s' % BOOTING_INSTANCE)
    ec2_launch_failed = True
    print "Base Instance terminated"
    break
    time.sleep(5)

    if ec2_launch_failed:
    print "Aborting build"
    return

  • Patrick 1 comment collapsed Collapse Expand

    Thanks, Peter. The original EC2.py was the problem. I now have the large AMIs up and running. Thanks again for the article and help!


    Patrick

  • Theo 1 comment collapsed Collapse Expand

    Peter:


    I am diving into Hadoop with Map/Reduce as we speak. As you know Google implemented its environment in C++, so I was a bit disappointed that Hadoop had chosen Java VM to do its bidding. Java makes interfacing with hardcore numerical operations much harder. The particular problems I am looking at are large scale Lanczos solvers to find eigen values/vectors of large systems of equations. These systems are of interest in advertising, quantitative finance, and sensor networks. Problem is that they all are environments in which latency is of the essence. So you have a capacity component in terms of the size of the system and a latency issue in terms of the data rate coming in and the opportunity cost for somebody to get to the answer faster.


    I would be interested in working on this particular benchmark problem: pick a big eigen value/vector problem and solve it on a cluster, EC2, and via Hadoop/Map-reduce. Clearly this is going to be a lot of work so this should be publishing worthy. I am sure many folks would be interested in this experiment, so let me know if this is something that could invest time in.


    Theo

  • Pete 1 comment collapsed Collapse Expand

    Patrick,


    Did you start with a clean install of the 64 bit scripts? I made some changes to EC2.py in the new scripts to handle the new instance types...

  • Patrick 1 comment collapsed Collapse Expand

    Peter,


    Very useful tool! I've gotten a cluster up and running using the small instance type but am having difficulty launching the _64 AMIs.


    $ ./ec2-start-cluster.py
    m1.large
    image ami-eb13f682
    master image ami-e813f681
    ----- starting master -----
    Traceback (most recent call last):
    File "./ec2-start-cluster.py", line 39, in ?
    master_response = conn.run_instances(imageId=MASTER_IMAGE_ID, minCount=1, maxCount=1, keyName= KEYNAME, instanceType=INSTANCE_TYPE )
    TypeError: run_instances() got an unexpected keyword argument 'instanceType'


    If I try to start the cluster without passing an INSTANCE_TYPE arg I get the following:
    $ ./ec2-start-cluster.py
    m1.large
    image ami-eb13f682
    master image ami-e813f681
    ----- starting master -----
    InvalidParameterValue: The requested instance type's architecture (i386) does not match the architecture in the manifest for ami-e813f681 (x86_64)
    ----- starting workers -----
    InvalidParameterValue: The requested instance type's architecture (i386) does not match the architecture in the manifest for ami-eb13f682 (x86_64)


    Any ideas? Thanks!

  • Pete 1 comment collapsed Collapse Expand

    Raghav,


    You can ssh in as root instead of lamuser, or compile the output file into your home directory.


    Check out the new AMI and managment code:


    http://www.datawrangling.com/pycon-2008-elasticwulf-slides.html


    The new AMI includes a preconfigured NFS mounted directory /home/beowulf. If you compiled the file there, hellompi would be available on all nodes.


    Note that the new images default to the 'large' instance type which charges .40 cents/hour for each node.


    -Pete

  • raghav 1 comment collapsed Collapse Expand

    i am trying to compile a simple c mpi file "hellompi.c" using the command:



    mpicc -o /usr/hellompi /usr/local/src/hellompi.c



    why does it give me the following error?


    /usr/bin/ld: cannot open output file /usr/hellompi: Permission denied
    collect2: ld returned 1 exit status


    how do I get root priveledges?

  • Kurt Grandis 1 comment collapsed Collapse Expand

    Thanks Pete. I wish I had made the PyCon session, but these posts have been very helpful. The cluster went up pretty quickly and I have already used it to crunch a few minor data runs.
    In setting everything up I also ran into a similar problem as Raghav and ended up solving it in a similar manner by forcing the -i credentials switch. I imagine it has something to do with the way I configured and placed my certs.

  • raghav 1 comment collapsed Collapse Expand

    Thanks pete. For your prompt reply!!

  • raghav 1 comment collapsed Collapse Expand

    Hey guys,
    Actually I made a change in the ec2-mpi-cluster.py file. I have no clue about python and I dono why it worked but it worked.


    I modified:


    template = ssh -o "StrictHostKeyChecking no" %(user)s@%(host)s "%(cmd)s"
    to
    template = 'ssh -i "/home/id_rsa-gsg-keypair" %(user)s@%(host)s "%(cmd)s"


    and


    template = '%(cmd)s %(switches)s -o "StrictHostKeyChecking no" %(src)s %(user)s@%(host)s:%(dest)s'
    to
    template = '%(cmd)s %(switches)s -i "/home/id_rsa-gsg-keypair" %(src)s %(user)s@%(host)s:%(dest)s'


    And it started working perfectly fine. I was able to log in to the master node and the pi problem executed perfectly fine.


    Thanks a lot guys


    Cheers,
    Raghav

  • Pete 1 comment collapsed Collapse Expand

    raghav,


    Another suggestion is to make sure the instances are running with ./ec2-check-instances.py and then retry the script, sometimes it takes a while for sshd to start up on EC2.


    -Pete

  • Pete 1 comment collapsed Collapse Expand

    raghav,


    I assume you were able to start the instances with ec2-start-cluster.py? The text on the terminal is normal, but it shouldn't ask you for a password (I should probably add a verbose option instead of streaming out text by default). There was a path issue on windows with an earlier version of the scripts, so that may be the problem.


    If you send me the script version number from the README and/or terminal output, I can try to track down what is going on...


    peter.skomoroch@gmail.com


    -Pete

  • raghav 1 comment collapsed Collapse Expand

    Why does it ask me for a password when i try to run the ec2-mpi-config.py file.?
    it says root@xxx password:
    And I get a lot of text on the terminal when I try running the file.

  • Pete 1 comment collapsed Collapse Expand

    No problem, thanks for finding the typos. These were meant to be some quick hacks, but took on a life of their own after a while.


    I found this worked for configuring LAM, I'll send you more details in an email...


    The contents of bash_profile should be as follows:


     
    -bash-3.1# more .bash_profile
    # .bash_profile

    # Get the aliases and functions
    if [ -f ~/.bashrc ]; then
    . ~/.bashrc
    fi

    # User specific environment and startup programs

    LAMRSH="ssh -x"
    export LAMRSH

    LD_LIBRARY_PATH="/usr/local/lam-7.1.2/lib/"
    export LD_LIBRARY_PATH

    MPICH_PORT_RANGE="2000:8000"
    export MPICH_PORT_RANGE

    PATH=$PATH:$HOME/bin

    PATH=/usr/local/lam-7.1.2/bin:$PATH

    MANPATH=/usr/local/lam-7.1.2/man:$MANPATH

    export PATH
    export MANPATH


    Launch the cluster on EC2 and try booting LAM manually:


    [lamuser@domU-12-31-33-00-04-4B ~]$ lamboot /etc/mpd.hosts

    [lamuser@domU-12-31-33-00-04-4B ~]$ lamnodes
    n0 domU-12-31-33-00-04-4B.usma1.c...:1:origin,this_node
    n1 domU-12-31-33-00-03-35.usma1.c...:1:
    n2 domU-12-31-33-00-03-3C.usma1.c...:1:
    n3 domU-12-31-34-00-00-55.usma2.c...:1:

    [lamuser@domU-12-31-33-00-04-4B ~]$ tping N -c3
    1 byte from 3 remote nodes and 1 local node: 0.039 secs
    1 byte from 3 remote nodes and 1 local node: 0.004 secs
    1 byte from 3 remote nodes and 1 local node: 0.002 secs

  • pete 1 comment collapsed Collapse Expand

    Found another typo too, ok I'm nit picking. In the stop-cluster script the message says Stoping as opposed to stopping. A year ago when you first posted this stuff you mentioned that the reason why the non-root user was called lamuser was that the scripts were used for LAM in some previous incarnation. Since I'm actually trying to use LAM, if you have any LAM stuff around that might help me to iron out one or two problems I still have.


    Anyway, thanks again,
    Pete

  • Pete 1 comment collapsed Collapse Expand

    pete found the error... the image Ids he entered into the config module inadvertently contained a capital letter. This doesn’t cause any problems for starting images since string case is ignored by Amazon. The corresponding image id response string from AWS is always lowercase, so the python script comparison on image ID string fails.


    In the next version of the scripts, I will handle upper/lowercase differences in the ami strings. For now, just make sure to use all lower case or call the python .lower() method,


     
    >>> test = 'ami-fE9a7f97'
    >>> test.lower()
    'ami-fe9a7f97'
    >>>

  • pete 1 comment collapsed Collapse Expand

    Can't get the ec-mpi-config to work. Says list index out of range for mpi-externalnames[0] on line 108
    start cluster and check instances are OK so I think that python, EC, elementree
    are OK
    Any ideas why? Has AWS changed the format of the response you're parsing (yes I have had a look at the python code but since I haven't used python before I can't see anything obvious to me)
    BTW you have a typo in mpi config Numer of nodes as opposed to Number of nodes , it even shows in your example above.
    Otherwise I like what you've done, I'd just like it to work for me.
    Thanks,
    Pete

  • Pete 1 comment collapsed Collapse Expand

    Theo,


    Sorry for the delay in posting this and responding. I've been working on a startup for the past 7 months and was in serious crunch mode. Don't read too much into the large gap in posts, it is just me working on this as a side-project. I finished moving the blog to another host and finally have some time to get back to the EC2 work. This experience has taught me to never name a series of blog posts "part 1 of N" :)


    You make some excellent points. One thing that has changed since I wrote the first post is that EC2 now offers larger 64bit machine images with better I/O (you can provision an entire physical server and not be limited by sharing network resources in the virtual instance). I'd like to see if this improves the network performance. I'm giving a talk on this in March, so I'm on the hook to have some benchmarks by then.


    I also agree on the mapreduce side. For embarrassingly parallel problems, hadoop on ec2 is potentially much more attractive...more robust, easier for most people to program. Ideally, I would like to do some comparisons between the two approaches and run the numbers.


    The performance of an EC2 MPI cluster is definitely going to be worse than your own custom hardware, but it still might fit certain niche situations. In my case, I needed to run some MPI code for a large problem and didn't have access to a large enough cluster. The performance on EC2 was nowhere near what you get on a high-end cluster, but it got the job done for a reasonable price.


    This discussion on the beowulf list goes into more detail on the pros/cons:


    http://www.beowulf.org/pipermail/beowulf/2008-January/020490.html


    -Pete

  • Theo 1 comment collapsed Collapse Expand

    Does the 5 month hiatus in this project mean that it was a bad idea and you guys have learnt enough to waste no more time on it?


    Given the virtualization uncertainty, finding the right communication/computation balance for typical MPI programs appears to be very unrewarding. Secondly, MPI development and debug and then QA and scale out are not addressed, which doesn't bode well. It appears most productive to have a local small cluster for development and debug, and then do QA and scale out on EC2, but some benchmarking numbers would really help.


    If EC2 is only robust for embarrassingly parallel problems, then MapReduce style programs are more attractive. There the size of the data set and how well it integrates in a distributed file system appear to be the problems to focus on. Or BOINC like approaches if there is no integrated DFS. Anyone have operational data on these approaches?

  • Patrick Ball 1 comment collapsed Collapse Expand


    the first two parts really set the stage ... Part 3?


    :)

  • Soo.. 1 comment collapsed Collapse Expand

    What about that Part 3? :)

  • Pete 1 comment collapsed Collapse Expand

    Update (7-24-07): I’ve made some important bug fixes to the scripts to address issues mentioned in the comments.


    Specific changes made:


    <ul>
    <li>fixed lamuser home directory permissions bug</li>
    <li>fixed section of ec2-mpi-config.py which clobbered existing rsa keys on the client machine</li>
    <li>Updated calls of AWS python EC2 library to use API version 2007-01-19
    http://developer.amazonwebserv...</li>
    <li>fixed mpdboot issue by using amazon internal DNS names in hosts files</li>
    <li>scripts should now work on windows/cygwin client environments</li>
    </ul>

    After I run some benchmarks, I'm hoping to find some time to add LAM and OpenMPI to the EC2 image along with NFS configuration, C3 cluster tools, Ganglia, and a benchmarking package.

  • Pete 1 comment collapsed Collapse Expand

    Ralph,


    More good points. I've been tied up with some other projects, but it sounds like enough feedback is in to make a revised version of the image and scripts. I expect the latency to vary a bit depending on the random EC2 network topology when a cluster is launched...(instances on the same box vs. over ethernet) that might explain the ringtest. The mutual ssh access was set up since we do a lot of file/data shuffling between nodes outside of MPI.


    Thanks again, looking forward to hearing how the regression test system works out.


    -Pete

  • Ralph Giles 1 comment collapsed Collapse Expand

    Yeah, that would work better. Some more detailed comments:


    <ul>
    <li>

    Your image has /home/lamuser/.mpd.conf owned by root. I had to chown it to lamuser before I could start mpd.

    </li>
    <li>

    You script passes the public dns names for the nodes into mpd.hosts. For that to work, a hole has to be opened in the firewall for the ports the mpi daemon is using. A simpler solution is to just pass the internal dns names. Then all the traffic happens behind the firewall, which probably also improves latency. (Although my ringtest was noticably slower than yours, averaging 2.2e-3 seconds/loop so who knows?)

    </li>
    <li>

    I was surprised that when I originally ran ec2-add-keypair in the EC2 tutorial that it uploaded the public key (ok) and printed out the private key (ok I guess) but didn't print out the public key locally (weird). Your scripts seem to assume the public key is available as id_rsa.pub on the client machine. Shouldn't this first be copied either from /root/.ssh/authorized_keys on the master node (as installed by amazon) or retrieved through the query interface?

    </li>
    </ul>

    Is the mutual ssh access required for more than just launching the MPI daemon? If all subsequent traffic goes through the mpi daemons, starting mpd from the client machine, or automatically from the init scripts after pulling mpd.hosts from S3 would save the whole trouble, including uploading the private key at all.

  • Pete 1 comment collapsed Collapse Expand

    Ralph,


    Good catch. Thanks for pointing that out. I just lifted those passwordless ssh lines straight from an MPI tutorial.


    This might solve the clobbering as well (from http://www.maclife.com/forums/topic/61520):


    cat id_rsa.pub >> .ssh/authorized_keys


    "The above command will create the "authorized_keys" file in the ".ssh" directory if that file doesn't already exist, and it will append the new id_rsa.pub file to it if it does already exist."


    I'll add that change to the scripts. Good luck with the regression cluster, I heard Oracle developers do something like that using Condor on otherwise idle desktops (see http://www.cs.wisc.edu/condor/doc/nmi-lisa2006-slides.pdf).


    -Pete

  • Ralph Giles 1 comment collapsed Collapse Expand

    ===== DO NOT USE THESE SCRIPTS! =====


    This section of ec2-mpi-config.py is a bit problematic:


    os.system('cp %s ~/id_rsa.pub' % KEY_LOCATION )
    os.system('cp ~/id_rsa.pub ~/.ssh/id_rsa')


    This will clobber any existing rsa key on the initiating machine's account, and with break normal auth on the next login if you have a different default rsa key!


    The script should instead copy the private key directly from KEY_LOCATION to the nodes.


    ===== DO NOT USE THESE SCRIPTS! =====


    Otherwise, way cool. Thanks for putting this tutorial together. We're trying EC2 clusters out as a way to get quicker feedback from regression tests after changes to our software. Unfortunately, with the one hour granularity I don't think it will be price competitive. We want 20-100 nodes for about 5 minutes at a time.

  • Pete 1 comment collapsed Collapse Expand

    Mike Cariaso modified my scripts to fix some path issues and got it working on a windows laptop, he might have also fixed some other errors I didn't notice. I haven't had a chance to try them yet, but you can download the modified scripts here:


    http://mpiblast.pbwiki.com/AmazonEC2

  • Michael Creel 1 comment collapsed Collapse Expand

    Yep, I run the mpi-config script right after creating the instances, doing just what you suggest. The fact that the instances start up at all seems to me to mean that the keypair information is ok. Do you know if anyone but you has been able to launch a cluster? Very cool stuff. I'm going to be looking into making a Debian AMI that works the same way.

  • Pete 1 comment collapsed Collapse Expand

    Michael,


    I haven't had the scripts prompt me for a password before, are you running them from your local machine? The mpi-config script expects the keyname and keypair location to match what was used to start the instance. Take a look at your EC2config.py file and make sure the instances were all started with your own keypair (i used the gsg keypair I created on my laptop in the Amazon "getting started guide" tutorial):





    AWS_ACCESS_KEY_ID = ‘YOUR_KEY_ID_HERE’
    AWS_SECRET_ACCESS_KEY = ‘YOUR_KEY_HERE’
    MASTER_IMAGE_ID = "ami-3e836657"
    IMAGE_ID = "ami-3e836657"
    KEYNAME = "gsg-keypair"
    KEY_LOCATION = "~/id_rsa-gsg-keypair"
    DEFAULT_CLUSTER_SIZE = 5





    I'm working on an updated version of the scripts and EC2 image which should make things a bit cleaner. Sorry the code is ugly right now in terms of error handling...I just wanted to toss something together to get people started :)

  • Michael Creel 1 comment collapsed Collapse Expand

    A little report on my trial.
    1) ./ec2-start_cluster.py is not always successful in getting the requested number of nodes to come up. The instances sometimes have status "terminated" before anything is done with them.


    2) When the 5 nodes all come up, I still get a problem with ./ec2-mpi-config.py requesting a root password:


    michael@yosemite:~/ec2/AmazonEC2_MPI_scripts$ ./ec2-mpi-config.py


    ---- MPI Cluster Details ----
    Numer of nodes = 5
    Instance= i-e39c7a8a hostname= ec2-72-44-45-138.z-2.compute-1... state= running
    Instance= i-e29c7a8b hostname= ec2-72-44-45-185.z-2.compute-1... state= running
    Instance= i-e59c7a8c hostname= ec2-72-44-45-186.z-2.compute-1... state= running
    Instance= i-e49c7a8d hostname= ec2-72-44-45-122.z-2.compute-1... state= running
    Instance= i-e79c7a8e hostname= ec2-72-44-45-60.z-2.compute-1.... state= running


    The master node is ec2-72-44-45-138.z-2.compute-1...


    Writing out mpd.hosts file
    nslookup ec2-72-44-45-138.z-2.compute-1...
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-138.z-2.compute-...\nAddress: 72.44.45.138\n')
    nslookup ec2-72-44-45-185.z-2.compute-1...
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-185.z-2.compute-...\nAddress: 72.44.45.185\n')
    nslookup ec2-72-44-45-186.z-2.compute-1...
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-186.z-2.compute-...\nAddress: 72.44.45.186\n')
    nslookup ec2-72-44-45-122.z-2.compute-1...
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-122.z-2.compute-...\nAddress: 72.44.45.122\n')
    nslookup ec2-72-44-45-60.z-2.compute-1....
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-60.z-2.compute-1...\nAddress: 72.44.45.60\n')
    Warning: Permanently added 'ec2-72-44-45-138.z-2.compute-1...,72.44.45.138' (RSA) to the list of known hosts.
    id_rsa.pub 100% 1675 1.6KB/s 00:00
    root@ec2-72-44-45-138.z-2.compute-1.am...'s password:


    This is as far as I can get at the moment. Looks like a minor problem. Cheers, M.

  • Michael Creel 1 comment collapsed Collapse Expand

    One question, do you know if something like an NFS shared home directory is possible. Using S3, possibly?

  • Michael Creel 1 comment collapsed Collapse Expand

    Excellent stuff! I've gotten started with EC2 and I'll be trying your images out soon. I doubt that I'll be trying to make ParallelKnoppix work on EC2, because your approach is the right one, I think. PK is designed to use when the hardware is not known ahead of time. With EC2, the hardware is known, so a tailor-made image is the way to go. Your scripts allow an on-demand cluster to be created in minutes, and that's all that PK offers, anyway. PK usually needs some remastering so that users can add their own packages. Re-bundling an EC2 image is completely analogous. I'm planning on doing just that, probably starting with your images, and doing some testing of latency on tasks that require different degrees of internode communication. Thanks for all this, it'll make the rest an easy job.

Reactions

Trackback URL
blog comments powered by Disqus