Batch Distribution –
A Tool to Run a Batch of Computation Jobs Distributed

Christian Hoene
Telecommunication Networks Group
TU-Berlin
June 8th, 2004

 

Description

Quantitative stochastic simulations are useful tools for studying performance of stochastic dynamic systems, but they can consume much time and computing resources. To overcome these limits, parallel or distributed computation is needed. The tool BatchDistribution distributes a batch of computation jobs (e.g. simulations) on a multiple computers.

BatchDistribution has some advantages:

1.      It is robust against failures of computers or jobs.  If a computer fails to compute a job, another computer starts this job. If a computer or a job has failed multiple times, it is skipped.

2.      It has a graphical user interface to monitor the progress and computation performance, and it predicts the time that is needed to finish all jobs.

3.      It is fast. In the ending phase unfinished jobs are distributed among multiple computers. Only the fastest computation of a job is used. Other computers working on the same job are stopped. Thus, we avoid waiting for a slow computer for a long time.

Usage

Download the distribution file from http://www2.tkn.tu-berlin.de/equipment/bd/release.zip (version June 14th, 2004) and unzip it in a directory of your choice. If not yet available, install ssh and java. Create a file, which contains a list of all computers, which you like to use. Ensure that you can log on the computers without any password query and warning message. Make a file containing a list of command lines, which describe the jobs. Next, start “BatchDistribution” with:
java –cp <Dir of BatchDistribution/classes> bd.Main <list of hosts> <list of jobs>

Next, a window pops up (fig 1.) and contains status information. After all jobs have been finished, the working directory contains log files with the standard and error output of the jobs, e.g.

_home_hoene_voip_simus_test2_remote__home_hoene_voip_audio_3_c_m01s42_sw_512309488_on_verleihnix.out.0

_home_hoene_voip_simus_test2_remote__home_hoene_voip_audio_3_c_m01s42_sw_512309488_on_verleihnix.err.0

If no output has been produced, no file is stored.

 

If you want to run a single command on every computer, use the following command:
java –cp <Dir of BatchDistribution/classes> bd.RunEverywhere <list of hosts> “<command including arguments>”

 

Requirements

The tool uses ssh to log on the other computer. The input of pass words or phrases is not possible during runtime. Therefore, an automatic log in is required. To allow ssh to connect to another computer, the following steps have to be performed.

 

1.)    Generate a public key with ssh-keygen. Normally each user wishing to use SSH with RSA or DSA authentication runs ssh-keygen once to create the authentication key in $HOME/.ssh/identity, $HOME/.ssh/id_dsa or $HOME/.ssh/id_rsa. Also, a pubic key is stored in a file with the same name but ``.pub'' appended.  The program also asks for a passphrase.  The passphrase must be empty.

2.)    Copy the public key to the file on the $HOME/.ssh/authorized_keys on each remote computer.

3.)    Log in to every computer that should be used for simulations with ssh. Ssh might ask for allowance if it log in to the computer the first time. The second log in will happen with out any questions and without any passwords.

Remarks

This software is open source. The software webpage is http://www2.tkn.tu-berlin.de/equipment/bd/. Please send bugs and remarks to hoene@tkn.tu-berlin.de.

 

 

Figure 1:  Graphical Monitor of BatchDistribution