Project

General

Profile

3. Software choices

After playing with slackware, redhat and other Linux distributions since 1994, I discovered Debian in early 2000, and I dont see any reason to change now or in the near future. My workstations are using heavily customized Debian, my servers (at home and in datacenters), too. I will remain consistent.

Raspbian (http://www.raspbian.org) is the logical choice. I want to be able to use this mini-cluster as a lab to learn the most I can about parallel computing. It includes distributed parallel computations using MPI/LAM or PVM, using a Batch queue manager such as DrQueue or Torque, or using Mosix. I would like to test also some parallel storage in terms of filesystem, using HDFS, DRDB, GFS, ... or using distributed databases such as MongoDB, Hadoop, BigQuery, ... Finally, I want to test theory about Load-balancing and High-Availability using haproxy, heartbeat, pacemaker, stonith, LVS... Maybe, if I can, I might test OpenStack to create a personal cloud, but given the hardware specs, each node might only be able to execute one and only one vm image... The raspberry cluster won't be a good choice for this purpose. I would like to use distcc to try distributed compilation. Maybe Globus , too...

Cluster boot

Architecture design

As the Raspberry Pi has no firmware embedded and as the whole boot process is made by the GPU from a loaded firmware, each node needs to have an SD card inserted with at least the GPU initialization code and a boot loader. I could use an NFS root filesystem, it would take no memory, would not be fully transfered at boot, but would be slower than a local filesystem and would uselessly overload the network and the CPU (the network interface is on the USB). I could load a kernel and an initial ramdisk with the whole OS. It would be convenient, all the slave will always boot on a fresh and up-to-date OS, but this initrd would be big, it would use a lot of the precious node's memory. Ok, since we have a local SD card in each node, it will be more efficient to have all the read/write operations made on this card instead of through the network or in memory, so I'll store the firmware, the swap and the root filesystem on the local SD.

Now, the issue is that the slaves will live their own life on their own local SD card. All the nodes have to be exact clones, at least at the file level. I have to find a way to minimize the changes made on the slave's OS. Cloning ensure similar nodes, but is heavy. Replicating the same changes on all slaves /should/ work, but divergences can happen, and if they can, they will happen. Here, the idea is to keep the cloned filesystem clean and to write changes somewhere else. I will mount the cloned filesystem read-only and use another partition mounted with unionfs on top of it, in read-write mode. When the changes made on the slaves using the configuration manager or made manually will diverge too much... I'll just have to revert all changes made on all slave since the OS cloning (clean the read-write partition).

Another issue, with 15 nodes, it is very long to clone a minimal Raspbian system (300MB) on each card. This operation occurs when initializing a new node, when a node got corrupted or when I want to change/upgrade the OS on all the slaves. I minimalized the risk of desynchronization, using unionfs, but it still can happen. I would need to reflash 300MB on 15 SD cards... No way ! I want this slave OS upgrade or restore to be automated throught the network, without any physical action. I can use the locally stored kernel and initrd to check on the master if there is a new boot and/or root partition image to flash on the SD card, download and install it, and finally boot on it.

This strategy seems OK. But can be long : 300MB for the image, compressed to 150MB, but sent to 15 nodes simultaneously at 500KB/s means... 1h15 to wait for a new OS to be deployed... :( For sure, I need the nodes to be exact clones, but clones at the filesystem level, not at the block level. So I don't need to deploy SD or partition images, I can only synchronize the filesystem. Thus, I can use rsync to only deploy the changes made to the slave's OS on the master node to the slaves... I can save more time, the longer part of rsync is to parse the local and the remote filesystems to detect changes. The initrd can just check the timestamp of a specific file to know if rsync has to be executed or not... The time needed to deploy a brand new whole OS will still be the same (but it occurs very seldom), and regarding changes or upgrades, ... it will be a lot faster (I estimate approximately 10 minutes).

There is still an issue : what happens if I break the slave's initrd stored on the master ? All the slave node will boot, their old initrd will detect an update, they will download and deploy it and... I'll have 15 bricked slaves ! I have to reflash manually each node's SD card. No way ! So, the idea is to get dynamically the kernel and the initrd (with the potentially broken upgrade logic) from the master, through the network, at boot time. It can be broken, I will only have to fix it on the master and to reboot the slaves (the only physical action). I still can break the slave's firmware or bootloader on the master, it can be deployed and will brick the slaves. In this case, I'll have to reflash all the SD cards, but it will be fast as the SD image will only have to contain the firmware and the bootloader, everything else will be automated through the network at the boot time. Unfortunately, the default firmware is only able to boot a kernel and an initrd from the local SD card. I have to install a network enabled bootloader, to write it on the SD card and to chain it after the default bootloader.

So, I have all the puzzle's pieces : The master hosts a slave's boot partition with the firmware, a network enabled bootloader, a kernel and a customized initrd. It also stores the slave's root filesystem. To create a new node or to restore a node after a bad firmware/bootloader configuration, I only need to write few KB to its SD card. When the slave boots, the bootloader loads the kernel and the customized initrd from the master, the initrd ensure that the locally stored partitions are synced with their template from the master very efficiently, the local partition should never desynchronize thanks to the changes externalized in another partition.

Sotware choices

Master

An SSH server is obviously needed to enable inter-node communication and control from the user. As well as an NTP client to get the time from internet (the Raspberry does not have a RTC to store the date and time, it loose it at shutdown) and an NTP server is also obviously needed to share the time between the master and all the slave nodes without overloading the outside NTP servers and the internet connection.

I need an option capable DHCP server, a DNS server, a TFTP server. Fortunately, DNSMASQ can deal with the DHCP, the DNS and the TFTP part with a quite light footprint and with consistency.

The master node needs to provide the boot and root partitions templates to clone on the node's local SD card. Rsync will do the trick, without a rsync server to save memory with one less daemon (either rsyncd or portmapper). The node's initial ramdisk with initiate the rsync server's execution on demand through SSH which is needed.

Nodes

On each node's SD, I need the raspberry's GPU firmware and a network enabled bootloader : Currently, I only found u-boot in Stephen Warren's GitHub repository to deal with a reliable and flexible network boot. It can use DHCP, BootP or PXE to get an IP, can use TFTP to grab the kernel and the initial ramdisk, and everything can be scripted.

Cluster execution

Architecture design

I will need to connect to a terminal on the master node and sometimes on the slave nodes, through the network. I don't want to enter again and again any credentials and I'll always need to connect as root. Furthermore, inter-node similar automated connections will also be needed to synchronize processes, for example.

When the nodes will try to install new packages, it will need to download several time the same package from the Raspbian repository or from the RaspberryPi repository, so it might be interesting to host repository mirrors on the master's hard drive.

The nodes will need to share data in files. So I need a shared scallable network filesystem. It has to be light, fast, reliable, synced with a small memory footprint... I am naïve and still believe that perfection does exist.

The master should be able to distribute post-initialization configuration changes. A configuration manager will be installed.

Software choices

I will need an SSH server in all nodes, with root login by key, both for manual connections and for automated inter-nodes communications and synchronizations.
From my workstation, I will use PAC (http://sourceforge.net/projects/pacmanager/) as an SSH client to connect to all the nodes. It has wonderful cluster management features.

As a configuration manager, I will use puppet to configure the cluster, with its friends, augeas, facter, hiera, and mcollective. I am not sure about mcollective, as it might not be the best tool to manage a cluster of similar nodes, it seems to be oversized and duplicating the useful needed features already provided by PAC. Puppet will help to manage configuration consistency. MCollective will help to execute tasks simultaneously on a defined set of nodes (its advantage against PAC is the ability to script it from command line, whereas PAC is interactive). So the master node will need a PuppetMaster installation with Apache and Passenger to deal with multi-threads and with concurrent (all nodes) connections.

Regarding the Raspbian and RaspberryPi repository mirrors : Apt-Mirror is my favourite tool for this task when rsync is not available. mini-httpd or tiny-httpd would be sufficient to serve these mirrors inside the cluster. These HTTP servers are very light, fast and perfect for this task, but as Apache needs to be installed to serve Puppet's clients, I'll use Apache for the mirrors, too.

I still hesitate between NFSv4 or GlusterFS. The first one is light, can be enhanced with a cache, but might be weak with simultaneous write from the whole network to the same files with the locks. The second one is stronger, but less light.

Cluster possible usages and evolutions

In an ideal world, I would need to be able to reset all the nodes and to choose to deploy a new disk image (Raspbian) on all nodes through the network. I might have a look at the Raspberry Fundation recovery partition to try to change it to achieve this goal : if shift is pressed or if there is a newer disk image available on the master server, then, the recovery partition will restore the master image on the node and execute a customization script (to update the hostname, the IP, ...). But it is not the main goal of this cluster. I'll try to do this later.

All the nodes have to share some standard basis :
  • ntp to be synced with the time, the raspberry pi dont have any RTC
  • ssh
  • apt.conf, sources.list, preferences
  • /etc/hosts, /etc/hostname, /etc/mailname
  • mc
  • prelink, preload, readahead
  • ssmtp
  • bashrc, inputrc
  • vim, vimrc
  • screen, screenrc
  • puppet
The first node will differ a little bit as it will act as the head of the cluster :
  • NFS
  • PuppetMaster with Passenger
  • dnsmasq
  • MCollective client
Then, they have to share a common filesystem, I'll have to test some of them in distributed environnement :
  • nfs
  • drdb
  • gfs
  • gluster
  • googlefs
  • hdfs
  • coda
  • afs
Now, I can share data :
  • MongoDB
  • Hadoop/Hive
  • Hadoop/HBase
  • Google BigQuery ?
Once that data are shared, I can begin to do some calculations :
  • DrQueue
  • Torque
  • pvm
  • mpi/lam
  • distcc
  • ipython
Maybe full cloud stacks such as :
  • Globus
  • OpenStack

Of course, all these possibilities have to be Pupettized ! I want to be able to reformat and rebuild automatically each node. to reaffect it to another task for example.

Also available in: PDF HTML TXT