Index by title

2. Hardware assembly

The layout

Once I got all the hardware, I had to try to mount everything together. The very first step was to insert all the SD cards in the Raspberry Pi card readers, then to put 7 Raspberry together and 8 Raspberry together to create the 2 raspberry stacks. As I already told in the Hardware choices page, I tried to use wood, but it was not the right solution. I preferred to use PCB-to-PCB standoffs. Of course, it will be a pain if I need to extract the 4th card, but it will be rigid and robust. Then I connected all the network cables to the raspberry. I wanted to have an idea of the final layout. All the cables were OK, the layout seemed fine.

Live test

I just had to connect the power supply to test it. It was also the opportunity to make my very first wire-wrapping test. I wrapped 2 cables on the pin P1-02 and P1-06.

I connected the first one to the positive screw on the power supply, the second one to the negative, and I connected quickly the power supply... The power supply green led lighted up, the raspberry red led also (half won at this step, the card myight not boot for any reason), and the other raspberry leds began to blink... I had a preview of the final cluster with only one node.

The geek's LEDs

As I wanted to have LEDs connected to the GPIOs and as I wanted to keep the power consumption the lowest possible, I wanted to use the GPIO pins and to be able to disable all the LEDs with a shared switch. I first connected the LEDs with all the anodes connected to the 3.3V (P1-01), and the anodes to the GPIOs 22-23-24 with a 100 Ohms resistor between the cathodes and the GPIOs. It worked, but it was impossible to have one master switch for all the LEDs, my ordered 100 Ohms resistors allow me to use 3.3V only and not 5V from the power supply, I can not use directly a 5V wire with a switch, from the power supply... But I can revert the problem, the GPIOs wont be the negative, but the positive and I will be able to use the negative from the power supply, with a LED switch... Now that I know how to connect, I had to prepare the LEDs to make them more robust and shortcut proof. I chose to wire-wrap the resistors on the cathodes, before the common negative and to secure the connections using an Heat-Shrinkable tube (I ordered other red LEDs and was still waiting for them when I took these pictures) :

The case

Theoric design

The wood was not a good idea to create the nodes stacks, but it can be used as a skeleton for the case. Here is a Blender3D mockup of this skeleton :

I want to have the SD cards available from the front panel. I wanted to have a compact layout, so I bought 15cm network cables. It means that the HDMI connectors wont be available, but that the case will have to be larger because of the composite connectors. I have to place stops at the front and the back of the cards to be able to push/pull the SD cards, but I can not have a stop at the bottom/back because of the network connector. I chose to use directly MDF panels and to screw wood sticks on the panels (I had to buy a Dremel case for this, but I wont count it in the bill ! )

There are two horizontal wood sticks on the front panel to : lock the switch in front of its window and to lock the Raspberry Pis 1cm behind the front panel (to keep enough place for the switch, the LEDs and to have only half of the SD cards outside of the case). Then, I used one stick at the very top and at the very bottom to lock the Raspberry Pi too. Then, I'll make holes in front of each node : SD card slot, 3 LEDs (maybe 4), and the node switch.

The reality

After several tests, I chose to build the case using MDF (Medium Density Fiberboards) and to screw/glue wood sticks inside. It was easier and faster to design the case and to fix design issues. For example, after building the floor, roof and front panels, the cluster began to heat... Adding heat sinks would not have solve the issue. The issue was that the air was not moving inside the case, even with 3 missing panels ! I had to force air flows and to create escapes for the hot air. The hot air was staying inside because of... convection ! What a terrible design issue ! So, I first created escapes for the hot air and waited to see the result. The goal was to keep the 15 CPUs and the switch below 80°C. If needed, I can change the case orientation to make the air flow easier and finally, I can add the biggest possible fan (big mean slower for the same air flow, and slower means more quiet).


1. Hardware choices

I love parallel computing, in the early 90s, the only way to play with real multitasking was to buy an expensive SMP motherboard with SMP CPUs. Later, in the 2000s, the multi-cores CPU appeared and replaced smoothly SMP motherboards. At least for my usage. By my dream to build a cluster was still there. Unfortunately, a single node had to be build from a motherboard, a CPU and RAM sticks (with fanless constraint, low power, ...). Then the cluster needs a network switch, and a storage. But it also needs either a HUGE power supply or one power supply per nodes (warming the flat, with a lot of fan and noise) Well it meant at least several thousands euros as I wanted 7 or 15 node to fill the switch and to play with a real cluster. Far too expensive for a toy to play with and I had higher financial priorities. I played with cheaper toys such as Soekris 4501 to practice on embedded linux.

Then, I heard about Raspberry Pi. great : no fan, everything embedded (no motherboard+cpu+ram to buy), very low power (no fan, no heat, no noise), made for Linux ! I immediately thought to my cluster project... 1 switch 16 ports, one 5V 16A power, 16 SD cards... about 500 €, perfect. As I want to try distributed FS such as gluster or GFS, either I need a disk for each node (bad idea as the network interface share the bandwidth with all the USB devices on Raspberry Pi, it would also explode the needed power, it would need a fan for the case and for the power supply), or I need one single disk to share through the network (maybe later). I will test distributed FS on the SD cards, no matter the resulting size.

Then, I want the cluster to be easily transportable, so I need a light power supply, able to power the nodes and the switch if possible, I need to find a compact layout (with 15cm network cables, it WILL be compact).

The nodes

To create a cluster, I need at least nodes connected to a network. Obviously, I want the nodes to be Raspberries. Then, I need a switch to connect them together. The cluster has to be reachable from outside, either from one of the switch port used as an uplink or from one of the nodes with a second network interface (wifi ?). The raspberries have a poor network card sharing its bandwidth with the USB plugs. It can not go faster than 100Mbps. So, I need a switch with N x 10/100Mbps ports and N-1 raspberries. This will drive the overall cost. I wanted to have the whole cluster for less than 1000€, then I chose to buy a 16 port switch and 15 raspberries.

I made an initial order with the core components :

I can begin to design the layout to have something very compact. I can stack 8 nodes below the switch and 7 on top of it. The short network cables are perfect for this layout. Furthermore, I will have all the SD cards accessibles from the front panel. As I only have 7 nodes at the top whereas I could have 8, I still have place for an optional 2.5 HardDrive (SSD).

Power supply

Now, I have to find or build a power supply. But I am definitely not an electronic expert and I know my limits... I need to buy it. Each Raspberry needs 5V and at least 400mA when idle with no device connected. It uses more current when it is under heavy load or when overclocked (they will be). But it remains below 700mA as long as they don't have any USB device connected to power (Keyboard, mouse, disk, flashkey, ...). The nodes will never have such devices. So, I need 15x0.7A=10.5A for the nodes. I want to have 3 LEDs on each node's GPIOs, each uses 20mA. I need 15*3*0.02=0.9A more. And I might add an USB hard drive and an USB Wifi dongle to the first node. At the end, I would need a little more than 12A if all nodes are under 100% load, overclocked, with all the LEDs on, with constant writing on the disk and constant wifi transfers... Impossible. So, I chose to order a 5V 12A embeddable power supply. It is more expensive than a standard PC power supply, but it is lighter, smaller and does not need a fan. I need a standard European C13 plug with a master switch to connect the power cord and I want to be able to individually switch on/off each node. All the currents will be kept low, I am not a soldering expert, so, I chose to build the whole cluster using mainly wire-wrapping. 30AWG cable can support 2.6A, perfect. I placed another order for the powering :

Fancy Geek lightnings !

I want to have some feedback from the nodes when they are working. I wont have any keyboard, mouse or display connected, but I will use the GPIOs with LEDs. I chose red, green and yellow small 3mm LEDs. They have to fit in the front panel, beside the SD Card slot and the mini switch for each node. The GPIOs can only be source or sink for 3.3V, the Raspberry Pi have a 5V rail and a 3.3V rail available from the pins. I want to be able to disable ALL the LEDs with one switch, to be able to lower the power consumption when needed. So, I choose low tension LEDs, with an associated resistor greater than needed (100 Ohms). I will connect the anodes to the GPIOs and all the cathodes to a LED dedicated common ground cable with a mini switch. The resistors will be placed either on the LEDs cathode or on the LEDs anode, I did not choose yet. As everything will be connected in a very tight place, I need to isolate the connections. I choose to use a heat shrinkable tube, chosen from the resistor's diameter. I placed another order for these fancy things :

The case

I need to stack 7 or 8 Raspberry Pi together. The stack has to be rigid, robust and easily manageable. I first tried with wood, but it was not a real success.

It can help, but is not sufficient. I wanted to use 2mm long tubes through the mounting holes, but I did not find what I needed. I finally chose to use PCB-to-PCB 2cm Standoffs (I felt stupid to not try it first, as it is the most obvious choice). I also need to be able to connect a network cable to the switch easily. Having the 1m cable going out from the box seems unfinished. I chose to use an RJ45 female/female panel adapter to create an RJ45 plug on the back panel.

I still need the panels themselves, I thought about steel flat panels, with an internal wood skeleton structure to screw the panels. The wood structure can be made from 9x9mm wood sticks.

The final bill !

Bought Used
Product Ref. Seller Ref. Provider P.U. Qty Amount Qty Amount
Power Supply Unit 5V 12A 6783773 RS 45,57 € 1 45,57 € 1 45,57 €
Wire to wrap Teflex 359857 RS Tefzel 0,53 € 100 53,45 € 1 0,53 €
Wrapping tool 605239 RS OK 49,57 € 1 49,57 € 0 0,00 €
Raspberry Pis Rev.B512 Kubii 30,99 € 15 464,85 € 15 464,85 €
4GB SD Cards S1620289 LDLC SD4/4G Kingston 6,20 € 15 92,93 € 15 92,93 €
RJ45 Cables 15cm S0610701 LDLC 1,50 € 15 22,43 € 15 22,43 €
16x10/100 ports Switch S0830767 LDLC TE100-S16EG TrendNet 39,90 € 1 39,90 € 1 39,90 €
RJ45 Cable 1m (uplink) S0610435 LDLC 2,75 € 1 2,75 € 1 2,75 €
Micro-switchs 7064351 RS 0,86 € 16 13,76 € 15 12,90 €
3mm Red LEDs 7319382 RS 0,21 € 15 3,15 € 15 3,15 €
3mm Green LEDs 7319373 RS 0,21 € 15 3,15 € 15 3,15 €
3mm Yellow/Green LEDs 7319364 RS 0,18 € 15 2,70 € 15 2,70 €
100 Ohms Resistors 7077587 RS 0,02 € 15 0,34 € 15 0,34 €
20mm Standoffs 2808979 RS 0,30 € 100 30,38 € 34 10,33 €
Power C13 connector + switch 208897 RS 13,51 € 1 13,51 € 1 13,51 €
RJ45/RJ45 F/F panel converter 1747053 RS 10,05 € 1 10,05 € 1 10,05 €
Heat Shrinkable tube 2.4-1.2mm 3415893 RS 0,01 € 122 1,41 € 90 1,04 €
Medium Density Fiberboard (MDF) 24x30cm Bricorama 0.50 € 6 3.00 € 6 3.00 €
0.9x0.9x240cm pine wood Bricorama 1.70 € 2 3.40 € 2 3.40 €
Screws 3mm 6.55 € 1 6.55 € 1 6.55 €

Next steps

Now, it is time choose how to use this beast and to make some Software choices.


Home

Well, my goal was to build a cluster to use it as a lab. I did not really care about the power of the cluster or about the storage. VM are good, but not geek enough, not fun enough and boring. The very first step was to make some Hardware choices and to make the Hardware assembly. Once all the hardware piece are together, I had to make some Software choices and to automate their installation. I was unable to automate the hardware assembly for the 15 nodes, but I can automate the software deployment !

Hardware choices
Hardware assembly
Software choices
Master Node Installation

Useful links


5. Master Node Installation

The nodes will be headless, this configuration enables to :
- Install software and data on every node, and to reset all of them to a clean state in 2 commands
- Install, Upgrade or change the root filesystem (clean installation) of every node in 2 commands
- Install, Upgrade or change the boot filesystem (firmware) of every nodes in 2 commands

Initial Raspbian installation

On my Linux workstation, I download the last Raspbian Unattended Installer from [[https://github.com/debian-pi/raspbian-ua-netinst/releases/latest]] and write it to the future master's SD card :

xzcat /home/fcerbell/Documents/Raspberry/raspbian-ua-netinst-v1.0.2.img.xz > /dev/mmcblk0

I place the SD card in the master's SD slot, connect it to the network (to internet), the power and wait for the process to complete (20-30 minutes). At the end, I have a Raspbian installed, with the latest kernel and packages. Its hostname is pi, the root password is raspbian. It is connected to my network with a dynamic DHCP address and has an SSH server listening allowing root connections. As I love to copy/paste command lines (from this page), I need to have both this page and a terminal on the RPi from my workstation. I need the RPi IP address, so I prepared my internal network's DHCP to give a fixed address to this RPi, I already know its IP. If I did not, I would have to connect a display to the HDMI connector and a keyboard to connect as root and ask for the IP address. Anyway, now, I can connect to the RPi from my workstation, with SSH, as root and I can do the installation with copy/paste the command from this page... I'm so lazy...

Master & Slaves : Upgrade the system

First, I install aptitude because I am more used to aptitude than to apt-get, then I check that the system is up to date regarding the repositories, the package lists and the installed packages :

apt-get install -y aptitude && 
aptitude update && 
aptitude safe-upgrade -y && 
aptitude dist-upgrade -y && 
aptitude full-upgrade -y && 
aptitude purge ~c -y && 
aptitude clean

Master & Slaves : Generate root's SSH keypair

Then, I generate the root's SSH passwordless key pair and I allow key-authentication for root to root. I also relax the checkings regarding the Host Keys (to avoid the question asking to add or not the remote host key during the first connection) and the DNS (to avoid DNS checkings that could slower the connections). As I generate it before making the snapshot that will be used as a template for the node's filesystem, it means that all the nodes will have the same root's key pair and will all be able to ssh to the others passwordless.

ssh-keygen -N "" -f /root/.ssh/id_rsa
cat /root/.ssh/id_rsa.pub > /root/.ssh/authorized_keys
echo >> /etc/ssh/ssh_config
echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config
echo "UseDNS no" >> /etc/ssh/sshd_config

Master & Slaves : SSH authorized keys

I want to be able to connect directly as root from my laptops, so I also add my public keys to root's authorized_keys file (useful when opening a terminal to every node with cssh) :

echo "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC8QIqdBsSN+X7p0yHbhtz4rYylElQBZQuFvx4nVMAFLW5I1ibuahmOSHsh2FRSaA+W99yGpToyB6w2Tk5dBte6h5QGnPJXwFZMBK9Xyc9L+iQDPyu0oQljpaB4fky9jsCpjtk5i6FldlcXLAeH8mb79eERxXxISZP+3mxA2udbPiffADq20idiVnd7zA0Ekl24283cooPPJXO7EPPNCBVzCxk1OhcYI91Pbjs2fY08lu2zaUsP7Fndx/IwS4k+FOYAnxTwhcvCH0g3qWv4VN+uDHoTG5l69XaCgZoD6W8gukk6yhrP4fd0ZyKcsZ9w7mo9XpLYLk9U8+I/u5vo8PCB francois@redislabs.com" >> /root/.ssh/authorized_keys
echo "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAtM8LzekUr46wvVNWoYzxPuKVTv7yFp+Aa/a1vKAendFa3xsMZz6Pp0Xn8U5ZYbTpqqVeM8O+ETqjtpBVk+7+C516DwB+R/cKulTjy061fBPZvTp5pIKm4+NQXNBhwjmQs//nWJ54PlDS5mHuj9NalX07b2OBztrvLjPzf/m4sB0= Francois Cerbelle" >> /root/.ssh/authorized_keys
echo -n '192.168.2.1 ' > /root/.ssh/known_hosts
cat /etc/ssh/ssh_host_ecdsa_key.pub >> /root/.ssh/known_hosts
echo -n ' root@(none)' >> /root/.ssh/known_hosts

Master : External hard drive to store repository mirrors and data

I need a storage to store the cluster files. I use a small hard drive, with one big partition, formatted in ext4 and connected to the USB port.

echo /dev/sda1 /data auto defaults 0 1 >> /etc/fstab
mkdir /data
mount -a

Master & Slaves : Puppet agent

apt-get -y update && 
apt-get -y upgrade && 
apt-get -y dist-upgrade && 
apt-get -y install puppet facter ruby-hiera augeas-tools &&
apt-get -y autoremove && 
apt-get -y clean && 
apt-get -y autoclean

cat << EOF | augtool
set /files/etc/puppet/puppet.conf/main/server rpinode01.raspicluster.cerbelle.net
set /files/etc/puppet/puppet.conf/main/pluginsync true
set /files/etc/puppet/puppet.conf/main/report true
set /files/etc/puppet/puppet.conf/main/report_server rpinode01.raspicluster.cerbelle.net
set /files/etc/puppet/puppet.conf/main/certname "rpinode01.raspicluster.cerbelle.net" 
set /files/etc/puppet/puppet.conf/user/waitforcert 120
set /files/etc/default/puppet/START no
set /files/etc/hosts/01/ipaddr 192.168.2.1
set /files/etc/hosts/01/canonical rpinode01.raspicluster.cerbelle.net
set /files/etc/hosts/01/alias[1] rpinode01
save
EOF

Master & Slaves : Raspberry Pi CPU tuning

First, I overclock the RPi, using the maximum allowed without breaking the warranty. Other values might set the "void warranty bit" in the card. These are sufficient (1GHz) and should not bring the RPi to its max temperature (85°C), even in a compact case. If it should happen, the RPi has an internal security which will slow it down until the temperature fall. NEVER change the max temperature threshold, it would break the warranty.

echo '#arm_freq=1000' >> /boot/config.txt
echo '#core_freq=500' >> /boot/config.txt
echo '#sdram_freq=600' >> /boot/config.txt
echo '#over_voltage=6' >> /boot/config.txt
echo 'gpu_mem=16' >> /boot/config.txt

I also install a library which is configured to be automatically preloaded (LD_PRELOAD) by the dynamic linker with every binary and which overloads the memcpy and memmove functions to optimize them on the RPi.

apt-get install -y raspi-copies-and-fills

Master & Slaves Internationalization

I choose to compile the en_US.UTF8 and to use it as default locale. But I choose to locate the RPi in the Europe/Paris timezone.

#dpkg-reconfigure locales
sed -i 's/.*en_US.UTF-8/en_US.UTF-8/' /etc/locale.gen
echo en_US.UTF-8 > /etc/default/locale
locale-gen
dpkg-reconfigure tzdata

Master & Slaves : Kernel downgrade

I had to install the linux kernel 3.12, the last Raspbian precompiled version including aufs.

apt-get install -y linux-image-3.12-1-rpi
cp /boot/initrd.img-3.12-1-rpi /boot/initrd.img
cp /boot/vmlinuz-3.12-1-rpi /boot/kernel.img 
sed -i 's/^initramfs.*/initramfs initrd.img/' /boot/config.txt
sed -i 's/^kernel=.*/kernel=kernel.img/' /boot/config.txt
reboot
apt-get purge -y linux-image-3.18.0-trunk-rpi

Master : Snapshot the base system as the future slave's root filesystem

Before taking the snapshot, Iclean the downloaded packages stored in apt cache, it is useless to replicate them on the slaves, they will be obsoletes and it will use bandwith/time to transfer them through the cluster network.

aptitude clean

Then, I create the root and boot filesystems copy as templates for the slaves.

rm -Rf /data/slaves/{rootfs,bootfs}
mkdir -p /data/slaves/{rootfs,bootfs}
cp -ax / /data/slaves/rootfs/
cp /boot/*.{bin,dat,elf} /data/slaves/bootfs/

And I remove the files that were automatically created during the previous steps and which are irrelevant to the slave nodes. The slave's root account does not need my installation bash history and, more important, they will have a different ethernet (MAC) address, so I make the system forget the MAC mapping to the NIC number (eth0, eth1, ...) :

rm /data/slaves/rootfs/root/.bash_history
rm /data/slaves/rootfs/etc/udev/rules.d/70-persistent-net.rules

Slaves : Disk configuration

The slaves will not have any hard drive connected so I have to remove the related line in /etc/fstab and they will not need the / and /boot mount neither because they will be mounted during the initrd, so I disable (comment) all these lines from the file :

sed -i 's~^/.* / .*$~#&~' /data/slaves/rootfs/etc/fstab
sed -i 's~^/.* /boot .*$~#&~' /data/slaves/rootfs/etc/fstab
sed -i 's~^/.* /data .*$~#&~' /data/slaves/rootfs/etc/fstab
rmdir /data/slaves/rootfs/data

Slaves : Networking configuration

I cleanup the /etc/hosts file, no need for 127.0.1.1 to resolve its own name as each node will have a DHCP assigned IP address with a DNS server resolving this external IP to the right name. Furthermore, I add a custom script to update the files and the running system with the right hostname as soon as the DHCP assign an IP address to the node. Basically, the node will have no name when it boots, as there is none in the configuration files, but everything will be automatically updated in files and in memory from the DHCP answer.

cat > /data/slaves/rootfs/etc/hosts << EOF
127.0.0.1       localhost
EOF
cat > /data/slaves/rootfs/etc/dhcp/dhclient-exit-hooks.d/hostname << EOF
#!/bin/bash

if [ \$reason = "BOUND" ]; then
    oldhostname=\$(hostname -s)
    if [ \$oldhostname != \$new_host_name ]; then
        echo \$new_host_name > /etc/hostname
        hostname -F /etc/hostname
    fi
fi
EOF

Master : PuppetMaster

All the remaining configuration (common to each node, specific to master and specific to slaves) will be handled and managed by puppet. So, I need to install a first standalone puppet master. It will use the embedded WebRicks to serve the incoming connections. Even if WebRicks is mono-threaded, it is not an issue as it will have to serve the master's puppet agent first with a "master configuration". This master configuration will begin to manage the master's puppet master installation and will configure it with Apache and Passenger (mod_rails). So, this very first puppet master installation is only intended to be used only once.

apt-get install -y puppetmaster
/etc/init.d/puppetmaster stop
cat << EOF | augtool
set /files/etc/default/puppetmaster/START yes
set /files/etc/puppet/puppet.conf/master/autosign true
set /files/etc/puppet/puppet.conf/master/allow_duplicate_certs true

# Puppet 3.x
#set /files/etc/puppet/puppet.conf/main/modulepath /etc/puppet/modules:/etc/puppet/site-modules:/usr/share/puppet/modules
set /files/etc/puppet/puppet.conf/main/environmentpath /etc/puppet/environments
set /files/etc/puppet/puppet.conf/main/basemodulepath /etc/puppet/modules:/etc/puppet/environments/production/site-modules:/usr/share/puppet/modules
save
EOF
mkdir -p /etc/puppet/environments/production/manifests
apt-get install -y mercurial
hg clone --insecure https://www.cerbelle.net/hg/puppet-modules /tmp/puppet-modules
/etc/init.d/puppetmaster stop
mv /tmp/puppet-modules /tmp/puppet
mkdir -p /etc/puppet/environments/production/manifests
cp -r /tmp/puppet/modules /etc/puppet/environments/production
cp /tmp/puppet/manifests/default.pp /etc/puppet/environments/production/manifests/site.pp
cp -r /tmp/puppet/hieradata /etc/puppet/
cp /tmp/puppet/hiera.yaml /etc/puppet/
ln -s /etc/puppet/hiera.yaml /etc/                                                                                                                            
rm -Rf /tmp/puppet
/etc/init.d/puppetmaster start

Master : Network configuration

sed -i 's/.*dhcp.*/#&/' /etc/network/interfaces 
cat >> /etc/network/interfaces << EOF
iface eth0 inet static
    address 192.168.2.1
    network 192.168.2.0
    netmask 255.255.255.0
    gateway 192.168.2.254
    broadcast 192.168.2.255
EOF
echo 'rpinode01' > /etc/hostname
hostname -F /etc/hostname
cat > /etc/hosts << EOF
127.0.0.1       localhost
192.168.2.1 rpinode01.raspicluster.cerbelle.net rpinode01
192.168.2.2 rpinode02.raspicluster.cerbelle.net rpinode02
192.168.2.3 rpinode03.raspicluster.cerbelle.net rpinode03
192.168.2.4 rpinode04.raspicluster.cerbelle.net rpinode04
192.168.2.5 rpinode05.raspicluster.cerbelle.net rpinode05
192.168.2.6 rpinode06.raspicluster.cerbelle.net rpinode06
192.168.2.7 rpinode07.raspicluster.cerbelle.net rpinode07
192.168.2.8 rpinode08.raspicluster.cerbelle.net rpinode08
192.168.2.9 rpinode09.raspicluster.cerbelle.net rpinode09
192.168.2.10 rpinode10.raspicluster.cerbelle.net rpinode10
192.168.2.11 rpinode11.raspicluster.cerbelle.net rpinode11
192.168.2.12 rpinode12.raspicluster.cerbelle.net rpinode12
192.168.2.13 rpinode13.raspicluster.cerbelle.net rpinode13
192.168.2.14 rpinode14.raspicluster.cerbelle.net rpinode14
192.168.2.15 rpinode15.raspicluster.cerbelle.net rpinode15
EOF
cat > /etc/resolv.conf << EOF
domain raspicluster.cerbelle.net
search raspicluster.cerbelle.net
nameserver 192.168.2.254
EOF

Master : DHCP, DNS and TFTP server with DNSMasq

apt-get install -y dnsmasq
/etc/init.d/dnsmasq stop
echo 'conf-dir=/etc/dnsmasq.d' >> /etc/dnsmasq.conf
cat > /etc/dnsmasq.d/rpicluster << EOF
dhcp-range=192.168.2.20,192.168.2.250,12h
dhcp-host=b8:27:eb:cb:63:18,,rpinode01,192.168.2.1,1h
dhcp-host=b8:27:eb:c2:9f:19,,rpinode02,192.168.2.2,1h
dhcp-host=b8:27:eb:27:b6:53,,rpinode03,192.168.2.3,1h
dhcp-host=b8:27:eb:b9:61:6d,,rpinode04,192.168.2.4,1h
dhcp-host=b8:27:eb:b6:57:a0,,rpinode05,192.168.2.5,1h
dhcp-host=b8:27:eb:ef:75:38,,rpinode06,192.168.2.6,1h
dhcp-host=b8:27:eb:2b:41:91,,rpinode07,192.168.2.7,1h
dhcp-host=b8:27:eb:0c:25:82,,rpinode08,192.168.2.8,1h
dhcp-host=b8:27:eb:e5:83:f5,,rpinode09,192.168.2.9,1h
dhcp-host=b8:27:eb:c5:5e:e0,,rpinode10,192.168.2.10,1h
dhcp-host=b8:27:eb:f2:55:fb,,rpinode11,192.168.2.11,1h
dhcp-host=b8:27:eb:0c:dc:64,,rpinode12,192.168.2.12,1h
dhcp-host=b8:27:eb:22:97:40,,rpinode13,192.168.2.13,1h
dhcp-host=b8:27:eb:e3:00:59,,rpinode14,192.168.2.14,1h
dhcp-host=b8:27:eb:19:dc:95,,rpinode15,192.168.2.15,1h
dhcp-option=6,192.168.2.1,192.168.2.254 # dns-server
dhcp-option=3,192.168.2.254 # router
dhcp-option=15,raspicluster.cerbelle.net # domain-name
dhcp-option=66,192.168.2.1 # tftp-server
dhcp-option=69,192.168.2.1 # smtp-server
dhcp-option=119,raspicluster.cerbelle.net # domain-search
enable-tftp
tftp-root=/data/slaves/tftp
EOF
mkdir -p /data/slaves/tftp
chmod -R u+rw,g+r,o+r /data/slaves/tftp

Master : Mail server : XMail

echo "xmail xmail/daemonpasswd string postmaster" | debconf-set-selections
echo "xmail xmail/daemonuser string postmaster" | debconf-set-selections
echo "xmail xmail/domainname string /etc/mailname" | debconf-set-selections
apt-get install -y xmail

Master : Create and use repository mirrors with apt-mirror (optional)

Installing a single node from the internet server is fine, the other nodes will not download everything, they will use a copy of the master filesystem as a template but they will install the additional specific packages by themselves. Installing on one single node is fine too, but installing (even a single package without dependencies) from 15 of them is slow and unfair !

apt-get install -y apt-mirror
mv /var/spool/apt-mirror /data/
sed -i 's~set base_path.*~&\nset base_path    /data/apt-mirror~' /etc/apt/mirror.list
sed -i 's~set run_postmirror.*~&\nset run_postmirror 1~' /etc/apt/mirror.list
sed -i 's~^deb~#&~' /etc/apt/mirror.list
sed -i 's~^clean~#&~' /etc/apt/mirror.list
cat >> /etc/apt/mirror.list << EOF

deb-armel http://archive.raspberrypi.org/debian wheezy main untested
deb-armhf http://archive.raspberrypi.org/debian wheezy main untested

deb-armhf http://archive.raspbian.org/mate wheezy main
deb-src http://archive.raspbian.org/mate wheezy main

deb-armhf http://archive.raspbian.org/multiarchcross wheezy main
deb-src http://archive.raspbian.org/multiarchcross wheezy main

deb-armhf http://archive.raspbian.org/raspbian wheezy main contrib non-free rpi firmware
deb-src http://archive.raspbian.org/raspbian wheezy main contrib non-free rpi firmware
deb-armhf http://archive.raspbian.org/raspbian wheezy-staging main contrib non-free rpi firmware
deb-src http://archive.raspbian.org/raspbian wheezy-staging main contrib non-free rpi firmware

deb-armhf http://archive.raspbian.org/raspbian jessie main contrib non-free rpi firmware
deb-src http://archive.raspbian.org/raspbian jessie main contrib non-free rpi firmware
deb-armhf http://archive.raspbian.org/raspbian jessie-staging main contrib non-free rpi firmware
deb-src http://archive.raspbian.org/raspbian jessie-staging main contrib non-free rpi firmware

clean http://archive.raspberrypi.org/debian
clean http://archive.raspbian.org/raspbian
EOF

cat > /data/apt-mirror/var/postmirror.sh << EOF
#!/bin/sh
/data/apt-mirror/var/clean.sh
chmod -R 755 /data/apt-mirror/mirror
chown -R apt-mirror.apt-mirror /data/apt-mirror/
find /data/apt-mirror/mirror -type f -exec chmod 644 {} \;

EOF

I enable the automatic execution every night :

sed -i 's~^#0~0~' /etc/cron.d/apt-mirror
sed -i 's~/var/spool~/data~' /etc/cron.d/apt-mirror

apt-mirror has a bug in a search : first for "arm" and then for "armhf", the second is never found as the first also match... Here is my fix :

sed  -i 's~arm|armhf~armhf|arm~' /usr/bin/apt-mirror

I launch an immediate mirror update :
This step which can take several days to download the 175GB, so I plug directly the hard-drive to one of my other local servers to initialize the cluster mirror with an already existing raspbian mirror. And I can continue with the other chapters while this mirroring is running. As this chapter takes a very long time to be executed, first it was designed to be non blocking, so I can continue with the following chapters while updating the mirror.

At the end of the mirroring process, I update the master's sources.list file to use my local repository from the filesystem (if I choose to not install Apache in the next steps or if Apache crashes)

su - apt-mirror -c apt-mirror
echo 'deb file:///data/apt-mirror/mirror/archive.raspbian.org/raspbian wheezy main firmware' > /etc/apt/sources.list
echo 'deb file:///data/apt-mirror/mirror/archive.raspberrypi.org/debian wheezy main' >> /etc/apt/sources.list
apt-get update

Master : Share the repository mirrors over the network with Apache (optional)

Given the cluster design, the slaves will execute Puppet agent and the PuppetMaster will be on the master node. Rake, the default PuppetMaster HTTP server is mono-thread and will not be able to answer to 15 simultaneous inquiries. It will need a stronger HTTP server such as Apache associated with mod_passenger. As I will need Apache for PuppetMaster, I will also use Apache to serve the repository mirrors. I install Apache and configure it to serve the files located in /data/apt-mirror/mirror.

apt-get install -y apache2
cat > /etc/apache2/conf.d/mirror << EOF
    <Directory /data/apt-mirror/mirror/>
        Options Indexes FollowSymLinks MultiViews
        AllowOverride None
        Order allow,deny
        allow from all
    </Directory>
    Alias /mirror /data/apt-mirror/mirror
EOF
/etc/init.d/apache2 restart

TODO: Optimize the Apache configuration to use less memory and CPU by decreasing the number of processes and of threads configured in /etc/apache2/apache2.conf

Slaves : Update the sources.list

Use the shared mirrored repository, then update the package list :

echo 'deb http://192.168.2.1/mirror/archive.raspbian.org/raspbian wheezy main firmware' > /data/slaves/rootfs/etc/apt/sources.list
echo 'deb http://192.168.2.1/mirror/archive.raspberrypi.org/debian wheezy main' >> /data/slaves/rootfs/etc/apt/sources.list
chroot /data/slaves/rootfs/ /usr/bin/apt-get update

Slaves : Network enabled bootloader : Das U-Boot

I enable the frame buffer in the command line (bcm2708_fb.fbwidth=1280 bcm2708_fb.fbheight=800 bcm2708_fb.fbdepth=8) because I sometimes debug a failing node by connecting it to a video projector.

apt-get install -y build-essential python bc git
git clone git://git.denx.de/u-boot.git
cd u-boot/

Apply the following patch to make the boot faster

cat | patch -p1 << EOF
--- u-boot.orig/include/configs/rpi-common.h    2016-01-22 10:25:00.487730581 +0100
+++ u-boot/include/configs/rpi-common.h    2016-01-22 10:31:02.286654115 +0100
@@ -168,7 +168,8 @@
     "scriptaddr=0x02000000\0" \\
     "ramdisk_addr_r=0x02100000\0" \\

-#define BOOT_TARGET_DEVICES(func) \\
+#define BOOT_TARGET_DEVICES(func) func(DHCP, dhcp, na)
+#define DISABLED_BOOT_TARGET_DEVICES(func) \\
     func(MMC, mmc, 0) \\
     func(USB, usb, 0) \\
     func(PXE, pxe, na) \\
EOF

make rpi_defconfig
make
cp u-boot.bin /data/slaves/bootfs/
grep -v initramfs /boot/config.txt > /data/slaves/bootfs/config.txt
echo 'kernel=u-boot.bin' >> /data/slaves/bootfs/config.txt
cat > /data/slaves/tftp/boot.scr << EOF
tftp \${kernel_addr_r} vmlinuz-3.12-1-rpi
tftp \${ramdisk_addr_r} initramfs.igz.uimg
setenv bootargs "bcm2708_fb.fbwidth=1280 bcm2708_fb.fbheight=800 bcm2708_fb.fbdepth=8 dwc_otg.lpm_enable=0 console=ttyAMA0,115200 kgdboc=ttyAMA0,115200 console=tty1 elevator=deadline root=/dev/mmcblk0p2 rootwait quiet smsc95xx.macaddr=\${usbethaddr} aufs=disk masterip=192.168.2.1" 
bootz \${kernel_addr_r} \${ramdisk_addr_r} 
EOF
/root/u-boot/tools/mkimage -A arm -O linux -T script -C none -n "U-Boot script" -d /data/slaves/tftp/boot.scr /data/slaves/tftp/boot.scr.uimg

No uEnv.txt (on local boot partition)

TODO: remove compilation dependencies packages

Slaves : Kernel and customized initial ramdisk

cp /boot/vmlinuz-3.12-1-rpi /data/slaves/tftp/

Ref : [[http://jootamam.net/howto-initramfs-image.htm]]
Ref : [[http://www.raspberrypi.org/forums/viewtopic.php?p=228099]]

apt-get install -y busybox-static e2fsck-static dosfstools
mkdir -p /data/slaves/tftp/initramfs/{bin,sbin,etc/udhcpc,proc,sys,rootfs,bootfs,aufs,rw,usr/bin,usr/sbin,lib/modules,root,dev}
cd /data/slaves/tftp
mknod /data/slaves/tftp/initramfs/dev/null c 1 3 
mknod /data/slaves/tftp/initramfs/dev/tty c 5 0 
touch /data/slaves/tftp/initramfs/etc/mdev.conf
cp /bin/busybox /data/slaves/tftp/initramfs/bin/busybox
chmod +x /data/slaves/tftp/initramfs/bin/busybox
ln -s busybox /data/slaves/tftp/initramfs/bin/sh
cp /usr/share/doc/busybox-static/examples/udhcp/simple.script /data/slaves/tftp/initramfs/etc/udhcpc/
chmod +x /data/slaves/tftp/initramfs/etc/udhcpc/simple.script

mknod -m 0444 /data/slaves/tftp/initramfs/dev/random c 1 8 
mknod -m 0444 /data/slaves/tftp/initramfs/dev/urandom c 1 9 
echo 'root:x:0:0:root:/root:/bin/bash' > /data/slaves/tftp/initramfs/etc/passwd
cp /lib/arm-linux-gnueabihf/libnss_files.so.2 /data/slaves/tftp/initramfs/lib/
PROG="/usr/bin/ssh";  for i in ${PROG} `ldd ${PROG} | cut -d/ -f2- | cut -d\  -f1`; do mkdir -p /data/slaves/tftp/initramfs/$i; rm -Rf /data/slaves/tftp/initramfs/$i; cp /$i /data/slaves/tftp/initramfs/$i; done
#PROG="/usr/bin/ssh-keygen";  for i in ${PROG} `ldd ${PROG} | cut -d/ -f2- | cut -d\  -f1`; do mkdir -p /data/slaves/tftp/initramfs/$i; rm -Rf /data/slaves/tftp/initramfs/$i; cp /$i /data/slaves/tftp/initramfs/$i; done
#chroot initramfs/ ./bin/sh -c 'ssh-keygen -N "" -f /root/.ssh/id_rsa'
#cat /data/slaves/tftp/initramfs/root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
cp -r /root/.ssh /data/slaves/tftp/initramfs/root/
chmod 600 /data/slaves/tftp/initramfs/root/.ssh/id_rsa
cp /lib/modules/3.12-1-rpi/kernel/fs/aufs/aufs.ko /data/slaves/tftp/initramfs/lib/modules/
mknod -m 600 /data/slaves/tftp/initramfs/dev/watchdog c 10 130
cp /lib/modules/3.12-1-rpi/kernel/drivers/watchdog/bcm2708_wdog.ko /data/slaves/tftp/initramfs/lib/modules/
cat > /data/slaves/partitions.sfdisk << EOF 
# partition table of /dev/mmcblk0
unit: sectors

/dev/mmcblk0p1 : start=       16, size=    97712, Id= b
/dev/mmcblk0p2 : start=    97728, size=  2000000, Id=83
/dev/mmcblk0p3 : start=  2097728, size=  2000000, Id=83
/dev/mmcblk0p4 : start=  4097728, size=  3529024, Id=83
EOF
PROG="/sbin/sfdisk";  for i in ${PROG} `ldd ${PROG} | cut -d/ -f2- | cut -d\  -f1`; do mkdir -p /data/slaves/tftp/initramfs/$i; rm -Rf /data/slaves/tftp/initramfs/$i; cp /$i /data/slaves/tftp/initramfs/$i; done
PROG="/sbin/mke2fs";  for i in ${PROG} `ldd ${PROG} | cut -d/ -f2- | cut -d\  -f1`; do mkdir -p /data/slaves/tftp/initramfs/$i; rm -Rf /data/slaves/tftp/initramfs/$i; cp /$i /data/slaves/tftp/initramfs/$i; done
cp /sbin/e2fsck.static /data/slaves/tftp/initramfs/sbin/e2fsck.static
PROG="/sbin/mkdosfs";  for i in ${PROG} `ldd ${PROG} | cut -d/ -f2- | cut -d\  -f1`; do mkdir -p /data/slaves/tftp/initramfs/$i; rm -Rf /data/slaves/tftp/initramfs/$i; cp /$i /data/slaves/tftp/initramfs/$i; done
PROG="/sbin/dosfsck";  for i in ${PROG} `ldd ${PROG} | cut -d/ -f2- | cut -d\  -f1`; do mkdir -p /data/slaves/tftp/initramfs/$i; rm -Rf /data/slaves/tftp/initramfs/$i; cp /$i /data/slaves/tftp/initramfs/$i; done
PROG="/usr/bin/rsync";  for i in ${PROG} `ldd ${PROG} | cut -d/ -f2- | cut -d\  -f1`; do mkdir -p /data/slaves/tftp/initramfs/$i; rm -Rf /data/slaves/tftp/initramfs/$i; cp /$i /data/slaves/tftp/initramfs/$i; done
touch /data/slaves/tftp/initramfs/init
wget -O /data/slaves/tftp/initramfs/init --no-check-certificate https://www.cerbelle.net/redmine/attachments/download/50/init
chmod +x /data/slaves/tftp/initramfs/init                                                                                                                                                                                                                                                                                     

Copy the attached init script in */root/work/initramfs/init", rebuild the initramfs file and package it for u-boot

cd /data/slaves/tftp/initramfs && find . | cpio -H newc -o | gzip > /data/slaves/tftp/initramfs.igz && cd -
/root/u-boot/tools/mkimage -A arm -O linux -T ramdisk -C gzip -n "U-Boot Initial RamDisk" -d /data/slaves/tftp/initramfs.igz /data/slaves/tftp/initramfs.igz.uimg

Master : Clean the installation (optional)

Clean the downloaded packages stored in apt cache.

apt-get clean

Master : Halt

Now, the master is ready to serve. At the next boot, it will use its static cluster IP (192.168.2.1), will act as a DHCP, DNS and TFTP server to permit the automatic slave installation, updates and boot. So, you can halt the master :

halt

Connect it to the cluster network, with all the slaves and begin to prepare the slave's SD cards.

Slave's SD card

Create a partition table with only one vfat partition.
Mount the partition
Copy all the files from /data/slaves/bootfs/ in this partition
unmount the partition
Boot the slave with this SD card, it will :
- get an IP address (and the Master's IP) from DHCP
- get u-boot's configuration file (boot.scr.uimg) from master via tftp
- get the kernel from the master via tftp
- get the initramfs from master via tftp
- boot the kernel with initramfs

The initramfs will :
- get the Slave's partition table from master,
- if the partition table changed, it will apply it, format all the partitions, update the boot partition (firmware) from the master and reboot

If the partition table is OK, so it will :
- if the slave's root filesystem (system) changed (faster than a full filesystem comparison), it rsync it from the master via rsync (faster than a copy) to the slave
- it mounts the root filesystem
- if a working partition is requested, it remounts the root filesystem RO, create and format a RW partition (in memory if volatile or on disk), and overlay this RW partition on top of the system filesystem
- switch to this root partition and launch the standard init process


3. Software choices

After playing with slackware, redhat and other Linux distributions since 1994, I discovered Debian in early 2000, and I dont see any reason to change now or in the near future. My workstations are using heavily customized Debian, my servers (at home and in datacenters), too. I will remain consistent.

Raspbian (http://www.raspbian.org) is the logical choice. I want to be able to use this mini-cluster as a lab to learn the most I can about parallel computing. It includes distributed parallel computations using MPI/LAM or PVM, using a Batch queue manager such as DrQueue or Torque, or using Mosix. I would like to test also some parallel storage in terms of filesystem, using HDFS, DRDB, GFS, ... or using distributed databases such as MongoDB, Hadoop, BigQuery, ... Finally, I want to test theory about Load-balancing and High-Availability using haproxy, heartbeat, pacemaker, stonith, LVS... Maybe, if I can, I might test OpenStack to create a personal cloud, but given the hardware specs, each node might only be able to execute one and only one vm image... The raspberry cluster won't be a good choice for this purpose. I would like to use distcc to try distributed compilation. Maybe Globus , too...

Cluster boot

Architecture design

As the Raspberry Pi has no firmware embedded and as the whole boot process is made by the GPU from a loaded firmware, each node needs to have an SD card inserted with at least the GPU initialization code and a boot loader. I could use an NFS root filesystem, it would take no memory, would not be fully transfered at boot, but would be slower than a local filesystem and would uselessly overload the network and the CPU (the network interface is on the USB). I could load a kernel and an initial ramdisk with the whole OS. It would be convenient, all the slave will always boot on a fresh and up-to-date OS, but this initrd would be big, it would use a lot of the precious node's memory. Ok, since we have a local SD card in each node, it will be more efficient to have all the read/write operations made on this card instead of through the network or in memory, so I'll store the firmware, the swap and the root filesystem on the local SD.

Now, the issue is that the slaves will live their own life on their own local SD card. All the nodes have to be exact clones, at least at the file level. I have to find a way to minimize the changes made on the slave's OS. Cloning ensure similar nodes, but is heavy. Replicating the same changes on all slaves /should/ work, but divergences can happen, and if they can, they will happen. Here, the idea is to keep the cloned filesystem clean and to write changes somewhere else. I will mount the cloned filesystem read-only and use another partition mounted with unionfs on top of it, in read-write mode. When the changes made on the slaves using the configuration manager or made manually will diverge too much... I'll just have to revert all changes made on all slave since the OS cloning (clean the read-write partition).

Another issue, with 15 nodes, it is very long to clone a minimal Raspbian system (300MB) on each card. This operation occurs when initializing a new node, when a node got corrupted or when I want to change/upgrade the OS on all the slaves. I minimalized the risk of desynchronization, using unionfs, but it still can happen. I would need to reflash 300MB on 15 SD cards... No way ! I want this slave OS upgrade or restore to be automated throught the network, without any physical action. I can use the locally stored kernel and initrd to check on the master if there is a new boot and/or root partition image to flash on the SD card, download and install it, and finally boot on it.

This strategy seems OK. But can be long : 300MB for the image, compressed to 150MB, but sent to 15 nodes simultaneously at 500KB/s means... 1h15 to wait for a new OS to be deployed... :( For sure, I need the nodes to be exact clones, but clones at the filesystem level, not at the block level. So I don't need to deploy SD or partition images, I can only synchronize the filesystem. Thus, I can use rsync to only deploy the changes made to the slave's OS on the master node to the slaves... I can save more time, the longer part of rsync is to parse the local and the remote filesystems to detect changes. The initrd can just check the timestamp of a specific file to know if rsync has to be executed or not... The time needed to deploy a brand new whole OS will still be the same (but it occurs very seldom), and regarding changes or upgrades, ... it will be a lot faster (I estimate approximately 10 minutes).

There is still an issue : what happens if I break the slave's initrd stored on the master ? All the slave node will boot, their old initrd will detect an update, they will download and deploy it and... I'll have 15 bricked slaves ! I have to reflash manually each node's SD card. No way ! So, the idea is to get dynamically the kernel and the initrd (with the potentially broken upgrade logic) from the master, through the network, at boot time. It can be broken, I will only have to fix it on the master and to reboot the slaves (the only physical action). I still can break the slave's firmware or bootloader on the master, it can be deployed and will brick the slaves. In this case, I'll have to reflash all the SD cards, but it will be fast as the SD image will only have to contain the firmware and the bootloader, everything else will be automated through the network at the boot time. Unfortunately, the default firmware is only able to boot a kernel and an initrd from the local SD card. I have to install a network enabled bootloader, to write it on the SD card and to chain it after the default bootloader.

So, I have all the puzzle's pieces : The master hosts a slave's boot partition with the firmware, a network enabled bootloader, a kernel and a customized initrd. It also stores the slave's root filesystem. To create a new node or to restore a node after a bad firmware/bootloader configuration, I only need to write few KB to its SD card. When the slave boots, the bootloader loads the kernel and the customized initrd from the master, the initrd ensure that the locally stored partitions are synced with their template from the master very efficiently, the local partition should never desynchronize thanks to the changes externalized in another partition.

Sotware choices

Master

An SSH server is obviously needed to enable inter-node communication and control from the user. As well as an NTP client to get the time from internet (the Raspberry does not have a RTC to store the date and time, it loose it at shutdown) and an NTP server is also obviously needed to share the time between the master and all the slave nodes without overloading the outside NTP servers and the internet connection.

I need an option capable DHCP server, a DNS server, a TFTP server. Fortunately, DNSMASQ can deal with the DHCP, the DNS and the TFTP part with a quite light footprint and with consistency.

The master node needs to provide the boot and root partitions templates to clone on the node's local SD card. Rsync will do the trick, without a rsync server to save memory with one less daemon (either rsyncd or portmapper). The node's initial ramdisk with initiate the rsync server's execution on demand through SSH which is needed.

Nodes

On each node's SD, I need the raspberry's GPU firmware and a network enabled bootloader : Currently, I only found u-boot in Stephen Warren's GitHub repository to deal with a reliable and flexible network boot. It can use DHCP, BootP or PXE to get an IP, can use TFTP to grab the kernel and the initial ramdisk, and everything can be scripted.

Cluster execution

Architecture design

I will need to connect to a terminal on the master node and sometimes on the slave nodes, through the network. I don't want to enter again and again any credentials and I'll always need to connect as root. Furthermore, inter-node similar automated connections will also be needed to synchronize processes, for example.

When the nodes will try to install new packages, it will need to download several time the same package from the Raspbian repository or from the RaspberryPi repository, so it might be interesting to host repository mirrors on the master's hard drive.

The nodes will need to share data in files. So I need a shared scallable network filesystem. It has to be light, fast, reliable, synced with a small memory footprint... I am naïve and still believe that perfection does exist.

The master should be able to distribute post-initialization configuration changes. A configuration manager will be installed.

Software choices

I will need an SSH server in all nodes, with root login by key, both for manual connections and for automated inter-nodes communications and synchronizations.
From my workstation, I will use PAC (http://sourceforge.net/projects/pacmanager/) as an SSH client to connect to all the nodes. It has wonderful cluster management features.

As a configuration manager, I will use puppet to configure the cluster, with its friends, augeas, facter, hiera, and mcollective. I am not sure about mcollective, as it might not be the best tool to manage a cluster of similar nodes, it seems to be oversized and duplicating the useful needed features already provided by PAC. Puppet will help to manage configuration consistency. MCollective will help to execute tasks simultaneously on a defined set of nodes (its advantage against PAC is the ability to script it from command line, whereas PAC is interactive). So the master node will need a PuppetMaster installation with Apache and Passenger to deal with multi-threads and with concurrent (all nodes) connections.

Regarding the Raspbian and RaspberryPi repository mirrors : Apt-Mirror is my favourite tool for this task when rsync is not available. mini-httpd or tiny-httpd would be sufficient to serve these mirrors inside the cluster. These HTTP servers are very light, fast and perfect for this task, but as Apache needs to be installed to serve Puppet's clients, I'll use Apache for the mirrors, too.

I still hesitate between NFSv4 or GlusterFS. The first one is light, can be enhanced with a cache, but might be weak with simultaneous write from the whole network to the same files with the locks. The second one is stronger, but less light.

Cluster possible usages and evolutions

In an ideal world, I would need to be able to reset all the nodes and to choose to deploy a new disk image (Raspbian) on all nodes through the network. I might have a look at the Raspberry Fundation recovery partition to try to change it to achieve this goal : if shift is pressed or if there is a newer disk image available on the master server, then, the recovery partition will restore the master image on the node and execute a customization script (to update the hostname, the IP, ...). But it is not the main goal of this cluster. I'll try to do this later.

All the nodes have to share some standard basis : The first node will differ a little bit as it will act as the head of the cluster : Then, they have to share a common filesystem, I'll have to test some of them in distributed environnement : Now, I can share data : Once that data are shared, I can begin to do some calculations : Maybe full cloud stacks such as :

Of course, all these possibilities have to be Pupettized ! I want to be able to reformat and rebuild automatically each node. to reaffect it to another task for example.


4. Software installation

Creating base node images

First of all, I needed a standard SD card image. I installed a defaut Raspbian on a SD card without any customization and I created an image of it using partclone. the standard Raspbian installation creates 2 partitions : the first one, with a FAT filesystem, containing the Raspberry Pi GPU firmware and the kernel to boot, and the second one which is the root filesystem. So, I have one image for each partition, one file for the SD card MBR, created with dd and one file for the partition table, created with sfdisk. Yes, I know that the partition table is part of the MBR, anyway.

I am now able to clone the default node on each other, but they will all have the same IP and the same hostname. Regarding the IP addresses, I used my network's DHCP server to give a fixed IP address, using the Raspberry Pi MAC address. So, each node has its own IP address and it is fixed, this way I know where is located each node in my box. Regarding the hostname, I created a scripts to clone the SD cards. This script mounts the cloned filesystem to change the hostname in several files (/etc/hostname, /etc/mailname, /etc/hosts, /etc/ssh/ssh_host_*_key.pub, /etc/exim4/update...). Then, when a node boots, it has its own hostname associated with the SD card and its own IP address associated with its MAC address. It is not yet perfect, but enough for my first try.

The cloning script also add an initialization script in /root : install-puppet.sh This one is intended to be executed manually after the first boot. Its goal is to update the system, install puppet, set the puppetmaster server and execute puppet to actually configure the node.

Repositories mirrors

When you update one computer, using a remote internet repository, your network has to download once every file. When you have several computers to update, the amount of download bits is several times the initial amount, the download time also, but they seldom update simultaneously. When you have 15 cluster nodes, they ALL updates simultaneously !!! They kill my network, they use all my internal and external bandwidth, they take more than one hour for their initial update. Well, I do need repository mirrors. I tried to rsync them. Even if Raspbian.org is rsyncable, unfortunately, raspberrypi.org is not... So, I tried to have some consistent mechanisms and chose apt-mirror to mirror the armel and armhf architectures, for wheezy and jessy versions of debian/raspbian. Then, I published the mirrors using the lightweight mini-httpd, which is already an overkill for this simple task !

Puppetmaster server


TODO

10 Cluster 10mo 10.7 Network FS : HDFS
54 Cluster L 8mo 4.21 Partition de recovery
55 Cluster L 8mo 4.21 Recovery automatique si GPIO1 pressé au boot
56 Cluster L 8mo 4.21 triggerhappy : GPIO1 = reboot / shutdown ( appui long)
15 Cluster 10mo 2.65 Remove uDev Persistent network rule
48 Cluster 9mo 2.5 Logstash
49 Cluster 9mo 2.5 Graylog2
14 Cluster 10mo 2.68 Puppetize collectd/carbon/graphite
13 Cluster 10mo 3.69 Install/Configure MCollective
2013-08-14 http://www.unixgarden.com/index.php/gnu-linux-magazine/mcollective-l-administration-systeme-massive
2013-08-14 http://docs.puppetlabs.com/mcollective/
2013-08-14 http://wiki.deimos.fr/Puppet_:_Solution_de_gestion_de_fichier_de_configuration
2013-08-14 http://fr.slideshare.net/PuppetLabs/presentation-16281121
2013-08-28 http://www.puppetcookbook.com/
5 Cluster 10mo 2.69 Processing : Torque
6 Cluster 10mo 2.69 Processing : DRQueue
7 Cluster 10mo 2.69 Processing : MPI-LAM
8 Cluster 10mo 2.69 Processing : PVM
9 Cluster 10mo 2.69 Processing : Globus
53 Cluster 8mo 2.46 Processing : JobScheduler
20 Cluster 9mo 2.62 Network FS : Gluster
21 Cluster 9mo 2.62 Network FS : GFS2/NBD GFS2/DRBD
22 Cluster 9mo 2.62 Network FS : Lustre
23 Cluster 9mo 2.62 Network FS : AFS
24 Cluster 9mo 2.62 Network FS : Coda
25 Cluster 9mo 2.62 Network FS : Intermezzo
26 Cluster 9mo 2.62 Network FS : GFarm
27 Cluster 9mo 2.62 Network FS : Ceph
28 Cluster 9mo 2.62 Network FS : PVFS
29 Cluster 9mo 2.62 Network FS : Parallel NFS
30 Cluster 9mo 2.62 Network FS : CXFS
31 Cluster 9mo 2.62 Network FS : pNFS
32 Cluster 9mo 2.62 Network FS : NFS4.1
33 Cluster 9mo 2.62 Network FS : MooseFS
34 Cluster 9mo 2.62 Network FS : GPFS GeneralParallelFS
35 Cluster 9mo 2.62 Network FS : DRBD
36 Cluster 9mo 2.61 DB: Cassandra
37 Cluster 9mo 2.61 DB: Riak
38 Cluster 9mo 2.61 DB: VoltDB
39 Cluster 9mo 2.61 DB: Redis
40 Cluster 9mo 2.61 DB: SolR
41 Cluster 9mo 2.61 DB: Lucene
42 Cluster 9mo 2.61 DB: ElasticSearch
11 Cluster 10mo -2.31 DB: Hadoop/Hive
12 Cluster 10mo -2.31 DB: Hadoop/HBase
43 Cluster 9mo 2.6 Cloud: OwnCloud
44 Cluster 9mo 2.6 Cloud: OpenStack
45 Cluster 9mo 2.6 Cloud: LVC
46 Cluster 9mo 2.6 Cloud: KVM
47 Cluster 9mo 2.6 Cloud: QEMU


Appendix A. Useful links

http://elinux.org/RPi_Hub
http://radiospares-fr.rs-online.com/web/
http://www.rs-particuliers.com/
http://www.ldlc.com/
http://coen.boisestate.edu/ece/raspberry-pi/
http://blog.afkham.org/2013/01/raspberry-pi-control-center.html
http://blog.afkham.org/2013/02/building-raspberry-pi-cluster-part-2.html

Unattended Raspbian install : http://www.raspberrypi.org/forums/viewtopic.php?f=66&t=50310
Read only rootfs : http://www.raspberrypi.org/forums/viewtopic.php?p=228099