Index by title

Wiki
- Firmware
- Gateware
- Hardware
  - Backplane
  - FPGA node
- Version 01

This document describes the different PCB. They should all be hot-pluggable :

Motherboard/backplane
PSUs PCB
UPS PCB
NIC PCBs
Microcontroller/Microprocessor PCB
Hakeva Core PCBs
RAM PCBs

Table of contents

Backplane/Motherbord
FPGA node
Microcontroler nodes

Hardware Architecture Overview Draft

Backplane/Motherbord¶

The size has respect an ATX standard to fit in standard racks. Micro-ATX

https://web.archive.org/web/20120725150314/http://www.formfactors.org/developer/specs/atx2_2.pdf

Connectors¶

ATX Power
JTAG
I2C
UART
FTDI/USB to UART
Connection BUS (2 microcontrolers, 1 UPS, 2 PSU, 5 RAM or FPGA)
Front panel (Power, Alert and Reset buttons, Power, Alert, Heartbeat1, Heartbeat2, LEDs)
2x 3pins PWM Global FAN connectors

Connection bus¶

Power : -12V, -5V, GND, +3.3V, +5V, +12V

JTAG : TDI, TDO, TCK, TMS, TRST

I²C : SDA, SCL

RAM bus : Addr lines, Data lines, BankSelect?, WE, OE, CLK?

It has connectors to support module hotplug :

Memory modules with standard DDR2 (DDR3/4/5 if possible)
Storage modules with up-to 4 x Standard 2.5' HDD/SSD for persistence and RAID 0, 1, 10, 5 support
PSU modules (up-to two of them) for HA
UPS to drain the data in case of power failure and to keep the RAM refreshed to avoid dataloss
Redis Core (FPGA) modules

BMC/IPMI/Redfish¶

It has to embed an onboard BMC/IPMI for easy management with its own dedicated NIC.
Maybe OpenBMC (https://github.com/openbmc) running on a Raspberry Compute Module 3+ (https://www.raspberrypi.com/products/compute-module-3-plus/) in a SO-DIMM form factor, with a dedicated ENC28J60 NIC (https://www.raspberrypi.com/documentation/computers/compute-module.html).

Power control and Reset¶

A Reset+Brown-out+Watchdog dedicated circuit
A clean reset momentary push button with NE555 debounce.
A momentary push power button (short press soft-toggles power, long press to hard power-off)

I2C BUS¶

RTC¶

with battery (or super-capa), updated by the microcontrolers, from the network, with NTP

PWM FAN controlers¶

FPGA node¶

FPGA node hardware design

Microcontroler nodes¶

It has 1 (or 2 for HA) hotpluggable microcontroler module. Each of them has his own dedicated NIC, as an 100M/1G/10G SFP+ module. It has to support backup and restore processes. With ATSHA, EEPROM, Flash, Clock, I2C BUS, I2C temp sensor

This document describes the server firmware design to be configured in the backplane microcontroler.

Table of contents

Bootloader
Firmware
- - FPGA start

Bootloader¶

Waits for 1 second for an USB/UART (FTDI or CH340G) upgrade.
If the boot failure value in the external flash is greater than 2 (configurable threshold in EEPROM) : Switch to the other firmware
1. Enable the other external firmware
2. Reset the boot failure failure value to zero
If the enabled external flash firmware version is different (not necessarily higher for downgrades) from the internal flash one
1. Check the enabled firmware signature with the ATSHA crypto chip
2. Copy the enabled firmware from the external flash to the internal flash
Increment the boot failure value in the external flash
Continue with the internal flash firmware

Firmware¶

Enable the watchdog
Sanity self-check
Reset the boot failure failure value to zero
Hardware devices check
Hardware devices initialization if needed (already up and running devices should not be reinitialized to allow firmware upgrade without downtime)
Configure the admin network (DHCP or fixed)
Start the enabled services
1. RSyslogd
2. HTTP REST API
3. HTTP web admin interface
  1. SSL certificates management
  2. authentication
  3. permissions
  4. monitoring
  5. DB management
4. Node manager
5. Hardware monitoring service (prom)
6. Backup manager
7. ...
On a regular (scheduled) basis
1. Update the hardware watchdog
2. Query NTP to update the RTC

FPGA start¶

Check configure the specified FPGA with the specified (in configuration) gateware from the external flash

https://www.youtube.com/watch?v=THLdycw9-Vs

PCIe(PCI Express)¶

Components and blocks¶

FPGA¶

Xilinx documentations and datasheets :

Third party references :

Configuration¶

FPGA configuration schematic
The FPGA can be configured by several means. It can received a pushed configuration from an external device such as a microcontroller, it can pull its configuration from an external device such as a flash memory, or it can be configured using the JTAG connection. Whatever happens, the JTAG always has priority. Thus, I chose to store the configuration in an external Quad-SPI flash. This FPGA device (Artix7 100T) needs 30Mbit to store the whole configuration and would need several seconds to initialize with standard SPI. I chose to use Quad-SPI flash to remain in a reasonnable configuration time, at a reasonnable cost and complexity. Unfortunately, there is a drawback, it would be too complex to access both the FPGA and the flash from the same connector (either JTAG or PCI). The Flash is behind the FPGA and not directly accessible. The programmer needs to configure the FPGA with a temporary bitstream which will act as a flash programmer, this is called indirect programming.

I need to configure the mode pins (M0, M1, M2 in bank 0) to tell the FPGA to fetch its configuration from the SPI flash. I hardwire this value because JTAG programming will still have the priority. The FPGA generate the QSPI clock signal to drive the QSPI flash on pin CCLK_0). Then I need to choose and connect the QSPI flash memory to the relevant pins in bank 14 (data lines and chip select). I do not have to care about the generated clock frequency, the FPGA will start slowly and the very first bits in the downloaded bitstream can increase the generated clock freqency dynamically. Which is very flexible to chose the actual QSPI flash chip.

The QSPI voltage levels have to be consistent across the different pins (CCLK_0 on bank 0, and the other pins on bank 14), the banks need to have the same voltage, at least during the configuration stage. I chose to power the banks 0, 14 and 15 (side effect) at 3.3V, This range is configured with the CFGBVS pin.

The PROGRAM pin acts as a reset an reacts to a pulse (it can not keep the FPGA in reset state). I connected it to a manual reset push button.
When the FPGA is started or reseted, before storing his configuration, it has to reset and clean its current configuration. The INIT_0 pin is switched to LOW during this cleanup and reverts to HIGH after to start the actual configuration. It is possible to keep it LOW and keep the FPGA in reset state.

Once the configuration is loaded, the DONE_0 pin is bring HIGH.

I connected LEDs to INIT_0 and DONE_0 as status indicators to follow and debug the reset and configuration stages.

Power needs¶

From Artix7 datasheet summary

Name	Min.	Typ.	Max.	Load	Decoupling	Comments
VCCINT	0.95V	1.00V	1.05V	0.3-6A	1x330uF, 6x4.7uF, 8x0.47uF	Same rail
VCCBRAM	0.95V	1.00V	1.05V	0.1A	1x100uF, 2x0.47uF	Same rail
VCCAUX	1.71V	1.80V	1.89V	0.15-0.35A	1x47uF, 3x4.7uF, 5x0.47uF
VCCO	1.14V		3.465V	0.2-2.5A	Bank0: 1x47uF, Other banks: 1x100uF, 2x4.7uF, 4x0.47uF
VMGTAVCC	0.97V	1.00V	1.03V	0.15-1A		Has to be filtered accordingly to 7 Series FPGAs GTP Transceiver User Guide (UG482)
VMGTAVTT	1.17V	1.20V	1.23V	0.05-0.4A
VCCADC	1.71V	1.80V	1.89V	0.15-0.35A
VREFP	1.20V	1.25V	1.30V
VCCBATT	1.00V		1.89V			Battery required only if encryption, otherwise : connect to GND or VCCAUX
VIN		-0.20V	VCCO+0.20V

(1*6)+(1*.1)+(1.8*.35)+(3.465*2.5)+(1*1)+(1.2*.4)+(1.8*.35) = 17.5W

The FPGA could theorically consume approximately 17.5W with mixed voltages. The different voltages can not easily be generated directly from the available 12V. We need a first DC/DC conversion from 12V to 5-5.5V and a second from 5-5.5V to the different voltages. Thus, with an efficiency of 80% for each converter, we need approximately :

5-5.5V at 4.38A for the second stage (17.5W/80%/5V = 4.38A)
12V at 2.28A for the first stage. (17.5W/.8/80%/12V = 2.28A)

The card will also have network, RAM, SPI Flash, ... Despite the PCIe bus provided 12V/2.1A would be sufficient in most cases, I will add an extra power connector to use the ATX 12V/6.5A if available.

Decoupling capacitor recommendations (Types, ESL, ESR, and suggestions available in UG483)

Value	Package	Volts
330uF	2917	2.5V
100uF	1210	2.5V
47uF	1210	6.3V
4.7uF	0805	6.3V
0.47uF	0603	6.3V

the JTAG connector¶

The card will be tested and used as a development board first. Thus, it needs to be configurable with a JTAG connector. The easiest way is to remain compatible with Xilinx's connector. Their ribbon cable has 14 pins, with an IDC connector. I added a very simple ESD protection.

I might switch later to a smaller JTAG connector with pogo pins, such as the TAG-Connects'

Power supplies¶

On standard (micro)ATX motherboards, there is a limited power available through the 3.3V and 12V rails. The ATX standard also includes extra 12V power connectors for graphic cards.

There are five possible power sources for the PCIe format :

The PCIe connector 3.3V at 3A (9.9W)
The PCIe connector 12V at 0.5A, 2.1A or 5.5A depending on the size and software configuration
An ATX 6-pins 12V/6.25A (75W)
An ATX 6-pins 12V/6.25A (75W)
An ATX 8-pins 12V/12.5A (150W)

The different sources can be combined, with limitations, but I will not combine them to keep the circuit simple.
There are some limitations :

PCIe x1 : limited to 0.5A (6 W)
PCIe x4 : limited to 2.1A (25 W)
PCIe x16: up to 5.5A (66 W), if software configured as an high-power device. Despite the card is in the PCIe format, it has to fit in a backplane, which may not implement this logic

As for graphic card, it is possible to use

up to 2 x 6-pins connectors to provide additional 12V (75W each)
up to 1 x 8 pins connector to provide additional 12V (150W)

Thus, I choose to use a PCIe x16 connector to be safe. If the motherboard can be configured to deliver 5.5A, fine, if not, it can deliver 2.1A at least. In addition, I add an ATX 6pin connector to use it as a main power source, when connected, to avoid any stress on the motherboard.

Automatic power source switch¶

For simplicity, I do not plan to use combined power supplies. The card can use more than the 25W provided by default by the 12V PCIe connector. It needs to automatically switch to a more powerfull power source, when available. The card should use the extra connector 12V if available, then the PCIe provided 12V as a fallback, with automatic switching.

I designed an automated switching circuit, based on two P-channel MOSFETs, used as an ideal diodes to avoid voltage drop and as switches, per power source. They are drived by smaller N-Channel MOSFETs to implement the priority chain. Extra care was taken during the PCB layout design to dissipate as much heat as I can.

PowerSwitch Schema PowerSwitch PCB Top PowerSwitch PCB Bottom
PowerSwitching View Top PowerSwitching View Bottom
Online simulator
PCB and assembly files sent to the manufacturer: PowerSwitchModule-production.zip

Q2A and Q2B are both blocked. When PWR12V is high enough :

It goes to GND through R8 and R11, acting as a voltage divider. The middle voltage is greater than Q3B's V _{GS (TH)} and unblock it.
Tt goes through Q2A using the internal diode, is blocked by Q2B's diode. It goes through R13, acting as a pullup resistor, and goes to GND through Q3B.
Q2A and Q2B gates are low, unblocking them. The current continues to flow through Q2A, bypassing the internal diode and its voltage drop, and both through the R13 pullup resistor and through Q2B to the 12V output.
Whatever voltage is PCIe12V, the voltage divider R7-R9 is connected to GND through Q3B and Q3A's gate is too low to unblock the MOSFET.
If PCIe12V is high, the current goes through Q1A's diode, is blocked by Q1B and Q3A, Q1A and Q1B's gates are high, the MOSFETs are blocked
If PCIe12V is low, the current can flow from the 12V output through Q1B's internal diode, but is blocked by Q1A's diode and can not flow to the PCIe12V source.
The system is stable with a 12V output from PWR12V

The design could be simplified at the PWR12V level, given that this is the highest priority source, it should not be disabled by something else. I chose to keep it, in order to have a scalable design, in case of additional power sources, in case of a soft start, in case of a power switch, ....

Component and values choices : we have to keep in mind the limited availability of some components, currently, and the prices.

Q1A/Q1B need :

V _DSS > 12V,
V _GS > 12V,
V ~GS (TH) < 12V
I _DS > 2.1A

Q2A/Q2B need:

V _DSS > 12V,
V _GS > 12V,
V ~GS (TH) < 12V
I _DS > 6.25A

Q3A/Q3B need:

V _DSS > 12V,
V _GS > 12V,
V ~GS (TH) < 12V
I _DS > current flowing through R12/R13

R1/R3

high values for low current leakage
voltage < Q3A's V _{GS (TH)} when Q3B is unblocked
voltage > Q3A's V _{GS (TH)} when Q3B is unblocked (floating)

R2/R4

high values for low current leakage
voltage > Q3B's V _{GS (TH)}

R5/R6

pullup resistors
high enought for small current leakage
R = U/I = 12V / 1mA = 12k. I chose 10k, this is a standard value for pullup/pulldown

I chose the same dual-PMOS chip for both Q1 and Q2, to limit the BOM length at the price of few extra cents and an oversized Q1.

Inspired by :

12V -> 5V5/5A DC/DC step-down¶

TODO
Inductance choice : https://www.youtube.com/watch?v=ki32ZtKWe_Q
https://www.youtube.com/watch?v=FqT_Ofd54fo

Needs:

Input: 12V
Output: 5.5V/5A

What ever the components for the rails, they need at least few volts over the higher used LDO. Thus, I need to output 5V-5.5V and the FPGA can consume up to 17.5W, depending on the GPIO voltage and on the configured IP. So, I need a step-down from 12V to 5.5V, supporting up to 5A. Several DCDC converters are available from

MPS,
TI,
DA,
...

I want to keep the manufacturing simple and I limit the choice between the available components at the chosen manufacturer (JLCPCB) :

The first one is fine enough and 0.20€ cheaper, still in production, available in QFN and SOIC packages, simple to implement with few components, has a programable switching frequancy between 200kHz and 4MHz. It can sustain 3.5A with a current limiter between 4 and 4.7A. Last but not least, it has an excellent documentation. Its only disadvantage is that it is an "extended part" at JLCPCB, meaning manual feeding and extra cost (x5 on average).

The third one has the huge advantage of being highly available at JLCPCB, as a "basic part" (no manual intervention, already in the feeders).

MPQ9633B:

few external components
relatively easy to implement
not too expensive
available

At worst case scenario, the output consumes 20A and C _IN needs to provide 10A. At a forced frequency of 1KHz, in the worst case, the input ripple should not be higher than 12V-(5.5V/0.9)=5.88V. Thus, the C _IN capacitors should be at least 850uF. I chose to use

C _IN1 = C _IN2 = 470uF (SMD 2917)
C _IN3 = 47uF (SMD 1206)

Low voltage power supplies¶

We need the following rails to power the FPGA :

VCCINT for internal logic
VCCBRAM for the BRAM cells, which can be consolidated with VCCINT
VCCAUX
VCCOx for each of the banks 0,13,14,15,34,35)
VCCBATT for the batterie used to keep the AES private key used to decrypt bitstream, in the FPGA
VMGTAVCC for the transceivers
VMGTAVTT for the transceivers
VCCADC for the ADC ?
VREFP for the ADC ?

It is possible to use a complex circuit with a lot of cheap and easy to buy discrete components and simple ICs, but it means a lot of components to order, a complex routing on the PCB, a lot of very small components to solder. On the other hand, I can use few and expensive complex components, which are harder to find, but the PCB routing will be easier, there will be less soldering.

The Reducing System BOM Cost with Xilinx's Cost-Optimized Portfolio whitepaper provides SMPS suggestions (Dialog DA9062, Monolithic Power Systems MP5416, Exar/MaxLinear XRP7714, Texas Instruments TPS65023), furthermore, the Arty A7 and the AX7101 schematics also provide some good inspirations.

The main power source is the PCI express slot. I can not use the permanent 3.3V, I need to use the 12V. Either I can find a suitable component which can accept 12V input or I need to use some kind of step-down from 12V to 5.5V (I added some extra headroom for LDO dropout, in case).

- Exar/MaxLinear XRP7714 is discarded because it has only 4 outputs, and would need extra components to get all the required voltages.
- Texas Instruments TPS65023 is discarded because it can provide less current, probably not enough to make a reasonable use of the FPGA.
- Dialog DA9062 has enough outputs, enough power (up to 8.5A combined), a good set of features (watchdog, RTC, timers, power on/off sequences, ...) and a very comprehensive documentation.
- Monolithic Power Systems MP5416 has one more LDO, very interesting power (approx 15A), with a lot of features (but no RTC).

Questions are :

how many external components needed ?
how easy to find and buy ?
how cheap/expensive ?

MP5416 is nearly impossible to find. Dialog DA9062 is not easy to find, but possible. That's also the PMIC used in Digilent's Arty A7 dev board.

RAM¶

Storage¶

The RAM storage has to be inexpensive and high-density, it uses DDRx standard sticks, non-EEC.
The FPGA node is too compact and the form factor is not compatible with onboard standard DDRx RAM sticks. The RAM sticks are plugged on the backplane (or on dedicated RAM extension boards) and are accessed thru DMA channels with the PCI-express bus.

Local RAM¶

The FPGA node also has a DDR3 chip for temporary and intermediate values.

Network¶

RJ45 10/100/1000 ethernet¶

SFP+ module¶

Crypto¶

ATSHA or newer

Clocks¶

Watchdog, Brown-out, Reset and Power-On-Reset circuits¶

Fans¶

Fans are not connected or mounted on the node, but on the motherboard/backplane, to mutualize noise filtering and airflow efficiency.

Sensors¶

Power rails ampere-meters to measure the consumed power. Temperature sensor (thermistance).

It can be used with two goals :

measure the power efficiency (nb operations/second/watt)
anticipate temperature raise to drive the fans

Connectors¶

Standard PCIe v3.0 2x or 4x connector
Standard ATX extra 12V 35W connector
Maybe 3pins PWM FAN connector (FPGA and NIC)

The PCB¶

It has a PCI express connector, at least v3.0 and at least x2 to have a bandwidth compatible with a Gigabit ethernet bandwidth.
The FPGA node has to be compatible, at least in form factor, with a standard (micro)ATX motherboard in a 1U rack. Despite it would be theorically possible to connect it to a standard motherboard, I strongly discourage this. First, you would need a kernel driver to manage and communicate with the card, but the card would have full access to the whole hardware, including the northbridge and the RAM or the southbridge and the devices, bypassing the OS kernel.

This document is the Hakeva core design for the Verilog implementation to be configured in an FPGA.

Internal RAM¶

Design subject to changes:
Currently, this module implements async-reads, thus the synthesis can not infer internal BRAM blocs. The module would have to be ported to synchronous reads and the testbench updated, in order to make it compatible with internal BRAM and to make it compatible with higher frequencies.

This module is a DP-SC-SW-AR RAM. It can be preloaded with an image file to be used as a ROM (if WE0+WE1 are forced to low).

The read value is continuously assigned, thus it returns the new value, in case of writes. Its interface has to remain compatible with the external RAM (SDRAM, DDR2, DDR3, ...) design.
If both ports write to the same address concurrently, the actual stored value is undeterminated. NEVER do that !

Inspired by :

FIFO¶

The FIFO uses the internal RAM module design and has the same limitations in terms of size : only a power of 2. Furthermore, only ((2**ADDR_WIDTH)-1) can be used. If the FIFO is empty, it can not support a Read and a Write during the same clock cycle. The Read will return undertermined value.

FIFO inspirations :

UART¶

UART does not need to be complete. It only serves as a test interface to send commands to the core and receive replies, until a network layer is available/implemented.
It has to expose received data and to consume data to send in a design compatible with the TCP/IP stack implementation.

A first draft implementation is done with Nandland's sample code (https://nandland.com/uart-serial-port-module/) with two FIFOs

Received data
Data to send

UART Inspirations :

https://www.verilog.pro/micro_uart.html

External RAM¶

SDRAM¶

Inspiration :

https://www.fpga4fun.com/SDRAM.html

DDR¶

Cached external RAM¶

The cache is stored either in logic elements or in BRAM. It is preloaded at reset with an external memory (SDRAM, SRAM, DDR, ...) region. It supports both reads and writes with a write-behind logic. It needs a "drain" signal, to flush the dirty data (when the FPGA has to reset but not the RAM, without loosing data).

Memory manager¶

TODO : Rewrite this chapter to have the memory map stored at the begining of each memory chip and cached in BRAM. This will allow Full FPGA reconfiguration (partial reconfig not supported in this design) for zero-downtime/zero-dataloss bitstream upgrade.

The RAM needs to be defragmented from time to time, if not continuously. The defragmentation needs to move data blocks which are currently used by other modules. In order to avoid any kind of locking, the other module do never store the actual RAM block address, but a pointer to the address in a RAM block allocation table.

The data is stored in internal RAM blocks (M9K) but could also be stored in external RAM (DDR2/DDR3....). The RAM block allocation table is stored in internal RAM to minimize the access latency due to the communication between chips.

The RAM Block allocation table contains 512 entries x 96 bits (49152bits = less than 64KB) and is stored at the very begining of the RAM, at address 0. The maximum number of entries is a compilation parameter that can be tuned depending on the actual hardware. Each entry contain the following fixed fields :

RAM block address (64bits to address up to 1TB)
RAM block size (64bits for a maximum size of 1TB)

Unused (free / available) slots reference address 0 (which is an invalid allocated block) and have the available RAM as a size. The table is initialized as follow.

Slot#	Address	Size
0	0000	1TB
1	0000	0000
...	0000	0000
511	0000	0000

Operation list (to be designed)¶

Allocate a block¶

Write a block¶

Allocate and Write a block¶

Read a block¶

Free a block¶

Read and Free a block¶

Read and Free a block, then Defragment¶

Move a block¶

Resize a block¶

Defragment¶

RESP decode¶

RESP String¶

RESP Errors¶

RESP Integers¶

RESP Bulkstrings¶

RESP Array¶

Internal structures¶

Strings¶

Numbers¶

Dictionnaries¶

CPU-like core¶

Inspiration:

Redis command decoding
Commands stored in a table with arity and branch to trigger thru a MUX.
ALU
Bus
Command pipelining
ATSHA-like crypto to crypt data at rest
ATSHA-like crypto to sign/auth the firmwares/bitstreams
Replace b-trees with n-trees and CAM-like searches, should find any key in less then logn(x)

No module support (neither preconfigured, nor dynamically configured).
Subset of Redis commands first.
One extra command returning internal statistics in a prometheus-friendly format.

decoder¶

Operations¶

PING¶

INFO¶

SET¶

GET¶

INCR/INCRBY¶

Hardware¶

FPGA node design
Backplane

Version 0.1¶

This version is the initial version with basic stuffs :

RedisCore (Verilog/FPGA)
- RAM blocks
- FIFO
- UART input/output 115200bds 8N1
- Command existence validation
- Command arity check
- Arbitrary processing
- Reply
RedisService (C/Microcontroller)
- Nothing (yet)
Hardware (Backplane and daughter PCBs)
- Nothing (yet)

see document#2

UART for input/output simulation¶

Accept incoming data and forward them, accept forwarded data and send them. The forward and accept design has to be compatible with a network connection design. This UART design is temporary, for tests and has to be replacable by a TCP/IP design.

Memory manager¶

The goal is to manage real-time background garbage collection and defragmentation. Each used memory block address is stoerd in a memory block pointer table and the IP Blocks only use indirect pointers, address in the memory block table. Thus, a physical memory block can be moved transparently, as long as the address in the block table is changed atomically.

Memory block table entries¶

RAM Block Address and RAM Block size

Modules¶

Allocate a block
Free a block

Command parsing¶

The goal is to receive a RESP encoded command from UART and reply to UART.

Consume (parse) the arity (RESP array size) and first string (RESP array first string)
Early (before consuming the arguments) return "ERR Command not found" if the command is not in the command table (internal M9K RAM).
Early return "ERR Wrong arity" if the parsed string count is incompatible with the command retrieved in the command table.
Find enough consecutive available RAM block allocation table entries accordingly to the arguments count
Store the arguments in a memory block (internal M9K yet), update the RAM allocation table entries with the address of each argument
Pass the first RAM allocation entry to the appropriate command circuit block module
Wait until RDY
Copy the result string from RAM (design to be specified) to UART

Design¶

Hardware
Firmware
Gateware