Practical Type Configured Hashing Circuits

hash-cores is an experimental library for building type configured FPGA hash cores. The intention is to provide a collection of RTL based hashing components that perform close to hand written verilog. Crypto mining is not a goal of the project, and the library design does not take into account memory hard targets.

From a Haskell perspective, I wanted to see how far type-level delay annotations on Clash DSignals could be taken, and hashing circuitry seemed a good fit as timing can mostly be statically determined. E.g. there aren’t complex timing dependencies on values which are common with branch prediction units and cache hierarchies. Additionally, cryptographic hash functions often have some iterative core that lends itself to different circuit representations such as being unrolled or not. I was also inspired with the “precisely typed” interface exposed by Grenade, and although hash-cores isn’t quite as beautiful, the resulting types do enable powerful definitions:

λ> :t SimpleCore Pipelined (SHA256 @1 @1)
SimpleCore Pipelined (SHA256 @1 @1)
  :: SimpleCore
       Pipelined       -- Composition type
       (SHA256 1 1)    -- SHA256 with register placement parameters
       'Always         -- Input validity semantics
       (BitVector 512) -- Input shape
       (BitVector 256) -- Output shape
       192             -- Input to output cycling timing

Theses types can guarantee if a circuit is combinational or not, input/output widths, input validity semantics, etc. It also allows us to “not worry” and use partial type signatures or GHCi to discover behavior such as timing. In a future project I’d like to go further with more complex dependent type timing, but I’ll likely make a dedicated deeply embedded DSL and compiler which produces Clash or Verilog. This is partly due to the verbosity and issues of dependent typing in Haskell, restrictions the Clash compiler requires and intended SAT solving on type-level timing.

Back to hash-cores, we can call data default def on the type above or use it’s constructors to generate a Clash type:

λ> :t mkCircuitDataOnly (SimpleCore Pipelined (SHA256 @1 @1))
  mkCircuitDataOnly (SimpleCore Pipelined (SHA256 @1 @1))
    :: (?clk::Clock domain gated,
        ?rst::Reset domain synchronous,
        KnownNat reference)
    => DSignal domain reference (BitVector 512)
    -> DSignal domain (reference + 192) (BitVector 256)

If we are in the Clash compiler and have a TopEntity annotated design (see examples), we can generate the verilog/VHDL for the design:

λ> :verilog -- Pause for cup of tea

The library is very much still in an early state, but I thought I’d try to spur collaboration or feedback!

Performance

The library is intended to produce performant designs but development effort has not yet focused on optimizations. Even so, simple test designs seem to perform well in both resource usage and timing. Unfortunately I haven’t managed to coax Xilinx Vivado into inferring DSPs which would likely significantly improve performance. At some point I’ll either correct this for candidate hash functions (SHA256 @1+ @2+) or provide explicit DSP configuration types.

SimpleCore InPlace (SHA256 @0 @1) can place and route on an Artix-7 at 100MHz, and uses <1k LUTs. SimpleCore Pipelined (SHA256 @0 @1) places at 150MHz and uses around ~23k LUTs. Increasing the registering significantly increases the possible max clock, with SimpleCore InPlace (SHA256 @1 @1) easily meeting timing at 200 Mhz without DSPs, with modest increases in resource usages. These numbers show the project is in the right place, but is approximate as I wasn’t routing all inputs outputs and so doesn’t represent a finished design (which is cheating but somewhat representative as cryptographic hash functions mix input well by definition.)

Interestingly Yosys reports much higher resource utilization than Vivado for the same device. I think this is down to differences in reporting of “packable” LUTs but I haven’t had the time to investigate further.

Project Notes - Timing and Synchronization

An intent from the outset was to be able to statically verify timing characteristics from the types. This was achieved by using DataKinds and Nats to represent the delay of a component, following the design of Clash DSignal. A first attempt had implementations of the Composition class enforcing delay correctness directly, but now that constraint (d, r and r*d below) is enforced on constructors of “cores” that consume Composition and Iterable pairs. This and other choices reduce the number of parameters required to be carried around in the types which was previously even larger. This delay timing constraint can be seen on the constructor of SimpleCore, which represents a hash function that can only accept a single block and is defined as such:

data SimpleCore ... where
    SimpleCore
      :: forall composition iterable inputSemantics i s o r d.
      ( Composition composition inputSemantics s (r*d)
      , Iterable iterable i s o r d)
      => composition
      -> iterable
      -> SimpleCore composition iterable inputSemantics i o (r*d)

The type parameters s r d are not visible from the SimpleCore type but can be reconstructed if needed thanks to functional dependencies on Iterable.

Dragging all this information along in the types allows us to discover timing through type holes, and/or allows the compiler to statically verify. With the following snippet we can construct a Clash circuit and dump out the timing (we also could directly dump timing from the SimpleCore type):

systemClockSHAPipelined
  :: Clock System 'Source
  -> Reset System 'Asynchronous
  -> DSignal System 0 (BitVector 512)
  -> DSignal System _ (BitVector 256)
systemClockSHAPipelined =
  exposeClockReset $ mkCircuitDataOnly $ SimpleCore Pipelined (SHA256 @1 @1)

TopEntity.hs:28:21: error:
    • Found type wildcard ‘_’ standing for ‘192 :: Nat’
      To use the inferred type, enable PartialTypeSignatures
    • In the type signature: ...

These static delay annotations are less useful for synchronization. Specifically synchronization between components appears to be a complex issue likely best tackled with a DSL as briefly mentioned above. There is nothing in the types provided to state that an output is tied to valid input. We assume from the timing that an output must be valid at a certain time. I’d like to be able to pass dependent pairs that contain proofs of correct usage- that an output value was produced from valid inputs. This is difficult when signals may or may not be valid such as in InPlace. You can see an example of this break down in type safety in the definition of InPlace. Such advanced typing would be nice to have, but there is nothing wrong with the current method, and correctness can be/is tested with property checking and unit tests.

Project Notes - Folds specifying layout

Types implementing the class Composition are folds over the vector [0..rounds] with a combining function specified by an Iterable. How the circuit representing this is constructed is down to the composition type selected. This is unusual as the output type is determined partially by the fold type choice, and the fold type also changes the physical circuit (not necessarily visible at type level). This allows the library to change timing and signaling semantics of an iterated function. For example, Pipelined will fully unroll an Iterable but InPlace iterates in place and adds valid/ready signaling. The extra valid/ready signaling is needed as InPlace has a reduced max inflight capacity determined by the static delay of an Iterable.

While the composition determines the physical circuit shape, it could go further. If Clash added support for physical placement constraints, the compositions could generate these constraints. Even more exotic, the fold type could determine things like dynamic clock adjustment to limit power usage, similar to AVX512 clock throttling.

Project Notes - Known out of reach features

Automatic parameter search on some target metric (for speed or size)

It would be useful to be able to perform a grid search (or something more intelligent) over the configuration space of cores/hashes. This would enable optimizing for speed/size/power etc. While it’s possible to generate arbitrary instances (see my messy test code), evaluating the cores on some external/IO metric is not easily done inline with the core definitions. This is partly due to the design of Clash - to generate HDL we need to run the Clash compiler, but Clash does not have an IO entry point. It can be done through wrapper code, and testing code within the Clash project contains parts of what is required. Hopefully the Clash project will expose this functionality in the future.
Auto vectorization

The Composition and Iterable types are agnostic to the types within their signals- the input/output types are defined by their implementations. Iterables could be made to support auto vectorization, with type parameters specifying the vector width of data types. I’m not sure on the use case of such massively parallelized/pipelined hash functions as a Pipelined core will already have huge throughput, but it would be interesting to see if these “SIMD” like types would enable extra efficiency with extra registering.
“Exotic” Stateful Hash Implementations

When it comes to enumeration of some weaker hash functions, it is possible to pass message deltas rather than the whole state vector, see http://nsa.unaligned.org/hash.php. As per the definition of the author it requires the hash function has a linear key expansion block, which includes MD5 and SHA1, but unfortunately/fortunately not SHA-256. When enumerating MD5 and SHA1 it would give a significant resource utilization/performance improvement. However, it requires the hash implementation to be stateful which would not be compatible with the InPlace composition.

Roadmap

Currently a SimpleCore with SHA256 (haskell tests and tested on device) or SHA1 (untested) can be composed with Pipelined or InPlace. Now the library is moving from concept to hopefully something useful, I’d like to expand the primitives available. My priority outline is to add:

Useful circuits such as HMAC and PBKDF2 (which will work with arbitrary hash functions)
Generator/filter functions
AXI or other interfaces
DSP/Other configuration lists vs current per hash non-descriptive naturals.

Optimistically leading to mkCircuitable definitions such as:

type MyWPAEnumerator
  = AXICore
      CounterGenerator 
      PrefixFilter 
      (SimplePBKDF2 InPlace (SHA256 [DSPAdders, SlackRegisters 1]))

If you have interest in the project feedback or collaboration is more than welcome via the github page, or directly at blog@blaxill.org.