Replication
Support from the manufacturer may be required.
About the replication layer
Replication (RL) on DINAMO is an implementation of a synchronous multi master type system, in which all the devices in a pool accept requests to change data (such as creating/removing keys, changing user passwords, etc). Any data modified in an HSM is transmitted from the node where the operation took place to all the other participants, before a distributed transaction is finally confirmed.
Transmission is carried out using the well-known two-phase commit (2PC) protocol, widely used in the database industry. 2PC has some disadvantages, but it is popular in practice nonetheless: it is simple, efficient, and provides a correct solution to the problem of consensus between nodes. Unfortunately, it is a protocol that allows distributed operations to be blocked.
Consistency is an intrinsic feature of HSM. No client logic or infrastructure is required for handling non-deterministic operational results. In fact, if RL returns one of the documented return codes, some troubleshooting steps can be followed. Most of the time, these steps are sufficient to reveal the underlying issues; in any case, if manufacturer intervention is really necessary, debugging data will enable more agile support provision.
In its effort to operate consistently, RL was designed to detect and avoid split-brain scenarios. In this case, Sync-Point (SP) was conceived as a way of encoding all the state of an HSM in a single number. With just one protocol-level validation, DINAMO is able to know whether its peers share the same state.
Initial checks
-
As noted above, all the HSMs in a pool must communicate for proper distributed operation; check that all the devices have the TCP/IP addresses of the other HSMs in their node caches; e.g. a pool made up of 2 HSMs, A and B: A must have B's IP registered, and, likewise, B must have A's IP (it doesn't matter if the addresses are obtained via auto-discovery, or if they are registered manually). The check can be done via the local or remote consoles.
-
Check that the HSMs can communicate via the addresses identified in [1.]. The communication test can be carried out via the local or remote consoles;
-
Check that all the equipment in the same pool share the same sync-point (SP); the SP is a hexadecimal number (e.g:
CA110F4B3A0662A2
; a small corresponding check number like5058
will also be available, making it easier to perform this step).The SP check can be done via the consoles local or remote;If [3.] is not met, the divergent equipment(s) must be guaranteed to be synchronized, using the using an official image for the pool; live-sync can be run for this purpose, without interrupting services; backups/restores backups / restores can also be carried out, with the disadvantage that they are offline processes that require adjustments to network settings; the live-sync is the recommended method (after synchronization through it, pre-existing nodes have their caches with IP addresses sensitized as part of the procedure, and start operating with the new HSMs automatically);
-
Check if one or more HSMs in the pool have a pending transaction log (PTL); as highlighted in the 2PC discussion, distributed write operations (e.g.: key creation) can be blocked if a previous transaction is still in progress; normally - based on the very asynchronous operational nature of RL - pending transactions are resolved automatically by the replication-manager in a matter of minutes.: key creation) can be blocked if a previous transaction is still in progress; normally - based on the very asynchronous operational nature of RL - pending transactions are resolved automatically by the replication-manager, in a matter of minutes; each PTL carries all the information needed for commits; if one of the nodes participating in the distributed operation cannot be reached (hardware failure, network, etc.), the pool remains blocked; administrative messages of node-down administrative messages can be used to inform the pool that one or more devices will no longer be in operation, allowing PTLs dependent on unavailable HSMs to be committed (and the pool finally unblocked). The node-down operation can only be done via the remote console.