Running a Lightning node in a safe and redundant way is not trivial. Scheduled backups are out of the question since we always need the absolute latest channel states. Restoring from just a slightly outdated backup can lead to channel breaches and loss of funds.
So how do we recover our Lightning node from, say, a disk failure?
One compromise is a static channel backup. However, this assumes a certain level of trust in the channel partners. We also have the problem that the node is offline until we set up a new one and restore from the static channel backup.
This can be overcome with redundancy. We can run a cluster of 3 nodes that act as a single Lightning node, replicating the channel state between the nodes in real time. If one node fails, one of the remaining two will take over.
As I could not find a detailed guide online on how to run a highly available lightning node, I decided to write one myself. Here is the result of my research and testing:
This repository provides an example of how to set up a highly available LND lightning node by running it as a 3-node cluster. The state is stored in a replicated etcd database. The active leader node is always accessible via the same floating IP address and a Tor hidden service.
Can you wrap that single server s docker container? If so putting in the kubernetes would solve many issues... I will check OPs repo to get more info, but you have a nice idea there.
healthcheck.leader.interval
set to 60 seconds andcluster.leader-session-ttl
set to 100 seconds, I could no longer produce a situation where multiple nodes were active at the same time.