I have this (maybe) wild idea to improve the backup ability of my node and I want to get some critical eyes on it before I go down a rabbit hole and build it.
Of course, everyone should use a RAID setup for their hard drives to prevent issues with drive failures, but some nodes are on systems without RAID capabilities. Regardless a single hard-drive failure is only one way your home lightning node could die. What if you run your node at home, and your house catches fire? What if your node fails more catastrophically than a disk failure and corrupts the channel.db (and writes it fully to your magic RAID storage)?
Let's assume complete system failure and location destruction of a non-cloud node.
A common pleb recovery method advertised as the recovery of last resort is to use an SCB to force close all your channels, which effectively destroys your node to recover funds (you have to start over by reconnecting to all the nodes and rebalancing / reconfiguring rates). If you don't want your node to die in this kind of situation you need better architecture, but we aren't trying to run services in AWS/Google, we are trying to be sovereign individuals running things on our own hardware at home/office/rented storage facility, and we want this to be doable by the masses, so no complex and expensive setup. We need architectures that support simple hardware configurations, remote recovery, and low cost/maintenance. So what we want is something more like the ability to Migrate Safely to a New Device but with the caveat that our primary node has literally burned down and cannot be used as a copy source at the time of recovery.
Enter the Spare Tire
I've setup a Raspberry Pi 5 as a clone of my lightning node. This is the "spare tire" since it's a lower spec hardware than my live node but is intended to come online and replace the live node in the case of a catastrophic meltdown.
But here's the tricky bit: keeping the clone up-to-date enough to do better than providing a channel backup for force close ability. The goal is to be able to bring the node back online with all the active HTLC state as if the node didn't die.
- full bitcoin node available and live syncing
- lnd node ready to run but turned off
- wireguard config for same public IP VPN (e.g. tunnelsats config)
- copy of primary node ssh pub key in ~/.ssh/authorized_keys (so primary machine can push updates)
- telegram bot link to a command that allows the node to come online by remote trigger from a phone
Then on the primary machine:
- A file watcher script that scp uploads the entire
lnd/data/graph/mainnet/channel.db
to the spare tire node file every time it changes, while notifying telegram of upload status/process and writing a version file to the spare tire to match (in case the node dies mid upload) - A telegram bot link that posts messages like "uploading version x", "saved x" (if the node dies and all it got was "uploading x" as the last message then you know there was a missed channel.db update.
The down side:
If the node is highly active, channel.db is going to change a LOT. This could cause file read and network bottlenecks, so the filewatch + upload script will need to cancel the current upload job to accept a new job (after all the current job is out of date and could have a bad HTLC).
Improvements:
Rather than a file watch, the upload script could hook into lnd and only upload the channel.db if an HTLC change (rather than other innocuous info). This would also make the upload less frequent and error prone.
Questions
- Has anyone done anything like this?
- Do you have a better idea?
- Is there a reason to think this is a terrible idea?
- What other ways could this be improved to limit impact?
apt install drbd8-utils
are offering just that. You mount a remote server, ideally through a tunnel, and setting up drbd. As a result, the channel.db states get written to both devices at the same time, just like a local raid