pull down to refresh

I have this (maybe) wild idea to improve the backup ability of my node and I want to get some critical eyes on it before I go down a rabbit hole and build it.
Of course, everyone should use a RAID setup for their hard drives to prevent issues with drive failures, but some nodes are on systems without RAID capabilities. Regardless a single hard-drive failure is only one way your home lightning node could die. What if you run your node at home, and your house catches fire? What if your node fails more catastrophically than a disk failure and corrupts the channel.db (and writes it fully to your magic RAID storage)?
Let's assume complete system failure and location destruction of a non-cloud node.
A common pleb recovery method advertised as the recovery of last resort is to use an SCB to force close all your channels, which effectively destroys your node to recover funds (you have to start over by reconnecting to all the nodes and rebalancing / reconfiguring rates). If you don't want your node to die in this kind of situation you need better architecture, but we aren't trying to run services in AWS/Google, we are trying to be sovereign individuals running things on our own hardware at home/office/rented storage facility, and we want this to be doable by the masses, so no complex and expensive setup. We need architectures that support simple hardware configurations, remote recovery, and low cost/maintenance. So what we want is something more like the ability to Migrate Safely to a New Device but with the caveat that our primary node has literally burned down and cannot be used as a copy source at the time of recovery.

Enter the Spare Tire

I've setup a Raspberry Pi 5 as a clone of my lightning node. This is the "spare tire" since it's a lower spec hardware than my live node but is intended to come online and replace the live node in the case of a catastrophic meltdown.
But here's the tricky bit: keeping the clone up-to-date enough to do better than providing a channel backup for force close ability. The goal is to be able to bring the node back online with all the active HTLC state as if the node didn't die.
  • full bitcoin node available and live syncing
  • lnd node ready to run but turned off
  • wireguard config for same public IP VPN (e.g. tunnelsats config)
  • copy of primary node ssh pub key in ~/.ssh/authorized_keys (so primary machine can push updates)
  • telegram bot link to a command that allows the node to come online by remote trigger from a phone
Then on the primary machine:
  • A file watcher script that scp uploads the entire lnd/data/graph/mainnet/channel.db to the spare tire node file every time it changes, while notifying telegram of upload status/process and writing a version file to the spare tire to match (in case the node dies mid upload)
  • A telegram bot link that posts messages like "uploading version x", "saved x" (if the node dies and all it got was "uploading x" as the last message then you know there was a missed channel.db update.

The down side:

If the node is highly active, channel.db is going to change a LOT. This could cause file read and network bottlenecks, so the filewatch + upload script will need to cancel the current upload job to accept a new job (after all the current job is out of date and could have a bad HTLC).

Improvements:

Rather than a file watch, the upload script could hook into lnd and only upload the channel.db if an HTLC change (rather than other innocuous info). This would also make the upload less frequent and error prone.

Questions

  • Has anyone done anything like this?
  • Do you have a better idea?
  • Is there a reason to think this is a terrible idea?
  • What other ways could this be improved to limit impact?
Definitely a terrible idea to attempt but you're not wrong in wanting something like this. You have to be 100% sure that the backup has the exact same state as the original 100% of the time with no microsecond where it might be different.
All filesystem writes should be done against a network drive or distributed store. You should also be 100% sure that the backup doesn't kick in prematurely and that it is impossible for 2 of them to be running at the same time.
At Mutiny we allow a node to run on multiple devices and it's decent enough for us to resolve when problems occur but in a routing node context, any bug that causes revoked funds will be close to impossible to get back.
reply
Update: Now that I have increased my channels and have constant in-flight HTLCs pending, I see that this is a totally unrealistic idea. I even just had a channel peer send me an older tail height (because his channel.db got out of date) and my node triggered a force close... that may turn into a full post for discussion.
reply
So you already figured that syncing the whole file on each state change isn't going to work. Here are two alternatives
  • CLN / eclair already has it, LND is going to put more work into it this year and next: the move from boltd db to postgres (and SQL). It's on the roadmap, and pg allows synchronous replication across clusters, which is ideal for disaster failover.
  • if you want to try something on your own, just for tinkering with the idea: remote raid solutions like drbd 👉 apt install drbd8-utils are offering just that. You mount a remote server, ideally through a tunnel, and setting up drbd. As a result, the channel.db states get written to both devices at the same time, just like a local raid
Of course the latter has high demand on your network availability and latency. But it's better than nothing in case of catastophic failures.
reply
I'd try StaticWire over tunnel sats. StaticWire will give you a complete IP address all to yourself.
reply
what advantage does having a full IP address provide? For IPv4 routing of other ports? in my case, I don't want to expose any other ports...
reply
Yes, you can route all ports. You can use an RPC, accept incoming peer connections, run a btcpay server, etc., all on standard ports.
reply
I didn't know about this. Interesting offer.
In comparison, you have more port flexibility with StaticWire, vs more straightforward simplicity and better pricing with Tunnelsats. Is that a fair assessment?
reply
With StaticWire you have all UDP and TCP ports.
I don't think tunnelsats is more simplicity because you have to run on non-standard ports. Tunnelsats should be cheaper for sure because they aren't allocating you as much capability with their service and you have to deal with the complexity of non-standard ports.
reply
There is no complexity of dealing with non-standard ports. If you're somewhat decently connected, gossip needs 2-7 minutes transferring changes across 95% of the network.
And once you have the custom-port, it doesn't change with your subscription. So port 1377 is as good as 9735, doesn't matter at all.
reply
I tested this with inotifytools which is a disk level copy on write. The problem is in why the primary write fails in the first place, since the backup will break too. A failure boils down to being unable to read the latest state.
Reality is that a production routing node can't be allowed to have storage fail, period... Else it's dead.
Use nvme raid and you'll have bigger problems to worry about.
reply
What I'm trying to anticipate is a failure that isn't based on a disk write fail, but let's say you forward a transaction, write your db to disk, and then a few minutes later (before any new transactions are sent to your node), your node catches fire and dies in a heap of ash? In that case, the last write to the channel.db was great, it's just that you can't get it anymore.
reply
You have no way of being sure that the db got the latest tx, and when shit goes sideways it's often a panic when writing, so no this isn't a great option... You risk loss of your funds if you ever have to use it for recovery which is why scb is the way it is
reply
Maybe Syncthing with file versioning? Use ECC memory if possible to reduce chances of corruption. The cloud can also fail, your concerns are not exclusive to self hosting.
reply
Setting up a spare Raspberry Pi as a backup lightning node with remote syncing and failover capabilities. Any thoughts?