reply on: Highly available Lightning node cluster setup guide \ stacker news ~lightning

pull down to refresh

155 sats \ 14 replies \ @justin_shocknet 17 Aug 2024 \ parent \ on: Highly available Lightning node cluster setup guide lightning

they've been moving to sql because etcd is too limited for proper HA iiuc... and I don't think lnd has an active/passive mode yet, this would seem to be relying on the loadbalancer to handle active-passive

OP can you break this down a bit? at a glance this seems more perilous than simply running lnd on VM's over ZFS

37 sats \ 6 replies \ @Filiprogrammer OP 17 Aug 2024

OP can you break this down a bit? at a glance this seems more perilous than simply running lnd on VM's over ZFS

ZFS is designed to only run on a single server. So if that server fails, the node will be down. If we are trying to achieve high availability, we need a distributed system where ideally every server has its own uninterrupted power supply.

I explain the setup in more detail in the linked guide.

52 sats \ 5 replies \ @justin_shocknet 17 Aug 2024

Yea I meant something like Ceph over it

117 sats \ 0 replies \ @JesseJames 18 Aug 2024

Ceph and cnpg (Cloud Native postgres) would be a nice fit.
Can you wrap that single server s docker container? If so putting in the kubernetes would solve many issues... I will check OPs repo to get more info, but you have a nice idea there.

44 sats \ 3 replies \ @Filiprogrammer OP 17 Aug 2024

I did consider trying with bbolt on top of Ceph, but since etcd is already implemented in lnd it seemed like the more native approach to use etcd. But I am planning to compare this to a setup with Ceph and do some benchmarks.

53 sats \ 2 replies \ @justin_shocknet 17 Aug 2024

Cool i'll be following, its been too long with LND as the only implementation thats somewhat production ready and not having HA or even an squeel backend... would also like to know more about the cluster awareness so a passive node doesn't broadcast something

475 sats \ 1 reply \ @Filiprogrammer OP 17 Aug 2024

LND has actually had support for leader election for at least 3 years already. Some documentation on it can be found here: https://docs.lightning.engineering/lightning-network-tools/lnd/leader_election

But during my testing I did manage to get two nodes to become active at the same time, which is bad. I described it in this issue: https://github.com/lightningnetwork/lnd/issues/8913

This was an LND bug, where it would not resign from its leader role. etcd was working as it should.

Two weeks later the bug got fixed with this pull request: https://github.com/lightningnetwork/lnd/pull/8938

With the patch applied, healthcheck.leader.interval set to 60 seconds and cluster.leader-session-ttl set to 100 seconds, I could no longer produce a situation where multiple nodes were active at the same time.

With this configuration, each lnd node creates an etcd lease with a time-to-live of 100 seconds. This lease is kept alive at intervals of one third of the initial time-to-live. So in this case it is kept alive every 33 seconds. When a node loses its connection to the rest of the cluster, it takes 27-60 seconds to initiate a shutdown. And it takes 66-100 seconds for another node to take over. So in this configuration there is no room for overlap, so no chance of two nodes being active at the same time.

21 sats \ 0 replies \ @justin_shocknet 17 Aug 2024

Great drop ty

17 sats \ 1 reply \ @k00b 17 Aug 2024

By passive do you mean not participating in state updates?

31 sats \ 0 replies \ @justin_shocknet 17 Aug 2024

Exactly, iirc LND isn't cluster-aware

10 sats \ 4 replies \ @031ef7d322 18 Aug 2024

deleted by author

20 sats \ 3 replies \ @k00b 18 Aug 2024

Is leader election supported for Postgres?

No, leader election is not supported by Postgres itself since it doesn't have a mechanism to reliably determine a leading node. It is, however, possible to use Postgres as the LND database backend while using an etcd cluster purely for the leader election functionality.

This is wrong though. You can construct something like an expiring lock in Postgres.

116 sats \ 2 replies \ @031ef7d322 18 Aug 2024

I would guess it’s more common for application developers to reach for a tool more specific to the use case, like etcd, zookeeper, or consul.

You’re right that LND could potentially use advisory locks, which might make sense to eliminate an entire dependency when postgres is used as the backend.

https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS

105 sats \ 1 reply \ @031ef7d322 18 Aug 2024

Relevant issue from 2022: https://github.com/lightningnetwork/lnd/issues/6894

0 sats \ 0 replies \ @k00b 18 Aug 2024

Nice find.

If Postgres is already a single point of failure in the cluster, there’s not much sense in alsp having etcd as a dependency.