I am curious as to what hardware, software, and practices the really big lightning nodes use. I'm still a minnow swimming with whales, but maybe this isn't a bad thing to think about as the big players represent more and more of the transactions. Centralization seems likely to be a problem pretty soon.
Does anyone on SN know? Have any former employees described the setup? Are there any sources online to find this stuff?
I'm talking about nodes like Wallet of Satoshi.
I guess I'm really asking how one would go about setting up a big, industrial size node and become a major player.
Lack of high availability makes it difficult to run either of the major distributions like a real enterprise system, with zero downtime. I suspect even the largest nodes on the network today are not much more sophisticated than what’s available out of the box with LND.
Improvements to LND on the db side have reduced downtime required between restarts (by enabling online compaction), and also setting up the architecture for multiple nodes to share a backend data store. Leader election can provide a primitive form of HA (failover but not load balancing). An external project, LNmux by Bottlepay, will allow load balancing HTLCs through multiple public gateways, which is the holy grail for enterprise LND.
For someone with some LN experience, who wants to learn more the above, I recommend setting up a regtest environment using containers or VMs, and configuring the following:
  • Create a basic postgres instance
  • Create an LND instance w/postgres backend
  • Upgrade your postgres to have high availability
  • Create an etcd cluster
  • Create 2 additional LNDs and configure leader election
  • Add centralized monitoring/logging (Grafana LGTM stack is phenomenal but other choices are available)
  • Add lndmon sidecars to lnd for monitoring LN metrics
If you can confidently configure all of that, you should have the basics to work on any major node on the network. Some will be drastically different in implementation but the concepts should be the same.
HTLC load balancing is still experimental as far as I know, but if you have some development experience you can add lnmux to your lab and experiment with it today.
An alternative node distribution that was created with more of an enterprise architecture from the start is ACINQ’s Eclair. They probably the single largest node on the network, however I have not seen Eclair elsewhere in the wild, which is why I recommend starting with LND then branching out when the fundamental concepts are well understood.
For something easy to digest and get you motivated to build an enterprise grade LN node, here’s a video from base58, Enterprise Lightning Engineering at CashApp w/Ryam Loomba: https://youtu.be/kbhL5RqL8Aw
It’s also important to keep in mind that building and operating enterprise grade infrastructure is completely orthogonal to managing liquidity. Doing both of these well is likely to require at least two talented full-time individuals (more likely a team for each).
reply
An astute reader may recognize I left out a small bit critical piece: secrets management
There’s plenty of solutions there depending on your cloud provider. You definitely don’t need dedicated hardware, most of top nodes are in AWS.
For the LND lab I describe above, I would also add lndinit to initialize each node. You could set up Hashicorp Vault to use as a backend.
My advice is don’t spend any money on hardware until you have been running a virtual environment suitable for production use. Then evaluate your threat model and determine if it’s worth the upfront cost.
reply
Great advice. I'll keep that in mind. I guess AWS is the way to go.
reply
From someone who has wasted thousands on unnecessary cloud bills, and has a mid-size homelab that’s 80% idle…
I would run everything locally (on a laptop/desktop) until that becomes a limitation, which is probably only the case when you’re ready to go live on mainnet.
Seriously, just use docker, or maybe a kubernetes dev cluster (kind or k3d) if you’re comfortable with it.
From someone who has accidentally lost too many testnet coins…
Use regtest as much as possible. When you’re ready to interface with the world, try one of the signets.
Seriously, be as stingy as possible with your sats. I used to think spending money on hardware and elaborate cloud infra would make me “invested” but the reality is all other resources required to learn this stuff are dwarfed by time, the scarcest resource of all.
reply
Thanks for the link to the Ryan Loomba video. It gave me a nice overview. I know I have my work cut out for me, but at least I have a better sense of knowing what I don't know.
I felt a little better realizing that cashapp struggles with routing and liquidity issues just like the rest of us.
reply
Thank you very much. This is exactly the information I was looking for. I guess you're experienced in this area(obviously)?
reply
You’re welcome, thanks for sharing your curiosity, the world needs more of that!
I have production experience using everything I described above, except Eclair.
reply
Also didn’t mean to imply I’ve used lnmux in prod. I experimented with early versions in a lab, and haven’t tried it in a while but there’s been progress based on git activity. They have a great introductory blog post.
It relies on htlc interception, which has been in production use for a while now. That starts getting into what I would consider protocol territory (moreso than infra) and requires deep integration with backend app logic. It’s worth learning about once you have a solid grasp of fundamentals in both distributed systems and LN.
reply
Don't sleep on Eclair if you are investigating setting up a professional routing node.
ACINQ runs one of the largest nodes on the network and their node software is (of course) open source. They have been running a large node for years, so much of the necessary optimizations and features already exist for running a large node. Eclair has been battle hardened over many years.
reply
It's funny. I just saw your post at the exact moment I was listening to Ryan Loomba describing Eclair in recalling cashapp's node planning. His exact quote was something like Eclair has already proven its scaling ability.
They ultimately went with LDK.
The video was referred to me by @031ef7d322 in his original response.
reply
Big +1 on @remyers
I don't use Eclair, but been observing in awe how they are running this power node with basically no downtime and super high speed, despite their size and age.
Notable also their mobile wallet phoenix, which doesn't get enough attention these days despite their reliability and robustness.
Don't miss this article from ACINQ on their underlying architecture, it's enlightning (pun intended)
reply
Amazing article! Thanks.
reply
Most of an operation like that is going to be the same for any kind of SaaS business. Meeting scaling challenges before they become noticeable by users, developing monitoring, alerting and redundancy to maximize reliability and uptime. For the Bitcoin and lightning specific parts you will probably want your own hardware instead of anything purchased as a virtual machine at least for handling keys and cryptographic material. You can watch videos on each of the things I mentioned to learn more and then hire such experience. Good luck!
reply
Thanks for your insights. I definitely would want my own hardware. Redundancy seems particularly critical. I'm sort of thinking along the lines of investing some money and running the node with complete disregard for routing fees at the outset. The emphasis would be on routing volume. If we're all correct, bitcoin's dollar value will go up in the years to come, and each channel would be substantially more valuable as time goes by.
The goal would be altruistic to start- benefiting the network and fighting centralization.
As time goes by, the node may become profitable. I know this isn't a unique idea. I believe a few nodes have taken this approach already.
reply
I use raspberry Pi 4 and umbrel. With a small battery you can reach almost 100% uptime without redundancy. Works for me for more than a year.
reply
Careful my pi worked great for about 1.5 years and now it randomly goes dark. I have been told it’s due to Umbrel running off an SD card and not the SSD.
reply
Many LND releases contain security fixes, so you should be upgrading at least a few times a year.
The default embedded db (bbolt) can only be compacted while offline, and the longer you go between compactions the longer it takes. This may not be significant for a personal node, but for a busy production node it’s likely to cause peers to temporarily disable their channels (if the outage lasts >20 minutes). Without compacting, your disk usage will grow and performance will suffer.
Restarts can be relatively quick (by bitcoin standards, not by enterprise standards) by issuing an external db.
The biggest risk to an RPi besides power is SD corruption. Using an onboard UPS (like PiJuice) can help, because SD corruption is more likely when power is flaky, but ultimately you’re just delaying the inevitable.
reply
That's exactly where I'm at now also. I've been up with little to no down time for about a year and a half.
I admit the raspberry pi scares me from a durability standpoint, and things like redundancy are non existent. I also want to take off the umbrel training wheels and learn what's going on at a deeper level.
I will keep my pi node going. I want my next node to be more like what I described in my op along the lines that were detailed by @031ef7d322.
reply
My single biggest upgrade was getting off a raspberry pi and throwing my raspiblitz setup on an old 2012 Macbook Pro I had laying around. Dramatically faster, and no more worrying about the pi4's weird power issues or SD card durability woes.
Running a personal (or even routing) lightning node doesn't have to be expensive, even 10-year-old hardware does a great job!
reply