Aether grows past 500 concurrent nodes

Feb 20th, 2019 2:08:54am

It’s been a pretty fun week so far.

After I posted Aether for Redditors, Aether ended up on the front page of Hacker News again. (Link)

The drill is familiar by now. Just keep everything working, and it’ll pass in a couple days. This time around, I’ve gotten about 6,600 unique visitors on the main site or so — which is normal, since the HN link was towards the blog, and not the main site.

There’s one more benefit in these things — it pushes the system to the next order of magnitude scaling, and it gives you a glimpse of where the next set of scaling issues are going to come from. This is indeed what happened when Aether broke through 500 concurrent nodes online. If you’ve been on the network for the past few days, you might have been having some trouble seeing new posts from other people. I’m writing to give a little bit of an educated guess on why, and what I’m doing to improve things. The stuff I’ve mentioned in this one is going to come in an update in the next few days or so.

In general, this is a great position to be in — Aether is picking up steam at a pretty steep pace, so when these 10x whale events happen, it’s always insightful. The reason I can be working these issues right now, early as possible, is because you guys came in all at the same time and stressed out the algorithm that chooses which nodes to connect to. So thanks for being here and using it. 🙂

What I thought it was

  • I had released an update one day earlier, and I thought that was the cause for the connectivity problems. So far as I can see, it wasn’t. I reverted that update, and it seems it’s pretty much still the same.
  • There are some users in Windows where because of some sort of antivirus or some other external issue, the app is not able to write cache files. After some debugging with some folks at the meta forum we’ve ruled that out as the culprit. (Thanks!)

What was happening

  • Aether is a flood network. For this to work effectively, every node has to choose who to connect to at any point in time. These connections happen every minute or two. These are all outbound connections.
  • You also receive inbound connections. There is a limit on how many inbound connections you can receive simultaneously at any one time, to respect your internet connection and not to tax your machine. In essence, your node has a defined number of ‘slots’ that other nodes can connect to. When your node runs out of slots, it starts to tell other nodes: “I’m too busy, connect back later”.
  • An Aether node, to determine which nodes to connect to, does a ‘network scan’. This is, in very simple terms, a process to check some nodes that have the highest likelihood of being online.
  • This network scan uses the first two steps of a full sync. So it’s much lighter than a regular sync. It’s just a ping.
  • Unfortunately, what happens is that these pings also take a slot. These slots take 60 seconds to clear, to allow for continued requests on the same slot if needed.
  • When we had fewer active nodes, network scans and regular syncs were both able to fit the given number of slots at any point in time.
  • When we came close to the 500 concurrent nodes, that was no longer true. Since syncs hit one node, but scans hit multiple nodes, the scan traffic grew much faster than the sync traffic, clogging the slots that were actually meant for syncs. This is why the updates slowed down.
  • The sync logic was a little too soft: when it failed a small number of times, it stopped trying to sync in that cycle, and left it for a future cycle. Since the large majority of the nodes in the network were saturated by scans, this meant it kept sitting around a lot. This was a means to reduce the number of scans, since every sync attempt would mean a new scan, but combined with other effects mentioned above, it backed itself into a corner.

What’s the fix?

Bunch of things.

  • The scans will be rate-limited. Now a node can do one scan every ten minutes, instead of being able to trigger a scan every time it wants. This should reduce the slots used by scans. (They are a tiny portion of the traffic—since they’re essentially pings—but since they’re indistinguishable from a sync starting, they took a full slot.)
  • There’ll be separate slots, only for scans. This makes it so that the syncs will never be blocked by scans.
  • Sync logic is more aggressive, since it no longer implies a full scan for every retry. It will keep retrying with different nodes using the existing scan data from up to 10 minutes ago, until it can find one that it can sync with, or it exhausts the addresses database. The cooldown for each address is increased from 3 to 10 minutes, so at any point in time, a node will only hit another specific node only every 10 minutes, at most.

In short, we are separating the scans from the actual sync, and it will just be something that happens every 10 minutes, and no longer a service other parts of the app can call at any time they want. The goal is to make scans a little less chatty and dominant, since scanning every attempts vs every 10 minutes does not appreciably diminish the value of scans (far as preliminary testing shows).

This makes it so that the other parts of the app that relies on this addresses table can now be more aggressive, since they are released from having to update this table themselves.

Lastly, having separate, dedicated slots for scans makes it so that we can give 10x the slot count only for scans, since they are effectively close-to-free to respond to.

Why did it work before, and why it didn’t work when it grew?

Because the traffic used for scans grew faster than the traffic used for syncs, since syncs are 1:1, but scans are 1:N. So the scan requests grew to be larger than the network’s overall capacity growth via new nodes joining. The changes to rate-limit the scans and give them separate slows brings them back to a more linear growth with the network’s overall capacity as new nodes are added.

These changes involve some backend changes, therefore it’s going to be a few days, and I’ll continue to work on and improve the stability in the coming weeks as my work schedule on the business version allows. I’m writing to shed some light on what I’m doing and what’s happening behind the scenes.

You should see see a steady stream of updates coming in - the changelog will carry more details. They’ll be focused on improving the scalability of the system as it gets bigger.

Growing pains y’all. Cheers!