Network architecture and security

Overview

This graph focuses on the network architecture, using the querier as a sample service. An almost identical setup is used for the indexer and companion (share) servers, and slightly simplified versions for the bots (since they don't need to scale or handle incoming end-user traffic).

graph TD;
  User(Legitimate user)
  Attacker(Attacker)
  User-->Cloudflare(Cloudflare Ingress)
  Attacker-->Cloudflare
  Cloudflare-->CFRules(Does this look like an attack?)
  CFRules-->|Yes|CFMitigation(Cloudflare Attack Mitigation)
  CFMitigation-->Block
  CFMitigation-->Challenge(Managed Challenge)
  CFRules-->|No|CacheCheck(Is the data available in Cloudflare cache?)
  CacheCheck-->|Yes|CFCached(Serve cached data)
  CacheCheck-->|No|ALB(Amazon Application Load Balancer)
  ALB-->ALBIPCheck(Is this from a Cloudflare IP address?)
  ALBIPCheck-->|No|ALBDeny(Drop the connection)
  ALBIPCheck-->|Yes|TargetGroup(Amazon Target Group)
  TargetGroup-->|Choose ECS Task|ECSTask(ECS Task)
  ECSTask-->QuerierConcurrencyLimit(Are we beyond our concurent request limit?)
  QuerierConcurrencyLimit-->|Yes|QuerierLoadShed(Load shed the additional request)
  QuerierConcurrencyLimit-->|No|QuerierProcess(Process the request)
  QuerierProcess-->Kingnodes(Make a gRPC request to Kingnodes)
  Metrics(Amazon Health Metrics)-->|Check health|ECSTask
  Metrics-->|Check stats|ALB
  Metrics-->|Scale up or down|TargetGroup

Goals

  • Provide high availability, even the in presence of machine failure
  • Cache as much within Cloudflare as possible to reduce traffic to our services and Kingnodes
  • Detect and block as many invalid requests (like DDoS attacks) within Cloudflare
  • Do as little work on invalid requests within our services as possible
  • Make it difficult to send cache-busting requests
  • Scale up our services in response to increases in traffic
  • Avoid overprovisioning (since it costs more), but use if necessary to handle bursty traffic

Cloudflare protections

TODO, need to pull from https://phobosfinance.atlassian.net/browse/PERP-2737

Amazon setup

We follow a fairly standard load balancer/auto-scaling group/node setup, but using Amazon ECS and Fargate instead of EC2 auto-scaling groups. We use various triggers to aggressively scale up and less aggressively scale down. We should review and document these triggers here. Right now, triggers include:

  • High CPU utilization
  • High memory usage

In-app protections

  • We reject with a 400 status code any request with invalid query string parameters. This helps prevent cache busting.
    • TODO We'd like to improve our Cloudflare protection to detect high levels of 400 responses and automatically block the offending client as an attacker.
  • Some data is cached in memory within the querier. This is either to provide protection against node downtime, to improve performance, or to help mitigate DDoS attacks (by absorbing the traffic cheaply in the querier instead of more expensively by quering nodes).
  • In addition to returning appropriate cache headers for each endpoint (with different cache duration depending on the data requested), all error pages also include cache headers to prevent the same invalid request from flooding our system.
  • Request timeout on all requests. This may result in errors for users, but usually better than hanging connections. Of all features, this one is the least "protective," but still helpful and good for the end user experience.
  • Global concurrency limit prevents more than a certain number of requests from being handled on a single node at a given time.
  • Load shed will return an error status code when the global concurrency limit has been hit.

Concerns with global concurrent limit and load shed

The inspiration for using these comes from the blog post I won free load testing. As I understand it, the theory between these two things tying together is:

  • By having a concurrency limit, we prevent the application from trying to do too much work at once, allowing it to more quickly handle a smaller number of requests at once, handle each faster, and get the backlog cleared out.
  • By using load shedding, we prevent the application from being overwhelmed by too many active connections, allow the load balancer to redirect requests to other nodes, and allow a feedback mechanism to the auto-scaler to increase the number of nodes.

I'm concerned that our current setup is making things worse, not better. First issue: too low a concurrency limit. If we do this, we essentially defeat any possibility of our application handling bursts of requests. A number of requests during a sudden spike will be rejected immediately. That's good for DDoS attacks, but bad for normal usage. After such a situation arises, I think the following can occur:

  • User browser receives the error message
  • Browser immediately retries, possibly multiple times
  • Instead of having a single request sitting in the queue for a bit of time, we have multiple requests touching all layers of our system, being rejected multiple times, a bad error for the end user, and more overall load on our system.
  • More theoretical, but we're preventing the Amazon load balancer from properly doing its job of choosing which node to send requests to. It's supposed to handle the cases of requests taking too long to process on a node, and we're preventing that from kicking in. It may also adversely tie in to auto-scaling rules.

I think we should instead do the following:

  • Drop load shedding entirely.
  • Keep the global concurrency limit. I'm not sure what the number should be, but I'd err on the higher side.
  • With these two changes, requests can now begin to pile up on an individual node waiting to be processed. They may end up getting timed out, but that's a more natural backpressure system for the load balancer.
  • Try to refine the auto-scaling rules to detect slow response time and higher number of "request timed out" responses.

Attack vectors

  • Sending so many requests to Cloudflare that it ends up overwhelming our load balancer in Amazon. (Load balancers can scale, but we've seen cases where they don't scale quickly enough.)
  • Similarly, taking down our own services by getting enough requests past Cloudflare's DDoS protection and cache layer.
  • And finally, similarly, getting our services to send too many requests to the node provider.