nbviewer and GitHub's rate limit

If you've been using nbviewer recently, you may have seen this message more often than you would like:

503 rate limit

It turns out that some overzealous SEO bots were eating up all of our resources. Last week, we discovered this cause of the API rate limit exhaustion and took a few actions to resolve it.

tl;dr; nbviewer shouldn't hit the rate limit so often anymore. This is what our API usage availability has looked like since we deployed the changes I'll describe here:

rate limit status

nbviewer's architecture

I'd like to take this opportunity to give an overview of how we deploy nbviewer, so we can see how each piece comes into play.

To start off, nbviewer is two things:

  1. the web application that finds notebooks on the internet and renders them, and
  2. the specific deployment of nbviewer serving https://nbviewer.jupyter.org

You can even roll your own nbviewer by grabbing the public docker image and running:

docker pull jupyter/nbviewer
docker run -p 5000:5000 -it jupyter/nbviewer

You can then start rendering notebooks left and right. However, there's a limit to how scalable such a basic deployment can be.

Rendering a notebook

Rendering a notebook involves two major steps:

  1. download the notebook from somewhere on the web
  2. render it to HTML with nbconvert

Each step can take a bit of time. Fetching and rendering a notebook start to finish can take as much few seconds if the notebook is big and/or the upstream server is busy or slow. We use tornado's async APIs to handle concurrent requests. Fetching notebooks from the web is handled via a subclass of AsyncHTTPClient and rendering is done in the background via concurrent.futures. A single nbviewer instance can handle quite a few concurrent visitors, depending on how many rendering threads are available.

Caching

Most notebooks on the web don't change very often, so going through the whole process when you just downloaded and rendered the same notebook a minute ago is a bit of a waste. This is where caching comes in.

nbviewer supports caching via a simple in-memory cache or memcache at two levels:

  1. Every time nbviewer completes a render, it caches the result for a period of time (usually 10 minutes to an hour, depending on render time), so that a very popular notebook doesn't consume too many resources. This is unconditional, and doesn't take into account whether the original notebook has been updated. If it's in the cache, it gets reused.
  2. nbviewer separately caches upstream requests. When nbviewer asks an external service for anything, whether it's a regular web server or GitHub's API, the response is saved for a time. The next time the same resource is requested, nbviewer checks for a cached response. Unlike the full render cache, this cache doesn't prevent a new upstream request. Instead, we check the headers of the cached reply (ETag and Last-Modified). nbviewer then takes the content of those headers to populate If-None-Match and If-Modified-Since headers before sending a new request. The server can then check these headers and return an empty reply with status 304, telling nbviewer that it should reuse the cached response instead of running the full process of handling the request again. Critically for nbviewer, when the GitHub API responds with 304, the rate limit is not consumed.

Our nbviewer deployment

To deal with the traffic our public nbviewer instance gets, we have deployed nbviewer with a few extra elements. The moving parts:

  1. we have multiple nbviewer instances on multiple servers deployed. Right now, there are two nbviewer instances (running in docker) on each of two servers running on rackspace, making four total instances.
  2. memcache, also running in docker, with separate instances on each server
  3. fastly sits in front of our nbviewer instances and load-balances requests across them, performing health checks, etc. It also has its own cache layer.
  4. Finally, cloudflare is out front, handles SSL termination, and a small amount of caching

Four instances is quite a bit more than we need for the load nbviewer gets, but having it spread out lets us do things like reboot or upgrade servers without any downtime.

What happened?

So, why were we seeing the GitHub rate limit hit so often? Did nbviewer just get too popular? Sadly, no. How did we figure out what happened? To the logs!

Fun fact: until now, we were using docker's default json logging, and nbviewer was producing 50GB of logs each month. docker logs --tail does not behave super well with a 30GB log file. I started digging through the logs to see what was going on, but quickly realized that this wasn't working very well. To try to get a better sense of things, I started by hooking up our logs to Loggly via syslog. Loggly lets me start to extract some metrics, such as status, URL, IP address, and user-agent of failing requests.

What did we learn?

One of the first things I did was group incoming requests by IP address, to see if anything stood out. This is what I saw:

ip adresses (c/o loggly)

So about two thirds of requests were coming from just two ip addresses! That's not right. Checking for user-agent showed something else:

user-agents

Two bots, Ahrefs and DotBot were accounting for almost three quarters of requests. That's too much.

What we did about it

The very first thing we did was to block the two IP addresses that were driving way too much traffic. This was done in the fastly layer, sending a custom 429 error for any request from these IPs:

429 Too many!

This improved availability a bunch, and would probably have been plenty on its own to resolve the current situation. Of course, it wouldn't be enough to prevent similar situations in the future. Since the problem was caused by robots, the next place to look was robots.txt. Our robots.txt had been the most permissive possible:

User-agent: *  
Disallow:  

explicitly allowing any bot. Turns out we were being too nice. We've now updated our robots.txt to tell all robots to slow way down, and instructed the two overzealous bots that they are not welcome:

User-agent: *  
Crawl-delay: 10

User-agent: dotbot  
Disallow: /  
User-agent: AhrefsBot  
Disallow: /  

Again, this should be enough to resolve the issue if these and all other bots are honorable and respect the robots.txt.

However, we would prefer to not rely on the kindness of strange robots.

Better caching and better use of the cache

As mentioned earlier, we cache the responses we get from upstream sources and use those to reduce the load on the GitHub API and other sources of notebooks. However, we were not making the best use of that cache. First of all, the responses were only being cached for two hours, which means this caching only help us for relatively short bursts of activity (e.g. a popular notebook being shared on social media). This certainly happens, but does not represent a large fraction of typical traffic. Since the way we check for updates with If-Not-Modified has no risk of stale results, we changed this to cache upstream responses forever, until memcache pushes them out when it gets full. Since most public notebooks don't change often, we should get a cache hit almost every time.

The next change we made was an improvement to behavior when the rate limit is hit. Previously, if we had a cached upstream response and GitHub told us there was an update that we couldn't fetch due to the rate limit, we would serve the 503 error page. But we had valid data right there! Now, if we hit the rate limit but happen to have an old version of the data, we work with the stale version rather than serving an error page. Something's better than nothing, right? This applies to any kind of error fetching from an upstream server, not just failures due to hitting the GitHub rate limit.

So we are now much less likely to hit the rate limit in the first place, and much more likely to be able to serve something if and when we hit it.

Our own rate limit

The last thing we did to help limit the consumption of resources by a small number of actors was to implement our own rate limit in nbviewer. This is a pretty simple rate limit, where we identify a source by the combination of ip address and user-agent. Currently, this limit is set at 60 requests in ten minutes. We only count non-cached requests against the limit, so refreshing the same few pages a bunch won't consume the limit.

Like the cache, the rate limit is actually implemented with memcache. When we see a request from a given source, we initialize a counter to memcache with an expiry of our rate limit window:

key = cache_key(request)  # combines ip and user-agent  
new_visitor = memcache.add(key, 1, rate_limit_window)  

If it's not a new visitor, we increment the counter with memcache's atomic increment operation:

if not new_visitor:  
    count = memcache.incr(key)

The last thing to do is to check if the rate limit is exceeded, and return an HTTP 429 error instead of proceeding with the render:

if count > rate_limit:  
    raise HTTPError(429, "Rate limit exceeded...")

If you hit the rate limit you will see:

rate limit error

From looking at our logs, we can see that this rate limit has blocked ~800 requests since deploying it a few days ago. More than 600 of these were from a bot that apparently doesn't respect robots.txt. A few dozen were made from phantomjs, presumably some manual scraping that made a few too many fetches. That leaves a few probably-real humans who hit the rate limit, but no IP has hit it often or for a sustained period of time.

Let us know if the rate limiting is causing an issue for you. We will keep an eye on the limits, and continue to tune both the limits and limiting conditions to find the best balance.

Monitoring

We run our instances with the New Relic Python agent, which gets us some basic health monitoring. We also have some monitoring and alerts in our Rackspace account to notify us when things go down.

We've added the error rate tracked by New Relic and a new metric that tracks how much of our GitHub API limit is available to our public status page, so you can see how healthy nbviewer is at any given time.

One last thing

After confirming that everything was working well, I did a final deploy on a Friday evening, and moved on. Then this came in:

Turns out there had been an update to the New Relic Python agent since the last deploy, which introduced this tiny off-by-one error:

if len(args) > 0:  
    args = list(args)
    args[1] = wrapped_callback
    # what if len(args) == 1 ?!

(that > 0 should be > 1)

This little bug just so happens to prevent nbviewer from making any outgoing requests :shrug:. This served as a friendly reminder that it is a good idea to pin your dependencies when deploying applications so that updates only happen when you decide they should.