Controlling uncertainty on web applications and APIs

A Django middleware to introduce uncertain behaviours on sites

Posted by Agustín Bartó 6 months, 2 weeks ago Comments

“But it works fine on my machine!”

How many times have your heard that phrase? I’ve been working as a developer for over a decade and I can swear I heard it in every project I worked on. No matter how well you design your software and plan its deployments, almost always something unexpected happens.

Over the years, the industry has developed techniques and technologies to minimize the difference between the development and production environments. Right now we have scripted deployments, several container and virtualization solutions, and continuous integration, all working together to make sure that every step of the process is predictable. But even with all this in our toolbox, things still go wrong from time to time.

Here at Machinalis we provide services and work on projects with small startups and huge corporations (and everything in between), and although the resources and corporate structure might change, the goal remains the same: producing efficient high-quality products that fulfills its throughout its lifespan. Hence, the policies and techniques for deployment are pretty similar regardless of the size of the client, and it usually works just as we planned it...for a time.

Whether a multi-million dollar storage solution that’s about to give up the ghost or a clueless operator that uses a mission critical network to transfer several terabytes of media right in the middle of the launch of a product (both true stories, by the way); there are things that are beyond our control.

How can we make sure that our applications and APIs are robust enough to handle these situations? What can we do to test theories regarding these transient difficulties in infrastructure without access to the real environments?

The first answer is to design robust software. It’s not particularly hard and several ways of doing it have been thoroughly researched and documented. The next thing is setup logging properly. It’s the easiest thing to do but it is also frequently overlooked.

The next thing is test for less than ideal situations. Enter chaos.

When we need to introduce simulated network problems into a project, we rely on chaotic proxies. These proxies sit in between components that need to communicate through a network, and they cause predictable problems like delays, disconnections and, in the case of TCP proxies, packet dropouts and other low-level network problems.

So far we’ve used mostly Vaurien due to its flexibility, ease of use, and the fact that is written in Python (which is alwasy a plus). Others we can mention are comcast, toxiproxy, and clumsy.

Let’s illustrate the process with a simple example. We have a very simple RESTful web API that serves a one-page web application. We want to make sure the application can handle transient problems with the API, so we need to set-up the proxy between the app and the API.

First we’ll hit the API using ApacheBench to get an idea of how it works under normal circumstances:

Server Software:        nginx/1.4.6
Server Hostname:        localhost
Server Port:            8000

Document Path:          /api/items/
Document Length:        2817 bytes

Concurrency Level:      1
Time taken for tests:   3.909 seconds
Complete requests:      500
Failed requests:        0
Total transferred:      1519000 bytes
HTML transferred:       1408500 bytes
Requests per second:    127.91 [#/sec] (mean)
Time per request:       7.818 [ms] (mean)
Time per request:       7.818 [ms] (mean, across all concurrent requests)
Transfer rate:          379.49 [Kbytes/sec] received

Connection Times (ms)
        min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     7    8   0.5      8      11
Waiting:        7    8   0.5      7      11
Total:          7    8   0.5      8      11
ERROR: The median and mean for the waiting time are more than twice the standard
       deviation apart. These results are NOT reliable.

Percentage of the requests served within a certain time (ms)
  50%      8
  66%      8
  75%      8
  80%      8
  90%      8
  95%      9
  98%      9
  99%     10
 100%     11 (longest request)

Now we want 20% of the calls to fail, so we set-up vaurien:

$ vaurien --protocol http --proxy --backend localhost:8000 --behavior 10:error
2016-08-08 11:54:58 [18580] [INFO] Starting the Chaos TCP Server
2016-08-08 11:54:58 [18580] [INFO] Options:
2016-08-08 11:54:58 [18580] [INFO] * proxies from to localhost:8000
2016-08-08 11:54:58 [18580] [INFO] * timeout: 30
2016-08-08 11:54:58 [18580] [INFO] * stay_connected: 0
2016-08-08 11:54:58 [18580] [INFO] * pool_max_size: 100
2016-08-08 11:54:58 [18580] [INFO] * pool_timeout: 30
2016-08-08 11:54:58 [18580] [INFO] * async_mode: 1

And run ab against the proxy port:

Server Software:
Server Hostname:        localhost
Server Port:            8005

Document Path:          /api/items/
Document Length:        188 bytes

Concurrency Level:      1
Time taken for tests:   3.854 seconds
Complete requests:      500
Failed requests:        471
   (Connect: 0, Receive: 0, Length: 471, Exceptions: 0)
Non-2xx responses:      102
Total transferred:      1241946 bytes
HTML transferred:       1144942 bytes
Requests per second:    129.74 [#/sec] (mean)
Time per request:       7.707 [ms] (mean)
Time per request:       7.707 [ms] (mean, across all concurrent requests)
Transfer rate:          314.72 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     1    8   3.8      9      36
Waiting:        1    8   3.8      9      36
Total:          1    8   3.8      9      36

Percentage of the requests served within a certain time (ms)
  50%      9
  66%      9
  75%      9
  80%     10
  90%     10
  95%     11
  98%     12
  99%     13
 100%     36 (longest request)

So we got 106 out 500 abnormal responses (close to our 20% target).

Vuarien supports other protocols as well, which makes it ideal for the task of introducing uncertainty into any modern project that integrates multiple back-ends, but it is not without its problems; it only supports Python 2, so you would have to use separate virtualenvs for Python 3 project, it lacks flexibility for some use cases, and it is another component that you have to integrate into your testing architecture which adds complexity.

What if all you wanted was a simple way to introduce uncertainty into an API to test a single-page app? If you’re using Django, we’ve got a solution for you.

Last week Django 1.10 was released, and with it, a new middleware system. While reviewing the release notes we realized that this was a good opportunity to fill a gap in our toolbox and learn to use the new features in the process; so we decided to create django_uncertainty.

django_uncertainty is a Django 1.10 middleware that allows developers to introduced the kinds of problems we described above to allow testing less than favorable conditions in the local development environment.

What makes it different than Vaurien or any other proxy?

  • It is dead simple and can be easily extended.
  • It has no external dependencies.
  • You can define behaviors based on knowledge of the internal structure of the application.
  • It only works with Django 1.10 or later.
  • It only supports HTTP.
  • It can only be placed in front of the target application.

As you can see, it has its limitations, but we think it can be useful under certain conditions. Let’s see it in action.

Same as before, we want 20% of the requests to fail, but now we want 10% of those to be Internal Server Errors (500), 5% to be Forbidden (403) and 5% to be Not Found (404). Assuming that the middleware has been properly set up (it’s explained in the package documentation), you can declare this particular behavior with the DJANGO_UNCERTAINTY settings variable:

import uncertainty as u
DJANGO_UNCERTAINTY = u.random_choice([
    (u.server_error(), 0.1),
    (u.forbidden(), 0.05),
    (u.not_found(), 0.05)

Let us fire up ab once again to see how the middleware performs:

Server Software:        nginx/1.4.6
Server Hostname:        localhost
Server Port:            8000

Document Path:          /api/items/
Document Length:        2817 bytes

Concurrency Level:      1
Time taken for tests:   3.401 seconds
Complete requests:      500
Failed requests:        97
   (Connect: 0, Receive: 0, Length: 97, Exceptions: 0)
Non-2xx responses:      97
Total transferred:      1242483 bytes
HTML transferred:       1135251 bytes
Requests per second:    147.01 [#/sec] (mean)
Time per request:       6.802 [ms] (mean)
Time per request:       6.802 [ms] (mean, across all concurrent requests)
Transfer rate:          356.76 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     1    7   2.7      8      33
Waiting:        1    7   2.7      7      33
Total:          1    7   2.7      8      33

Percentage of the requests served within a certain time (ms)
  50%      8
  66%      8
  75%      8
  80%      8
  90%      8
  95%      9
  98%     10
  99%     10
 100%     33 (longest request)

The count for each specific error code is not shown, but we can extract the values from the log file. Here’s the final tally.

Status Code Count Percentage
200 403 0.81
500 51 0.10
403 15 0.03
404 31 0.06

The numbers don’t match our specification exactly due to the usage of a random number to chose which behavior to activate.

Let us change the specification a bit. Now we want to delay requests by half a second, but only those which use PUT or PATCH requests and go through under the /api/ path and we want the requests to go through as normal

import uncertainty as u
DJANGO_UNCERTAINTY = u.cond(u.path_is('^/api') & (u.is_post | u.is_method('PATCH')), u.delay(u.default(), 0.5))

You can read the documentation to see what other behaviors and conditions can be used to control the outcome of the requests. We tried our best to provide the tools to describe the most common scenarios, but if you think there are others that we can cover, feel free to create a ticket in the GitHub repository with a description of the situation you want to specify.

If you want to try a live demo of the middleware, we’ve created a sample that exposes a RESTful API using Django Rest Framework. You can get the code on GitHub and there’s also a Vagrantfile with everything already set-up for you.

Drop us a comment with your take on this matter. We’re interested to learn about other situations were the unexpected happened and how was the problem deal with. Feedback on the blogpost and the code is always welcomed as well.

Previous / Next posts