The Benefits of Automated Tests

In this article, we'll learn the benefits of automated tests in programming and deploying software, including websites, applications, and even video-games. Automated tests are programs you program with the purpose of ensuring other programs you programmed behave in the way they should. There are many kinds of automated tests, such as unit tests, integration tests, and regression tests. Many programmers, both new and junior, and even old veterans, don't fully understand the benefits of automated tests because they may simply never have used them before, so in this article I'll seek to explain how they work.

Uses of Automated Tests

Let's start with the most important thing: what can we use automated tests for? There are several things!

Ensuring complex small utility functions work and handle all edge cases properly.
Improving code quality in general.
Ensuring bugs don't reappear.
Making code easier to change and removing risks.
Ensuring your consumers get all the features they were promised.
Testing things you didn't think you could test.

1: Ensuring complex small utility functions work and handle all edge cases properly

A common issue in programming anything is that eventually you need to write a small "utility" function that doesn't deal with APIs but raw actual data, perhaps processing strings or doing math.

This could be anything. A simple function to split text by some separator or delimiter and get part of it. Perhaps a function to calculate a number from other numbers. Or tell you whether something is true or false based on math, string, or list manipulation.

These functions tend to be a hassle to test manually because there is simply no practical use case for most of their execution paths.

For example, if I have a tiny function that splits the text and returns the first part that I need to use with input I wrote myself, it makes sense to think I'm always going to write a valid input, so the function is never going to fail. If all data I have in practice is valid, I'll never hit the unhappy path that says "Error: input is in invalid format" or something like that.

I would have to keep a bunch of invalid files around just to manually check that this utility function, that could buried deep into several function calls, actually fails the way it should fail.

These utility functions are often a time bomb.

They look like they work properly and their algorithm is sound because we only use them with valid inputs in practice. One day in the future we'll want to use it with a different input, and the utility function is going to fail, and what first looked like a small addition in code becomes fixing the utility function to work properly with the new input.

Even making the function work in the first place with valid data can be a hassle.

For example, consider the very real possibilities in real life:

There is a website, when you press a button, the browser sends a POST request to an endpoint, which triggers a function in the backend, that sometimes calls an utility function if some conditions are met.
There is a desktop application, when you press a button, it sometimes calls an utility function if the state of the application matches several conditions.
There is a game, when you press a button, it sometimes calls the utility function if the player is at a certain point in the game, and the enemies are in a certain position relative to the player, and random() > 0.97.

Ensuring the utility function works properly the first time can be a huge hassle if you're testing it manually depending on the application.

A good example would be Tetris. Imagine you're programming Tetris. You have a grid where falling blocks can be placed. After placed, you have a function that checks if a line was formed before clearing the line. In order to manually test this, many new programmers HAVE, IN FACT, MANUALLY PLAYED THE BROKEN GAME OVER AND OVER AGAIN just to make sure their isLineCleared was programmed correctly. Every time a new programmer failed to program this correctly, they have to replay the game and try to get a line cleared just to make sure their code is working the way it should.

And this is a simple example. In real life, games far more complex than Tetris have been tested manually, as have applications that are relied upon to do real jobs, some of them critical.

By writing an automated test for this isLineCleared function, we just need to run the test to make sure the function actually works. We wouldn't need to play test it manually until the conditions required to trigger all the executions paths are met because we can simply artificially construct the state required within the test function.

For example, if we write a script called game.test.py like this:

#!/usr/bin/env python3
from game import isLineCleared

board = [[0] * 8, [1] * 8]
if isLineCleared(board, 0):
    raise ValueError(
        "isLineCleared returned True when it should have returned False!"
    )
   
if not isLineCleared(board, 1):
    raise ValueError(
        "isLineCleared returned False when it should have returned True!"
    )

We can simply run ./game.test.py to test our game. 10 lines of Python saves minutes of manual testing.

Note that generally it's a good idea to use a test framework that manages all your test cases in an organized structure, but if you don't want to, you can easily create your own custom tests by just writing a program yourself.

Tests for small functions are unit tests, which are isolated and meant to test only one function of your application or library.

When a function depends on external objects, unit tests are supposed to "mock" external dependencies so that you can ensure that the things the function is responsible for work. This means if you have 2 teams of developers, team #1 can test their code even if the code depends on functions implemented by team #2 that aren't finished yet by providing mocks for the dependencies that always return the expected value according to the interface's specification.

In practice, writing mocks so you can isolate everything is more effort than it's worth in small projects. Focus on just creating automated tests first, and worry about the "proper" way to do things later.

2: Improving code quality in general

Writing tests improves the quality of your code. Because of how tests work, it's often not even possible to test code if the code quality is terrible. This means that in order to write tests, you must first write testable code, and testable code has greater quality even if no tests are written.

Testable code means functions and subprograms that follow the Single Responsibility Principle: each function should do ONE thing, and ONE thing only. Since you're going to have write tests for all sorts of things, it's going to make your life much easier if you split every single complexity into a separate, more testable function.

Single-responsibility functions have self-documenting names, which just improves code readability in general. You pretty much don't need comments when all your functions explain what they do with their very names. But you can add comments to them nevertheless, and that is awesome.

Every non-trivial computation should ideally be separated into its own testable function. For example, consider the following function:

def get_canonical_url(url: str) -> str:
   frag = url.split('#')[-1].split('?')[0]
   if frag.startswith('comments-'):
       return f'/comments/{frag.split('-')[-1]}'
   else:
      return input

Although this is just a few lines of Python, this function does too much. It's clear that this is supposed to transform /post/123#comments-456?view=comments into /comments/456. We could write a test for it as-is, but it's easier to split it into simpler functions instead because then you only need to ensure the edge cases are covered by the simpler functions instead of testing the super-function fully.

Let's rewriting it:

def get_fragment_from_url(url: str) -> str|None:
    try:
        idx = url.index('#')
    except ValueError:
        # raised if '#' is not found
        return None
    
    after_hash = input[idx + 1:]
    return after_hash.split('?')[0]

def get_comment_id_from_fragment(frag: str) -> int|None:
    prefix = 'comments-'
    if not frag.startswith(prefix):
        return None

    after_prefix = frag[len(prefix):]
    try:
        return int(after_prefix)
    except ValueError:
        # int raises ValueError if it can't parse a string
        return None

def get_comment_url_for_id(id: int) -> str:
    return f'/comments/{id}'

def get_canonical_url(url: str) -> str:
   frag = get_fragment_from_url(url)

   if frag is not None:
       comment_id = get_comment_id_from_fragment(frag)
       if comment_id is not None:
           return get_comment_url_for_id(comment_id)

   return url

Now we have a lot more code than we started with, and that's good. When we write everything in a single function, it's easy to ignore edge cases because if we considered all the edges cases we would have too write too much code in a single function. When we split functions, we have to consider: are we really handling every possible scenario?

What if comments-abc shows up for example? Because our subprogram that gets the ID from a fragment must return a numeric ID now, it must handle the case where what comes after comments- isn't numeric.

It isn't a bad thing that we have more code, because if that case ever happens in practice, we would have to write this code anyway. We're just writing it beforehand so we aren't caught by surprise later.

Observation: interpreted languages like Python and Javascript are extremely slow to handle function calls because they must resolve the function name at runtime. I'm sure there may be optimizations for this, but none of them can really compared to the way compiled languages like Zig, Rust, C++, and C work. In compiled languages, a function call is just the address of the function's machine code, which requires no processing, you just jump to the address and you're done. If the function is simple enough, the compiler can inline the function's code into the caller function. This means, effectively, that the tiny functions disappear from the compiled binary, their code is just copy-pasted into the function that called them automatically. Although none of this is important generally, in performance critical applications, readable code leads to bad performance. Thus readability and testability aren't always the most important thing about code, but it's very, very rare for extra function calls to be the cause of bad performance (unless you're doing real time graphics rendering in Python or Javascript, which I have). In these cases the best solution (as I bitterly learned after much trial and error) isn't to restructure your code for better performance, it's to just use a lower level language that is faster and import it into the interpreted language somehow. It's what MyPaint, an digital art application made in Python, does with its brush-rendering code, for example.

Now that everything is split, it's very easy to test everything. For the sake of brevity, let's use Python's assert statement to quickly test everything.

def test_get_fragment_from_url():
    assert get_fragment_from_url('') is None
    assert get_fragment_from_url('/') is None
    assert get_fragment_from_url('#') == ''
    assert get_fragment_from_url('/#') == ''
    assert get_fragment_from_url('/foo#bar') == 'bar'
    assert get_fragment_from_url('/foo#bar/baz#fish') == 'bar/baz#fish'
    assert get_fragment_from_url('/foo#bar?baz') == 'bar'

def test_get_comment_id_from_fragment():
    assert get_comment_id_from_fragment('') is None
    assert get_comment_id_from_fragment('foo') is None
    assert get_comment_id_from_fragment('comments') is None
    assert get_comment_id_from_fragment('comments-') is None
    assert get_comment_id_from_fragment('comments-abc') is None
    assert get_comment_id_from_fragment('comments-123abc') is None
    assert get_comment_id_from_fragment('comments-123') == 123
    assert get_comment_id_from_fragment('comments-000') == 0

def test_get_comment_url_for_id():
    assert get_comment_url_for_id(123) == '/comments/123'

def test_get_canonical_url():
    assert get_canonical_url('/foo') == '/foo'
    assert get_canonical_url('/foo#comments-123') == get_comment_url_for_id(123)

As you can see above, the beauty of tests is that they literally tell you—they document—the behavior of your program.

What happens when you pass a URL that has two hashes to get_fragment_from_url? Should #bar/baz#fish result in bar/baz or bar/baz#fish? This edge case really should never happen in practice, but if it does happen, the test documents what the output should be: bar/baz#fish.

If we ever change the code of the function, e.g. to use a library instead, and that assertion fails because the library code differed from out implementation, then we effectively changed the API of our code. The input that outputted something before no longer outputs the same thing. Of course, in this case it's an input that really should never be used in practice, but the same thing applies to any input you actually support.

Any time you change the code of a subprogram, it may change its output, which may change the behavior of any subprogram that depends on it. Even small projects can have an infinite web of dependencies from one function to another, from one class to another. The tests ensure that the consumers of a subprogram's public interface are NEVER affected by any internal changes that are supposed to be implementation details.

Although tests may LOOK repetitive at first, we can easily remove redundant code with several methods. We can move some of the test case's tests to a separate function that is called by multiple test cases. We can use lists of tuples in Python containing the input and expected output and just do all of it in a single for loop (or even a map function in some languages). We can use metaprogramming in languages like Zig as well.

Although they may look tedious to write at first, in many cases, if you try to program something that does anything even slightly complicated, you're going to make a mistake and it won't work correctly the first time. Writing the tests first (also called Test Driven Development) means that you don't have to run the main program to ensure you wrote it correctly, you can just run the test program. In fact, in VS Code you can simply click a play button with the test Python script open to run it in a debugger. Testing can require zero setup or third party tools. Just write a .test.py and click run.

In any real project, one problem you'll have with unit tests is that many unit test frameworks don't allow you to declare dependencies, and people on the Internet will be generally unhelpful because they actually believe tests should always be isolated. In the code above, test_get_canonical_url uses get_comment_url_for_id. This is a dependency. If get_comment_url_for_id is broken, i.e. if its test fails, then this test will also be broken, and in fact it won't even matter if the test passes or not because we can't trust the function it uses is giving the correct output. There are many cases where this will happen. In any CRUD, for example, it's pointless to test Read, Update, Delete, if you need to Create first and that failed. You'll always have hundreds of tests that depend on functions tested by hundreds of other different tests. If a low-level subprogram is broken, you just know it's all going to be broken, but some frameworks insist on running all the tests, in random order, even when you know it's a waste of time.

3: Ensuring bugs don't reappear

One of the greatest things about tests is that you can always write more tests to test more things, specially things that were untested previously.

Let's say that your code appears to work, but then you did something unusual and you found a bug. You could simply fix the bug, but that's not very maintainable. That's because you could end up reintroducing the same bug afterwards if you changed the code.

Assuming your brain is as forgetful as mine, we can imagine there are three situations we could have:

You just fix the bug, and forget all about it in 2 days.
You fix the bug and add comments to the code that fixes the bug so you remember why you're doing all the stuff you're doing. One day you might replace the whole thing from scratch and you'll rewrite the bug in a blank new file!
You write a test so you never encounter the same bug again.

Tests are the superior solution to addressing bugs. Bugs are lapses in logic, and tests are a support structure that ensures the whole thing doesn't fall apart just because you forgot an if again.

For example, let's say that the first time we wrote a function, we didn't consider what would happen if one of the arguments was an empty string. This is a common scenario to have, which is surely going to lead to a bug one day. When it eventually happens, instead of just adding the if to the function to check if it's empty, also add an automated test that tests that empty strings are properly handled. You can include it in the same unit test you already have for that function, just add a comment to explain why the new assertion was included.

Tests that test if bugs reappear are called regression tests, because if bugs reappeared, that would mean your codebase has regressed to a previously broken state.

Depending on your programming language handling exceptions with hidden control flow can be tricky, which is why we generally use test frameworks instead that come with all the functionality we need to handle these cases. In Python, for example, if you use a with statement, all exceptions are sent to the with's handler so they can finalize properly before rethrowing the exception. For example, with pytest we can test the following code raises an exception:

import pytest

def test_zero_division():
    with pytest.raises(ZeroDivisionError):
        1 / 0

4: Making code easier to change and removing risks

Have you ever seen some code that you have to change, but you hesitate because it's such an epic mess that you have no idea about the consequences? It could be someone else's code. It could be your code that you wrote 1 year ago for one of your dozen unfinished side projects. You have no idea who depends on it. You use your IDE to check the uses, but all you see is a bunch of functions used by OTHER functions, which are used by OTHER functions, and so on.

In order to touch that code first you need COURAGE. Because you're afraid of the consequences.

You're on a shaky bridge. On top of a jenga tower of code. Moving any piece can make everything fall apart. So before doing anything first you take a deep breath and start trying to figure out what can you do that wouldn't accidentally make everything fall apart.

This happens because the code doesn't have tests.

People are afraid of trying things because of the risks: because of what happens when they fail. This is the same in life in general as it is in development.

You don't want to touch that database, because you don't have backups, so if something bad happens, it's all over.

You don't want to make backups, because it's a hassle. You forgot how to do it. You have to look it up all again.

You DO have backups, but restoring the backups is a hassle. You forgot how to do it. You have to do it all over again.

You don't want to deploy because deploying is a hassle. You know how to do it, but it takes time to do manually. If there is a bug, you're going to have to go in panic mode to fix it in prod. So you want to make sure the whole thing is well-tested and working before deploying.

Because applying changes, and most importantly, undoing them, is so difficult, developers become afraid of touching things.

You need automation.

You need a script to handle deploying to prod with one click. And undoing that deployment instantly if you need. You don't need git and CI/CD for this, you just need a script where you wrote every step you would have to manually do, but that is all done automatically, e.g. copying files with scp or sending commands via ssh. Just please don't write it in bash. Write in Python. It will save you a lot of trouble.

You need a script to handle backing up your data, routinely, automatically, to multiple locations, even if you are hit by a bus and gets sent to the hospital in coma for 6 months. You need a test environment to test that your backups actually work, so you can write a script to automatically restore them with one click if you ever need. You can use docker for this.

And you need automated tests to ensure that if you ever break anything, you'll know instantly what is broken and why, so you can just make all the changes you want, break everything, take a look at what you broke, and fix everything you broke afterward before deploying.

Because the tests document every single behavior of your program, so long as the tests pass, there are no bugs. You don't even need to check if things are actually unbroken, because you got all the tests to support you. But check it manually quickly anyway because you could have written the tests wrong.

Automation provides the support you need to make changes without being afraid of all the manual labor you'll need to do to fix things if you break them. Because the fixes are automated.

Most importantly, automated tests provide an assurance: the code actually works.

Untested code is broken code.

If you inherit a codebase today, you have absolutely no idea what parts of the code are working and what parts of the code could be broken. You're going in blind. The same thing applies if you revisit a side-project after a while. What exactly did you get working? What isn't working yet?

The tests tell you: you aren't standing on top of a jenga tower, nor on top of thin air. You are standing on top of something real. You can jump on it and it isn't going to fall apart. It's solid. You can't trust the code, but you can trust the tests.

5: Ensuring your consumers get all the features they were promised

Let's say you have a GUI application. Like most GUIs, it has buttons everywhere. Every button does a different thing. Sometimes two buttons do the same thing. Sometimes buttons do different things based on the state of the application. Sometimes buttons depend on configuration, user's preferences, etc. You have 100 buttons. 80% of the users use 20 buttons. 20% of the users use the other 80. Ensuring that every single button works and continues working as you keep adding more buttons and fixing buttons that were broken before is literally your job.

Without automated tests, you can probably imagine what nightmarish scenario would be to test this application. You would have to click on every single button. Check every single checkbox, and then test every single button again. Check some boxes, and test test every single button again.

In many cases, GUI applications don't have any tests, which means that developer created the GUI once, added the button once, clicked on it once to make sure that was working, and completely forgot about it for five years until someone came along that actually tried clicking that button and it turned out the button didn't work, so they went through all the effort of reporting this issue instead of just trying a different application.

Having too much functionality is great. It means that some functionality is only used by a small portion of your users, and if it's almost never used, it may have bugs that are almost never encountered. It makes sense to think that functionality used by most would be more stable, but there is no reason we can't guarantee the stability of functionality used by fewer. Just add tests to inflate the number of users artificially.

With automated tests, it's like everyone is using all features equally.

However, do note that to write tests for these things it gets a bit complicated.

For library code, it's easy, because all you need are unit tests. Application code inherently depends on countless system-level or framework-level components. You MUST test with dependencies. There is just to mock the GUI toolkit functionality of the filesystem. To the end user, what matters is that button they click actually do what it's supposed to do, they don't care if the filesystem or GUI toolkit has a bug, they want the button to work, so the button not working, even if your code is perfect, is still a bug. If there is a bug in a library you depend on, you just work around it.

Most important, even projects that DO have automated tests tend to test their user-facing functionality superficially. Or rather, they only test the deep level, and not the surface level the user interacts with.

Consider, for example, a GUI that has a button, and if you click on it, the toolkit sends an event to be handled by your code, and your on_click event handler just calls a function like foo(). Most developers will ASSUME that everything above foo will be working and will only test foo. This is a mistake.

What if in a code change you forget to attach the event handler? Then foo works, as the tests pass, but the button doesn't work.

The functionality that you want to test is the functionality that the user will use. In fact, you could have a codebase without any unit tests, full of utility functions that can't handle even the simplest use cases, but so long as it's impossible to reach those use cases by clicking the buttons, and you tested that all the buttons work in all possible states, your application is absolutely perfect as far as automating tests are concerned.

Instead of testing every single component and every single level of your application, focus on testing whether that features you support actually work. You support clicking on buttons, that should work then.

If your project is a website, you have an extremely complicate situation.

Your backend receives input from HTTP, so you could test it by literally just writing curl in a shell script a bunch of times.
The output is HTML, so you really should use a HTML parser to test it, like Python's Beautiful Soup.
Javascript and CSS can hide or display buttons randomly, so what you REALLY should test is whether a web browser can trigger the functionality you provide, which means you'll need to use Selenium to run Chrome headless and just script your interactions with your webpages like a TAS speed-runner.

Many web developers only test the backend and the Javascript, because those are easy to test, you just need a simple testing framework for that. However, what you ideally should test is whether the user can actually see the button that you believe is visible. Testing with Selenium is far more complicated than just testing HTTP endpoints or Javascript functions, but it can be useful to ensure critical functionality actually works.

Even if you don't test your entire website with Selenium, or your entire GUI application by emulating button clicks, you could have a few tests to ensure some critical parts actually work the way they should.

6: Testing things you didn't think you could test

Once you become familiar with automated testing, you'll start using automated tests to test things you wouldn't imagine you could use automated tests for before. Things that seemed impossible to do because it would be too much work become trivial when we leverage a CPU.

For example, let's imagine that you have a whole website made with X. One beautiful day you realize that you have to upgrade Xv1 to Xv2, that sends you in panic mode because you used too much unusual functionality of Xv1, so any of your pages could break if you do this, and you have THOUSANDS of pages. There is no way you could ever possibly check every single webpage you made manually. How do you ensure that the upgrade won't break your website in ways you'll only realize 2 years later when you accidentally view an old webpage?

You test it.

Create a test environment, load a backup of your website, program a crawler to fetch every single webpage of your website and save it as a static file, upgrade to Xv2, save all pages again in a separate folder, then check if all of the old pages are identical to all of the new pages.

This may sound like it's a lot of things to do, but this is a single test. You want to, ideally, write a test script that can setup Docker and do all of these things automatically. That's because you're going to do it wrong the first time anyway. The way you check if pages are identical will be wrong. Xv2 added some code, but it doesn't really matter, so you wish you could ignore it from your test, so you rewrite the test to ignore it, then rerun the test to redo all of this process.

This could take minutes, but it's minutes of CPU time, not of your time. You can some of your time from programming a crawler if you used a common framework, by the way, e.g. Flask has Frozen-Flask which just generates a static website from a Flask app automatically.

How do you make sure your website doesn't have broken links?

You test it.

Write a script that loads the backup, etc., check every link. You're done.

How do you make sure you don't have CSS classes that are never used by any single webpage of your entire website?

You test it.

How do you make sure there are no 5XX errors in your website? You don't really have to test this, because Google will crawl your whole website anyway, so you can just check the search console they provide to webmasters, but if it's hidden behind a login page, you could just test it.

Let's say you're programming a video-game or some application that takes data in a custom format. There are some complicated cases that can lead to bugs, like circular dependencies, which stem from the data you load, but this is data you provide, this is configuration, assets, it's not the user that loads this, so not only it should be correct, but there is nothing the user can do if it isn't. Checking for these errors every time is expensive, so you don't want to check it when the program loads because it will be a performance hit on production. What do you do then?

You test it.

You don't even need test cases in this case. Some programming languages like Python support assertion. Code in an assertion is only executed when the program is compiled or run in debug mode (depends on the language). For compiled languages like C, this means that if you have an expensive function call that is only referred to in an assertion, the compiled binary for production won't have the assertion which means the function won't even be part of the outputted binary.

You can use an assertion, for example, to ensure that two objects weren't registered to the same key or ID. If all objects are always registered when the program runs in debug or development mode, it's not necessary to include this check in production. If they failed, the whole thing wouldn't even run. Asserts can be used to automatically test that trusted configuration data is properly formatted.

Conclusion

I hope you can understand the benefits of automated testing, or, more generally, the benefits of automated processes now.

Many IDEs, including VS Code, have support for integrating test frameworks into the IDE's user interface, which is pretty nice. This means that tests you write in pytest for example, will appear in a nice list in VS Code (if you have an extension to handle pytest installed), and you can just test all of them or test a specific test case individually by clicking on buttons.

Some frameworks provide test functionality that you can use. For example, in Django, a project would use Django's ORM to declare its models, and Django normally handles creating the database and performing migrations automatically. But Django also provides tests. In tests, Django will automatically create a blank database for your tests to use that is already setup to store your models.

The Benefits of Automated Tests

Uses of Automated Tests

1: Ensuring complex small utility functions work and handle all edge cases properly

2: Improving code quality in general

3: Ensuring bugs don't reappear

4: Making code easier to change and removing risks

5: Ensuring your consumers get all the features they were promised

6: Testing things you didn't think you could test

Conclusion

Comments

Leave a Reply Cancel reply