Unit Testing Principles, Practices, and Patterns, written by Vladimir Khorikov, explains how to test and refactor complex software projects in a way that maximizes both development speed and quality. Don't be mislead by its title, this is as much a book about unit testing as it is a book about integration testing and refactoring towards a more testable software architecture.
The first part introduces the philosophy of unit testing: purpose, schools, and method. The second part explains how to refactor an application so you can write more valuable tests. The third part dives into integration testing, mocking, and how to handle dependencies like databases. The final part provides a brief overview of a few testing anti-patterns.
This article summarizes the key lessons from this book. This is by no means a full account of its content, because the author builds up concepts by adding nuance in each new chapter and he covers a lot of ground. Prior exposure to Domain-Driven Design and common architectural patterns can help for the part about refactoring. You will find in the book detailed examples that put these ideas into practice.
- Unit testing
- Valuable tests
- Refactoring towards testability
- Integration testing
- Further notes
Growing software projects become more difficult to maintain over time. As the code becomes more complex, unit tests make sure each component still meet its original business goal.
§Two schools of testing
A unit test:
- Verifies a small piece of code.
- In isolation.
The interpretation of what "a small piece of code" and "isolation" means gave birth to two schools of testing, with profound implication all the way down to integration tests, mocking, and software architecture.
The classical school advocates for testing a unit of behavior in isolation from other tests. The System Under Test (abbreviated SUT) is defined by its business goal, so it can range from a single function to a group of dependent classes.
The isolation property only implies that a test cannot have shared mutable dependencies, like a shared database. So you could have multiple instances for each test, or have each test affect a separate collection. The only goal is to prevent a test from affecting the outcome of another.
The London school advocates for testing a unit of code in isolation from all other units of code. A unit of code could be a single function, or a single class. If this unit relies on mutable dependencies, including classes of your own, you would have to replace them by mocks.
The author favors the classical school of testing, as the London school can lead to damaging issues further explored in the following sections. I can give here a few risks associated with this way of testing:
- Over-specification: imposing stricter rules on the implementation than required by the business goals.
- High coupling: testing communication between units that has no direct link to a business goal.
- Lower test quality: instantiating and configuring mocks reduces the overall readability and maintainability of a test suite.
The important keyword here is business goal. A practical implication of testing according to a business goal is that a unit test shouldn't be named after the method under test. Instead, the author advocates for a flexible naming policy, as if you were trying to describe the test scenario to a domain expert.
For instance, you should rename
Test_AuthMiddleware_Forbids_Guests. The latter is understandable by anyone,
it clearly matches a business goal, and you can rename the underlying method
without renaming the test.
That also means you should abandon the idea of achieving 100% test coverage. The notion itself is flawed: is it 100% lines of codes, 100% branches, 100% possible code paths? Even then, not all pieces of code deserve to be tested. In the authors words: code is not an asset, it's a liability. Each additional line of code, whether it is a test or not, increases the project maintenance cost.
§Arrange, Act, Assert
As an important step towards maintainability, your tests should be easily understandable. Unit tests should follow a common structure where the following sections appear in this order:
- An arrange section to instantiate the SUT and the required dependencies. To keep this section short you can extract common initialization logic into reusable methods.
- An act section to call the method under test. Your code should have no more than one method for each important business goals.
- An assert section to do your checks. To keep this section short, you can implement equality methods on important value objects, and refactor common assertions into reusable functions.
By sticking to the AAA structure, anyone can quickly understand what a test does. It also prevents you from testing multiple business goals within a single test, which improves their readability and promotes isolation. For exceptional reasons, like complicated test setup or performance, you could break this rule.
I would personally add a few exceptions to this general structure. Sometimes you need some asserts just after the arrange step to ensure that the call in the act section is really responsible for the observed changes, otherwise you could have a test that passes for the wrong reasons.
Similarly, you may have intermediate steps in the arrange or act section that return errors. Even if they do not correspond exactly to the method under test, I would still assert that they do not fail to detect improper uses the API that must be fixed.
Writing tests is good, writing valuable tests is better.
According to the book, you can measure the value of a unit test by the following properties:
Protection against regressions evaluates how good a test is at finding bugs. This property depends on the following metrics:
- The amount of code executed during the test: the more code you cover, the higher your chances of detecting regressions.
- The complexity of that code: complex systems tend to have more bugs, whereas trivial utilities are not worth the effort.
- The domain significance: regressions that affect the business logic are more important than regressions affecting minor features.
Resistance to refactoring evaluates whether a test can survive refactoring without breaking. It depends on the coupling between the test and the implementation, as measured by:
- The proportion of tested code paths, regardless of business significance, which leaves less room for refactoring.
- The probability of false positives, defined by test failure in the absence of business or technical issue.
Fast feedback measures how quickly a test terminates so you can get its result. It depends on:
- The dependencies under test: features like hashing or compression make test execution slower.
- The size of the test: involving a large amount of data, or exercising too much features at once.
Maintainability measures how easy the test is to set up and how readable the code is. It depends on:
- The dependencies: how difficult they are to instantiate, use, or mock.
- The size of the test: complicated scenarios that involve many entities are harder to reason about and refactor.
- Convoluted code: tests that do not follow the common AAA pattern are less readable.
The author explains that if you associate a score between 0 and 1 to each of these attributes, you can compute the value of a test by taking the product. You could define a threshold to decide whether you should keep a test in your suite or not. That gives you a complement to code coverage metrics.
Clearly, no single test can maximize all of these attributes, so what trade-off should you make? Let's eliminate one option right away: maintainability. Tests must be given the same care you give to production code. If you cannot write simple tests, you have to refactor your code. Conversely, you must write tests that resist to refactoring.
§Resistance to refactoring
In the author's view, resistance to refactoring is non-negotiable because it is a binary attribute: a test can resist refactoring, or it cannot. Consequently, a test that does not max-out this attribute scores a 0, so it is worthless.
Let me emphasize that resistance to refactoring doesn't mean that you never have to edit your tests when you rename a method, or when you change some type. It only means that a test must not fail in the absence of bugs, or in other words, return a false positive.
In the author's experience, the number of false positives isn't noticeable at the beginning of a project, but they accumulate over time leading to:
- A loss of trust in the test suite: if you doubt their ability to uncover bugs, you may be tempted to ignore or disable failing tests.
- Higher cost to introduce new features: you have to decrease the protection against regression if you want to introduce new features in a timely manner.
In the long run, having too much brittle tests is the same as not having enough tests: the cost of introducing new features and refactoring becomes prohibitive, and the overall project quality falls, hurting its long-term sustainability.
§Styles of test
If your tests are too coupled to the implementation, they leave no room for refactoring. That's why it's important to choose what you test wisely. There are three styles of tests you can write to evaluate the correctness of a piece of code:
Output-based tests call a function with chosen arguments and compare the returned value to the expected value. This test is suitable for pure functions, which do not have side effects.
Programs written in a functional style are easier to reason about, because each function fully encapsulates the behavior with no hidden state. Since the output only depends on the input, they are also easier to test.
State-based tests call a method that mutates some object and compare the final state to the expected state. This style is commonly associated with object-oriented programming, where classes encapsulate both state and behavior, exposing mutating methods.
This style makes it difficult to narrow the scope of a test, and any change in some part of the code can negate the assumptions you can make locally about the final state. The associated risk of over-specification generates false positives, which decreases their resistance to refactoring.
Communication-based tests call a method and inspect how it interacts with its collaborators, replaced by mocks. Most programming languages have powerful libraries that allow you to mock anything you want, but with great power comes great responsibility.
This kind of test leads to the highest degree of coupling to the implementation, because you directly test what the code does. If you decide to replace an internal class by an external library, communication-based tests will break. You should reserve these tests to communication that truly has business significance and that is externally observable.
These styles are presented in increasing degree of coupling to the implementation, which is equivalent to decreasing resistance to refactoring. According to the author, output-based tests generally rank better in all the metrics of a good test, in particular, they are easier to write and maintain.
§Refactoring towards testability
Test-Driven Development (TDD) is a software development practice that encourages programmers to write tests before writing the production code that satisfy them, following a red-green-refactor process. Because the refactoring step comes last, the tests directly influence the architecture of the program.
The most important logic of your program, usually the business logic, should be testable in output-based style. You can achieve that by writing the business logic in a functional style with:
- Immutable arguments.
- Returned error values instead of exceptions.
- No reference to mutable state.
- Equality based on value, not on reference.
Eliminating all the hidden state from the most important parts of your program leaves you with two layers:
- The functional core, which produces business decisions through pure functions.
- The mutable shell, which acts upon these decisions and encapsulates all the side effects.
The author cites the Fundamental Theorem of Software Engineering, attributed to David Wheeler:
Any problem in computer science can be solved with another level of indirection.
Humorously expanded with:
...except for the problem of too many levels of indirection.
Too many layers of indirection are detrimental because they force you to write tests at multiple levels:
- Either you end up testing the behavior of the upper layers in lower layers, which is a waste of time and efforts.
- Or you test each layer independently with low-value tests, decreasing the overall protection against regression and resistance to refactoring.
For all practical purposes, three layers of indirection are enough. This is captured by many software design patterns like Model-View-Controller (MVC) or Model-View-Presenter (MVP). Expressed in terms of Domain-Driven Design, a three-layer architecture comprises:
- The domain layer, which encapsulates the most important business logic, free from external dependencies.
- The application service layer, which provides an entrypoint for the application and wires up the domain to the external dependencies.
- The infrastructure layer, which encapsulates the communication with external systems and contains modules like database repositories and API clients.
The prime candidate for refactoring is overcomplicated code, so let's see how to identify it. Beyond the basic red flags like the number of lines of code or nesting levels, the most recognizable attribute is being hard to test, which should be the number 1 red flag. This is often the consequence of the following issues:
- High cyclomatic complexity, which measures the number of execution paths in a program. It depends on the number of decision (or branching) points. Complicated algorithms mixed with infrastructure code tend to rank high in this metric.
- Dependency cycles with no clear entrypoint to start exploring the program. Not only does that makes it hard to reason about the execution flow, but you cannot test a SUT in isolation without involving the entire dependency cycle.
- Implicit dependencies instead of dependency injection, because you have no control over how the internal dependencies are instantiated.
The key transformation consists in segregating important algorithms and business logic from orchestration code that deals with hard to test dependencies. The book refers to this transformation as the Humble Object pattern (where the humble object belongs to the application service layer).
After this transformation, you can categorize the code along two axes:
- Complexity or domain significance (the logic that is the core of your business).
- Number of collaborators (where a collaborator is either a shared dependency like a database, or a mutable dependency like a singleton).
Given these two axes, you can identify four types of code:
- Domain model / algorithms, characterized by high complexity or domain significance, and a low number of collaborators. This is the code to unit test, which means it must only depend on in-process dependencies.
- Controllers, which have a low complexity or domain significance, but a high number of collaborators. To test this kind of code, you have to rely on integration tests, covered in the next section.
- Overcomplicated code with both high complexity and high number of collaborators. You should split this code so it fits into the first two types.
- Trivial code, which has low complexity and low number of collaborators. You don't really have to test it because the chance of finding regression is rather low.
The main idea is that the more complex the code is, the less collaborators it should have. The consequence of pushing external dependencies to the edge, keeping all infrastructure operations outside the domain, is that some operations become more complex or less performant.
During the evaluation of a complicated business rule, the domain may need to conditionally load data from the database. An example of such an operation is checking uniqueness: you cannot prefetch all the identifiers beforehand, and the domain cannot check whether an identifier already exists by itself, so it has to go through the application service layer.
Refactoring your code towards a more testable architecture is a balancing act between domain model testability, controller simplicity, and performance. Here are the most common solution to evaluating complex business rules:
- Introduce collaborators into the domain, at the cost of making the domain less testable and pushing it towards overcomplicated code.
- Let some domain logic leak into the application services, making them less testable and pushing them towards overcomplicated code.
- Split the business decisions into more granular steps, at the cost of making both the domain and the application services more complicated.
The author favors the last option, because it is the only one that doesn't truly affect testability. You just have more tests to write, but they are often simpler because they only target pieces of the original business logic.
The book outlines some patterns to make the logic more granular. For instance,
Do pattern splits a business decision into two steps: a
verification step that returns quickly, and an evaluation step that makes the
decision. A controller is expected to call the verification step before
fetching the data required for the evaluation, and you can enforce this rule in
the domain by asserting it as a precondition of
Sometimes the steps that lead to the final decision are as interesting as the decision itself. When the domain wants to notify an external system of these actions, or when you need even more granularity, you can use domain events:
- The domain registers events while it processes business rules. These events should have business significance.
- In addition to the final decision, the controller has access to the domain events, and it can forward them to an event dispatcher.
- Finally, the event dispatcher forwards them to event handlers, that can be in or out-of-process.
For more details, the book provides a concrete refactoring example that spans multiple chapters, splitting the standard mutable OOP class according to a functional architecture, and then transforming it into a simple domain-driven architecture. Event sourcing is a way to push these ideas even further, by using domain events as the source of truth.
Testing multiple units in isolation doesn't tell how they work together. Bringing these units together is the purpose of the application service layer, whose work is verified through integration testing.
§The testing pyramid
Integration tests exercise more code than unit tests, so they rank better on the metric of protection against regression. Additionally, testing the application at a higher level generally improves the resistance to refactoring.
Unfortunately, integration tests have to set up multiple systems, so they are less maintainable and slower than unit tests. This higher cost implies that you should limit to the number of integration tests:
- A few integration tests for the happy path, until they cover all the collaborators involved. A happy path is a successful execution of a business scenario.
- A few more tests for important edge cases, if they cannot be unit-tested (it is not always possible nor practical to encapsulate all the business logic inside the domain). An edge case is an execution of a business scenario that results in an error (but you don't have to test all errors if you follow the fail-fast principle).
You should aim at two integration tests per business scenario. Following this guideline should make your test suite resemble the test pyramid, with a higher proportion of fast feedback and highly maintainable unit tests, and a lower proportion of slower, less maintainable integration tests.
As we've seen previously, extracting the business logic into a domain layer without dependencies is the key to be able to write valuable unit tests. The separation between the domain layer and the application service layer serves as a boundary between the realms of unit testing and integration testing.
It is possible to write integration tests that couple to the implementation in a way that reduces their resistance to refactoring. Unit tests couple to the implementation when they check properties that do not have business significance.
With integration testing, you can test:
- The output of the SUT.
- The state of the dependencies.
- The communication with the dependencies.
This last option is prone to excessive coupling. Valuable integration tests only examine the interactions between the SUT and the dependencies if they are part of the observable behavior, which means:
- The action must have an immediate connection to the client's goals. Which method a controller calls on the domain is an implementation detail. Whether a controller sends an email is an observable behavior.
- The action also has to incur an externally visible side effect in an out-of-process dependency. What a controller saves in a private database is an implementation detail. What a controller writes to a message queue consumed by other applications is an observable behavior.
But what exactly constitutes an observable behavior varies at each level of abstraction:
- If you want to test your application's web API, you have to consider yourself the client of this API. The observable behavior comprises the entire public API (assuming it doesn't leak internal implementation details), and any external system the client may have access to (like an SMTP server if you send emails).
- If you want to test an application service controller, you have to consider yourself as a client of this controller. The observable behavior comprises the controller's public API (again, it shouldn't leak internal implementation details), and any out-of-process dependencies it interacts with that you could have access to (like a message queue).
At both levels, how the controller calls the domain is an implementation detail:
- The domain encapsulates the business logic, so refactoring it shouldn't change the controller's observable behavior, and you shouldn't have to rewrite the integration tests.
- The domain is not an out-of-process dependency observable by external applications, so there is no need to inspect the interactions.
Note that you shouldn't try to test domain properties at this level, because they should already be covered by unit tests. Similarly, you shouldn't test how external libraries or dependencies work internally through a controller. But you should test how the properties of all the dependencies interact.
Mocks are designed to replace an object by a fake implementation that provides the means of instrumentation. You can determine what calls another object makes to it or return arbitrary values. Most languages have frameworks that can automatically create mocks. They are very powerful, but you should only have one goal in mind: keeping the tests as close as possible to production, which means limiting the use of mocks.
To achieve resistance to refactoring, your integration tests must only target observable behavior, and such behavior can only exist in out-of-process dependencies, which are of two types:
Managed dependencies only accessible through your application. You should run integration tests with the true managed dependencies, without replacing them by mocks. The only thing you should check is their final state.
A database is considered managed if your application has exclusive access to it (or at least exclusive access to a set of tables). How your application interacts with it is an implementation detail.
Unmanaged dependencies are not exclusively accessible through your application. You should replace unmanaged dependencies by mocks during your integration tests. These tests ensure that you don't introduce regressions in your application externally observable behavior.
A message bus, or a third-party API, are examples of unmanaged dependencies. When your application interacts with them, it produces externally observable side effects. And these side effects obey a contract, both with the dependency itself, like a protocol, but also with other clients of this dependency.
Therefore, the main purpose of mocks is to help maintain backward compatibility for externally observable behaviors:
- They should only be used for unmanaged dependencies and not for internal or managed dependencies.
- They should only be used in integration tests, because they target unmanaged dependencies that aren't present in unit tests.
- They should only target types that you own, because you cannot ensure the stability of a third-party interface like an SDK client, and as a consequence, the stability of your tests. Putting them behind a minimal interface provides an anti-corruption layer that you can then mock.
The author gives further advice on the use of mocks:
Test at the edge to exert all the intermediate layers. That increases the protection against regression, because you test all the layers in between, but also the resistance to refactoring, because how the intermediate layers work doesn't matter. What matters is the contract with the outside world and this contract materializes itself at the edge.
Use dependency injection to replace infrastructure code by mocks in the intermediate layers. You should avoid interfaces when there is only a single implementation of a managed dependency.
Prefer spies over mocks when testing at the edge. Compared to mocks, spies are handwritten test doubles that record an interaction without being told beforehand what input it must receive (sometimes you don't know what piece of information may be generated).
In the assert section, you can check the details of the interaction. Because they are handwritten, you add reusable assertion methods following a fluent style to make the tests more readable.
As a summary, valuable integration tests should verify:
- The output of the SUT.
- The state of the dependencies as required by the business goals.
- The communication with the unmanaged dependencies through the use of test doubles.
This section goes over a few more subjects: mocking the time, testing with databases, and what you shouldn't test.
§Mocking the time
When you issue a token for a limited duration based on the current date, it is difficult to test that past some future date the token will actually expire. Here, the time is an implicit dependency, which you do not have control over.
There are three options to make the time explicit:
- Relying on an ambient context with a global function that you change and that returns the current time. The advantage is that it limits code pollution, because it only requires a function call, but it sacrifices isolation, because all your code depends on the same view of the current time.
- Instead, injecting the time as a service allows you to easily mock it only for the SUT. The main drawback is that it adds an additional service that is only necessary for testing, thereby increasing code pollution.
- Finally, if you pass the time as an argument you can define for each function call what the current time should be. It gives you the highest granularity, but it comes at the price of increased code pollution.
The author advises injecting the time as a service in controllers, because it is just another dependency that you can handle in integration tests. However, you should pass the time as an argument when you call a method from the domain, because it keeps this layer free from globally mutable state and still unit-testable.
The book dedicates an entire chapter to testing with a database. The first part is dedicated to maintaining the database state, preferably with a migration-based approach. The author gives some advice on handling transactions, especially their interaction with repositories, the unit of work pattern, ORMs, and how to use them in tests.
When used as a managed dependency, you should run your tests against a real database. The idea is to keep your test suite as close as possible to the production code. Tools like Docker makes it very easy to spin up a local database to run your tests.
The issue with a single database is that it is shared between tests, but the book outlines a few solutions to keep your tests isolated:
- Recreate the database for each test. The issue is that it dramatically increases the time it takes to run the test suite, and it makes the test runner more complicated.
- Run the tests inside a transaction that isn't committed. Unfortunately, not all databases support rollbacks and nested transactions, and it doesn't match how the production code works.
- Running the tests against an in-memory database also take your tests away from the code that runs in production.
- Cleaning up between tests, the solution encouraged by the author.
Sometimes you can work around these cleanup operations. Some DBMS support the creation of collections on the fly, so you can associate each test with a single collection. This is also handy if you want to inspect the content after a test failure.
If the number of collections is limited, or if creating them on the fly is not practical, you can just prepare a few at the start so you can run your tests in parallel, and configure the appropriate locking and cleaning steps before reusing them.
§What you shouldn't test
The book is focused on writing valuable tests, but too many tests can be detrimental to the long-term sustainability of a project. So let's approach the problem of testing from the opposite side: what shouldn't you test without sacrificing the core testing goals.
If the author has one guiding principle, it is that you shouldn't test implementation details, because they reduce the essential resistance to refactoring that makes a project sustainable. That means you shouldn't test the following items:
Private methods as part of white-box testing. You can make an exception for methods that implement particularly difficult algorithms (many edge cases, security implications). In all other cases, if the method in question is not part of the public API, then it must stay an implementation detail.
Dependency internals, including the domain. You shouldn't test edge cases from third-party libraries, unless they constitute a business goal. Similarly, you shouldn't test what the domain does from a controller, because the controllers' purpose is to integrate the domain, not to implement the business logic.
Testing takes time, and that time is better invested on valuable tests that have a high chance of uncovering regressions. Hence, you shouldn't test trivial code. But what qualifies as trivial code?
Technical preconditions that do not have business significance. With a dynamically typed language, asserting that an argument is of a particular type is usually a technical detail. You could change how the argument is represented without changing the underlying business goal. The assertion of this precondition alone is strong enough to prevent any misuse.
Error paths that fail fast. Controllers have to deal with errors that come from dependencies. Most of these errors are immediately returned to the caller without further processing. Excluding important edge cases or business significant errors, code that fails fast is not a valuable test target (plus it may be difficult to exercise the error code paths without extensive mocking).
§Low value targets
Finally, some pieces of code that are neither trivial nor business insignificant are best tested in integration rather than in isolation. The author doesn't give a lot of details about the following examples, so I supplemented them with my own interpretation.
Repositories give access to a datastore. According to the author, you shouldn't test them because they do not add enough value in light of their maintenance cost. Given the importance of not corrupting the application state, this claim is surprising, so let's try to understand why.
If you follow the author's advice, your integration tests should be using the concrete repositories, so there is an overlap between repository tests and controller tests. The mapping between these layers may be complicated, but if that's the case, you should extract it out of the repository and unit test it separately. Note that if you use an ORM, you do not have this intermediate layer.
Testing a repository involves calling a method on it, and querying the database to inspect the final state. If you add a field to an entity, the updated integration tests should indicate that it wasn't persisted. Despite this incorrect implementation, the tests on the repository will still pass. That justifies why they do not add significant value.
But where does the maintenance burden comes from? After you update the repository to persist the new field, you have to update the tests to include it as well. So adding a single field cascades into the modification of the domain, the controller, the repository, and the associated tests. This is the maintenance burden you should avoid.
The author also recommends against testing reads. There aren't many details in the book, so I'm not sure if that applies to views or all of the read side.
In applications that segregate reads from writes, the domain model belongs to the write side. The purpose of the domain model is to encapsulate the domain logic, and this domain logic is only useful for operations that have side effects so you can unit test it alone.
So there is no need for a domain model with reads. As a consequence, controllers on the read side have fewer levels of abstraction, so testing them is less valuable from a protection against regression standpoint. There are exceptions with read operations that aggregate data in a complicated way.
The author further argues that reads are not as sensitive as writes because they cannot corrupt the system. But with microservices, reads may very well be other services writes, so they can corrupt other systems, or at least affect the global consistency.
So I wouldn't follow this advice. I think you should test reads at the controller level, focusing on complex aggregations first. Just like repositories, testing views directly will not provide significant benefits.