Nowadays, when iterative and agile software product development has become an industry-standard, companies strive to automate the way they deliver new code to the users. To keep the code in a release-ready state, Merit uses a set of practices combining Development, IT Operations, and Quality Assurance named DevOps. (You can read How Merit Embraced a DevOps Culture in a previous blog post, by heavily utilizing continuous delivery and continuous integration.)
Merit has the distinct mission of creating a verified identity ecosystem to allow entire industries to operate programmatically and reliably with trust, insight, and efficiency. Because of this, it’s crucial for us to safely deliver well tested, robust code multiple times a week.
Our CI pipeline includes different types of automated testing aligned with the concept of the Test Pyramid. In this blog post, I will discuss Exchange to exchange (E2E) — the most fragile, slow, and expensive test type. While running the test against the same code several times in a row, we would expect to see the same either "pass" or "fail" result every time. A test that has different results against the same code is flaky. Flaky tests immediately lose their value and become a headache for all related parties:
Decreasing potential sources of flaky E2E tests at Merit:
Instead, we use Cypress. WebDriver based test framework is sending commands to the browser driver over HTTP protocol, then the browser driver sends commands to the browser itself. With Cypress, the test is running inside the browser so commands can be executed right away without any "middleware". That in-browser execution is not only making tests less flaky, but it also increases the speed of execution.
We run our tests inside Docker containers on Kubernetes. That gives us a simple way to set up test machines with almost endless potential of scaling our lab for parallel test execution and eliminates many sources of flakiness, from an unstandardized test environment to an electricity outage.
Every test is interacting with a bunch of HTML elements, and there are a lot of ways to select those (id, class name, attribute, tag name, element's text, CSS selector, and XPath). In order to prevent wasting time fixing broken selectors, we only use the `data-testid` attribute to select elements. Out of 650 selectors in our test framework, every single one includes that attribute. When using component libraries, we don't always have the option to add an attribute to the element. We make sure to include `data-testid`
as a parent of that element in these cases (e.g. `[data-testid="confirmation-modal"] > .ant-modal-close-icon`
).
The app needs to be in a certain state before test execution begins, and if state setup includes interactions with UI, it may introduce the unnecessary cause of flakiness. We have a dedicated helper in our test framework responsible for putting the app to the desired state and it does not involve any UI interactions, only API calls. We are grouping tests with the same starting state into `describe`
or `context`
blocks so all the setup code goes to the `beforeEach`
hook that helps us reduce code duplication and increase readability.
Test results should not depend on the order in which the tests are run, otherwise, it will introduce additional flakiness. At the beginning of every test, we build a desirable app state from scratch so we can make sure our tests are completely isolated from each other and the result of the previous test has no effect on a subsequent one. That provides an ability to have multiple suites (smoke test, build acceptance test, regression, etc) in one test framework because we can run any number of tests in any order.
It is a common mistake to wait for an arbitrary time period to enforce a certain order in test execution. A "wait" command pauses the execution of the test for a fixed amount of time to wait for the result of a previously fired asynchronous task. Such tests waiting for an asynchronous task to finish without being explicitly notified tend to be non-deterministic. We are not using implicit waits in our framework. In Cypress, every command is a promise (not an actual promise, but similar). Every test case forms a promise chain so as the next command is waiting for the current command to finish, it eliminates the need for implicit waits.
One of the patterns for writing unit tests recommends that test cases should fail only for one reason, and that reason should be unambiguously stated in the test title. Somehow that pattern migrated to many end-to-end and integration frameworks, and despite the fact that a single fail reason does not necessarily mean one assertion per test case, it is being interpreted that way. We have multiple assertions in our test cases, moreover, every single step in our test cases assume an assertion:
No errors on a successful login with an existing email:
The generation of random test data can cause non-deterministic failures when not all possible values, and especially edge cases, are accounted for (e.g. generation of 0 when only positive numbers are expected, or generation of a phone number that does not conform to the North American Numbering Plan). To avoid test flakiness, we are using Chance data generator which is able to generate almost everything from a single letter to geographic coordinates. We defined maximum and minimum lengths for every random value to ensure it stays within the expected borders.
Once the flaky test was debugged and the cause of flakiness revealed, we created a ticket with all uncovered details. Documenting helps us to make sure we recognize existing problems that may cause the flakiness so that when we design, implement, or refactor tests we can avoid those (or at least minimize the impact). When we debug a newly discovered flaky test, we keep known symptoms of existing problems in mind rather than starting from scratch.
In the book Clean Code, Robert Martin says, "Duplication is the primary enemy of a well-designed system. It represents additional work, additional risk, and additional unnecessary complexity." That is exactly how duplication affects a test framework. It’s hard to name a bad practice introducing the same level of fragility. To reduce duplication, we use a hybrid pattern that combines Page Object and App Actions patterns. It also helps us to increase readability, maintainability, and add a layer of abstraction between an app and a test code.
Conclusion
Even though we are using the practices mentioned above when a test framework is growing and the app under test is constantly evolving, having flaky tests is almost inevitable, especially on E2E tests, involving UI interactions. At Merit, we have a culture of treating every failing test seriously. Even if we suspect the test to be flaky, we do not assume that there is something wrong with the test, infrastructure, or network. First, we suspect the application code has a bug and then the progress to rest. We tirelessly reproduce and debug every failed scenario in order to make sure a bug is not pretending to be a flaky test.
If you would like to help us in writing high-quality, fast, and robust automated E2E tests—or maybe even lead us in that endeavor—please check out our open positions at merits.com/careers.