min read

Apr 15, 2025

Engineering Reliability at Scale - A deep dive into our custom automated E2E testing service

The Enterprise Browser platform continues to rapidly expand, yet methods for ensuring its stability remain largely undefined. Here's how we solved it using our custom E2E testing infrastructure.

Igal Kolihman, Developer Experience Software Engineer

Building a functional and stable automation infrastructure for an enterprise browser presents unique challenges. With no established conventions for testing and constructing CI/CD pipelines in this domain, we needed to create custom solutions that could scale with our growing product. As the Island Enterprise Browser expands, so does the complexity of its features. This, in turn, intensifies our testing, deployment, and stability challenges, placing greater demands on our development processes.

To address these challenges, we built a comprehensive testing infrastructure to validate the product at every level. This includes browser tests for stability verification, frontend UI tests for management console validation, API tests to verify service endpoint functionality, and unit testing frameworks to confirm the reliability of isolated logic. At the top of this hierarchy sits our end-to-end (E2E) testing infrastructure, designed to simulate real-world workflows and validate component interactions.

In this article, we’ll focus on our approach to designing, implementing, and scaling E2E tests. We'll explore the strategies that helped us overcome key challenges and align testing with our CI/CD goals, ensuring support for the rapid evolution of our enterprise browser.

**Figure 1** Island browser modules and their corresponding test coverage (See area highlighted in red)

The Scale of Testing an Enterprise Browser

Before we dive into the solution, let's define the scale of complexity we need to address. Each new version released to our customers undergoes rigorous testing during every step of its lifecycle.. Every pull request (PR) opened by a developer, as well as each canary, beta, or stable browser version released internally or externally, must pass through our comprehensive suite of end-to-end tests.

To grasp the sheer scale of the infrastructure involved, here's what a single end-to-end step entails:

Each new GitHub Pull Request (PR) or deployment triggers our CI/CD pipeline to initiate testing
Our browser supports six platforms: Windows, Mac, Linux, Android, iOS, and a standalone extension
For each platform, we run between 100 to 500 E2E tests, depending on their specific configurations and filters
Each E2E test takes an average of 30 seconds to complete

A typical E2E test simulates a full product flow, from the Management Console to the browser. For example, a test might configure a policy that blocks access to websites of a certain category, apply that policy to end-user devices, validate through the browser that access to a website in that category was blocked, and confirm through the Management Console that an audit log was created for the blocked operation.

On the Windows platform alone, with 500+ end-to-end tests running at 30 seconds each, a single PR in a traditional testing environment can result in over four hours of testing—creating a significant bottleneck for developers waiting to merge their changes.

Managing the infrastructure to support testing at this scale introduces substantial complexity. With a growing R&D team of 150+ developers, we needed to ensure multiple virtual machines (VMs) remained highly available across all platforms for seamless testing. Bugs inevitably arose, often amplifying infrastructure issues, causing delays, and increasing resource demands.

To meet these challenges, we maintain approximately 300 physical and virtual machines, requiring proactive and efficient management to quickly address problems and minimize disruptions.

Given the scale and complexity of this project, let’s discuss some of the major challenges we faced and how we overcame them.

An Old Enemy - Flaky Tests

In 2023, we faced a significant challenge: flaky tests. These tests intermittently failed even when the product functionality worked as expected. Such unreliability caused pull requests to fail globally, obstructing developers and slowing down the workflow.

To address this issue, we initially marked the flaky tests as ignored while investigating the root causes and covering any test gaps with manual testing. This temporary solution allowed developers to continue their work without disruption. However, it introduced its own set of challenges:

Marking a test as ignored required committing a change to the codebase
Developers needed to rebase their branches to sync with the updated state of ignored tests
This process necessitated rerunning the entire CI pipeline for the affected branches, effectively doubling the wait time for developers

This process caused a significant reduction in R&D velocity across the organization. As the company continued to grow, it became evident that this approach was not scalable. Addressing flaky tests at scale required a more robust and sustainable solution to maintain the efficiency and productivity of our teams. Because of this, we decided to create our own service.

Introducing Jarvis: Our Automation Service

After thorough deliberation, we concluded that the best way to improve E2E test management and address the flaky test problem was to introduce a dedicated service to manage our test sessions. The concept was that each agent running our tests would communicate with the server and receive instructions on how to behave during execution. Additionally, the service would use a PostgreSQL database for memory storage and a Redis database for cache storage. This service, which we named Jarvis, became the central authority for coordinating and managing test behavior.

When building Jarvis, we aimed to create a development and testing experience that would stand the test of time. Knowing it would become a crucial part of our infrastructure and play a key role in maintaining the stability of our CI pipelines, we designed it to reflect the structure of our cloud infrastructure. This made it intuitive for our team and other R&D developers to work with while ensuring it could meet the demands of our growing Enterprise Browser. The result was a robust and dependable solution that supports both current needs and future growth.

**Figure 2** Decision flowchart outlining the test distribution process across agents

Test Management

With Jarvis, we gained the ability to monitor all of our tests in real time, storing their results and creating a live snapshot of the test environment. This foundation allowed us to develop creative solutions to address various challenges.

Test Distribution

Jarvis changed the way we distribute tests to agents by introducing new entities: Test Suite and Test Case. A Test Suite represents a complete test execution, while each Test Case links runtime metadata to individual tests. These entities significantly enhanced our continuous integration processes.

Key improvements introduced by Jarvis include:

Dynamic assignment of tests to agents in real-time based on availability and prior allocations, replacing the inefficient dummy splitting approach used previously
Minimized impact of agent crashes, as other agents can seamlessly take over pending tests, ensuring uninterrupted execution
Smarter decision-making processes that significantly reduced execution times and improved system stability
Ability for agents to re-execute failed tests, even if initially run on different agents, which was previously impossible with the old allocation method

These enhancements collectively optimized our testing infrastructure, making it more robust and efficient.

The following diagram outlines a testing workflow where a pull request in GitHub triggers Jenkins to (a) create a test suite via Jarvis, (b) allocate agents, and (c) execute tests, with results merged into an Allure report and reported back to GitHub. Jarvis then manages test metadata, ensures test uniqueness, and finalizes the suite with statistics.

**Figure 3** Decision flowchart of the test distribution process across agents

Ignore Rules

To streamline the management of flaky tests, we developed a system for marking specific tests as ignored before execution called Ignore Rules.

Each rule links multiple tests to a description and its corresponding Jira ticket. During the test execution, the agents verify with Jarvis whether or not the test is flagged as ignored.

Jarvis also provides a user-friendly front-end interface where we can easily view, add, update, and remove Ignore Rules.

**Figure 4** Front-end interface of Jarvis

Additionally, during the period that the ignore rule is active, the tests linked in the rule will not run anywhere in CI. This also means that branches containing the fix for those tests will not run either. To resolve this issue, we adjusted the rule so it isn't applied when the branch includes the Jira ticket associated with the rule.

This system streamlined the handling of flaky tests, enabling rapid responses without requiring developers to rebase their branches or rerun full CI pipelines. It significantly reduced disruptions and improved overall productivity.

Automation User Management

A critical aspect of testing the Enterprise Browser involves the management of end-users - the ones used for logging into Island as part of our test scenarios. Initially, we used a fixed set of four shared users, leading to issues such as rate limiting, data corruption, and random deletions.

With Jarvis, we implemented a robust user management system:

Each machine running CI is assigned four unique users, corresponding to the number of parallel tests it executes
Users are refreshed daily to maintain a clean state

By integrating our user service through Jarvis' API modules, we eliminated user-related issues, enhancing the stability and reliability of our CI environment. An easy fix for a painful issue.

Monitoring and Insights

As Jarvis became a cornerstone of our CI/CD management, it began to process a substantial amount of traffic, generating a wealth of data. These real-time updates enabled us to develop an advanced monitoring and insights system with live alerts to enhance our operational efficiency.

Test Flakiness Overview Dashboard

One of the key tools we developed was the Test Flakiness Overview Dashboard. This dashboard provided us with a comprehensive view of test flakiness over time, allowing us to identify patterns and address issues proactively.

**Figure 5** Flakiness dashboard in Grafana, visualizing data sourced from Jarvis

Jenkins Job & Node Overview

To gain deeper insights into our Jenkins environment, we created the Jenkins Job & Node Overview Dashboard. This tool offers an in-depth perspective on our pipelines, enabling us to monitor and optimize our CI/CD processes effectively.

**Figure 6** Jenkins pipeline overview dashboard in Grafana, based on data from Jarvis

In addition, it allowed us to get insights on Jenkins that were not possible before. The most valuable of them is the history of each Node in Jenkins. We are now able to see the run history of each VM in our infrastructure and easily find crucial failure points that corrupted the VM.

**Figure 7** Jenkins node overview dashboard in Grafana, based on data from Jarvis

Alerts: Proactive Problem Detection

We leverage Grafana’s alerting system to scan the database for specific behaviors that may indicate potential issues. This proactive approach allows us to monitor problematic machines and tests, ensuring we can address concerns before they escalate.

**Figure 8** Slack message alerts generated by Grafana using live data from Jarvis

Transformation of Our Workflow

The integration of Jarvis led to a significant transformation in our workflow. By leveraging automatic alerting and enhanced monitoring, we made detecting and diagnosing problems easier, resulting in faster test results and increased developer satisfaction and productivity. This efficiency allowed our team to focus on further infrastructure improvements and respond more swiftly to issues.

Implementing Jarvis was instrumental in boosting our R&D velocity. Jarvis not only streamlined our processes but also empowered our growing team to achieve unprecedented levels of efficiency and innovation, making our CI/CD management more robust than ever. This transformation supported the dynamic needs of our expanding R&D team, enabling us to reach new heights in development velocity and quality, positioning us for continued success and growth.

The Island Enterprise Browser fundamentally transforms how the world’s leading organizations work by embedding enterprise-grade IT, security, network controls, data protections, app access, and productivity enhancements directly into the browser itself, enabling secure access to any application, protecting sensitive data, streamlining IT operations, and delivering a superior end-user experience while actually boosting productivity.

To learn more about how we're reimagining the enterprise workspace from the browser up, start here. If you’re interested in building something that’s changing everything, check out our open positions here.

‍

Igal Kolihman

Igal is a software engineer focused on developers experience tools. With years of experience in one of IDFs elite intelligence units where he specialized in optimizing development workflows and infrastructure. Today, Igal focuses on building scalable automation systems, boosting productivity, and enhancing developer experience at Island.

Table of Contents

Example H2