Engineering Reliability at Scale - A deep dive into our custom automated E2E testing service
The Enterprise Browser platform continues to rapidly expand, yet methods for ensuring its stability remain largely undefined. Here's how we solved it using our custom E2E testing infrastructure.

Building a functional and stable automation infrastructure for an enterprise browser presents unique challenges. With no established conventions for testing and constructing CI/CD pipelines in this domain, we needed to create custom solutions that could scale with our growing product. As the Island Enterprise Browser expands, so does the complexity of its features. This, in turn, intensifies our testing, deployment, and stability challenges, placing greater demands on our development processes.
To address these challenges, we built a comprehensive testing infrastructure to validate the product at every level. This includes browser tests for stability verification, frontend UI tests for management console validation, API tests to verify service endpoint functionality, and unit testing frameworks to confirm the reliability of isolated logic. At the top of this hierarchy sits our end-to-end (E2E) testing infrastructure, designed to simulate real-world workflows and validate component interactions.
In this article, we’ll focus on our approach to designing, implementing, and scaling E2E tests. We'll explore the strategies that helped us overcome key challenges and align testing with our CI/CD goals, ensuring support for the rapid evolution of our enterprise browser.

The Scale of Testing an Enterprise Browser
Before we dive into the solution, let's define the scale of complexity we need to address. Each new version released to our customers undergoes rigorous testing during every step of its lifecycle.. Every pull request (PR) opened by a developer, as well as each canary, beta, or stable browser version released internally or externally, must pass through our comprehensive suite of end-to-end tests.
To grasp the sheer scale of the infrastructure involved, here's what a single end-to-end step entails:
- Each new GitHub Pull Request (PR) or deployment triggers our CI/CD pipeline to initiate testing
- Our browser supports six platforms: Windows, Mac, Linux, Android, iOS, and a standalone extension
- For each platform, we run between 100 to 500 E2E tests, depending on their specific configurations and filters
- Each E2E test takes an average of 30 seconds to complete
A typical E2E test simulates a full product flow, from the Management Console to the browser. For example, a test might configure a policy that blocks access to websites of a certain category, apply that policy to end-user devices, validate through the browser that access to a website in that category was blocked, and confirm through the Management Console that an audit log was created for the blocked operation.
On the Windows platform alone, with 500+ end-to-end tests running at 30 seconds each, a single PR in a traditional testing environment can result in over four hours of testing—creating a significant bottleneck for developers waiting to merge their changes.
Managing the infrastructure to support testing at this scale introduces substantial complexity. With a growing R&D team of 150+ developers, we needed to ensure multiple virtual machines (VMs) remained highly available across all platforms for seamless testing. Bugs inevitably arose, often amplifying infrastructure issues, causing delays, and increasing resource demands.
To meet these challenges, we maintain approximately 300 physical and virtual machines, requiring proactive and efficient management to quickly address problems and minimize disruptions.
Given the scale and complexity of this project, let’s discuss some of the major challenges we faced and how we overcame them.
An Old Enemy - Flaky Tests
In 2023, we faced a significant challenge: flaky tests. These tests intermittently failed even when the product functionality worked as expected. Such unreliability caused pull requests to fail globally, obstructing developers and slowing down the workflow.
To address this issue, we initially marked the flaky tests as ignored while investigating the root causes and covering any test gaps with manual testing. This temporary solution allowed developers to continue their work without disruption. However, it introduced its own set of challenges:
- Marking a test as ignored required committing a change to the codebase
- Developers needed to rebase their branches to sync with the updated state of ignored tests
- This process necessitated rerunning the entire CI pipeline for the affected branches, effectively doubling the wait time for developers
This process caused a significant reduction in R&D velocity across the organization. As the company continued to grow, it became evident that this approach was not scalable. Addressing flaky tests at scale required a more robust and sustainable solution to maintain the efficiency and productivity of our teams. Because of this, we decided to create our own service.
Introducing Jarvis: Our Automation Service
After thorough deliberation, we concluded that the best way to improve E2E test management and address the flaky test problem was to introduce a dedicated service to manage our test sessions. The concept was that each agent running our tests would communicate with the server and receive instructions on how to behave during execution. Additionally, the service would use a PostgreSQL database for memory storage and a Redis database for cache storage. This service, which we named Jarvis, became the central authority for coordinating and managing test behavior.
When building Jarvis, we aimed to create a development and testing experience that would stand the test of time. Knowing it would become a crucial part of our infrastructure and play a key role in maintaining the stability of our CI pipelines, we designed it to reflect the structure of our cloud infrastructure. This made it intuitive for our team and other R&D developers to work with while ensuring it could meet the demands of our growing Enterprise Browser. The result was a robust and dependable solution that supports both current needs and future growth.

Test Management
With Jarvis, we gained the ability to monitor all of our tests in real time, storing their results and creating a live snapshot of the test environment. This foundation allowed us to develop creative solutions to address various challenges.
Test Distribution
Jarvis changed the way we distribute tests to agents by introducing new entities: Test Suite and Test Case. A Test Suite represents a complete test execution, while each Test Case links runtime metadata to individual tests. These entities significantly enhanced our continuous integration processes.
Key improvements introduced by Jarvis include:
- Dynamic assignment of tests to agents in real-time based on availability and prior allocations, replacing the inefficient dummy splitting approach used previously
- Minimized impact of agent crashes, as other agents can seamlessly take over pending tests, ensuring uninterrupted execution
- Smarter decision-making processes that significantly reduced execution times and improved system stability
- Ability for agents to re-execute failed tests, even if initially run on different agents, which was previously impossible with the old allocation method
These enhancements collectively optimized our testing infrastructure, making it more robust and efficient.
The following diagram outlines a testing workflow where a pull request in GitHub triggers Jenkins to (a) create a test suite via Jarvis, (b) allocate agents, and (c) execute tests, with results merged into an Allure report and reported back to GitHub. Jarvis then manages test metadata, ensures test uniqueness, and finalizes the suite with statistics.

Ignore Rules
To streamline the management of flaky tests, we developed a system for marking specific tests as ignored before execution called Ignore Rules.
Each rule links multiple tests to a description and its corresponding Jira ticket. During the test execution, the agents verify with Jarvis whether or not the test is flagged as ignored.
Jarvis also provides a user-friendly front-end interface where we can easily view, add, update, and remove Ignore Rules.

Additionally, during the period that the ignore rule is active, the tests linked in the rule will not run anywhere in CI. This also means that branches containing the fix for those tests will not run either. To resolve this issue, we adjusted the rule so it isn't applied when the branch includes the Jira ticket associated with the rule.
This system streamlined the handling of flaky tests, enabling rapid responses without requiring developers to rebase their branches or rerun full CI pipelines. It significantly reduced disruptions and improved overall productivity.
Automation User Management
A critical aspect of testing the Enterprise Browser involves the management of end-users - the ones used for logging into Island as part of our test scenarios. Initially, we used a fixed set of four shared users, leading to issues such as rate limiting, data corruption, and random deletions.
With Jarvis, we implemented a robust user management system:
- Each machine running CI is assigned four unique users, corresponding to the number of parallel tests it executes
- Users are refreshed daily to maintain a clean state
By integrating our user service through Jarvis' API modules, we eliminated user-related issues, enhancing the stability and reliability of our CI environment. An easy fix for a painful issue.
Monitoring and Insights
As Jarvis became a cornerstone of our CI/CD management, it began to process a substantial amount of traffic, generating a wealth of data. These real-time updates enabled us to develop an advanced monitoring and insights system with live alerts to enhance our operational efficiency.
Test Flakiness Overview Dashboard
One of the key tools we developed was the Test Flakiness Overview Dashboard. This dashboard provided us with a comprehensive view of test flakiness over time, allowing us to identify patterns and address issues proactively.

Jenkins Job & Node Overview
To gain deeper insights into our Jenkins environment, we created the Jenkins Job & Node Overview Dashboard. This tool offers an in-depth perspective on our pipelines, enabling us to monitor and optimize our CI/CD processes effectively.

In addition, it allowed us to get insights on Jenkins that were not possible before. The most valuable of them is the history of each Node in Jenkins. We are now able to see the run history of each VM in our infrastructure and easily find crucial failure points that corrupted the VM.

Alerts: Proactive Problem Detection
We leverage Grafana’s alerting system to scan the database for specific behaviors that may indicate potential issues. This proactive approach allows us to monitor problematic machines and tests, ensuring we can address concerns before they escalate.

.avif)
Transformation of Our Workflow
The integration of Jarvis led to a significant transformation in our workflow. By leveraging automatic alerting and enhanced monitoring, we made detecting and diagnosing problems easier, resulting in faster test results and increased developer satisfaction and productivity. This efficiency allowed our team to focus on further infrastructure improvements and respond more swiftly to issues.
Implementing Jarvis was instrumental in boosting our R&D velocity. Jarvis not only streamlined our processes but also empowered our growing team to achieve unprecedented levels of efficiency and innovation, making our CI/CD management more robust than ever. This transformation supported the dynamic needs of our expanding R&D team, enabling us to reach new heights in development velocity and quality, positioning us for continued success and growth.
The Island Enterprise Browser fundamentally transforms how the world’s leading organizations work by embedding enterprise-grade IT, security, network controls, data protections, app access, and productivity enhancements directly into the browser itself, enabling secure access to any application, protecting sensitive data, streamlining IT operations, and delivering a superior end-user experience while actually boosting productivity.
To learn more about how we're reimagining the enterprise workspace from the browser up, start here. If you’re interested in building something that’s changing everything, check out our open positions here.