Testing..vaccine-supply…123

5 min readJun 9, 2021

In late 2020, as the US tested against the dreaded virus and waited patiently for the arrival of vaccines, a group of Castlighters were working with Boston Children’s Hospital and the CDC to build and test a software system to help the CDC track COVID-19 vaccine supply at vaccination sites and then use that data to help the general public find vaccines nearby. Recently, my colleague Gavin Thompson wrote a great blog on the development process. This is the Quality Engineering side of the story on the test strategy, test automation, and test monitoring.

Testing Requirements:

COVID-19 vaccine enrolled provider data: Data files are received daily from multiple sources, including a cloud-hosted national data repository and 16 major national pharmacy retail chains. The data pipeline and the ETL’d data had to be tested daily and hence automated data tests were necessary at every stage of the pipeline. We also needed to automatically identify and communicate bad records to upstream sources for correction in their next update.
Inventory site built by Castlight: Code deployment started on a daily basis in the initial two weeks before shifting to weekly deploys. Hence, documenting the initial manual tests and converting them to automated tests was key for a successful CI/CD process.
Reporting: Tableau was used as a reporting tool internally as well as externally and the underlying data had to be tested for accuracy and completeness. We also supported Castlight customer support teams with a data report used by our Tier 2 support team to triage and debug issues. We also created an automated report identifying any bad data we received from the national data repository. This was triaged daily and fixed in the next file delivery.
Public-facing vaccine search: The front end that was developed by a team with Boston Children’s Hospital interfaced with the back end developed by Castlight. We decided to add front-end tests to monitor the runtime system in addition to our data and backend API tests.

Test Data:

To mock the user scenarios for the provider portal, we created a stored procedure in the MySQL database. We also had to create mock files to test the ingestion and export process before testing with the production files.

Test Types:

Data Tests: The data tests are simple SQL tests with a PASS/FAIL Condition. We used our in-house Python 3 based test automation framework that can connect to multiple relational/non-relational databases and compare the data. The data tests are plugged in as child jobs to the pipeline task jobs. The results are posted over Slack and email for instant review and triage.

API Tests: The tests are written against the API contract using Java 8 + TestNG + REST assured and run as scheduled jobs on a Jenkins server. The tests are used in the CI/CD pipeline and the smoke tests are used for monitoring and alerting via PagerDuty.

UI Tests: The tests are written using Java 8 + TestNG + Selenium and also run on a scheduled basis on production (smoke suite) as well as a test environment.

Performance Tests: We used JMeter and Blazemeter to run performance tests (we also explored Gatling and Blazemeter browser tests). We optimized the performance of the platform by repeatedly running tests on our performance test environment and production environment with different configuration settings for the Kubernetes deployment and our Akamai caching layer. Using Blazemeter’s capabilities we were able to run the tests from six different geographical locations in the US to mimic the expected load.

Security Tests: We used Qualys to scan for vulnerabilities including cross-site scripting (XSS) and SQL injection.

Timeline

Initial Stage: In-sprint test automation

Since the application had to be deployed in a short span of time, the testing time was limited. Therefore, a careful test strategy was very important. The scrum team was building features and releasing them in the same sprint, with each sprint lasting one week. If we had not automated the tests before releasing them, we would have created significant risk as well as technical debt.

Some key decisions that helped us mitigate risk:

Write tests in parallel with the development cycle. Since the project used an Okta API for authentication, we used off-the-shelf Okta APIs to write API tests even before the first application release was available to test.
Don’t reinvent the wheel: We used existing in-house test frameworks to automate the tests. This ensured we spent our time on writing tests, rather than new frameworks.
Automate small tests early: This reduced the initial test execution time while also generating the base on which we built our production monitoring. For example, the application used a Sign-In component provided by Okta. The UI locators of this web component do not change and we had positive and negative tests which we also used to monitor the application even before the first official drop for testing :)
Collaboration: Daily meetings with the Product, Development, and Support teams enabled us to review upcoming changes. Also, we met with the User Support leadership team to better understand the types of issues being reported by users. This helped us prioritize areas for additional automated tests.

Intermediate Stage: Focus on Non-Functional Testing — Performance and Security

Given the critical nature of the application, we started security testing well in advance of the production release. We collaborated with our Security team to scan for vulnerabilities so we could quickly follow up with other engineers and remediate all issues before launch.

Due to the expected high demand for information about COVID-19 vaccine availability, performance testing was a must. We spent a lot of time performance testing and endurance testing all endpoints that would be part of critical user journeys on the public site. The concurrency goal we set was one million users per hour.

Post-launch Activities: Continuous Test and Monitor

A few months after VaccineFinder.org launched, the application front end that serves as a vaccine search tool was split into separate sites for English and Spanish languages. Since we had already written UI tests against the original site, only a few minor tweaks were needed to the UI locators to ensure the tests were quickly up and running against the new sites, too.

API smoke tests run every five minutes and UI tests run every 10 minutes. The test failure PagerDuty account is subscribed to by members of the Castlight, USDS, and BCH teams.

The Team

Alone we can do so little; together we can do so much — Helen Keller

Many thanks to my Castlight test team: Akhilesh Patel @akhilesh1088, Gaurav Bhargava @GauravBhargav, Khadar Sheikh @KhadarSheikh