Began in August 2024
I collaborated with the Data Engineering & Analytics team to create automated tests that enforce data quality standards. Previously, these tests were performed manually by testers running queries on the development and production databases, and this project helped automate that process for reliability and development speed.
We used the PySpark Python library to interface with the Apache Hadoop distributed database. We performed tests on the data that moved from the Core to the Curation layer of the data lake, enforcing dynamic enumerations (valid values), custom regex validations, and other checks.
From this project, I learned a lot about working with scaled data: the data lake and layers within it, the processes surrounding data ingestion and presentation, and data governance laws.
I worked heavily to ensure complete documentation of the project, helping with onboarding the next wave of interns. This included orienting text such as project purpose and technologies used as well as high-level architecture layouts and introductions to each of the tools that we used.
As a part of this process, I developed a small command line utility that worked to streamline the process for fetching logs from the latest job run. This queried and extracted log information to display the relevant application logs.