Identify the problem before assuming solutions.
Extending a System
Any data transformation application consists of two parts:
- The boundary, which talks with data sources and destinations.
- The core, which transforms the data from the way it is represented in the sources to the way each destination wants it.
The boundary is much more difficult to test, because it involves interacting with other systems.
From another perspective, we can think of the job of a data transformation application in two parts:
- Load and store data.
- Understand, parse, and clean data.
Parsing, understanding, and cleaning data is much more difficult to test, simply because of the wide variety of possible input variations and messy data.
The key to keeping a data transformation application simple is to keep the hard parts separate. In other words, we need to ensure that the boundary code only loads and stores opaque data. All parsing, understanding, or cleaning data happens only in the core.
This module focuses on the core; the next module will focus on the boundary.
Implementing all the parsing, understanding, and data cleaning logic in the core requires two techniques: stick-figure testing and the data pipeline design.
Create design now that code is written.
Use stick-figure testing to modify existing code.
Use a data pipeline to make transforms easy.