You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 30, 2021. It is now read-only.
As with any project of meaningful utility and scale, we never know all of its needs up front.
First, we build the thing, and then we see where it takes us. We learn as quickly as possible, adapt, and grow. (Who could have anticipated that governments would publish pandemic case data in PDFs or images? Or require cookies and csrf tokens to just request a page containing basic public health data?)
The purpose of this document is to discuss the future architecture plans¹ for COVID Atlas.
This issue assumes a semi-large scale refactor.
I know, this can make folks feel uncomfortable. It makes me somewhat uncomfortable. It's also where we are.
A quick spoiler: scrapers may need some updating, but they will be preserved! We love our scrapers. We are not tossing out the scrapers!
Why start fresh
The initial analysis I did of the coronadatascraper codebase seemed promising for an in-flight, gradual refactor into production infrastructure.
After spending the last few weeks in the codebase, discovery surfaced deep underlying architectural flaws that posed significant barriers to overcoming core issues in our current processes.
For those that may not be aware of the problems downstream of these issues, they include such fan favorites as: Larry has to stay up until 10pm every night manually releasing the latest data set, which only he knows how to do; unexpected errors can fatally break our entire build; and, even minor changes require a large degree of manual verification.
@lazd and I agree agree these issues are fundamental and must be addressed with seriousness, care, and immediacy.
Second-system syndrome
We must immediately call out a common reason refactors or rewrites may fail: second-system syndrome.
Putting aside the fact that this codebase is only a few weeks old, we still need to be clear about expectations: v1.0 will likely seem like a step back at first; it will do fewer things, and the things it does may be approached differently.
This issue is not a dropbox for every idea we have, or a long-term roadmap for the future. This issue is a plan to get us into robust and stable production infra as soon as possible, and to begin phasing out parts of CDS as quickly as possible.
What we learned from v0 (coronadatascraper) architecture
Over the last few weeks, we learned an enormous amount from coronadatascraper. Below is a summary of a few of those findings that informed this decision, and will continue to inform our architecture moving forward:
Crawling
Crawling once daily is insufficient and a catastrophic single point of failure
We have witnessed frequent failures due to a variety of reasons, and need to be crawling many times per day
Crawling many times per day necessitates datetime normalization and understanding source locales
Example: the 2020-04-01T00:00:00.000Z crawl for San Francisco, CA must somewhere, at some point, cast its data to 2020-03-31
If this example is not immediately apparent to you, that's ok! Just take our word for it for the moment
Datetime normalization is greatly aided by the increased separation of concerns of crawling and scraping (e.g. logic related to URLs belongs outside of logic related to scraping)
No individual crawl failure should ever take down crawling another source; crawls should run independently of other sources
The cache should be read-only; humans should should not be responsible for maintaining the cache
We must remove manual steps prone to human error or other external factors (whether someone's internet connection is working) and replace such steps with automation
Scraping
Scrapers will frequently begin to fail due to no fault of our own
Every day a half dozen or more scrapers will start to fail due to changes in their sources
This is a known phenomena, and should be expected and accounted for
Scrapers often require a very high degree of flexibility due to the absurd variation seen by state and local governments in data publishing
URLs change all the time, and we need to be highly flexible with that
Some URLs are published daily and are thus date-dependent (example: VA)
Some data sources represent multiple timezones (JHU, NYT, TX)
Some need to access headers, cookies, and other non-obvious vectors in order to acquire data (RUS)
Scrapers need many built-in parsers, including HTML (unformatted and tabular data), CSV, JSON, ArcGIS, etc.
Some data is reported with cities subtracted from counties (example: Wayne County - Detroit)
Some countries block access to our requests! WTF!
Normalizing location names is very difficult, but we have to be extremely good at it in order for other things to work without issue (see morebelow)
Some scrapers may only be able to return large datasets from states; these datasets may completely rely on some post-run normalization to make usable
Which is another way of saying: scraper devs should not be solely responsible for adding ISO/FIPS IDs to their own metadata; but they are responsible for ensuring their metadata can be identified with ISO/FIPS IDs
Data normalization + tagging
All data being emitted from scrapers needs to be normalized to ISO and (in the US) FIPS codes
This is very important, because location normalization unlocks a large number of other key features
This includes GeoJSON, which enables us to plot distance of cases around a location, or the effects of population density
Data normalization is a key and essential ingredient in the future efficacy of the system
Normalizing our data has a number of ongoing challenges, including:
Variations in official casing; see: Dekalb County vs. DeKalb County || Alexandria City vs Alexandria city
Variations in characters; see: LaSalle Parish vs La Salle Parish
Varied classifications of locales; see AK: Yakutat City and Borough, Skagway Municipality, Hoonah-Angoon Census Area
Some sources do not present results cleanly and uniformly, for example:
The state of Utah aggregates the counts of three counties (Uintah, Duchesne, and Daggett counties) into Tricounty, which requires denormalization
Untaggable data (namely: cities) are a nice to have, but may only appear in certain API-driven data sets
Local workflows
Local workflows should have clear success / failure vectors
Any backend dev should be able to easily understand and diagnose potential side effects of their changes
Any scraper dev should be able to easily understand and diagnose potential issues with their scraper's returned data
Testing
We need to employ TDD, and key areas of the codebase (such as scraper functions, caching, etc.) should be completely surrounded in tests
Failures should be loud, and we should hear about them frequently
Scraper testing will be a particular focus
All scrapers will undergo frequent regular tests running out of the cache (read: no manual mocks) and against live data sources to verify integrity
Moving towards 1.0 architecture
Prerequisites
Node.js 12 – same as today
For anything in the backend, we will use CommonJS, not ES modules
Node.js continues to have a lot of thrash around ES modules, and it is unclear when it will stabilize
This app has been and will continue to be written for Node.js, not the browser
Therefore, we will use the tried and true, boring, built-in option
Technical decisions will be made favoring:
Separation of concerns
Functional paradigms and determinism
Developer velocity
Production readiness
Changes should be describable in failing tests
Scrapers may need some updating, but they will be preserved!
We love our scrapers.
We are not tossing out our scrapers!
Workloads will run on AWS
The cache will be served out of S3
All data will be delivered via a database (not a batch scrape job)
There will be proper APIs (in addition to or instead of large flat files)
Key processes
References to the "core data pipeline" refer to the important, timely information required to publish up to date case data to covidatlas.com location views, our API, etc.
Crawling
Crawling will become its own dedicated operation
This represents step 1/2 in our core data pipeline
This operation will have a single responsibility: loading one or more pieces of data (a web page, a CSV file, etc.) from the internet and writing that data through to the cache
The cache will be stored in S3, and local workflows will start to copy down or hit the S3 bucket
Incomplete crawls – say 1/3 URLs requested fails – should be fail completely
Crawling failures should be loud; alert Slack, etc.
Scraping
Scraping will become its own dedicated operation
This represents step 2/2 in our core data pipeline
Prior to invocation, the scraper-runner will load the latest, freshest data from the cache, parse it, and pass it to the scraper function
If the data is not fresh enough (say: a successful scrape has not completed in the last n hours or days), the scrape run will fail
If this cannot be accomplished for whatever reason, the scrape run will fail (read: scrape runs do not invoke crawls)
A scraper function will be supplied the parsed object(s) it's specified (e.g. CSV) as params
The scraper function will return data to the scraper runner, which will then normalize (aka "transform") the locations of its results
Non-city-level results (such as counties, states) that cannot be normalized to an ISO and/or FIPS location will fail
When a scrape is complete, its output should be a simple JSON blob that stands completely on its own
Depending, this result may be written to disk for local workflows / debugging, to database, or fired to invoke other another event or events
Annotator (updating locations' metadata) ← name needs work
What I'm currently calling annotation or tagging (additional name ideas welcome!) is its own dedicated operation
It will run periodically / async, and is not part of our core data pipeline
This operation will loop over all our locations at a higher level and ensure corresponding location metadata is updated; examples:
Associate a location with its GeoJSON
Associate a location with population density, hospital beds, etc.
(More to come!)
Metadata updater ← name needs work
Updating metadata is its own dedicated operation
It will run periodically / async, and is not part of our core data pipeline
Metadata updates ensure our sources for metadata tagging are up to date
This may include updating and loading various datasets (GeoJSON, population / census data, etc.) into database for querying during tagging
Rating sources
Blob publishing (tbd)
Any large published datasets that we don't want to make accessible by dynamic API, we will accomplish in a blob publishing operation
It will run periodically / async, and is not part of our core data pipeline
I'm looking forward to your thoughts, questions, feedback, concerns, encouragement, apprehension, and giddiness.
Let's discuss – and expect to see a first cut this week!
As with any project of meaningful utility and scale, we never know all of its needs up front.
First, we build the thing, and then we see where it takes us. We learn as quickly as possible, adapt, and grow. (Who could have anticipated that governments would publish pandemic case data in PDFs or images? Or require cookies and csrf tokens to just request a page containing basic public health data?)
The purpose of this document is to discuss the future architecture plans¹ for COVID Atlas.
This issue assumes a semi-large scale refactor.
I know, this can make folks feel uncomfortable. It makes me somewhat uncomfortable. It's also where we are.
A quick spoiler: scrapers may need some updating, but they will be preserved! We love our scrapers. We are not tossing out the scrapers!
Why start fresh
The initial analysis I did of the
coronadatascraper
codebase seemed promising for an in-flight, gradual refactor into production infrastructure.After spending the last few weeks in the codebase, discovery surfaced deep underlying architectural flaws that posed significant barriers to overcoming core issues in our current processes.
For those that may not be aware of the problems downstream of these issues, they include such fan favorites as: Larry has to stay up until 10pm every night manually releasing the latest data set, which only he knows how to do; unexpected errors can fatally break our entire build; and, even minor changes require a large degree of manual verification.
@lazd and I agree agree these issues are fundamental and must be addressed with seriousness, care, and immediacy.
Second-system syndrome
We must immediately call out a common reason refactors or rewrites may fail: second-system syndrome.
Putting aside the fact that this codebase is only a few weeks old, we still need to be clear about expectations: v1.0 will likely seem like a step back at first; it will do fewer things, and the things it does may be approached differently.
This issue is not a dropbox for every idea we have, or a long-term roadmap for the future. This issue is a plan to get us into robust and stable production infra as soon as possible, and to begin phasing out parts of CDS as quickly as possible.
What we learned from v0 (
coronadatascraper
) architectureOver the last few weeks, we learned an enormous amount from
coronadatascraper
. Below is a summary of a few of those findings that informed this decision, and will continue to inform our architecture moving forward:Crawling
2020-04-01T00:00:00.000Z
crawl forSan Francisco, CA
must somewhere, at some point, cast its data to2020-03-31
Scraping
Data normalization + tagging
Dekalb County
vs.DeKalb County
||Alexandria City
vsAlexandria city
LaSalle Parish
vsLa Salle Parish
Yakutat City and Borough
,Skagway Municipality
,Hoonah-Angoon Census Area
Uintah
,Duchesne
, andDaggett
counties) intoTricounty
, which requires denormalizationLocal workflows
Testing
Moving towards 1.0 architecture
Prerequisites
Key processes
Crawling
Scraping
Annotator (updating locations' metadata) ← name needs work
Metadata updater ← name needs work
Blob publishing (tbd)
Any large published datasets that we don't want to make accessible by dynamic API, we will accomplish in a blob publishing operation
I'm looking forward to your thoughts, questions, feedback, concerns, encouragement, apprehension, and giddiness.
Let's discuss – and expect to see a first cut this week!
¹ Previous planning took place in #236 + #295
The text was updated successfully, but these errors were encountered: