The purpose of my post here is to share some features and trends, as well as problems that I’ve seen with public COVID-19. It’s not meant as an overall tutorial for anyone wishing to begin using public COVID-19 data. There’s plenty of good suggestions in many of the public health policy and data visualization forums. Go there for those.
And hey - let’s work together. After you check this out, please comment, correct me, or tell me something different or new.
PUBLIC SOURCES OF DATA
Every day more and more are available. There’s been a few helpful aggregations of WHO, JHU, and Country, State, and Region that I’ve used, including:
Starschema’s aggregation, here: https://github.com/starschema/COVID-19-data
The New York Times also is a good aggregation, here: https://github.com/nytimes/covid-19-data
UsaFacts runs a comprehensive site at https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
FEATURES AND TRENDS
Here’s a couple of either good (or bad) data features and trends that I’ve seen so far.
Relationship between metric and extent or severity of outbreaks
Early on, I saw lots of references to the number of reported COVID-19 cases as a measure of the extent of the outbreak. We now know that reported cases are only a factor of the number of people who have been tested, rather than the true extent of the outbreak. Here in the United States well, name your reason, but we’ve unable to quickly deploy reliable and comprehensive testing. And, the results that come back are statistically limited in terms of health, economical, and sociological representative value.
Unfortunately, deaths tell a better story. For most developed countries, when someone dies, the death and cause of death is recorded by an authority, who then regularly tabulates this statistic. Everyone pivoted quickly to this - for example, John Burn-Murdoch along with the data viz team at the Financial Times recognized this and added deaths as a measure.
However, as we go on, it’s even hard to agree on how many have died from COVID-19, for reasons I’ve mentioned below in the Technical Challenges section.
The ‘how much and where’ versus the ‘how bad and when’
Outbreak maps, showing where COVID-19 is happening are news and social media’s most popular and readable visualizations of the outbreak. These maps, featuring either color-coding or bubble marks, show the relative size of cases or deaths. You can easily see where there is the greatest incidence of outbreaks.
Maps have a harder time communicating how things are going, and in particular, trending. Colors and arrows can show trending; some of the more sophisticated examples include this trending representation from Mathieu Rajerison, here:
We can show growth rate trending of either cases or deaths. Early on, and again, the Financial Time’s chart is an excellent example of this – I and others made the knee-jerk mistake of dismissing a scale showing the number of cases that looked thoughtlessly distorted by an arbitrary scale. Smarter people quickly jumped in to explain that the scale was logarithmic, and totally appropriate. Epidemics, by nature, tend to grow at an exponential rate, rather than a linear rate.
The idea behind the chart (example above provided by Chris Canipe at Reuters then is to show how an individual cohort’s (whether it’s a country, or a segment of a population) experience is improving or worsening (as seen by the trajectory and inflection of the curve) but also the rate of the exponential growth, as shown by the angle of the curve. Most of these charts have plotted a perfect exponential growth rate as a benchmark against which each population can be measured. These charts are super valuable for showing the severity of the outbreak, and also extrapolating over time the extent of total deaths or cases that we can expect in the next few weeks.
Decontextualizing and dehumanizing COVID-19 causalities through relativity and probabilities
I often see COVID-19 cases and deaths presented in ways that dehumanize and decontextualize the human condition of falling ill, being hospitalized, or dying. They include:
Incidence relative to the entire population or a cohort / segment of the population. This is typically presented as a way of showing a probability or statistical magnitude. If only 1 out of 20 people (ie a 5% average) have an incidence, then we feel more comforted than a higher probability, for example, 1 out of 2 people having an incidence. However, at scale, this completely ignores the human costs. In a city of 8 million people, even 1 out of 100 people represents 80,000 people whose lives are disrupted, permanently changed, or ended. The social, psychological and economic costs of that are devastating, especially when society already operates with a thin safety net under the presumption that people are always going to be fine.
Incidence relative to other typical cases of illness or mortality. For example, COVID-19 cases compared to heart disease, cancer, diabetes, or vehicular accidents.
To anyone who might want to take this fight up: please stop. This second example is particularly dismissive.
The timing of the incidence is much more concentrated than the distribution of other types of illness and mortality, thereby overloading the hospital and health care systems.
The application of care to a cancer patient or an accident victim requires different resources and protocols for a COVID-19 patient, which are furthermore novel and changing, worsening the system overload.
TECHNICAL CHALLENGES
Aside from how the information is applied, there’s also challenges I’ve seen and had with the data that have to do with how it is collected, gathered, and reported. These have been getting in the way of credibility and reliability. I’ve put a few down here that I think are causing the most problems - watch out for them:
Common discrepancies and differences
Here in New York City, we’ve had bad news, all day, all the time. When I go through twitter and news media I’ll probably see three or four versions of yesterday’s cases and deaths. They are probably from:
Timing differences: some publishing sources may publish several times throughout the day; a version you see may be the 5PM posting versus the midday posting that was used somewhere else.
Version differences: sources often revise their data due to errors, recounts, or after revising methods.
Aggregate totals different from individual totals: for example, a country summary may have a different number due to the aggregate of timing and version differences that I talked about earlier.
Data format and combination/join failures
A lot of the data collection is done by hand, by professional but super-stressed people filling out semi-arbitrary forms. Because many of the processes used to collect, aggregate, and publish the data are manual, and the point-of-capture itself is almost always manual, we’ll see classifications that may result in join failures or misclassifications. Some examples I’ve seen include:
Location name confusion: Good example is New York – does it refer to the city, the county, the state, the MSA;
Filename confusion: Link or file name contains date, and date hasn’t been updated: this one is self-explanatory; it’s easy to miss since it’s often the last step of an automated process that requires a person to publish it. Often the ‘version’ field of a file hasn’t been updated, which will corrupt version-based joins.
Vague titles and naming for metrics
It’s difficult to tell whether what we are looking at is new or total, and over what time frame. Often, the documentation is not footnoted or annotated, and the reference material is in a different location than the published data. The following metrics have been used as a measure of the extent and severity of the outbreak (I mentioned cases and tests already):
Cases
Tests
Deaths
Hospitalizations
Deaths: Medical and examiner settings have been totally overwhelmed in the last two months. It’s certainly been challenging for even for these officials with the most resources to evaluate, record, and send information under their normal processes and protocols. As a result, numbers will update.
Hospitalizations: It can be difficult to confirm whether the hospitalization metric is:
A total number, representing the net total of admitted patients minus discharges;
Consistently a COVID-19 diagnosis, as the admission diagnosis may be same as the interim or discharge diagnosis
Pace, Urgency, and Political Factors
Governments, journalists, NGOs, and other professional bodies have been feverishly trying to make sense of the situation. The information coming out is going to feature less validation, peer review, and editorial oversight than normal. There’s also been raw political currency or social control concerns that factors into what gets released, or doesn’t.
Just to conclude on this note, and to ask others that might want to participate here to do the same: I’d like to focus, as a professional, on the ways we can evolve our trends and methods, and overcome our technical challenges. Most people are working hard as hell to bring us the truth, and often risking their health - and life - along the way. We owe them a huge amount of gratitude.