3 Signs You’re Working with a Dead-End Dataset
Every dataset that you work with is imperfect. There are always limitations, but if you take the time to understand the limitations, then you’ll be able to quickly move on from a dead-end dataset or provide the appropriate caveats in your analysis and presentations. This will save you time and protect you from misleading others.
To help you identify a dead-end dataset before you spend too much time with it, we’ll share three signs to look for when you hit that download button.
1. OPAQUE DATA: Watch out for opaque data that’s not clear who is responsible for the data.
Why? When a dataset gets separated from its creator, it’s difficult to track down the limitations of the data, the way it was collected, and how it was transformed. This means you could be misrepresenting the data when you visualize it or make decisions from it. You also won’t be able to answer important questions about your dataset, like how was this attribute measured or what does this outlier mean?
What do you look for? When you download a dataset, look for a person’s name or organization that created the dataset. Is there a link or other identifiable information that will help you trace back to the original source?
If you don’t see information about where or who the data came from, then this is a dead-end dataset.
2. AMBIGUOUS DATA: Watch out for data with unclear or lacking metadata (data about the data), like definitions, terms and units.
Why? Without much, if any, background information about the data set, you’re forced to make assumptions about what the attributes mean and how they were collected. You may think it’s safe to assume a degree symbol (°) is referring to Fahrenheit, but it may lead you to declare your state had the coldest month ever when in reality the data collector simply used Celcius. A small assumption can lead to big mistakes.
What do you look for? Every data set should have all of its terms, units, and definitions clearly spelled out in a place that’s easy for you to access. This is often called a data dictionary and can look like a separate tab in the spreadsheet or a “Methodology” section of the website. Organizations are getting better at including this metadata, but you may need to ask for it.
If you’re unclear about the terms, units, and definitions in your dataset, then don’t make assumptions. Ask for the information, otherwise this is a dead-end dataset.
3. INCOMPLETE or OUTDATED DATA: Watch out for data that has not been updated in some time or data that contains unexplained blank/null values.
Why? Again, when there are blank or null values, you’re forced to make assumptions. Does a blank value mean zero? Did they stop collecting data for a certain time? Why? Never assume a blank value means zero. Check the original data source or data dictionary for explanations and limitations because you may not be able to confidently complete your analysis if the data is incomplete. Outdated information could also result in false conclusions because you’re not using current information in your analysis.
What do you look for? Check the date that the data was collected. Sometimes data collected a year ago is very recent, and sometimes it’s astonishingly out of date. It depends on your situation. Also, look for an explanation of what blank/null values mean for each attribute in the data dictionary, noted somewhere in the spreadsheet or on website you downloaded the data from.
If you can’t find the date the data was collected or an explanation for blank/null values, then this is a dead-end dataset.
These are only a few ways your dataset may be unusable. If you’d like to learn more about incomplete data and what a real-life, tool-agnostic process of transforming raw data into wisdom looks like, then join this NEW training from Data Literacy CEO Ben Jones!
Alli Torban is a contributor to Data Literacy LLC. She’s an Information Design Consultant based in Washington, D.C., and hosts the podcast Data Viz Today. Alli specializes in designing data visualizations for researchers to help get their work understood by a wider audience. She enjoys creating whimsical patterns and spending time with her husband and two daughters.