1.4 Finding, Gathering, or Creating Data

Topic Summary

The next step in the process of converting data into wisdom is to ask whether you have the data you need in order to answer your question and put your hypothesis to the test. Sometimes relevant data has already been turned into a chart or dashboard for you by someone else. Oftentimes though, there’s no ready-made chart at your disposal and you find yourself rolling up your sleeves and getting your hands dirty with the raw data itself.

You may already have the raw data in your hands. Perhaps someone emailed you a spreadsheet, or you downloaded a csv (comma-separated values) file from the internet. If you’re lucky, you have access to raw data that’s stored in a well-designed and maintained database, or via an API (Application Programming Interface). Either way, if the answer to the first “Decision” step in the process is “Yes,” then you can proceed to the Shape phase.

Other times, though, you don’t have the data you need to answer your question and test your hypothesis. Perhaps you have no idea where the data might be found, or if it even exists at all. If so, you’ll proceed to the final step in the WONDER Phase: “Find & gather data or create it yourself.” In order to become proficient in the language of data, you’ll need to learn how to get your hands on raw data so that you can work with it, learn from it, and communicate it to others. If you can’t manage to get ahold of raw data, then you’ll hit a ceiling in your development. 

This topic in the course provides some practical advice on how to go about getting your hands on data in a couple different situations – on the web and on the job. And then it covers how to select and filter a large data set down to just the relevant records and attributes using Microsoft Excel, SQL, and a web portal interface.

Where are You in the Flowchart?

The WISDOM Data-Working Flow, with the fourth and fifth steps highlighted: WONDER - Do you have relevant data? and WONDER - Find & gather data or create it yourself.

Key Points To Remember

  • When gathering data from the web, an off-the-shelf web scraping application like DataMiner.io or WebScraper.io, or a custom web scraping program such as one written using the Python “Beautiful Soup” library can be useful.
  • For simpler situations where you simply need to connect a spreadsheet to a live, online data table, you can use built-in functions, such as:
    • The =IMPORTHTML function in Google Sheets
      • Syntax: =IMPORTHTML(“url”,”query”,index)
      • Example: =IMPORTHTML(“https://en.wikipedia.org/wiki/List_of_tallest_buildings”, “table”, 2)
    • The Get Data from Web function in MS Excel
  • If you’re struggling to find data in a corporate environment, consider turning to either a data curator or a data catalog for help getting access to the data you need.
  • If you want to limit a data set to just what’s relevant to your question, then you’ll need to perform one or both of the following actions:
    • Selecting the attributes/columns that you want to include in your analysis.
    • Filtering the records to exclude the ones you don’t need, and including the ones you do need to test your hypothesis.
  • You can use many different data tools and technologies to select and filter a data set down to just the relevant data, including MS Excel, SQL, web portals, visual analytics softwares, programming languages, and more. The good news is that you don’t have to learn them all. You just have to know how to use at least one of the tools that would work in a given situation.

Definitions

  • data curator – a person within an organization who’s intimately familiar with many data sets and their respective uses. The data curator hand-picks and combines individual assets, displays them for others to find, and introduces them to the meaning and value of what’s on display.
  • data catalog – a data catalog is a platform or service that organizes data sources for its users, making them more visible and accessible. It also provides important metadata that explains where the data came from and what the terms, acronyms, and units of measure actually mean.

Tool Tutorials

To follow along with this tool tutorial, first create a free account at data.world, a cloud-native data catalog that we use to share files and examples. Then, go to the Data Literacy Level 2 On Demand project space. Click “Launch workspace” in the upper right corner to open a workspace where you can see shared data files and SQL queries.

Google Sheets: How to gather data from a website using IMPORTHTML

Excel: How to gather data from a website using Get Data from Web

Excel: How to use the Filter feature to filter a data table

SQL: How to Filter a Data Table using SELECT & WHERE Clauses, and ON & IN Operators

Course Project

Look at your observation, question, and hypothesis, and ask yourself if you have data that will help you put this hypothesis to the test and answer your question.

  • If so, then congratulations! You can skip to the next step – you’re headed to the SHAPE Phase.
  • If not, then you’ll need to get your hands on some good data to proceed.
    • Is there a data set that already exists that you can access? Then do so!
      • Do you need to select or filter records to just what’s relevant? Then use whatever tool you can use to get data that’s relevant to your inquiry.
    • Is it a data set that doesn’t exist that you’ll need to create? Then do so!
      • If it’s sample data, then make sure your samples are representative of the population from which they are being drawn. Eliminate any sources of bias and use a sample size that’s as large as is feasible given the time and budget constraints you’re working with. Shoot for a sample size of no less than 30.

Quotes

“Data! Data! Data!…I can’t make bricks without clay.”

Sherlock Holmes, in Arthur Conan Doyle’s The Adventures of Sherlock Holmes (London: George Newnes Ltd, 1892)

Further Learning