Hands-on Activity: Big Data, What Are You Saying?

Contributed by: Computing Research Experience for Teachers Program, University of Notre Dame

A “Wordle” graphic shows 16 related words scattered about with different orientations. The words are: information, statistics, cloud, learning, excel, CVS, the, models, big analytics, data, machine, knoema, google, research, technology.
Big data tools.
copyright
Copyright © 2016 Wordle (images created with Wordle are permitted to be used in any way) www.wordle.net/create

Summary

Students act as R&D entrepreneurs, learning ways to research variables affecting the market of their proposed (hypothetical) products. They learn how to obtain numeric data using a variety of Internet tools and resources, sort and analyze the data using Excel and other software, and discover patterns and relationships that influence and guide decisions related to launching their products. First, student pairs research and collect pertinent consumer data, importing the data into spreadsheets. Then they clean, organize, chart and analyze the data to inform their product production and marketing plans. They calculate related statistics and gain proficiency in obtaining and finding relationships between variables, which is important in the work of engineers as well as for general technical literacy and decision-making. They summarize their work by suggesting product launch strategies and reporting their findings and conclusions in class presentations. A finding data tips handout, project/presentation grading rubric and alternative self-guided activity worksheet are provided. This activity is ideal for a high school statistics class.
This engineering curriculum meets Next Generation Science Standards (NGSS).

Engineering Connection

Engineers obtain and analyze data for myriad applications, including quality control of processes, modeling and simulation design, and optimizing production supply based on consumer demand. They may use data to improve reliability of the existing manufacturing processes, predict emerging equipment failures, design next-generation manufacturing equipment, and invent new technologies based on historical data. Data analysis also brings to light hidden factors on demand, including demographic influences, contributions of related variables, and variability over short and long time intervals. Data engineers are specialists in growing demand; they design and build applications to collect data and organize it for data scientist teams. Big data is influencing all sorts of industries—healthcare, entertainment, transportation, government, and even dairy—for emissions control to planning for transportation, disaster relief and population migration, to product evolution and productivity optimization. Many of today’s students may end up in "big data" career paths.

Pre-Req Knowledge

  • A familiarity with measures of center (mean, median and mode), measures of spread (range, variance and standard deviation) and other measures (mean absolute deviation), correlation and linear regression (at least a notion of its use; calculation comes later).
  • Basic ability to use Internet browsers and navigate menus on personal computers.

Learning Objectives

After this activity, students should be able to:

  • Use software tools to locate and import data into Excel.
  • Organize and filter data to compare related factors.
  • Generate graphs and statistics for analysis.
  • Determine factors and influences on variables.
  • Use data analysis to form and support conclusions.
  • Think and argue critically about decisions and reflect on their advantages and shortcomings.

More Curriculum Like This

All about Linear Programming

Students learn about linear programming (also called linear optimization) to solve engineering design problems. They apply this information to solve two practice engineering design problems related to optimizing materials and cost by graphing inequalities, determining coordinates and equations from ...

High School Lesson
A Daily Dose of Sun Keeps the Pests Away: How Soil Solarization Works

Students learn how the process of soil solarization is used to pasteurize agricultural fields before planting crops. In preparation for the associated hands-on activity on soil biosolarization, students learn how changing the variables involved in the solarizing process (such as the tarp material, s...

Repairing Cracked Steel Structures with Carbon Fiber Patches

Over several days, students learn about composites, including carbon-fiber-reinforced polymers, and their applications in modern life. This prepares students to be able to put data from an associated statistical analysis activity into context as they conduct meticulous statistical analyses to evalua...

Keeping Our Roads Smooth

Students learn how roadways are designed and constructed, and discuss the advantages and limitations of the current roadway construction process. This lesson prepares students for the associated activity in which they act as civil engineers hired by USDOT to research through their own model experime...

High School Lesson

Educational Standards

Each TeachEngineering lesson or activity is correlated to one or more K-12 science, technology, engineering or math (STEM) educational standards.

All 100,000+ K-12 STEM standards covered in TeachEngineering are collected, maintained and packaged by the Achievement Standards Network (ASN), a project of D2L (www.achievementstandards.org).

In the ASN, standards are hierarchically structured: first by source; e.g., by state; within source by type; e.g., science or mathematics; within type by subtype, then by grade, etc.

  • Different patterns may be observed at each of the scales at which a system is studied and can provide evidence for causality in explanations of phenomena. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Summarize, represent, and interpret data on a single count or measurement variable (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Recognize possible associations and trends in the data. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Represent data on two quantitative variables on a scatter plot, and describe how the variables are related. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Evaluate reports based on data. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Collect information and evaluate its quality. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Information and communication systems can be used to inform, persuade, entertain, control, manage, and educate. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
  • Create, compare, and evaluate different graphic displays of the same data, using histograms, frequency polygons, cumulative frequency distribution functions, pie charts, scatterplots, stem-and-leaf plots, and box-and-whisker plots. Draw these with and without technology. (Grades 9 - 12) Details... View more aligned curriculum... Do you agree with this alignment?
Suggest an alignment not listed above

Materials List

Each group (student pair) needs:

For the entire class to share:

  • capability to show the class websites, infographics, datasets and online videos

Introduction/Motivation

Who knows what we mean by the phrase “big data”? (Listen to student responses and explore it as a class, as described in the Assessment section.)

In our world, the accumulation of data is massive, and is rapidly growing. For that reason, we refer to it as “big data.” Here is a summary of some facts and statistics related to big data. According to IBM Big Data & Analytics Hub and Forbes.com, each day about 300 billion emails are sent, more than 230 million Tweets are made, 45 billion Facebook messages are sent, and 3.5 billion Google searches are made. According to TechCrunch.com, by 2020, our world will have 6.1 billion smartphone users sensing, interacting, storing, and retrieving data at a rate never before seen. Currently, we analyze and utilize less than one-half a percent of the data that exists, but industry and governments are increasing their investment in research to aid in marketing, healthcare, policing, decision-making and entertainment. (In addition, as desired, show students some of the resources, including infographics, listed in the Additional Multimedia Support section, to provide some notion of the scope and applications of big data.)

Who cares about big data? Why should you care about big data? Well, first of all, you should be aware of how much data is being collected right now, all around you and generated by you! And, if you like the project we’re about to embark on, you might get a future job that has something to do with big data.

Today we’re starting on a multi-day project that will conclude with a presentation by you and your team partner. The work you do for this project is intended to improve your ability to find and analyze data—much more than a simple Google keyword search or looking on a Wikipedia page. It will also give you practice in manipulating data and analyzing it to see what it tells you by looking for data relationships and trends. And, you’ll also get the opportunity to be creative in applying your data discoveries to product launch planning and strategizing. All these skills are valuable in navigating our everyday lives and maybe a future career.

How might being a “data master” have real value in the working world? Here are a few examples:

  • Data scientists and engineers compile all the data from wearable fitness trackers to continually revise and revamp their products and their interfaces to be better, plus all that collected health information generates huge databases for analysis by the healthcare industry.
  • A great amount of data is also generated from the media and social media, and the entertainment industry uses what it learns from that data to customize its products. For example, Netflix discovered a correlation between the color of a TV show title’s cover art and customer response. The media mogul also uses its collected data on viewer habits and preferences—all those clicks and choices you make when you’re watching (or stopped watching) shows and deciding on your “watch list”—to tailor its productions, including its hit series House of Cards.
  • The railroad company Union Pacific uses alert systems to measure daily emissions using data comprised of 20 million pattern matches. That’s a big dataset!
  • Rio de Janeiro’s government uses big data to improve its regional transportation, natural disaster relief, and population migration.
  • Even non-high-tech industries are collecting data now. For example, engineers who work for the dairy industry compile data from genomics and productivity statistics to improve the industry’s methods and production in coordination with feedback from sensors on each animal.

Can you think of any other examples of data being gathered and then used to make decisions, suggest conclusions and give advice? (See what ideas students may have.)

So that you get a taste of what it’s like to be “data master,” for this project you’ll work in pairs as if you are an R&D team who wants to bring a new (fictional) product to market. So imagine that you and your partner are opening a new business. You are investing your life savings, so you need to do everything possible to make sure your venture doesn’t go belly up! Over the next few days, you and your partner will decide on an item you are going to produce. Maybe you have an idea for a new kind of specialized shoe, or music, electronic equipment or food. Or maybe a new idea for an app. Then, you will research and collect relevant data to examine the pulse of your target population—interest, demand, cost, trends and potential market for your product—and its relation to factors of age, income, education and other demographics.

You’ll learn about and use many resources and techniques to query and obtain useful information. Then you’ll import data into spreadsheets, sort and organize it before graphing and calculating statistics. You’ll analyze the data to see what it can tell you and apply your results to suggest production and marketing strategies that help you make better decisions with the aim to create a better product and make a higher profit. Then you will present to “the board” (your classmates) your refined product and marketing plans based on your research findings. Let’s get started!

Vocabulary/Definitions

application program interface: A tool that enables a user with permission to access website data. Abbreviated as API.

big data: Large amounts of information (terabytes or more) that are often in raw or uncleaned form and gathered from repositories or “scraped” from websites or APIs. Big data also refers to a research field focused on extracting value from large datasets, for example, predictive analytics and user behavior analytics.

cloud computing: Using a network of remote servers hosted on the Internet to store, manage, and process data, rather than using a local server or a personal computer.

comma-separated values: A common way to organize, store and read tabular data (numbers, text) in which each data record (or field) is separated by commas. Abbreviated as CSV.

data analytics: The process of analyzing and then reporting on data into or out of a system, with the goal to discover useful information.

data cleaning : (for purposes of this activity) Removing the following from datasets: duplicate results, unwanted data, and extraneous, superfluous information.

exponential growth: When a growth rate becomes ever more rapid in proportion to the growing total number or size.

Procedure

Activity Goals and Overview

Relevant data is seldom pre-packaged and obtainable via a Google keyword search or view of a Wikipedia page. One activity goal is to expand students’ ability to obtain and analyze data beyond keyword searches using search engines. A second goal is to show students how to analyze data to determine correlations and strengths of relationships between variables. A final goal is to encourage students to find creative ways to strategize and maximize positive results based on their research.

In this activity, students act as members of a research and development team for a hypothetical product. They research and collect relevant online data about possible target consumer populations, their interest levels, and relationships to factors of age, income, education level and trends. Student pairs gather and analyze numeric data from applications including Wolfram Alpha Pro, Knoema Data Finder, Scrapy, Datahub and other sources. They import the data into a spreadsheet application, sort and organize it, and choose appropriate variable pairs to analyze by graphing and calculating related statistics. From the research results, students suggest appropriate strategies for production and marketing of their products, and report their findings, conclusions and suggestions in presentations to the rest of the class.

Background Terminology

Data analytics is a process of examining data to discover patterns, correlations and trends in order to find useful information, prompt conclusions, and enhance and fine tune decision making.

Big data refers to the growing accumulation of information from sources such as online shopping, Internet browsing, uploaded images and sound files, and collected information from Internet-connected objects and appliances such as fitness bands and smart appliances.

Collected data may be in comma-separated values (CSV) format, tab-separated values (TSV) format, or raw (unorganized) form. These formats are different ways to organize data items (or fields). Microsoft® Excel® (and other spreadsheet and database programs) can read and interpret these data forms as columns. Raw data must be organized manually. More sophisticated methods exist, which are beyond the scope of this activity. Excel provides an intuitive approach towards sorting raw data.

Trend lines are best fit linear regression lines. This activity uses Excel’s Add Trend Line feature.

Coefficient of determination is the R2 (R-squared) value shown with the trend line that Excel creates. It is a measure of the “goodness of fit” of the data—a number between 0 and 1, in which 1 is considered a perfect fit.

A graphic image shows a computer monitor and keyboard resting in a few fluffy white and gray clouds.
Where is the data? It’s in the cloud!
copyright
Copyright © 2010 Πrate, Wikimedia Commons, CC BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Cloud-computing-1.gif

Before the Activity

  • Gather the software and apps, and prepare the computers.
    • (optional) Consider whether or not to provide and demonstrate on Day 2 the use of Wolfram Alpha via Mathematica 11.0 or Wolfram Alpha Pro. The Wolfram Alpha search engine yields a variety of qualitative and quantitative data, graphs and interactive features on searches. The subscription enables users to download data for exploration and analysis. The free version can be explored with interactive features and download capabilities disabled.
    • For Day 4, you may want to be ready to pull up the ESPN Example, an Excel file, to show students.
  • Make copies of the Student Tips to Finding Data Handout.
  • If your students are well-versed in Excel, consider making copies of the Student Self-Guided Worksheet to Working with Data in Excel to guide them in a more independent version of this activity, or portions of the activity. Essentially, this worksheet provides on paper the guidance that otherwise is provided by the teacher in the Procedure steps below.
  • For Day 1, make copies of the Big Data Presentation Rubric, one for each student group, or else make one copy that is posted in the classroom for everyone to see.
  • For most days, be ready to show the class websites, infographics, videos, databases, apps, tools, Excel, demonstrations and final presentations.

With the Students—Day 1: Project Overview

  1. (15 minutes) Present to the class the Introduction/Motivation section content, which includes an introduction to and discussion about the concept of big data as well as an introduction and explanation to kick off the activity.
  2. (20 minutes) After the introduction, show students the 16-minute TED Talk video, “Big Data is Better Data” by Kenneth Cukier, data editor of The Economist, explaining how big data is used as a tool to deal with social challenges, teach machines to learn through patterns, and ultimately change the way society works. After the video, recap big data sources and uses by asking students to:
    • Name some sources of big data. (Examples: Public records; smart appliances.)
    • Name some uses for big data. (Examples: To learn about and predict trends; for marketing new products and services.)
  1. (10 minutes) Give a brief description of the final project presentation rubric so expectations are clear. Hand out copies of the rubric or post one in the classroom. Assessed categories include:
    • Terms Research: precise product description, variety of import data resources, variety of studied factors
    • Data Analysis: use of scatter plots with R2 values, use of other graph types, measures of summary statistics
    • Results Presentation: neat, creative, interesting; methods revealed; analysis discussion; resulting statistics interpreted; graphs interpreted; summary of findings with suggestions
    • Overall Effort, Participation and Enthusiasm: during discussions, data search, analysis and presentation
  1. (10 minutes) Divide the class into student pairs to work together for the duration of the project. Direct the teams to begin to brainstorm their product choices (such as a shoe, video game, or food) and brainstorm variables (such as product cost, price, target demographic, materials) that may relate to production and marketing. Have teams document and present their choices and variables to the teacher. This requires that the pairs each discuss and decide what product they are going to introduce, including a consideration of details such as color, size, use, cost, target demographic, etc. They also must decide which factors may be relevant to production, marketing, distribution and manufacturing of the product. (Later, at activity end, students will summarize these details in their projects’ concluding presentations.)

Day 2: Demonstrate Ways to Find and Import Data into Excel

Today, students locate and import data that is relevant to their fictitious products into Excel spreadsheets for analysis. First, the instructor demonstrates sources and methods for searching and importing data as students follow along. This is a good time (optional) to have students who are advanced in using Excel proceed to explore independently using the self-guided worksheet.

  1. (10 minutes) Pass out the tips handout and demonstrate an example big data source from a database. For example, at Data.gov, search for “shoes” (see Figure 1). Note that the available datasets are very limited, but note the format of the available datasets. The comma-separated value (CSV) format, a file format for organized data, most easily incorporates into Excel. Download the Columbus Avenue BID Businesses CSV file by clicking on the CSV icon below its description and opening the file with Excel. (Tip: You may have to drag the file to an Excel worksheet or change the format in the Excel open menu from All Excel Files to Text Files. You can also import the data using the menu Data > From Text.) Point out that the data is organized by columns. Columns can be resized to fit the data (see Figure 2).

A screen capture shows the data.gov website search results for “shoe” query. A headline says: 11 datasets found for “shoes.”
Figure 1. Searching data.gov to sample a CSV file.
copyright
Copyright © Data.gov (free use without restriction) https://www.data.gov

A screen capture of an Excel spreadsheet shows columns of CSV data.
Figure 2. Opening the CSV data in Excel.
copyright
Copyright © Data.gov (free use without restriction) https://www.data.gov

  1. (10 minutes) Briefly introduce alternate search engines and databases such as Data.gov, Konect, Reddit, RefDesk, Internet Public Library, iTools, Encyclopedia.com, Reference.com, Lifewire, and Datahub.io. With the class, preview each of these source and briefly demonstrate a search to compare with the earlier example search on Data.gov.

As mentioned earlier, it is unlikely that a search will immediately give you formatted, packaged and relevant data. Instead, demonstrate how “smart searching”—by varying your search terms and exploring available data that is related to the search, along with analysis of that data—can yield useful information. For example, searching the term “shoe” does not immediately yield data related to shoe consumption or preferences. However, a search for “disposable income” may yield Census data with population demographics that may lead to pertinent and useful trends and results.

  1. (15 minutes) Demonstrate the use of the Knoema Data Finder plug-in to find data. Figure 3 shows how a direct search of “shoes” yields a table of planned monthly spending on shoes.

How to add Knoema to Excel:

    • Download the app > go to the File tab > click Options > click Add-Ins.
    • In the Manage box > click COM Add-ins > click Go.
    • In the Add-ins available box > select the check box next to Knoema (OR add it first, if it’s downloaded but not listed).

How to use Knoema Data Finder:

    • Click the Add-Ins tab.
    • Click the Find data icon.
    • Type your query in the box in place of Search for data and statistics.

A screen capture shows an Excel spreadsheet with Knoema add-in search results for a “shoes” query.
Figure 3. Using the Knoema add-in for Microsoft Excel.
copyright
Copyright © 2016 Author’s results from using Microsoft Excel 2010 and Knoema Data Finder (fair use)

  1. (15 minutes) Demonstrate how to obtain data via Excel’s Data > From Web command.
    • Use an Internet browser to find a website that hosts numeric data in tabular form. One example is ESPN’s MLB Player Batting Stats – 2017 page.
    • Copy the website URL from the browser address bar.
    • In Excel > go to the Data tab > select From Web (left of the menu, shown in Figure 4).
    • Paste the URL into the Address in the New Web Query form > click Go.
    • Click OK to import the data into the existing Excel worksheet.

A screen capture shows an Excel Web Query window on a blank Excel spreadsheet with data that includes pro baseball player names, win stats and photos pulled from an ESPN website.
Figure 4. Using Microsoft Excel’s Web Query utility to import data.
copyright
Copyright © 2016 Author’s results from using Microsoft Excel 2010 and ESPN MLB Player Batting Stats – 2017 data from Disney Services (fair use) http://www.espn.com/mlb/stats/batting/_/type/sabermetric

A screen capture shows an Excel spreadsheet populated with imported MLB 2016 statistics data that includes an assortment of data, some extraneous for the given research purpose.
Figure 5. Importing data often includes extraneous information.
copyright
Copyright © 2016 Author’s results from using Microsoft Excel 2010 and ESPN MLB Player Batting Stats – 2017 data from Disney Services (fair use) http://www.espn.com/mlb/stats/batting/_/type/sabermetric

  1. Notice that the data includes lots of extraneous information, as shown in Figure 5. We can “clean” the data by hiding (or deleting, with caution) rows or columns that repeat or are irrelevant.
    • Go to the Data tab > click the Remove Duplicates button > click Select All > click OK.
    • You can also manually hide remaining irrelevant text and information by selecting the rows or columns > right click > select Hide (see Figure 6).
    • Save your progress for the Day 3 instruction (or else use the included Excel file: ESPN Example on Day 3).

A screen capture of an Excel spreadsheet data import with extraneous information.
Figure 6. When importing data, use Excel features to remove data duplicates and hide unwanted data.
copyright
Copyright © 2016 Author’s results from using Microsoft Excel 2010 and ESPN MLB Player Batting Stats – 2017 data from Disney Services (fair use) http://www.espn.com/mlb/stats/batting/_/type/sabermetric

  1. (Remainder of class period) After the Excel demonstration, direct student pairs to explore their own searches using the just-demonstrated tools.
  2. (optional) Demonstrate the use of Wolfram Alpha via Mathematica 11.0 or Wolfram Alpha Pro.
    • To obtain data via Wolfram Alpha in Mathematica, open a new notebook > double press the equals key > an orange star icon appears:

The Wolfram Alpha icon, which looks like an orange 10-pointed star with a white equals sign (two short, stacked horizontal bars) inside.
copyright
Copyright © Wolfram Mathematica, Wolfram Alpha (fair use) https://www.wolframalpha.com/pro/

    • Next, type the term to be searched > press enter > and the results, including graphs and information, appear. Figure 7 shows the results from a search for “Nike Air.”
    • Clicking on the + icon in the upper right corner of each boxed item permits users to import data from the result. Alternatively, users can also manually enter data.
    • The Figure 7 example results show statistics and current market investment values and trends. Students may do further research to compare and contrast similar products.

A screen capture shows results from a Wolfram Alpha search using the term “Nike Air” via Wolfram Mathematica 11. The results include the latest stock exchange trade values (NKE, AIR), other financials like market cap, revenue, number of employees, net income, number of outstanding shares, annual earnings, P/E ratio, annual dividends per share, dividend yield, plus recent returns and an interactive line graph showing relative price history.
Figure 7. Example results from using Wolfram Alpha for data searches.
copyright
Copyright © Wolfram Mathematica, Wolfram Alpha (fair use) https://www.wolframalpha.com/pro/

Day 3: Formulate Product Plan and Gather Data

  1. (~45 minutes) Remind students to bring out the tips handout from Day 2. By this time, expect pairs to have each identified a product and begun to determine several (at least five) variables related to their products (production, marketing, sales). Direct students to begin searching for data, importing it into Excel—which means using the tab at the bottom of the Excel workbook to create a new sheet for new data.
    • Work with students to encourage them to use the search resources and not solely Google.
    • Encourage them to vary their search terms, as demonstrated in Day 2.
    • Remind them to immediately save new Excel documents, and periodically save their progress.
    • Remind them to record the URL sources of all data they collect.

Day 4: Analyzing Data with Excel

  1. (10-15 minutes) Using Excel, demonstrate to the class how to insert a scatter plot of related columns of data in order to analyze their relationships by inserting a trend line and viewing the coefficient of determination (R2, R-squared). These tools enable a visual analysis and provide a statistic of the correlation between the data, which show the strength of their interrelation. (Students with advanced Excel background who are independently completing the self-guided worksheet practice the same steps.)
    • Open the Excel file from Day 3 (or use the ESPN Example Excel file). Select the first column (or row) of data by clicking and highlighting just the data values, or by selecting the row or column header at the top. Then hold down the control key (Ctrl, usually near the space bar on the keyboard) while selecting the second row or column of data. Demonstrate by selecting columns D (AVG) and I (GB) > go to the Insert menu > select Scatter > select the Scatter with only Markers option (see Figure 8) > move the chart to a blank, open area in the Excel sheet. (See sheet 2 of ESPN Example for the resulting scatter plot.)

A screen capture shows an Excel spreadsheet, its cells filled with data. Two columns are highlighted (the cells are blue), D (batting average) and I (number of ground balls), and the “Scatter with only Markers” option is chosen from the top horizontal menu bar under Insert > Scatter.
Figure 8. Inserting a scatter plot of data in Microsoft Excel.
copyright
Copyright © 2016 Author’s results from using Microsoft Excel 2010 and ESPN MLB Player Batting Stats – 2017 data from Disney Services (fair use) http://www.espn.com/mlb/stats/batting/_/type/sabermetric

    • The chart auto-scales the axes (see Figure 9). To customize the graph, double click on the values along one of the axes to bring up a menu with options.

A screen capture shows scatter plot (graph) with the title GB (ground balls). The x-axis is “batting average,” ranging from .28 to .36, and the y-axis is the number of ground balls, from 0 to 350. The ~40 plotted data points are blue diamonds that cluster between 130 and 300, and .29 and .35.
Figure 9. An example scatter plot created in Excel plots the number of ground balls vs. batting average.
copyright
Copyright © 2016 Author’s graphing results from using Microsoft Excel 2010 and ESPN MLB Player Batting Stats – 2017 data from Disney Services (fair use) http://www.espn.com/mlb/stats/batting/_/type/sabermetric

    • To customize the output, double click on part of the graph to bring up a menu box with options. Right click on the data points themselves to add a trend line, as shown in Figure 10. In the Format Trendline box that appears, check the boxes Display Equation on chart and Display R-squared value on chart. (The R-squared value, or the coefficient of determination, is a value between 0 and 1, in which, essentially, 0 indicates no relation, and 1 indicates a perfect relation.) 

A screen capture of the Figure 9 Excel scatter plot shows how to add a trend line to the graph. The “Add Trendline…” option is highlighted.
Figure 10. Adding a trend line to an Excel scatter plot.
copyright
Copyright © 2016 Author’s graphing results from using Microsoft Excel 2010 and ESPN MLB Player Batting Stats – 2017 data from Disney Services (fair use) http://www.espn.com/mlb/stats/batting/_/type/sabermetric

    • Note that in this example, the R2 value is 0.0089, which tells us that in this dataset, very little correlation exists between batting average and number of ground balls. Keep in mind that finding no correlation is useful knowledge! Experiment with other columns to note differences in plots, trends and R2 values.
  1. (10-15 minutes) Load the Analysis ToolPak (included with Excel) by clicking on the Developer tab > selecting it from Add-Ins Available. After loading > go to the Data tab > click Data Analysis, then:
    • Explore Histogram, Descriptive Statistics, Correlation and others. See the online Microsoft help for details about using the Analysis ToolPak.
    • Experiment with variable pairs and graph customization, taking note of differences in shapes, trend lines, etc.
  1. Give students the remaining time to continue to search for and import data. Remind them to continually save their progress!

Day 5: Students Research and Analyze Their Data

  1. Oversee student pairs as they work on forming scatter plots, trend lines, and R2 values for columns of the data they collected in Days 3 and 4. Expect them to also find descriptive statistics via the Analysis Toolpak. Circulate through groups to assist with suggestions for searches and using Excel.
    • Emphasize the project goal—to find relevant variables’ effect on the data in order to make evidence-driven planning, production and marketing decisions.
    • Remind students that the results may not reinforce their opinions and expectations. Advise them to look for creative responses and solutions. Outlandish ideas may be considered, while taking into consideration (and stating) the risk of loss and damage, and conducting further research, when possible. For example, the low correlation between ground balls and batting average may suggest that batting average is more attributed to fly balls and line drives. Or, perhaps the best hitters hit ground balls as often for outs as they do for hits. These conjectures can lead to more research and discovery; however, each conjecture by itself is weak and might not be good advice without testing data. Alert students to not confuse correlation with causation.

Day 6: Create a Presentations of Analysis

  1. Direct the teams to create electronic presentations that summarize their research. Suggested outline:

Page 1: Introduction with topic; include images, overview, why did you chose this topic?

Page 2: Related variables with explanations; why was it chosen? how did it come up?

Page 3: Graphs and relations with variables; explain patterns and/or lack of patterns, trends, inferences

Page 4: Difficulties; where would you recommend more emphasis? What was lacking? What do you wish you would have located or discovered? What goals were not supported?

Page 5: Recommendations based on research; give degree of risk of proposal, explain why you would recommend and what benefits may occur

Days 7-8: Students Finish Work and Give Presentations

  1. Have student pairs fine tune their presentations to tell a cohesive story that includes pertinent research, graphs, interpretations and conclusions. Team by team, review their progress and give them feedback.
  2. Student pairs each present a summary of their work to the rest of the class. Provide assessment in accordance with the suggested outline and rubric criteria.

Attachments

Safety Issues

  • Advise students to beware of the potential for false and misleading data from websites that lack credibility and authenticity.
  • Oversee students to ensure safe Internet use, which may require filtering out objectionable websites and domains. 
  • Remind students to obtain permission to use data that is not made publicly available.
  • Monitor students’ computer and resources work so they stay on task with the project and do not get distracted by social media or other diversions.

Troubleshooting Tips

It is helpful if the teacher is well versed in what students will encounter. It helps to:

  • Practice using Excel (or equivalent) to import data from websites or other sources.
  • Practice cleaning, sorting, analyzing and graphing data.
  • Practice searching terms to help students with suggestions when searching for relevant data.

Investigating Questions

  • What part of this activity did you find the most challenging?
  • How might industry research and analyze big data differently?
  • Should (raw) data be free? How does privacy affect your opinion?
  • Do you think your product would be successful on the market? How much money would it cost to start your business?
  • What is the government’s role in the use of big data? (Possibly issues relating to privacy and security.)

Assessment

Pre-Activity Assessment

Discussion: As a class, explore students’ base knowledge of the concept of big data. Talk about exponential growth and its effects on data acquisition and analysis. Mention the existence of many and varied large data sources, such as cell phones, Internet shopping and browsing, Facebook and other social media, virtual artificial intelligence personal/voice assistants (such as Alexa, Siri, Google Assistant), Internet of Things (wireless, networked devices that collect data), electronic medical records, electronic pre-college exam results, census data, election results, income tax information, other government data and indicators, wearable technology activity trackers, etc.

Activity Embedded Assessment

Research: Have students research topics related to terms introduced on Day 1. As part of their overall project scores, include their daily results, discussion, participation and effort. Note students’ progress in searching, obtaining, cleaning, graphing, analyzing and deriving implications from their data gathering and manipulation work. It may help to create a checklist with group names for logging daily progress, which can be recorded and assessed by the teacher in collaboration with the student pairs. Also record and assess students’ research, focus and collaboration as they work toward retrieving data, analyzing it, and packaging it for final presentations.

Post-Activity Assessment

Final Project: Assess the final project presentation using the criteria provided in the Big Data Presentation Rubric. Overall, the rubric’s main components are:

  • Terms Research
    • Precise description
    • Citation of a variety of search sources
    • Analyze a variety of related factors
  • Data Analysis
    • Comments on scatter plots and correlations
    • Comments on other types of graphs
    • Comments on summary descriptive statistics
  • Results Presentation
    • Neat, creative, interesting, visually appealing
    • Explains how data was obtained, citing sources
    • Explains how data was analyzed, including successes and difficulties
    • Logical graph interpretation
    • Describes findings and possible implications
  • Overall Effort, Participation and Enthusiasm
    • Discussion contributions / attentiveness
    • Focused search, analysis, summary

Activity Extensions

Have students with computer programming ability learn to use Python packages such as Networkx to analyze relationships among variables.

Have students learn about the features of Eureqa Desktop and its use of “machine learning to unravel the intrinsic relationships in data and explain them as simple math.”

Activity Scaling

  • Depending on student progress, adjust (lengthen or shorten) the time window for each teacher-presented demonstration and practice of new material.
  • Consider assigning student teams each a product that is pre-screened by the instructor in order to eliminate the selection phase and give the instructor more foresight in helping students with searches and analyses.
  • For students with strong backgrounds in Excel, have them conduct the entire activity, or the Day 2 and/or Day 4 teacher-led demonstration portions, on their own using the Student Self-Guided Worksheet to Working with Data in Excel.
  • For more advanced students, direct them to use APIs from websites or search for data from a larger pool of potential resources. This may require them to clean the data before importing it.

Additional Multimedia Support

Resources about big data for background and class presentation:

  • Video: TED Talk: Big Data is Better Data (16 minutes) by Kenneth Cukier, data editor of The Economist: https://www.ted.com/talks/kenneth_cukier_big_data_is_better_data
  • Creative Corporate and Marketing Communication > What is big data and where is it coming from? (good infographic) http://mariajose-ccmc.weebly.com/pbl-vii/march-31st-2015
  • What Happens in an Internet Minute in 2016? (good infographic) http://www.visualcapitalist.com/what-happens-internet-minute-2016/
  • Presentation on Google (21 slides): https://www.slideshare.net/SardarDnay/google-presentation-61595802
  • Is Facebook Becoming the Internet? http://www.trustedreviews.com/opinions/is-facebook-becoming-the-internet
  • Twitter (Probably) Isn’t Dying, But Is It Becoming Less Sociable? http://mappingonlinepublics.net/2015/11/11/twitter-probably-isnt-dying-but-is-it-becoming-less-sociable/
  • Saxon Global, Fast-Growing BI, Big Data, Cloud Service Provider: http://www.kdnuggets.com/2014/05/saxon-global-fast-growing-bi-big-data-cloud-service-provider.html

Additional helpful resources for this activity:

  • Video: Automated Data Scraping from Websites into Excel (12:42 minutes): https://www.youtube.com/watch?v=qbOdUaf4yfI
  • 7-Zip extraction software freeware: http://www.7-zip.org/
  • Data.gov datasets: https://catalog.data.gov/dataset
  • Datahub: https://datahub.io/
  • Encyclopedia.com: http://www.encyclopedia.com/
  • Internet Public Library: http://www.ipl.org/
  • iTools: http://itools.com/
  • Knoema (download links Excel add-ons at bottom of page) https://knoema.com/datafinder
  • Konect: http://konect.uni-koblenz.de/
  • Lifewire: https://www.lifewire.com
  • Reddit: https://www.reddit.com/
  • Refdesk (fact checker for the internet): http://www.refdesk.com/
  • Reference.com: https://www.reference.com/
  • Wolfram Alpha Pro: http://www.wolframalpha.com/?source=nav
  • Wolfram Mathematica Student Edition (free 15-day trial; perpetual educational licenses < $100 per computer): https://www.wolfram.com/mathematica/trial/

References

Big Data & Analytics Hub. IBM. Accessed April 2017. (Source of big data statistics) http://www.ibmbigdatahub.com

Forbes. Accessed April 2017. (Source of big data statistics) http://www.forbes.com

TechCrunch. Accessed April 2017. (Source of statistics) https://techcrunch.com/

Contributors

Tom Falcone

Copyright

© 2017 by Regents of the University of Colorado; original © 2016 University of Notre Dame

Supporting Program

Computing Research Experience for Teachers Program, University of Notre Dame

Acknowledgements

This work was supported by the Computing RET Program at the University of Notre Dame, which was funded by National Science Foundation grant no. CNS 1609394—RET Site: Physically and Biologically Inspired Computational Models and Systems. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

A special thanks to Michael Niemier, Timothy Weninger, Corey Pennycuff and Sal Aguinagas.

Last modified: July 20, 2017

Comments