Data Cleaning

From E-Learning Faculty Modules


Contents

Module Summary

Data cleaning (also "data cleansing" or "data scrubbing") was something that used to be in the realm of data scientists and analysts. Back in the day, most data was structured, pre-labeled in particular cells in data tables (or arrays), and they were of particular predefined types.

Today, much more has been datafied, and the data is now often unstructured: texts and text corpora, imagery, audio files, video files, games, simulations, and all sorts of analog artifacts as well. The work of studying raw data, though, has become much more common with the widespread proliferation of research tools (such as online survey tools), learning management systems (often with built-in data capture and analytics), social media platforms (with data-based application programming interfaces), applications to collect website usage statistics, and other features of the data landscape.


Image:Data_cleansingarticlenetworkonWikipedia1deg.png


Takeaways

Learners will...

  • consider what data cleaning is and why it is often necessary for research projects
  • explore some common types of computationally-enabled data cleaning used in research today
  • explore some common methods of manually-enabled data cleaning used in research today
  • review where data cleaning fits in in a typical research sequence
  • consider why data cleaning differs between different types of research projects (and whether there is structured or unstructured data in use)
  • gain a sense for why it is important to understand the types of data that will be captured with research and how to handle it (during the early research design phase)

Module Pretest

1. What is data cleaning? Why is it often necessary for research projects?

2. What are some common types of computationally-enabled data cleaning used in research today? What are some common types of manually-enabled data cleaning methods in research today?

3. Where does data cleaning fit in in a typical research sequence? Why?

4. Why does data cleaning differ among different types of research projects (and on whether the data used is structured or unstructured)?

5. Why is it important to understand the types of data that will be captured using different research methods and technologies early on, during the research design phase?

Main Contents

1. What is data cleaning? Why is it often necessary for research projects?

Data cleaning involves the various steps or processes taken to prepare data (structured or unstructured) for analysis.

Data cleaning / data preparation is necessary because raw data is not usually in a sufficiently de-noised form for analysis. A formal analysis of research data involves various ways to process that information, but if noise is in the original raw data, that noise will affect the data outcomes, which will harm the clarity of the findings from the research. (In some cases, data sets may have coded data fields, and these may have to be re-coded in a human readable way for usage.)


2. What are some common types of computationally-enabled data cleaning used in research today? What are some common types of manually-enabled data cleaning methods in research today?

Generally speaking, data cleaning may include the following:

De-duplication: Multiple copies of a record or a file or a document or image are removed from the data set. There is no point in counting works multiple times. If text analyses are run, having repeated information will skew the count. If there are multiple copies of a file, those will be an annoyance to the researcher who is coding the data.

Data reformatting / formatting: If data is not in a form that is easily usable, then it has to be further cleaned. For example, if a data set has one data field containing first and last names, then it will be important to split that field in order to get two fields—one for first names and one for last names (for example). Let’s say that a web-based dataset contains multiple languages; if the full set is to be analyzed, then the messaging may have to all be changed into a base language, so that the text set may capture all expressed ideas (regardless of the language). Or, if there are image sets that need to be analyzed, but the new imagery to be integrated is in an awkward format, then those images may have to be batch-processed to be able to be used with the other image files.

File naming protocols: Another data cleaning step may involve the naming or re-naming of files for easy archival, findability, and curation. Oftentimes, research is iterative and multi-processed, which means files are accessed multiple times and multiple ways for different types of insights. Having access to the original dataset of contents is enhanced when it’s clear where the respective files are and how they are labeled. (Metadata may need to be applied to the data objects and data sets, but this is not considered part of “data cleaning” per se.)

Data clarification: In open-source surveys, textual responses are often “noisy” or unclear. For example, people will misspell words, use incorrect words, refer to inaccurate units of measure, and otherwise offer ambiguous responses. In many cases, it is important to clarify what the respondent may have meant by removing the noise. (Properly designed surveys will use close-ended answers in order to head off some of the noise of open-ended questions. Or these survey questions will combine close-ended selections with optional short-text responses.) When “de-noising” data, it is important not to introduce error in the process.

Data transcoding: Sometimes, researchers may need to ensure that all data of a type is in a particular format. For example, in a text set, all numbers may need to be in numerical (vs. word) format for processing. Or, in a mapping project, all locations may need to be defined as longitudes and latitudes. In those cases, the data has to be sufficiently processed so that when automated (and manual) processes are applied to them, the data may be accessed in a machine- and human- useful way.

De-identification of respondent information: Oftentimes, when there are ways to identify or re-identify participants in research, their identities are masked by using stand-in codes for their actual names. Because of the capabilities of re-identification of individuals with only a few data points, researchers have to work hard to effectively mask people’s identities and to defend against potential re-identification or any potential harm from having participated. (The de-identification of research participants should not lose relevant demographic data that may inform the research though.)

Removal of irrelevant information: Another common data cleaning step is to remove information that is irrelevant to the research.

Of course, most researchers keep pristine master versions of the raw files *before* any data is cleaned, in order to head off the potential risks of damaging the original raw files or datasets. The order in which data is cleaned may affect the quality of the cleaned data, so it helps to think through what is going on in each process, so that problems are not introduced to the process.

Segmenting / sub-segmenting data: Another common data cleaning step is to break up data into smaller datasets which may be analyzed individually or separately.


3. Where does data cleaning fit in in a typical research sequence? Why?

One common sequence for more traditional research would be the following:

Inspiration / Conceptualization -> Review of the Literature -> Research Design -> Data Collection -> Data Cleaning -> Data Curation -> Analysis -> Write-up -> Peer Review -> Presentation / Publication

Another type of research involves starting with the data and extracting insights from the data.

Data Collection -> Data Exploration -> Data Cleaning -> Data Curation (and storage and management) -> Analysis (Computational, Manual, and others) -> Write-up -> Peer Review -> Presentation / Publication


4. Why does data cleaning differ among different types of research projects (and on whether the data used is structured or unstructured)?

Because data cleaning is about de-noising data and preparing it for analysis, what needs to be done depends in part on the type of data collected, the quality of that collected data, and what the researcher wants to know from that collected data. It is not uncommon for researchers to realize that they may need to go back to clean the data after they have started the analyses. Various methods of data cleaning may be applied at any point in the pre- and during- data analytics process.


5. Why is it important to understand the types of data that will be captured using different research methods and technologies early on, during the research design phase?

Generally, it is important to capture information in the most accurate and least-noisy way as possible. To those ends, it helps to design research and research instruments (like surveys) so that there is not a “make work” component to it.

Also, there is a need to document the planning, decision-making, and work around research methods, data cleaning, and data analysis, so these may be reported out in the Research Methods section of the paper or presentation

Examples

The following examples will be necessarily general.

An extracted Tweetstream dataset from the Twitter microblogging site. Depending on how Twitter Tweetstream data is collected from the Twitter API (application programming interface), the data will vary. Some software programs enable the capture of Tweetstreams based on individual accounts. Others capture data around certain target (or seeding) keywords or #hashtags. There are also ways to extract messaging based on geolocational information. In most if not all cases, the extracted data will be in cells of a data table. Some types of data cleaning may involve the removal of retweets (de-duplication), the removal of foreign language information, the extraction of thumbnail imagery, the removal of URLs, or other types of data cleaning. (Removed information may be collected and used for other types of analyses.) If there is a need to anonymize individuals, that will be an important step as well.

An unstructured data set from multiple online sources. So “unstructured data” refers to data that is not prelabeled (pre-identified) in data tables. This data may be digital or analog. Those that are digital may be of varying data types (text, slideshow, audio, video, and others). Cleaning such unstructured data may include file renaming, file transcoding, text / string data transcoding, and other approaches.

A digitized set of analog contents from multiple sources. Digitized files created from analog objects may be of various types. There may be photographs, audio recordings, video recordings, scans, and so on. Essentially, the original object is captured in digital format, so that the related information is easy to handle and analyze—through computational and manual means. The data cleaning here may include oversight of the digital curation and metadata-applied information. It will likely also include various types of data cleaning applied to unstructured data.

How To

Image:DataRelatedTagsNetworkonFlickr1Point5Deg.jpg


To broadly generalize, it is a good idea to keep a pristine master copy of the data sources before any process is applied to clean that data. This is to protect the integrity of the original dataset, but it is also to enable future asking of other questions—which may be enabled by other ways of data cleaning and querying. In every context, there will be a need for protection of the original data set in its original format.

Also, it is a good idea to document the processes as clearly as possible. Any data that has been cleaned is changed, which means that the potential findings from that data will be affected.

Beyond those initial points, how to clean data depends on a range of factors:

  • the research design;
  • the types of data;
  • the available technologies;
  • the training of the researcher;
  • research practices in the domain field;
  • the various research needs, and other factors

Possible Pitfalls

Data cleaning may change the core data, if it is not done correctly. There may be a possible loss of information from that original data. Improper cleaning may introduce error or mistakes.

In other words, data cleaning cannot be an unthinking process. The sequence of the cleaning has a clear effect on the output data, so the sequencing has to be thought through as well. Each data cleaning step affects the data that is then passed on to the next step for processing.

Module Post-Test

1. What is data cleaning? Why is it often necessary for research projects?

2. What are some common types of computationally-enabled data cleaning used in research today? What are some common types of manually-enabled data cleaning methods in research today?

3. Where does data cleaning fit in in a typical research sequence? Why?

4. Why does data cleaning differ among different types of research projects (and on whether the data used is structured or unstructured)?

5. Why is it important to understand the types of data that will be captured using different research methods and technologies early on, during the research design phase?

References

Extra Resources

Data cleansing. (2016). Wikipedia. https://en.wikipedia.org/wiki/Data_cleansing