Multimodal Data Handling

From E-Learning Faculty Modules


Contents

Module Summary

Multimodal digital data—from social media platforms, online research suites (survey tools), digital libraries, digital repositories, text corpora, and others—are commonly used in research. While these multimodal data sets can be highly useful for research insights, it is important to understand the following:

  • some legal considerations in the acquisition and use of multimodal data
  • the role of metadata
  • the importance of maintaining a pristine raw digital data set
  • the importance of digital data preservation and future proofing
  • ways to handle multimodal digital data with finesse to extend knowability and accuracy

This module addresses the issues above and other related ones.


Takeaways

Learners will...

  • consider what multimodal digital data are, where such data may be from, and ways these are used in research today
  • review legal considerations in the acquisition and use of multimodal digital data (in research) and some basic principles of data ownership
  • describe what metadata is in relation to digital multimodal data and why metadata is important in a research context
  • explain the importance of keeping a pristine set of raw digital data and ways to digitally preserve multimodal data
  • review methods for handling multimodal digital data to both extend what is knowable but to also preserve the original meaning as accurately as possible


Module Pretest

1. What is multimodal digital data? Where is multimodal data from? Why is the provenance of digital data important in a research context? What are some ways that multimodal data is used in research today?

2. What sorts of legal considerations are there in terms of acquiring and using multimodal data for research? What are some basic principles of data ownership?

3. What is metadata in relation to digital multimodal data? Why is metadata important in a research context?

4. Why is it important to keep a pristine set of raw digital data? What is “future proofing” or “digital preservation” of multimodal data?

5. What is “multimodal data handling”? What are some common ways that digital data is “handled” in a research context? How are digital translations captured from documents? What is auto voice-to-text translations from videos? What are some ways to handle multimodal data with finesse? What are ways to learn from multimodal digital data? What are some risky ways to handle multimodal data?


Main Contents

1. What is multimodal digital data? Where is multimodal data from? Why is the provenance of digital data important in a research context? What are some ways that multimodal data is used in research today?

Digital data comes in many forms: texts, still imagery, audio, video, simulations, interactive data visualizations, and other objects. Data that is multi-modal draws from several types of data modalities. For example, there may be an image overlaid with text, or an image overlaid with audio, or video, or simulations.

Multimodal data may be captured from online surveys and online research suites. They may be captured from social media platforms, many of which enable digital media sharing. The multimodal data may be created “manually” by people using technologies, or made in an automated way with the automated collation of multimedial elements into a coherent whole. Multimodal data may come from sensors that capture information from an environment, whether cyber- or real-.

The provenance of digital data is important to know in a research context because origins of data are important for citation purposes. Also, so many people repurpose others’ works online that it is important to conduct reverse image searches and other due diligence to try to identify an actual originator of a work.

Multimodal data is used in a number of ways in research—to capture people’s attitudes and opinions, to capture the effects of branding and marketing and public health outreaches, and other endeavors. There are more specific research questions that can be asked and answered using social media data, for example.


2. What sorts of legal considerations are there in terms of acquiring and using multimodal data for research? What are some basic principles of data ownership?

With the handling of data, there are a number of considerations:

First, it is important to respect people’s privacy rights, particularly with information from social media platforms, with dedicated data sets, social media data sets, and others. Personally identifiable information (PII) should be protected and not leaked.

If anonymity is promised, the technical tools used to capture the data should actually protect anonymity (no one is identified). If confidentiality is promised, the follow-through on those protections should be achieved to the letter. Data should not be leaked.

It is important to avoid defamation in the presentation of human data. All information about people should be as accurate as possible.

Intellectual property rights should be protected. If a work is copyrighted, it should not be used without express and written permission of whomever owns copyright. The data should not be used in ways that have not been released.

If people have contributed to a work, they should be given proper and accurate crediting.

If work is used from a social media platform, it is important to read the fine print to understand what data is available and to depict those delimitations accurately. For example, what is the context of the data collection? What is hidden or encapsulated? If the data is textual data, is it non-consumptive (with no access to the underlying data)? Whatever the context, this should be documented accurately.

There are likely other considerations as well. Basically, researchers have to abide by the laws of the various spaces in which they work…as well as the policies…as well as the professional ethics and values.


3. What is metadata in relation to digital multimodal data? Why is metadata important in a research context?

Metadata is sometimes generated automatically in the capture of imagery (such as EXIF data) and other data.

Metadata may be created purposefully by people to record relevant information related to multimodal data.

Metadata is important in a research context because this information enriches the main research data and can be harnessed for learning / awareness and direct research.


4. Why is it important to keep a pristine set of raw digital data? What is “future proofing” or “digital preservation” of multimodal data?

Most people keep a pristine set of raw digital data so they can go back to that set if there has been accidental mishandling of the data during its cleaning or management. If errors were somehow introduced, it is possible with a reserve set to copy that set and start the process again, without any accidental taints.

“Future proofing” generally refers to building objects that can be used into the future. “Digital preservation” refers to efforts to simplify digital data to non-proprietary and open-source forms, so that unnecessary dependencies (requiring proprietary software, requiring possibly defunct software) are eliminated. For example, a Microsoft Word file can be saved as a .txt or .rtf file for increased accessibility into the future.


5. What is “multimodal data handling”? What are some common ways that digital data is “handled” in a research context? How are digital translations captured from documents? What is auto voice-to-text translations from videos? What are some ways to handle multimodal data with finesse? What are ways to learn from multimodal digital data? What are some risky ways to handle multimodal data?

In general, “data handling” refers to any of the following processes or the full sequence: data capture, data storage, data labeling, data cleaning, data conversions, metadata extraction, data documentation, data transcoding, the handling of missing data, data translation, and other aspects.

As data is handled, errors may be introduced. For example, as video is compressed, visual and audio information is lost. If digital imagery is transcoded, color, other visual data, metadata, and such may be lost. “Big data” datasets do require the proper technologies, so no row data is lost, particularly during transfer, importing and exporting…and processing. Also, digital data may be corrupted and inaccessible, and those handling these have to have backups in order to support their work.

Automatic digital translations are only about 60 to 70% accurate. Voice-to-text auto-transcriptions (from video or audio to text files) are only about 70 to 80% accurate. In other words, there can be lossiness or corruption in going to computers in unthinking ways…so it is important to have human oversight over these transcoding endeavors.

If multimodal datasets are shared, it is important to document the limits of the data, so that other users will not try to over-assert from the data. Shared datasets should be cleaned of metadata that may reveal personal identities or private information.

It makes sense to handle multimodal digital data in thinking and purposeful ways.

Examples

Online, there are many examples of the “fine print” that describes and delimits the accessible data from social media platforms.

There are not that many public examples of multimodal datasets that have been annotated (with metadata), described, managed, and handled—in a research context for exploration. Some that are available are in proprietary software formats, which limit their access.

How To

There are many right ways to handle multimodal digital data. As per the main contents of this module, the following are important precepts:

  • Collect multimodal digital data as thoroughly as possible.
  • Be clear about the provenance of the data.
  • Use collected data legally—with respect to privacy rights, avoidance of defamation, respect for intellectual property, due diligence in proper crediting, and others
  • Cite sources accurately.
  • Use metadata accurately.
  • Data handling should not introduce research errors or skews.
  • Data cleaning should not introduce research errors or skews.
  • Create metadata and notes to accurately document the data.
  • Transcode files for digital preservation, as needed.

This list will evolve with changing laws in the ecosystems and technological changes and human practice changes.

The more specific “how-to’s” will vary based on the research context and the data that is accessed.

Possible Pitfalls

There are pitfalls in engaging digital data without understanding the various technologies and file types.

Module Post-Test

1. What is multimodal digital data? Where is multimodal data from? Why is the provenance of digital data important in a research context? What are some ways that multimodal data is used in research today?

2. What sorts of legal considerations are there in terms of acquiring and using multimodal data for research? What are some basic principles of data ownership?

3. What is metadata in relation to digital multimodal data? Why is metadata important in a research context?

4. Why is it important to keep a pristine set of raw digital data? What is “future proofing” or “digital preservation” of multimodal data?

5. What is “multimodal data handling”? What are some common ways that digital data is “handled” in a research context? How are digital translations captured from documents? What is auto voice-to-text translations from videos? What are some ways to handle multimodal data with finesse? What are ways to learn from multimodal digital data? What are some risky ways to handle multimodal data?


References

Extra Resources