Automated Theme Extraction

From E-Learning Faculty Modules


Contents

Module Summary

In the new NVivo 11 Plus (which rolled out in late September 2015), one of two new autocoding functions was the automated theme extraction tool. Various theme extraction tools use different ways to identify “themes” from text documents and text corpora. There may be a set of semantic terms (or the converse, a list of “stopwords” which may not be considered for themes) that are explored. Maybe there is a threshold of frequency counts to indicate “importance” or “influence” of certain terms in a text set (or multiple text sets). Through various algorithms, base words are identified, and then related words (maybe based on proximity based on “sliding windows”) are linked to those base words—to give a sense of higher-level themes and related sub-themes. It is often unclear what is going on under the hood with various software tools unless the makers of that tool actually share information about how particular themes are extracted. Different software tools extract different themes from the same text set.

Automated theme extraction has come to the fore because of the need to process and understand some of the copious amounts of text that are created in social media platforms and other available texts (whether digitized from analog versions or born-digital). With the capability to “speed read” and summarize texts—based on statistical measures—people can then apply their slower “close reading” to particular extracted texts that may be of more direct interest.

How is all this relevant to online learning (given the focus of the E-Learning Faculty Modules)? Well, those who teach massive open online courses (MOOCs) and large-size online courses may use automated theme extracted to capture the summary gist of discussion board threads around certain topics…or based on an individual’s account (assuming they communicate prolifically). Collections of academic readings may be machine skim-read in order to identify articles for assignment to learners for close reading. Theme extraction may also be used for research applications. There are a number of other “use cases” for this feature in NVivo 11 Plus.

This module introduces the theme extraction autocoding tool in NVivo 11 Plus.


Takeaways

Learners will...

  • Learn about the automated “theme extraction” tool in NVivo 11 Plus and the types of data that it may be applied to
  • Acquire a sense of the process of autocoded theme extraction in NVivo 11 Plus based on a social media tweetstream
  • Explore ways the text may be explored after the initial theme extraction, including sentiment analysis and manual close reading analysis
  • Consider how to qualify assertions made from the autocoded theme extraction process
  • Conceptualize some ways in which automated theme extraction may be applied to online teaching and learning


Module Pretest

1. What is the automated “theme extraction” tool in NVivo 11 Plus? What sorts of data may this autocoding be applied to?

2. What is the general process of automated theme extraction using the NVivo 11 Plus tool?

3. What further explorations of the text may be done after the initial theme extraction? How does the “sentiment analysis” tool (also in NVivo 11 Plus) apply in enhancing the understanding of themes? What sorts of manual “close reading” analysis may be done?

4. What may be asserted using the automated theme extraction tool? What may not be asserted using the automated theme extraction tool? How may the theme extraction results be qualified?

5. What are some possible applications of automated theme extraction to online teaching and learning?


Main Contents

This section offers some short-form brief insights about the theme extraction autocoding feature in NVivo 11 Plus.


1. What is the automated “theme extraction” tool in NVivo 11 Plus? What sorts of data may this autocoding be applied to?


The “theme extraction” autocoding tool in NVivo 11 Plus enables users to automatically extract concepts (semantic-based terms) from text documents and text sets. This tool captures both top-level themes and the subthemes that are linked semantically to that top-level theme.

Text corpora may be created from a variety of file types, including portable document files (PDFs), text files (.txt, .rtf), and Word files (.doc, .docx), and web-based files (.html, .xml), and others. Texts are also a nexus in multimedia analysis given the alt-texting of imagery and the transcripting of audio and video.

The types of themes that may be extracted will be influenced by the following factors:

  • The types of texts included in the text set
  • The ways the texts are synthesized into single files, text subsets, or into stand-alone documents
  • The methods used to clean or process the text data

NVivo 11 Plus enables a user to select the entire project for theme extraction or subsets of the text within the project


2. What is the general process of automated theme extraction using the NVivo 11 Plus tool?

Briefly put, the steps to the process are as follows:

1. Collect a set of text [to answer a research question(s) and related subquestions]. Text may be acquired in a number of ways. Texts may be formal or informal; they may be born digital, extracted from multimedia, digitized from analog files, or other methods.

2. Process and clean the text. (Make a note of how the text was processed, if at all.) Ensure that the text is in one main language (since NVivo 11 Plus works with one base language at a time). Multiple autocoding runs and multiple types of autocoding may be done on a particular text set.

3. Start an NVivo 11 Plus project.

4. Ingest the text(s) into the project in the proper form (as a synthesized individual text set, as small text subsets, as single documents, or whatever).

5. Run the “theme extraction”.

6. Decide which top-level themes and subthemes to keep.

7. Finalize the theme extraction.

8. Analyze the created nodes with the coded text linked to each node.

9. Create data visualizations from the theme extractions for further insights.

10. Run additional analyses (and related data visualizations). In NVivo 11 Plus, these analyses may include the following: text frequency counts, text searches, matrix queries, sentiment analyses, and others.

11. Use “close reading” to understand the data at a more granular level.

12. Use due care when making assertions from the automated theme extraction.

To see how this all might work, a basic Tweetstream data extraction was conducted on the @Jack user account on Twitter. Jack Dorsey was one of the co-founders of Twitter and is its current CEO. His account was started in March 2006, in the very early days of this microblogging platform. At the time of this data extraction, the account had 18,818 Tweets, 1,818 following, 3.42 million followers, 10,818 likes, and 1 list. The extraction done, with NCapture, did not include retweets—in order to try to capture as many original messages as possible (and without the undue influence of retweet counts affecting the themes extracted). The NCapture download resulted in 889 Tweets only, when accounts of this size often capture much closer to the upper limit (to NCapture) of about 3,200 Tweets.

The account does include a lot of retweeting. The landing page is at the following URL: https://twitter.com/jack.


Image:JackDorseyTweetstreamLandingPage.jpg


Once the set was collected (in a data table), the AutoCode Wizard was launched, and the “Identify themes” selection was chosen.


Image:IdentifyThemesinAutocodeWizard.jpg


Once the data was captured, the autocoded subthemes were more important than the top-level themes because there was only one top-level theme identified (http) and the rest were all below that subtheme. (NVivo 11 Plus’ theme extraction works much more coherently on other text sets. It is possible that @Jack’s Tweetstream is so diffuse that this result occurred. Usually, even with Twitter data, there are much more coherent themes that come to the fore that identify something about the personal interests of the particular account holder.)


Image:AutocodedThemes.jpg


3. What further explorations of the text may be done after the initial theme extraction? How does the “sentiment analysis” tool (also in NVivo 11 Plus) apply in enhancing the understanding of themes? What sorts of manual “close reading” analysis may be done?


Just because the theme extraction was run over the text does not mean that other types of analyses may not be run over the data. As a matter of fact, for many, the autocoded theme extraction is only a start. It is a way to provide initial leads to explore in depth.

First, it is helpful to fully benefit from the theme extraction. It is possible to look at the node outline extracted based on the theme and sub-theme extraction. For an individual who is expert in a particular field, that outline may be beneficial to insight already. Clicking on the respective themes will enable reading the text coded to those particular themes and subthemes, which means that close reading may benefit the work. An important question also is, “What is not seen?” Did the researcher expect to see particular top-level themes that did not get picked up by the machine analysis? Why could that be? Is there research relevance in that absence?

There are other autocoding tools within NVivo 11 Plus. Another one which complements the theme extraction tool is the sentiment analysis. This tool basically compares the targeted text set against a sentiment dictionary (a collection of words labeled on a positive-to-negative continuum). Semantic-based texts are labeled in one of four (actually five) categories: very negative, moderately negative, moderately positive, and very positive (and neutral). Users may directly access the text sets that were coded to the particular categories (except neutral text sets, which are those which are not captured in one of the four sentiment sets). The sentiment analysis may be reconstituted as only two sets—positive and negative, with the melding of the very negative and moderately negative text sets, and then the moderately positive and very positive text sets. Also, the respective unit of analysis for the coding may be at three general levels: paragraph, sentence, or cell (as in a data table). The level of analysis can change the outcomes of the sentiment analysis.


Image:AutocodedSentiment.jpg


Datasets may be extracted from the sentiment analyses.


Image:AVeryPositiveSentimentDataset.jpg


In NVivo 11 Plus, there are a number of types of data queries which may be run against the text sets and subsets (in this case the resulting nodes). For example, there are text frequency counts, text searches (with resulting text trees), cluster analyses, matrix queries, and compound queries.


4. What may be asserted using the automated theme extraction tool? What may not be asserted using the automated theme extraction tool? How may the theme extraction results be qualified?


As with any research context, it is important to know what may be asserted from the findings or not. After all, every form of research has its limits, and every form of research has dependencies. In this context, it may help to answer some basic questions:

  • Where do the texts come from?
  • How were they processed or cleaned?
  • What were the parameters of the theme extraction? What themes or subthemes were removed by the researcher?

Given those factors, what themes are seen? What themes are not seen? What do these mean?

What are the gists of the top-level themes (especially as informed by the related subthemes and by close readings of the underlying texts)?

  • What additional analyses were conducted on the underlying text set?

Based on the various endeavors, what is the researcher seeing? What do these various elements mean?


5. What are some possible applications of automated theme extraction to online teaching and learning?

In the introduction, some possible uses of automated theme extraction for online teaching and learning were mentioned. Particularly, it is possible to machine-read discussion board postings around particular topics to understand the main themes created around particular concepts. Individual communications streams may be analyzed (assuming that the learning management system offers ways to offload the data). This is especially helpful in “big data” situations, such as massive open online courses (MOOCs) and large-scale online courses.

Also, if there is a dynamic literature linked to a particular field, it would be possible to run a theme extraction on a variety of formal articles downloaded from databases to identify which ones contain important concepts for the course…and to identify articles which should be assigned for close reading.

There are other ways to use this tool to collect and analyze information from social media platforms and the World Wide Web and Internet. In other words, there is a research angle as well.


Examples

See above.


How To

1. Collect a set of text (to answer a research question). Text may be acquired in a number of ways. Texts may be formal or informal; they may be born digital, extracted from multimedia, digitized from analog files, or other methods.

2. Process and clean the text. (Make a note of how the text was processed, if at all.) Ensure that the text is in one main language (since NVivo 11 Plus works with one base language at a time). Multiple autocoding runs and multiple types of autocoding may be done on a particular text set.

3. Start an NVivo 11 Plus project.

4. Ingest the text(s) into the project in the proper form (as a synthesized individual text set, as small text subsets, as single documents, or whatever).

5. Run the “theme extraction”.

6. Decide which top-level themes and subthemes to keep.

7. Finalize the theme extraction.

8. Analyze the created nodes with the coded text linked to each node.

9. Create data visualizations from the theme extractions for further insights.

10. Run additional analyses (and related data visualizations). In NVivo 11 Plus, these analyses may include the following: text frequency counts, text searches, matrix queries, sentiment analyses, and others.

11. Use “close reading” to understand the data at a more granular level.

12. Use due care when making assertions from the automated theme extraction.


Possible Pitfalls

Autocoded theme analysis does not preclude human close reading and human extraction of themes. After all, in qualitative and mixed methods research practices, coding sometimes involves using a priori coding approaches—based on theory(ies), models, and given practices…or emergent coding practices (creating a codebook from the data directly)—based on “grounded theory” approaches.

What is happening in NVivo 11 Plus with this feature takes on the assumption that designed algorithms are able to identify inherent themes in text sets. That assumption is quite different than qualitative and mixed methods assumptions. It is important to acknowledge the underlying beliefs in terms of various research and analytical approaches. (Sometimes, theme extraction is used in a complementary way with human coding and close reading, and that is the best way to understand distant reading capabilities.)


Module Post-Test

1. What is the automated “theme extraction” tool in NVivo 11 Plus? What sorts of data may this autocoding be applied to?

2. What is the general process of automated theme extraction using the NVivo 11 Plus tool?

3. What further explorations of the text may be done after the initial theme extraction? How does the “sentiment analysis” tool (also in NVivo 11 Plus) apply in enhancing the understanding of themes? What sorts of manual “close reading” analysis may be done?

4. What may be asserted using the automated theme extraction tool? What may not be asserted using the automated theme extraction tool? How may the theme extraction results be qualified?

5. What are some possible applications of automated theme extraction to online teaching and learning?


References

“Automatically detect and code themes.” (2015) QSR International. Retrieved February 11, 2016, from http://help-nv11.qsrinternational.com/desktop/procedures/automatically_detect_and_code_themes.htm.


Extra Resources

Hai-Jew, S. (2014 - 2016). “’Autocoding’ through Theme Extraction.” In Using NVivo: An Unofficial and Unauthorized Primer. Retrieved February 11, 2016, from http://scalar.usc.edu/works/using-nvivo-an-unofficial-and-unauthorized-primer/autocoding-through-theme-extraction.