Using Distant (Machine) Reading on Discussion Boards

From E-Learning Faculty Modules


Module Summary

In online learning courses, discussion boards are often used for several purposes:

(1) to enhance the learning through instructor-directed prompts (questions, cases, imagery, audio, video, and others) and learner responses;

(2) to showcase learner works (for peer analysis and discussion);

(3) to enhance the creation and maintenance of “communities of practice”;

(4) to improve the senses of learner and instructor presence in the online course, and others…

Learners in online classes have long benefited from online interactivity with their peers and instructors—through discussion boards, text chats, VOIP, and web conferencing technologies, among others. They have benefited with richer insights and improved learner retention.

For larger online courses, the discussion boards themselves may be unwieldy to handle for an instructor, even with the support of graduate teaching assistants (GTAs) or with assigning students to support each other’s interactivity (usually by requiring a certain number of substantive responses to peers’ postings). One newer approach involves the application of “distant” or “machine” reading of student postings in order to more efficiently process large amounts of student interactivity data—to enable instructor awareness, instructor messaging, improved instructional design, and other features.

This module addresses machine learning applications enabled by NVivo 11 Plus as applied to discussion boards. (Some learning management systems enable the download of discussion boards in an automated way. All LMSes enable access to discussion boards through manual download methods.)


Learners will...

  • explore various “distant reading” features of NVivo 11 Plus and see how these may apply to analysis of texts from online discussion boards; the two main features are “sentiment analysis” and “theme / subtheme extraction”;
  • consider a general sequence of distant reading;
  • compare “distant” reading with “close reading” and understand the strengths and weaknesses of each;
  • consider some “askable” questions from discussion board text data (as analyzed by broadly available computer software);
  • explore some ways to represent data using NVivo 11 Plus\

Module Pretest

1. What are some “distant reading” enablements of NVivo 11 Plus? What is sentiment analysis? Theme and subtheme extraction? Other exploratory features? And how may these features be applied to online discussion boards?

2. What is the general sequence of distant reading?

3. How does “distant” reading compare with “close” reading?

4. What are some “askable” questions from discussion board text data analyzed with computer software?

5. What are some ways to represent data visually and otherwise in NVivo 11 Plus?

Main Contents

How are students (en masse) responding to a particular issue brought up in class? What are particular areas of controversy or strife? How are sentiments trending regarding a particular event or individual or phenomenon or technology? What language do they use in expressing themselves? (And what may be seen in the text? The subtext?)

When students respond to a particular prompt, what topics do they bring up? What themes are they mentioning?

The thinking is that even in large-scale courses that faculty awareness and engagement with learners is important for the entire learning experience.

1. What are some “distant reading” enablements of NVivo 11 Plus? What is sentiment analysis? Theme and subtheme extraction? Other exploratory features? And how may these features be applied to online discussion boards?

“Distant reading” refers to the uses of computers to “mine data” (or identify patterns) from texts essentially. If reading is about decoding texts, then distant reading is about decoding texts at mass scale through computational means. NVivo 11 Plus is a qualitative data analytics tool that enables a range of features that may be applied to distant reading. There is not a preset sequence of ways to machine read texts in this context, but it is helpful to note that there are a variety of ways to extract meanings: word frequency counts, text searches, theme and sub-theme extractions, matrix queries, sentiment analyses, autocoding by existing pattern, geolocational mapping, sociogram creation, and various types of time pattern analyses.

Two of the new features in NVivo 11 Plus (which rolled out in late 2015) include some autocoding tools: sentiment analysis and theme and sub-theme extraction.

Sentiment is conceptualized as a polarity, either positive or negative. In NVivo 11 Plus, sentiment may be a binary (positive or negative) or something expressed on a continuum with the following four sentiment categories: very negative, moderately negative, moderately positive, or very positive (with the uncoded text assumed to be “neutral”). A sentiment analysis may be conducted on a text document or corpus (collection of texts) or corpora (a mix of text collections). How is sentiment computed? Using a built-in sentiment set (“dictionary”) of words, which are coded to a culturally-informed inherent sense of word meanings, the respective text is coded to each of the four categories mentioned earlier. Researchers may re-code text to another category, or they may uncode auto-coded text to be “neutral.”

The theme and subtheme extraction (another autocoding tool) function enables the automated extraction of key words that represent a text document, a text corpus, or text corpora. Most of the words identified to have thematic relevance tend to be semantic terms. Users of the tool may decide which proposed words to include in the final set. The theme-and-subtheme contents are applied to the text, and the coded text may be viewed within the autocoded theme and subtheme structure. The “subthemes” seem to be literal occurrences of the top-level theme words albeit in a mix of other combinations of terms.

As for how distant reading may be applied to online discussion boards, some “askable” questions are offered in Section #5.

2. What is the general sequence of distant reading?

The general sequence of “distant reading” is as follows.

1. Collection of relevant texts (challenges with access) 2. Data formatting and cleaning (data pre-processing, including file formatting, searchable text, de-duplication of files, spell checks, and other types of work) 3. Textual markup via tagging (sometimes, with XML, for example) 4. Running data queries and analytics 5. Outputting relevant data visualizations (can start with exploration using data visualizations) 6. Analysis and interpretation 7. Finalization and presentation

In a context where distant reading may be applied to discussion boards, Step 1 would involve downloading textual and multimedia data from the learning management system’s discussion boards. (If this may be done in an automated way while preserving the poster’s name, the time of the posting, and so on, that would be optimal.) Step 2, the data cleaning, is likely required as well. If the data requires some manual coding, that should be done. Then, autocoding may be important—to capture sentiment, to capture themes and subthemes, from various chunks of the textual data. Then, the various data queries may be conducted on the raw texts as well as the coding. Data visualizations may be created from the autocoding as well as the data queries (run on the raw data, run on the coded data, or run on a mixture of both).

If there are specific research or applied questions, the sequence may be different.

3. How does “distant” reading compare with “close” reading?

The term “distant reading” was originated by Franco Moretti (2000), in “Conjectures on World Literature,” in which he suggested that computational means should be used to process more than the local literary canon but also works of world literature, digitized texts, and born-digital contents in the Internet era. The abundance of available texts in the world, with ever diminishing human time to read them, means that many works remain in “the great unread”; Moretti (2013), in “Distant Reading,” suggests that distant reading may stand in the gap. Other arguments are that people have lower levels of interest in terms of reading nowadays than in the past. Those who would conduct surveys of the general public generally aim at the 3rd to 5th grade levels; newspapers are written to the 8th grade level. Even when human readers are proficient, they generally read between 200 – 800 words per minute. The 200 wpm are for close academic reading and the 700 – 800 wpm are for skimming and scanning.

Statistical analyses of texts have been ongoing for over 50 years but started initially with manual word counts.

Up front, “distant” and “close” reading often surface different insights. A “distant” reading does not necessarily validate insights from “close” readings, except potentially in superficial ways. (As a superficial example, a quick read-through of a text may give a general focus, which may also be capturable with distant reading.) In the computer case, the algorithms are set generally and so pull general patterns from text. In the human case, the close reading by an expert will mean insights from beyond the text—to culture, to history, to symbolism, to genre, to multiple languages, and other rich background.

4. What are some “askable” questions from discussion board text data analyzed with computer software?

Some “askable” questions were brainstormed for a presentation on this topic (link below). The questions were as follows:

  • What are dominant issues on a certain discussion thread(s)? What are the outlier issues on certain discussion threads (the topics that are part of the “long tail”)?
  • What sorts of language are used in addressing particular topics? Why?
  • Who is interacting with whom? Around what issues? What types of cliques may be observed in the online course social network?
  • What are expressed sentiments based on a certain discussion thread(s)?
  • How are particular terms used in various contexts in the discussion? What are the various gists and word senses?
  • How does activity in discussion (message) boards change over time?
  • What sorts of prompts are the most effective to encourage online discussions?
  • What sorts of instructor interventions are most constructive (based on learner responses)? Which are least constructive? Why?
  • What sorts of messages spark discussion? What sorts of messages shut down discussion? When is each type of message preferred?
  • What are observable word relationships in a discussion board(s)? What do these (inter)relationships indicate?
  • Are there hidden or latent topics in the discussions based on the observed words? What insights are there?
  • What locations are indicated in uploaded images shared on a discussion board(s)? How do these map? Are there spatial patterns?

These are some early questions, but many others may be designed and used in this context.

5. What are some ways to represent data visually and otherwise in NVivo 11 Plus?

NVivo 11 Plus represents the results of the various types of “distant reading” in data tables and also various data visualizations. These data visualizations include the following: word clouds, word trees, data tables, cluster visualizations (2D, 3D, 4D), dendrograms, tree maps, intensity matrices (matrix factorizations), node-link diagrams, topic histograms, sociograms, geographical maps, and others.

To view some of these data visualizations, please see the link to a related slideshow below, in the Extra Resources section.


There are no direct discussion board examples here because no actual data was used by this author for specific research.

Those who would use student information for published research have to go through “human subjects research” review and gain approval to proceed with the research. Then, they have to write up an “informed consent” to let students know what sort of research is going on…and they have to follow all sorts of processes to protect student identities, to properly handle (and store) the data, and so on. Students should be able to opt-in or opt-out of the research without there being any repercussions on their study or grades.

If faculty are merely trying to learn about how to improve their course and how to better engage their learners, additional permissions may not be necessary.

How To

Earlier, Section 2 covered the general sequence of how a distant reading run may be done generically and then in the case of discussion boards. This section will summarize how discussion boards may be analyzed using machine reading.

First, it may help to determine whether the individual is using this method for discovery (knowing what’s going on) or to answer a research question. What an individual wants to know will affect how he or she (or they) approach the data. Discovery is often more broadly applied and may work with generic processes, but particular research questions may require certain types of text collection, text cleaning, and defined sequences of text exploration / processing / queries.

After the initial goal is set, it is important to collect text from the discussion boards. Generally, the researcher has to collect the proper amount and types of texts. Taking discussion board data off of an LMS is not easy unless that functionality is built into the system. However, there are some faculty who have their graduate students or lab aides access the data manually in a cut and paste. The idea is to not lose information or to misidentify information. For example, common “lost” information is who said what when. That data can be used in the creation of online social networks…and other types of analyses (like “survival analyses” based on certain topics).

Once the data is collected, it may have to be cleaned to remove duplicate files…to normalize data…to translate information from one language to another…to transcode multimedia data into textual data, and so on. The data is then ingested into NVivo 11 Plus.

In the next step, researchers may want to mark up the text or to code it in some way to enable the asking of particular questions.

Then, autocoding may be done to extract machine-based insights, particularly in regards to sentiment and to theme and subtheme extraction.

Various data queries may be run against the source text or the coding (or both). Data visualizations may be output as well.

The resulting data may be analyzed and interpreted. Then, the online instructor may make decisions, design changes, interact with the learners, highlight relevant observations, and so on.

Possible Pitfalls

What are some possible pitfalls to applying distant reader to discussion boards in large-scale classes?

One pitfall involves using the tool without fully understanding what is going on. For example, what does a “sentiment analysis” do? How is it created? How should the results be interpreted? Also, what do the respective text sets look like? In a theme and sub-theme extraction, what are the strengths and limits of those? What level should the coding be applied at, and why? What may be assertable from either or both of these autocoding features? Likewise, all the various data queries (word frequency counts, text searches, matrix queries, and others) have to be understood accurately first and foremost. Otherwise, it is possible to draw incorrect conclusions and use the findings incorrectly. Any findings should be analyzed and explored—not dealt in a mechanistic or unthinking way.

Just because distant reading is used does not mean that close reading does not have an important place in all this. There should be close reading of target messages and close interpretations of all data outputs.

How the software is used may be a challenge. For example, running queries at too high a level (such as the full set of text) may result in a failure to truly engage relevant issues. For example, if particular threads address particular issues, it may be important to engage at the thread level to understand what is happening in a more close-in or granular way. Also, too much of distant reading involves frequencies…when small frequencies (word counts on the “long tail” of a frequency distribution) may matter. It is important to understand outlier information, particularly since some creative ideas may be expressed there.

Finally, another pitfall may involve just going with one distant reading tool. In most analytical tools, it is possible to export the data for further analysis elsewhere. Different methods and tools enable different types of insights. To that end, it would help to familiarize with a range of software tools and methods and then apply the ones that work most effectively for particular data outcomes.

Module Post-Test

1. What are some “distant reading” enablements of NVivo 11 Plus? What is sentiment analysis? Theme and subtheme extraction? Other exploratory features? And how may these features be applied to online discussion boards?

2. What is the general sequence of distant reading?

3. How does “distant” reading compare with “close” reading?

4. What are some “askable” questions from discussion board text data analyzed with computer software?

5. What are some ways to represent data visually and otherwise in NVivo 11 Plus?


Extra Resources

Hai-Jew, S. (2016, Apr. 7). "Using 'distant reading' to explore discussion threads in online courses." SlideShare. Retrieved Aug. 31, 2016 from