Decision Trees

From E-Learning Faculty Modules


Contents

Module Summary

Decision trees, which show relationships between attributes and target outcomes (expressed as “rules”), are induced from quantitative and categorical data from learning algorithms. Decision trees are expressed visually as “trees” from the root node at the top through branches and leaves at the bottom. Decision trees may identify macro level patterns from data. They are considered very human readable and directly interpretable. The required data structures tend to be typical ones for data tables, with little required data pre-processing / data cleaning. Decision tree classification models may be evaluated for validity--or correctness of classifications vs. misclassifications...and also for external validity.

This module introduces decision trees as a predictive classifier method for interpreting learning data.

Takeaways

Learners will...

  • explore the uses of decision trees to capture patterns in quantitative and categorical data
  • think about how to pre-process or clean data for decision tree induction; consider how various parameter settings affect the information that may be collected from decision trees
  • consider how decision tree classification models are validated or not and consider what external validity may be
  • explore various types of decision trees that are able to be created using RapidMiner Studio (with a free educational license)
  • consider some ways that online learning data may be analyzed using decision trees (as a form machine learning)

Module Pretest

1. What are decision trees, and how are these used in analyzing data?

2. What are some basics to pre-processing or cleaning data for decision tree induction? What are various parameter settings that affect the information that may be collected from decision trees?

3. How are decision tree classification models validated or not? What is external validity for decision tree models?

4. What are various types of decision trees that may be created using RapidMiner Studio (with a free educational license)?

5. What are some ways that online learning data can be analyzed using decision trees (as a form of machine learning)?

Main Contents

BasicPartsofaDecisionTree.jpg


1. What are decision trees, and how are these used in analyzing data?


Manually drawn decision trees are data visualizations that capture decision-making in particular contexts. These inform how people make decisions to address particular challenges or issues. (In game theory, extended-form decision trees define optimal strategies for decision-making in particular contexts.) Manual decision trees may be based on a number of factors: people’s experiences, applied theories, models, empirical data, and others.

Decision trees may also be machine-drawn or “induced” from quantitative and categorical data. These decision trees are a product of “machine learning” or “data mining,” with learning algorithms applied to data to extract data patterns that would be invisible or latent otherwise (or only identifiable with other machine learning methods). Decision trees may be pre-pruned and post-pruned, with particular parameter settings, to suggest the optimal size of the induced trees and the types of data analytics outcomes desired (based on the underlying data and the issues of interest to researchers).

A decision tree is composed of basic parts: a root (starting) node at the top, branches with particular values, and leaves (which are end states or “terminal nodes”). A decision tree is read from the root node at the top and downwards, down branches, down leaves, all the way to the bottom. One of the major strengths of decision trees is that these resulting visualizations can be human-readable and human-interpretable while also statistically significant, repeatable, workable at machine scale, and insightful.

The basic patterns extracted from data are the attributes (variables) that are the most influential on differentiating the (mutually exclusive) classification of a particular individual or object or phenomenon respectively. At each level of a “split” from the root node downwards, the best attributes for determining (mutually exclusive) classification are identified, and the values of the respective branches along which the classifications split are noted. The algorithms deal sequentially with the data, and prior levels and prior splits affect future ones. The coarsest scale splits occur early, and the more nuanced ones occur closest to the terminal nodes at the decision tree’s top (e.g. the bottom of the visualization…because decision trees are upside-down).

Decision tree algorithms vary in their processes, their efficiencies, their accuracy, and their computational expense. In this short module, these will not be explored in any depth. There are a number of publicly accessible resources that describe these. It should be noted that the data patterns identified by decision trees may be identified with other machine learning “learners” as well because the data patterns are within the data. (Quite a few data analytics software tools and high-level computer languages have decision tree algorithms as part of their respective packages.)

Decision trees are said to have originated in the 1970s (“Who invented the decision tree?” n.d., CrossValidated on Stack Exchange).

Decision trees provide

  • descriptive information (about which attributes most determine classifications to various classes)…and
  • predictive information (about what the tendencies are with similar datasets…and with inferences that can be made about single data points and its likely classification…and
  • prescriptive information (such as what changes may be made to attributes or certain contexts to try to modify a classification outcome), among others.

The classifier predictivity comes from the rules that are surfaced or identified from the data. Internal validity of the decision tree classifier model comes from its accuracy rate in predicting the class of an unknown test data set (vs. the training data set). External validity relates to how well the induced decision trees reflect the world outside of the data. More on this will follow.


2. What are some basics to pre-processing or cleaning data for decision tree induction? What are various parameter settings that affect the information that may be collected from decision trees?

Depending on the types and state of the data in the data table, it may be necessary to pre-process or clean the data before it is used for the induction of decision trees. For some of the decision tree algorithms, the target classification has to be nominal or categorical data. Other decision tree algorithms enable the output classification column to contain continuous numerical information, which may be split into categories…and then the main attributes identified based on how relevant they are to determining likely classification to which categories. (These are for regression-type decision trees.)

The basic steps to pre-processing or cleaning data are as follows.

First, store a pristine backup version of the dataset / data table.
Second, identify the column with the classification data. Make sure that this data is in a clear categorical format. If there are only two categorical possibilities in the entire set, this is generally referred to as binominal data (with a taxonomy of two). If there are more categories, this is generally referred to as polynomial data. Generally, this column contains the “labels” for the labeled row data by identifying their type. (It is possible to apply decision trees to unlabeled data, too, but that is beyond the purview of this module.)
Third, some or all of the attribute columns may need data rendering into one form or another, so that the attributes are comparable using the particular decision tree algorithm. (In many cases, the software that induces decision trees can be highly informative about what changes need to be made.)

Parameter settings for basic decision trees relate to how such decision trees are (1) grown and (2) pruned (both pre-pruned and post-pruned). Decision trees may be allowed to grow with no pruning and no inherent limits, and these can enable trees to be highly nuanced and complex…especially if the underlying data has a lot of complex attributes and a wide range of possible classifications. It is thought that humans are unable to practically use trees that are more than about 20 splits or levels in depth, so some default settings have decision trees limited to about 20 levels. (The data usually stops the tree splitting well beyond the limit is reached.)

Pruning refers to the rules that are applied to simplify the induced decision trees and their branches. Pre-pruning refers to the settings that limit the tree growth during its induction from the data, such as minimum criterion for leaf node generation and maximal tree depth and standards for “value” in each split. The post-pruning refers to measures such as the confidence level “used for the pessimistic error calculation of pruning” (in other words, how accurate is the decision tree model at labeling the data…with the pruning of nodes which do not meet certain thresholds or tests). This is post-pruning because the pruning isn’t done until the decision tree has been induced…at which time its performance is assessed, and additional pruning occurs. This is it in a very small nutshell.


WeightingDatasetView.jpg


WeightingDecisionTreeDesign.jpg


WeightingDecisionTree.jpg


3. How are decision tree classification models validated or not? What is external validity for decision tree models?

Decision tree classification models are validated / invalidated based on an accuracy rate. As may be noted in the table view below, this particular decision tree model applied with default settings on a data set provided inside RapidMiner Studio…has a very high accuracy rate for identifying both true positive and true negative instances in the respective sets. Class recall is high as is class precision. Note that the misclassifications are simple calculations (100%-precision percentage) (100%-class recall percentage), etc.


3DConfusionMatrixofPerformance.jpg


AccuracybyTableView.jpg


ValidationofDecisionTree.jpg


4. What are various types of decision trees that may be created using RapidMiner Studio (with a free educational license)?

Inside the free (row-data limited) educational version of RapidMiner Studio, there are quite a few types of decision trees which may be run. There are ensemble modelers using decision trees, such as random forest algorithms. This software is well-documented with context sensitive help as well as crowd-sourced live help (and active software makers who are friendly in outreach). It is beyond the purview of a short module to go into each of these, but they are fairly easy to deploy, particularly with the active help and the descriptive information within the tool.


5. What are some ways that online learning data can be analyzed using decision trees (as a form of machine learning)?

To abstract this, to run a decision tree requires a few things: a data table with row data containing attributes and a column that may serve as the mutually exclusive labeled classifications.

In terms of learning data (and online learning data), the following types of questions may be asked if the data is available:

  • What courses in a learning sequence are the most informative of students who go on to graduate with the particular degree?
  • What technological features of an online course are most informative of high student retention?
  • What behavioral or performance variables / attributes may be used to identify learners who fail to pass a course?

And so on. This is not to say that there does not have to be some light pre-processing and data cleaning and maybe some aggregating of datasets, but the above are not beyond reasonable and practical possibility.

Examples

Online, there are a number of resources on decision trees.

How To

There are many nuances about how to set up a decision tree for analytics, dependent in part on the questions of interest and the underlying data. The steps in this module are general ones.

Also, this is a draft by an early user of decision trees, so it is highly possible that there are nuances that are being missed altogether.

Possible Pitfalls

Decision trees, like all machine learning methods, have their strengths and weaknesses. If anything, it would be good to experiment broadly, read broadly, and try other machine learning / data mining methods to see what seems to work best. (Beyond a certain level of accuracy, though, for most practical purposes, a certain level of accuracy may suffice.)

There are any number of things which can affect the efficacy of decision trees. The quality of the data can be a major factor in the induced decision tree.

Presenting decision trees can take finesse...so users can understand what they mean and how much / little confidence to put into these.

Also, decision trees show data patterns. There are contexts where those patterns may not be accurate for external validity.

Also, depending on the context, anti-patterns may be more relevant than patterns...so analyzing what is not seen...and what is not there...may be more informative than describing what is there.

Module Post-Test

1. What are decision trees, and how are these used in analyzing data?

2. What are some basics to pre-processing or cleaning data for decision tree induction? What are various parameter settings that affect the information that may be collected from decision trees?

3. How are decision tree classification models validated or not? What is external validity for decision tree models?

4. What are various types of decision trees that may be created using RapidMiner Studio (with a free educational license)?

5. What are some ways that online learning data can be analyzed using decision trees (as a form of machine learning)?

References

“Who Invented the Decision Tree?” (n.d.) CrossValidated, StackExchange. Retrieved Aug. 22, 2017, from https://stats.stackexchange.com/questions/257537/who-invented-the-decision-tree.

Extra Resources

"Using Decision Trees to Analyze Online Learning Data." (2017, Aug.) SlideShare. Retrieved Aug. 22, 2017, from https://www.slideshare.net/ShalinHaiJew/using-decision-trees-to-analyze-online-learning-data.