Item Analysis of Objective Online Exams

From E-Learning Faculty Modules


Contents

Module Summary

In the online learning space in higher education, true/false and multiple-choice exams are not uncommon to test for “objective” knowledge. After summarizing some informal ways that such assessments are created, this module offers a review of some item analysis research and models to identify ways to improve the assessment value of such assessments. This also reviews how to create an item analysis report in the Canvas learning management system (LMS) and then how to understand the resulting data. Finally, this module offers some ways to create objective online exams informed by learner performance (and other streams of data).

Takeaways

Learners will...

  • consider how “objective” online exams may be created informally
  • review some item analysis research and some models of item analysis research
  • create an item analysis report in the Canvas LMS (learning management system)
  • review how to understand the data in an item analysis report in the Canvas LMS
  • consider improved ways to create objective online exams that are informed by learner performance on those exams

Module Pretest

1. How are “objective” online exams created informally?

2. What is item analysis research? What are some models of item analysis research?

3. How are item analysis reports created in the Canvas learning management system (LMS)?

4. How do you understand the item analysis data in the Canvas LMS?

5. What are improved ways to create objective online exams informed by learner performance on those exams?


Main Contents

1. How are “objective” online exams created informally?

If one were to generalize how instructors approach objective assessments—those that assess student knowledge and skills against an assumed base of objective information, usually in reference to true-false assessments and multiple-choice exams—the sequence could go something like the following:

  • review of the taught information
  • drafting of the true/false and multiple-choice questions
  • deploying the exam

After the exam, if they find questions that a majority of students miss, they may remove such questions. Or they may change up their teaching style to reinforce the teaching around the particular issues.

Randomizing the test questions. Those that have an appetite for more complexity may do the following:

  • categorizing the questions by level of difficulty (or categorizing the questions by topic)
  • randomizing the questions in the particular categories

This approach will ensure that learners experience different assessments that are optimally equivalent in terms of difficulty and experience.

The typical teaching -> objective assessment time sequence. Expressed visually, the sequence may go something like the following timeline.



AssessmentPractice.jpg


Note that the “objective assessment” may be either formative or summative (or a mix of both). A formative assessment is designed generally to promote and support the learner’s learning. An effective formative assessment points back to materials already covered prior and reinforces the important takeaways that learners should have. Such assessments help learners study and acquire what is relevant in the contents they have studied. A summative one is designed generally to understand where the learner is in the developmental learning sequence for that topic.

General desirable characteristics of objective assessments. General desirable characteristics of objective assessments include the following:

  • fair for all learners (through clear writing, through fair design)
  • rigorous for learning (generally non-guessable with appropriate distractors)
  • accessible (usable by people with differing levels of perception)
  • supportive for learning
  • accurate to the contents
  • easy to deploy (such as through an LMS platform)
  • easy to assess (optimally through computational means)
  • defensible in cases where learners or others may want to contest the contents

Question difficulty. The general concept of “question difficulty” is important because this goes to the idea of the informational equivalency of questions in an objective assessment.

One widely used approach involves the use of Bloom’s Taxonomy, depicted below as six levels of difficulty in the image below. The idea is the following:

  • a Level 1 question may only require rote memorization
  • a Level 2 question may only require comprehension
  • a Level 3 question may only require application of a concept or idea
  • a Level 4 question may only require analysis of a concept or idea
  • a Level 5 question may only require evaluation of a concept or idea
  • a Level 6 question may only require creation of a new idea

And note that each level has to be expressed as a multiple-choice question. In the current age, multiple-choice questions may include rich prompts: video, audio, interviews, case studies, scenarios, images, data visualizations, graphs, and others. They may include complex multimedia in the answers. (Objective assessments may include “constructed-response” items: short answer, essay, file upload, performance assessment, and other types of information elicitations…but these go beyond this module.)


BloomsTaxonomyLevelofQuestionDifficulty.jpg


Bloom’s Taxonomy goes to the cognitive load of the assumed learning and the recovery of that learning in a test environment. “Intrinsic cognitive load” refers to the learning task, and “germane cognitive load” refers to the amount of mental processing needed to acquire the learning and to make sense of it (such as into organizational structures known as mental “schemas”) and to store the new learning into long-term memory. [“Extraneous cognitive load” refers to the amount of effort needed to learn something based on the presentation of that information—and this is what goes to instructional design, which strives to align the learning with human perception, cognition, and learning.]

The “science of test design” is a complex one, and the main insight here is that it is way more difficult than subject matter experts (SMEs) might assume. Off-the-cuff test designs have their limits as does blind inheritance of others’ designs (whether human or machine generated).


2. What is item analysis research? What are some models of item analysis research?

In the educational context, item analysis research refers to the study of learner performance on true/false and multiple-choice questions (in “objective” assessments) in order to better understand the efficacy / inefficacy of the respective questions and the assessment in total. The items in the analysis are the respective questions (and their constituent parts): the stem, the answer, and the distractors (incorrect options that learners can select from). The main point of an assessment, in an educational context, is to accurately measure the respondent’s level of understanding and knowledge. If an assessment item is poorly defined, however, the assessment itself may bias learner responses and add noise to the assessment data. Noise diminishes the accuracy of an assessment, also known as “reliability.” (This item analysis is not to be confused with item analyses of surveys and questionnaires, which may use some of the same methods but which often have very different assumptions.) In terms of actual analyses, there are various approaches.

Various theories have been applied to the analysis of assessments: “measurement theory,” “cognitive structure,” and others. It is beyond the purview of this work to explore the applied theories further, but they are mentioned here in case readers may want to pursue more exploration on their own.

One common approach is a rule-based one. Another, which is the main subject of this module, is statistical (and built into learning management systems). A summary of each follows.

Rules for item creation for objective assessments. The rules for item creation for objective assessments have been studied for many decades, and these are still being studied up to the present. There are commonly-accepted rules for how test writers should approach the work. The resulting assessments are analyzed in various ways—both rule-based and statistically (based on learner responses). This section summarizes some of these basic rules. An updated taxonomy of multiple-choice item-writing guidelines follows verbatim:

A Revised Taxonomy of Multiple-Choice (MC) Item-Writing Guidelines
Content concerns
1. Every item should reflect specific content and a single specific mental behavior, as called for in test specifications (two-way grid, test blueprint).
2. Base each item on important content to learn; avoid trivial content.
3. Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for simply recall.
4. Keep the content of each item independent from content of other items on the test.
5. Avoid over specific and over general content when writing MC items.
6. Avoid opinion-based items.
7. Avoid trick items.
8. Keep vocabulary simple for the group of students being tested.
Formatting concerns
9. Use the question, completion, and best answer versions of the conventional MC, the alternate choice, true-false (TF), multiple true-false (MTF), matching, and the context-dependent item and item set formats, but AVOID the complex MC (Type K) format.
10. Format the item vertically instead of horizontally.
Style concerns
11. Edit and proof items.
12. Use correct grammar, punctuation, capitalization, and spelling.
13. Minimize the amount of reading in each item.
Writing the stem
14. Ensure that the directions in the stem are very clear.
15. Include the central idea in the stem instead of the choices.
16. Avoid window dressing (excessive verbiage).
17. Word the stem positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface.
Writing the choices
18. Develop as many effective choices as you can, but research suggests three is adequate.
19. Make sure that only one of these choices is the right answer.
20. Vary the location of the right answer according to the number of choices.
21. Place choices in logical or numerical order.
22. Keep choices independent; choices should not be overlapping.
23. Keep choices homogeneous in content and grammatical structure.
24. Keep the length of choices about equal.
25. None-of-the-above should be used carefully.
26. Avoid All-of-the-above.
27. Phrase choices positively; avoid negatives such as NOT.
28. Avoid giving clues to the right answer, such as
a. Specific determiners including always, never, completely, and absolutely.
b. Clang (sic) associations, choices identical to or resembling words in the stem.
c. Grammatical inconsistencies that cue the test-taker to the correct choice.
d. Conspicuous correct choice.
e. Pairs or triplets of options that clue the test-taker to the correct choice.
f. Blatantly absurd, ridiculous options.
29. Make all distractors plausible.
30. Use typical errors of students to write your distractors.
31. Use humor if it is compatible with the teacher and the learning environment.” (Haladyna, Downing, & Rodriguez, 2002, p. 312)

Violations in item design do have real-world consequences:

”The purpose of this research was to study the effects of violations of standard multiple-choice item writing principles on test characteristics, student scores, and pass–fail outcomes. Four basic science examinations, administered to year-one and year-two medical students, were randomly selected for study. Test items were classified as either standard or flawed by three independent raters, blinded to all item performance data. Flawed test questions violated one or more standard principles of effective item writing. Thirty-six to sixty-five percent of the items on the four tests were flawed. Flawed items were 0–15 percentage points more difficult than standard items measuring the same construct. Over all four examinations, 646 (53%) students passed the standard items while 575 (47%) passed the flawed items. The median passing rate difference between flawed and standard items was 3.5 percentage points, but ranged from )1 to 35 percentage points. Item flaws had little effect on test score reliability or other psychometric quality indices. Results showed that flawed multiple-choice test items, which violate well established and evidence-based principles of effective item writing, disadvantage some medical students. Item flaws introduce the systematic error of construct-irrelevant variance to assessments, thereby reducing the validity evidence for examinations and penalizing some examinees” (Downing, 2005, p. 133).


Behind all of the suggestions are logically arrived-at observations. For example, one author describes why a test writer should not use “all of the above” as an option, in part, because that is a giveaway of the answer not based on actual learning by the student:

”7. Don't use "all of the above." Recognition of one wrong option eliminates "all of the above," and recognition of two right options identifies it as the answer, even if the other options are completely unknown to the student. Probably some instructors use items with "all of the above" as yet another way of extending their teaching into the test (see 2 above). It just seems so good to have the students affirm, say, all of the major causes of some phenomenon. With this approach, "all of the above" is the answer to almost every item containing it, and the students soon figure this out.” (Frary, 1995, pp. 3 - 4)

“None of the above,” though, may be helpful:

”8. Do ask questions with "none of the above" as the final option, especially if the answer requires computation. Its use makes the question harder and more discriminating, because the uncertain student cannot focus on a set of options that must contain the answer. Of course, "none of the above" cannot be used if the question requires selection of the best answer and should not be used following a negative stem. Also, it is important that "none of the above" should be the answer to a reasonable proportion of the questions containing it.” (Frary, 1995, pp. 3 - 4)

Behind many of the suggestions is various empirical research. There are not “equal degrees of evidence” supporting each of the ideas (Haladyna, Downing, & Rodriguez, 2002, p. 325). Various publishers of academic assessment contents have listed their respective guidelines, and the most popular ten are as follows, in descending order: single content and behavior; important, not trivial content; use novel material; keep items independent; avoid over specific/general; avoid opinions; avoid trick items; (use) simple vocabulary; format vertically; and edit and proof. (Haladyna, Downing, & Rodriguez, 2002, p. 314)

A common approach in analyzing an objective assessment is to identify any violations of standard item-writing principles.

Some statistical analysis approaches to item analysis. To look at statistical analysis applied to item analysis, it helps to begin with its main underlying theory. One caveat: There are many more advanced statistical methods applied in this space today—including forms of machine learning—so this module should only be understood as an elementary one.

Item response theory. Where classical test theory focused on the entire assessment, item response theory focuses on a more granular level: the specific item (question) within the larger assessment.

How a person responds to an item in an objective assessment may be revelatory about the test taker, whether about a more long-term characteristic (like traits) or level of learning (which might be more temporal). One theoretical approach that underlies these analyses is “item response theory” (also known as “latent trait theory,” “strong true score theory,” and “modern mental test theory”); in psychometrics, IRT “is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables” (“Item Response Theory,” Sept. 6, 2017). IRT is informed by statistical and logical analysis approaches:

”IRT is based on the idea that the probability of a correct/keyed response to an item is a mathematical function of person and item parameters. The person parameter is construed as (usually) a single latent trait or dimension. Examples include general intelligence or the strength of an attitude. Parameters on which items are characterized include their difficulty (known as "location" for their location on the difficulty range), discrimination (slope or correlation) representing how steeply the rate of success of individuals varies with their ability, and a pseudoguessing parameter, characterising the (lower) asymptote at which even the least able persons will score due to guessing (for instance, 25% for pure chance on a multiple choice item with four possible responses)” (“Item Response Theory,” Sept. 6, 2017)

For more on Item response theory, which may apply to a variety of educational and non-educational contexts, there are various publicly available readings.

Some research in item analysis focuses on fair scoring. For example, in an assessment with a “don’t know” (or non-applicable) option, how an instructor comes up with a final score for the assessment matters. To simplify, a final score can be calculated from “number right,” or it can be calculated through “formula scoring” (such as the total number of correct answers minus the number of incorrect answers). In a number-right test situation, learners may do better to answer every question, even if they are guessing, because guessing enables a chance at credit acquisition than a non-attempt (which is a sure 0). Formula scoring results in higher reliability of the instrument (because it serves as a corrective against people guessing on an item and functioning on partial knowledge), but “number right” tends to have less bias against learners who are less willing to guess in assessments which include “don’t know.” (Muijtjens, van Mameren, Hoogenboom, Evers, & van der Vleuten, 1999, p. 267) The authors explain further:

“With formula scoring the examinee is stimulated to investigate whether he has enough knowledge of a subject to answer an item properly, i.e. he is made aware of possible gaps in his knowledge. The don’t-know option offers the teacher information about the quality of the item. A relatively large percentage of don’t-know answers may indicate that the item does not belong to the domain of the examination, that it was erroneously not part of the educational programme, or that it was poorly formulated. With formula scoring the examinee is not encouraged to guess when he feels insecure about answering an item. Whether this is wrong or right has been debated in Harden et al. The opposing opinions can be summarized as follows: ‘medical professionals should not be stimulated to react with guessing when they are faced with a lack of knowledge’ vs. ‘in medical practice the doctor frequently has to make a decision on only partially complete information’.” (Muijtjens, van Mameren, Hoogenboom, Evers, & van der Vleuten, 1999, p. 268)

A portion of the research literature in this area examines more domain-specific contexts: healthcare, accounting, and others. There are assessments of particular high-risk standardized assessments. One approach examined how to assess how closely learners think like experts in science-based contexts (Adams & Wieman, Oct. 27, 2010, p. 1289). As part of this work, the authors suggest a multi-stream method for collecting information to inform the creation of such an educational assessment (to assess expert thinking):

”These can be done through a combination of the following four methods: (1) reflect

on your own goals for the students; (2) have casual conversations or more formal interviews with experienced faculty members on the subject; (3) listen to experienced faculty members discussing students in the course; and (4) interview subject matter experts” (Adams & Wieman, Oct. 27, 2010, p. 1293).

Three, four, or five answer options? One long-debated question has been how many possible answers should a multiple-choice question have, especially since writing a number of equally plausible alternate answers can be difficult. Common practice may be going with four or five options in order to head off random guessing, and the general consensus is that three options offers sufficient discriminative power of respondent learning in a multiple-choice question. From an information theoretic approach, three seems to suffice (Bruno & Dirkzwager, Dec. 1995, p. 964). This finding was supported psychometrically as well:

”More 3-option items can be administered than 4- or 5-option items per testing time while improving content coverage, without detrimental effects on psychometric quality of test scores. Researchers have endorsed 3-option items for over 80 years with empirical evidence—the results of which have been synthesized in an effort to unify this endorsement and encourage its adoption.” (Rodriguez, Summer 2005, p. 3)

Sequentiality of items. In general, at a high level, an assessment evolves from simple to complex, in order to prime learners and to help them build up to the more difficult processes. Other questions have been on the effect of item sequentialitiy on performance. For example, does having a harder question just prior to a present question affect a person’s performance on the latter question?

One study did not find any effect of having a harder question first before a present item as affecting the outcome of that latter item (Huck & Bowers, Summer 1972)

Mix of assessment item types. There are studies of assessments with both multiple-choice options and “constructed response” questions (which refer to open-ended questions, short-answer questions, performance assessments, file upload types, and others).

In terms of the specific variables in a general item analysis, more on that will follow in Question 4 in this section.


3. How are item analysis reports created in the Canvas learning management system (LMS)?

To create an item analysis report, after the online exam has been created and students have taken the exam, go to the Quizzes or Assignments (either directly or through the Gradebook). Select the target quiz/assignment. (Remember that these have to have questions that are true/false or multiple-choice ones.) Click on the “Survey Statistics” button at the top right. You will be at the Quiz Summary page. At the top right are three buttons: Section Filter, Student Analysis, and Item Analysis. Select “Item Analysis.” A report will be generated, and the results will be downloaded as a .csv file. Access this via the Downloads folder (on a Windows machine), or click on the downloaded file in the “Show all” menu at the bottom of the browser (if available).

The digital breadcrumbs trail may go at least in the following two ways (and possibly others):

Course -> Quizzes (or Assignments) -> Assessment (quiz) -> Statistics -> Item Analysis

Course -> Grades -> Quizzes (or Assignments) -> Assessment (quiz) -> Statistics -> Item Analysis

The .csv (comma separated values) data table will have the following column headers (as relevant):

In terms of the contents of the report, they are as follows:

Question ID: a unique numerical identifier for each question

Question Title: either a name for the question (if assigned) or the body of the question

Answered Student Count: the number of students who responded to the question (vs. those who did not respond)

Top Student Count: the top tier of student performers on the question (top 27%)

Middle Student Count: the number of students who scored in the middle 46% of the question

Bottom Student Count: the low tier of student performers on the question (bottom 27%)

Quiz Question Count: the number of quiz questions in the target assessment

Correct Student Count: the number of learners who got the question right

Wrong Student Count: the number of learners who got the question wrong

Correct Student Ratio: the percentage of students who got the question right (expressed as a decimal)

Wrong Student Ratio: the percentage of students who got the question wrong (expressed as a decimal)

Correct Top Student Count: the number of students in the top 27% who got the answer right

Correct Middle Student Count: the number of students in the middle 46% who got the answer right

Correct Bottom Student Count: the number of students in the bottom 27% who got the answer right

Variance: the divergence (spread) of scores for the particular question, indicating whether the scores cluster (low variance) or disperse (high variance)

Standard Deviation: a summary measure to indicate the amount of deviation of the set of total scores

Difficulty Index (or “p-value”): the proportion of those who attempt the question and who arrive at the correct answer (with 0 meaning that none of the learners got the question correct and 1 meaning that all got the question correct, with incremental values in between)

Alpha: Cronbach’s alpha to represent the internal consistency between test items (only applies when there is more than one item in the analysis; otherwise, will be “N/A”)

Point Biserial of Correct: a coefficient or index that represents the degree to which the correct performance on the item is correlated (based on the Pearson Product Moment correlation) with performance on all the other items in the assessment (assuming the assessment itself focuses on a particular construct)

Point Biserial of Distractor 2: a coefficient or index that represents the degree to which the selection of the first incorrect distractor on the item is correlated (based on the Pearson Product Moment correlation) with performance on all the other items in the assessment (assuming the assessment itself focuses on a particular construct)

Point Biserial of Distractor 3: a coefficient or index that represents the degree to which the selection of the second incorrect distractor on the item is correlated (based on the Pearson Product Moment correlation) with performance on all the other items in the assessment (assuming the assessment itself focuses on a particular construct)

(Canvas Quiz Item Analysis, 2017, p. 2)

Each row of the data table represents one item (a true/false or multiple-choice question) in the “objective” assessment.


4. How do you understand the item analysis data in the Canvas LMS?

So based on the information captured in an item analysis report on the Canvas LMS in combination with “Survey Statistics” reports (particularly the Question Breakdown), it is possible to

  • identify items that have biserial correlation (discrimination index) of at least .15 with the rest of the assessment, or the items should be revised or replaced or omitted altogether without replacement
  • identify items that have difficulty indexes in the range of 30 – 70% (by general practice); some suggest that the ideal difficulty level should range from 70 – 85% success (“Understanding Item Analyses,” 2017)
  • identify distractors which are not selected by learners, with research suggesting that unselected distractors should be “replaced or eliminated” because of their lack of discriminative effect (Kehoe, Oct. 1995)

More information about assessment “reliability,” “difficulty,” and “discrimination” is available in the Canvas Quiz Analysis. (Canvas Quiz Item Analysis, 2017, pp. 1 - 2) The Canvas Quiz Item Analysis report also has some limitations:

”To maintain optimum course performance in the Canvas interface, the maximum values for calculation are 1000 submissions or 100 questions. For instance, a quiz with 200 questions will not generate quiz statistics. However, a quiz with 75 questions will generate quiz statistics until the quiz has reached 1000 attempts.”

In other words, it would help to go through this item analysis process multiple ways to gain a sense of the nuances…before making public and high-risk assertions.


5. What are improved ways to create objective online exams informed by learner performance on those exams?

“Item-writing is an art. It requires an uncommon combination of special abilities. It is mastered only through extensive and critically supervised practice. It demands, and tends to develop, high standards of quality and a sense of pride in craftsmanship” (Ebel, 1951, p. 185, as cited in Rodriguez, Summer 2005, p. 3).
“If assessments are to provide meaningful information about student ability, then cognition must be incorporated into the test development process much earlier than in data analysis” (Gorin, Dec. 2006, p. 21).

The art and science of test writing is more complex than this particular module might suggest. There are a number of dependencies on domain field knowledge and related professional practices that may inform assessment writing.

“All assessments require evidence of the reasonableness of the proposed interpretation, as test data in education have little or no intrinsic meaning.” -- Steven M. Downing in “Validity: On the meaningful interpretation of assessment data” (2003)

How learner performance is assessed, interpreted, and understood, has implications for the teaching and learning, affordances for student learning, and other factors.

What are some general takeaways?

  • It helps to be mindful in creating multiple-choice questions. Assessments should be drafted intelligently and in alignment with “best practices” (based on logic and empirical evidence).
  • It helps to be able to categorize multiple-choice questions in different levels of difficulty based on Bloom’s Taxonomy particularly if the items will be used in randomized testing (so that the questions given to the learners in a randomized way are “equivalent”).
  • Assessments, whether they are inherited or self-created, should be assessed and updated as needed.
  • Assessments do not originate or exist in vacuums. They emerge from teaching and learning contexts. A professional instructor has to be aware of how the learning objectives, instructional design, learning experiences, and assessments, all come together for the advancement of the learner(s), the program, the college / university, and the field. For every rule, too, there are exceptions, so it is important to avoid being doctrinaire but to approach this in a thinking way.

Examples

Because of the context-dependence of such formal item analyses and the importance of nuanced and in-depth analytical work, there are no direct examples shared here.

How To

The “how-to” depends on the local context, the objectives of the assessment, the teaching style and signature of the instructor, the learning contents, the available digital resources, the LMS, and other factors. A few general points may be made though.

For those with pre-existing objective assessments, they can start with running an item analysis report from the Canvas LMS and learning what they can from those findings. They may start early non-destructive revisions and edits. They may remove the true-false questions, and write in more discriminative multiple-choice questions.

Those who inherited objective assessments may want to run an item analysis inside the LMS after the assessment has been taken by learners in order to identify questions that require revision or removal (non-destructive deletion). A number of book publishing companies (content creators) offer free test banks, and many of these are not assessed with live learners. (Even if they were, how an instructor teaches will affect learner performance on their assessment to some degree. The local conditions are important in how objective assessments deploy.) Also, in terms of various sources of objective test originators, in particular domains, there is automatic multiple-choice item generation through computer software (Gierl, Lai, & Turner, 2012), based on initial subject matter expert (SME) inputs. Whether objective assessments are created by humans or computers, multiple research studies suggest that there is room for improvement. An audit of ten test banks provided by publishers to see how closely the items aligned with consensus-based rules for quality item assessment found that 75% of the questions had guideline violations (Hansen & Lee, Nov./Dec. 1997, pp. 94 - 95). For example, were the stems written in “simple, clear language” (if yes, good) and were “absolute terms” (like “always” and “never”) used in the distractors (if no, good)? (Hansen & Lee, Nov./Dec. 1997, pp. 94 - 95)

And for those who are writing new objective assessments, they may first increase their streams of information about the learners from various means—interviews, tutoring sessions, teaching assistant feedback, and others—and write multiple-choice questions that are thought through…and that have appropriate distractors (to avoid “gimme” questions)…and that meet the standards of fairness and accessibility. They need to ensure that randomized assessments involve questions pulled from categories that are of the same level of question difficulty and that the various randomized assessments sufficiently cover the issue without gaps—for both formative and summative applications. The draft assessments should be tested with colleagues, with learners, and others, and then necessary revisions need to be made. Certainly, there should be a comprehensive walk-through of each assessment in its randomized forms to test for full functionality. The walk-throughs should also include the outcomes assessments in terms of scores and feedback to the learners.

Possible Pitfalls

This module takes a generalist approach to item analysis, and depending on the learning domain, some information may be more or less relevant than others. The acceptable metrics for the item analysis vary depending on the field, and some of those used may be somewhat arbitrarily applied.

As to revisions of objective assessments, it may be important to use an assessment several rounds before making huge changes…or at least using non-destructive approaches to revising objective assessments until more permanent decisions may be made.

Module Post-Test

1. How are “objective” online exams created informally?

2. What is item analysis research? What are some models of item analysis research?

3. How are item analysis reports created in the Canvas learning management system (LMS)?

4. How do you understand the item analysis data in the Canvas LMS?

5. What are improved ways to create objective online exams informed by learner performance on those exams?


References

Adams, W.K. & Wieman, C.E. (2010, Oct. 27). Development and validation of instruments to measure learning of expert-like thinking. International Journal of Science Education: 33(9), 1289 – 1312.

Canvas Quiz Item Analysis. (2017). Instructure. https://s3.amazonaws.com/tr-learncanvas/docs/CanvasQuizItemAnalysis.pdf.

Bruno, J.E. & Dirkzwager, A. (1995, Dec.) Determining the optimal number of alternatives to a multiple-choice test item: An information theoretic perspective. Educational and Psychological Measurement: 55(6), 959 – 966.

Downing, S.M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education: 37, 830 – 837.

Downing, S.M. (2005). The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Advances in Health Sciences Education: 10, 133 – 143.

Frary, R.B. (1995, Oct.) More multiple-choice item writing do’s and don’ts. ERIC /AE Digests. 1 – 6.

Gierl, M.J., Lai, H., & Turner, S.R. (2012). Using automatic item generation to create multiple-choice test items. Medical Education: 46, 757 – 765.

Gorin, J.S. (2006, Dec.) Test design with cognition in mind. Educational Measurement: Issues and Practice: 25(4), 21 – 35.

Haladyna, T.M., Downing, S.M., & Rodriguez, M.C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education: 15(3), 309 – 334.

Hansen, J. & Lee, D. (1997, Nov./Dec.) Quality multiple-choice test questions: Item-writing guidelines and an analysis of auditing testbanks. Journal of Education for Business: 73(2), 94 – 97.

Huck, S.W. & Bowers, N.D. (1972, Summer). Item difficulty level and sequence effects in multiple-choice achievement tests. Journal of Educational Measurement: 9(2), 105 – 111.

Item response theory. (2017, Sept. 6). Wikipedia. https://en.wikipedia.org/wiki/Item_response_theory.

Kehoe, J. (1995, Oct.) Basic item analysis for multiple-choice tests. ERIC/AE Digest. 1 – 5.

Muijtjens, A.M.M., van Mameren, H., Hoogenboom, R.F.I., Evers, F.L.H., & van der Vleuten, C.P.M. (1999). The effect of a ‘don’t know’ option on test scores: Number-right and formula scoring compared. Medical Education: 33(4), 267 – 75.

Rodriguez, M.C. (2005, Summer). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice. 3 – 13.

“Understanding Item Analysis.” (2017). Office of Educational Assessment, University of Washington. http://www.washington.edu/assessment/scanning-scoring/scoring/reports/item-analysis/.

Extra Resources

“Item Analysis in Quizzes” https://community.canvaslms.com/videos/1177

“Item Response Theory.” https://en.wikipedia.org/wiki/Item_response_theory