Copyright © 2003-2008  The Center for Exercise Physiology.   All Rights Reserved.

 

 

               Journal of Professional Exercise Physiology        

Vol 6 No 4 April  2008    ISSN 1550-963X

 


Advertisements
 
 
 
 
 
 
 
 
 
 




 
Editor-in-Chief:  Larry Birnbaum, PhD, FASEP, EPC
An Internet Electronic Journal Dedicated to
 Exercise Physiology as a Healthcare Profession


Validity of Multiple-Choice Exam Questions
Larry Birnbaum, PhD, FASEP, EPC
Department of Exercise Physiology
The College of St. Scholastica
Duluth, MN  55811

A previous article listed guidelines for writing multiple choice questions [1].  Following those guidelines will improve the quality of multiple choice questions, but will not necessarily assure their validity.  What makes an exam question valid?  Put simply, a test item (question) can be considered valid if it covers the material it is supposed to cover and students get it correct because they know the material and get it wrong because they do not know the material.  After a test is written and administered, item analysis is commonly performed to help determine which questions might need some revision or might even need to be discarded.

Two common item analysis statistics are the difficulty index and the discrimination index.  The difficulty index is the percent of students who selected the correct response.  For example, if 15 of 20 students chose the correct answer, the difficulty index for that question is 75%.  Since it actually reflects how many students get it right, some may prefer to call it the easy index.  An obvious question is, “What level of difficulty is ideal?”  That will vary with the nature and goals of the course and the exam itself.  Generally, for tests that are intended to differentiate among students, maximum differentiation can be achieved in tests of moderate difficulty (i.e., the difficulty index is 50-80%) [2].  Difficulty scores of 20-80% may be considered acceptable [3].  On the other hand, if the purpose of the test is to show students’ levels of content mastery, high item difficulty values should be observed because it is expected that most, if not all, students should correctly answer each item.

If the instructor is not satisfied with the difficulty index (i.e., too easy or too difficult), she should consider why the question failed to perform as desired.  If it is perceived to be too easy, perhaps it only required rote memory of simple material that was emphasized in class.  Alternately, it may have contained an item-writing flaw that made selection of the correct answer easy for most students.  For example, the incorrect choices (distractors) may not have seemed plausible to most students.  If the difficulty index is lower than desirable for a question, the instructor should ask himself if the material was covered adequately in class.  Is the question written clearly or is it somewhat ambiguous so students are not certain what the instructor is really asking?  How similar are the distractors to the correct answer?  Some instructors may feel that a difficulty index of 25% is good because it separates the good students from the poorer performing students.  This may not be the case.  Consider the fact that students have a 25% chance of guessing the correct answer in four-response multiple choice questions.

The discrimination index will help instructors determine which questions actually do separate the students who perform well on exams from those who do not.  It is calculated by ranking the students according to total test score and then selecting the highest group (e.g., top 27%) and the lowest group (e.g., bottom 27%) in terms of total score [4,5].  For each item, the percentage of students in the upper and lower groups answering correctly is calculated.  The difference is one measure of discrimination.  The formula is:

 Discrimination index = (Upper Group % Correct) – (Lower Group % Correct)

Obviously, the discrimination index is positive when the response is chosen more frequently by high-scoring students and negative if chosen more frequently by low-scoring students.  For example, a value of .70 indicates that 70% more high-scoring students marked the response than low-scoring students.  A value of -.40 indicates that 40% more low-scoring students marked the response than high-scoring students.  Guidelines for satisfactory discrimination scores vary somewhat, although negative scores are not tolerable – the question needs to be revised or replaced.  Values of .40 to 1.00 are considered good to excellent, while values from .20 to .39 are acceptable to good, but may be evaluated for possible revision [2,4].  A negative value suggests that the students in the upper group were misled by an ambiguity that the students in the lower group, and the item writer, failed to discover [2,3].

Suggestions for making exam questions more discriminating include the following [2,6]:  (1) Check the question for unintentional clues which could help the less-knowledgeable but test-wise student.  (2) Revise questions that have low discrimination scores.  (3) Revise distractors by using common student errors or misconceptions.  (4) Distractors can be made more similar to the correct response, which would require finer discrimination by test takers.  (5) Distractors that are not chosen by any examinees should be replaced or eliminated.  They do not contribute to the test's ability to distinguish good students from poor students.  (6) Items that virtually everyone gets right also do not discriminate and should be replaced by more difficult items.

It is noteworthy that sample size is important when using these item statistics.  The number of test takers should be at least 20 for these statistics to be useful.  When testing small groups, variation can be quite wide among responders.  If an instructor teaches the same course more than once to small classes and uses the same exam questions, then item analysis can be performed after at least 20 students have taken the exam. 

A couple of assumptions must be met before using these statistics.  The content must be homogeneous and the test must actually reflect the content covered.  It is also assumed that cheating does not occur.  Additionally, several factors should be considered when evaluating an exam (e.g., content delivery method, student ability level as a class and individually, student backgrounds and previous academic preparation, testing environment, etc.).  Reviewing the exam with students after they have taken it may provide useful information about the exam as well as help students understand the material.  As all faculty know, exams themselves are teaching tools.  The statistical tools reviewed here only provide data that help instructors make decisions about their exams.  They should not make decisions for the instructor. 

References

1.  Birnbaum, L.  (2008)  Guidelines for Writing Multiple Choice Questions.  Journal of Professional Exercise Physiology, 6(2).  (http://www.exercisephysiologists.com/JPEPFeb2008MCguidelines/index.html)

2.  University Evaluation and Examination Service.  (1981)  Technical Bulletin #17.  Iowa City, IA: The University of Iowa.

3.  http://scoring.msu.edu/itanhand.html

4.  http://www.uwosh.edu/testing/facultyinfo/itemdiscrimone.php

5.  http://www.rasch.org/rmt/rmt121r.htm

6.  http://pareonline.net/getvn.asp?v=4&n=10