Editor-in-Chief: Larry Birnbaum, PhD, FASEP,
EPC
An Internet Electronic Journal
Dedicated to
Exercise
Physiology as a Healthcare
Profession
|
|
|
Validity
of Multiple-Choice Exam Questions
Larry Birnbaum, PhD, FASEP, EPC
Department of Exercise Physiology
The College of St.
Scholastica
Duluth, MN 55811
A previous article listed guidelines for writing multiple
choice questions [1]. Following those guidelines
will improve the quality of multiple choice questions, but will not necessarily
assure their validity. What makes an
exam question valid? Put simply, a test
item (question) can be considered valid if it covers the material it is
supposed to cover and students get it correct because they know the material
and get it wrong because they do not know the material. After a test is written and administered,
item analysis is commonly performed to help determine which questions might
need some revision or might even need to be discarded.
Two common item analysis statistics are the difficulty index
and the discrimination index. The
difficulty index is the percent of students who selected the correct
response. For example, if 15 of 20
students chose the correct answer, the difficulty index for that question is
75%. Since it actually reflects how many
students get it right, some may prefer to call it the easy index. An obvious question is, “What level of
difficulty is ideal?” That will vary
with the nature and goals of the course and the exam itself. Generally, for tests that are intended to
differentiate among students, maximum differentiation can be achieved in tests
of moderate difficulty (i.e., the difficulty index is 50-80%) [2]. Difficulty scores of 20-80% may be considered
acceptable [3]. On the other hand, if
the purpose of the test is to show students’ levels of content mastery, high
item difficulty values should be observed because it is expected that most, if
not all, students should correctly answer each item.
If the instructor is not satisfied with the difficulty index
(i.e., too easy or too difficult), she should consider why the question failed
to perform as desired. If it is
perceived to be too easy, perhaps it only required rote memory of simple
material that was emphasized in class.
Alternately, it may have contained an item-writing flaw that made
selection of the correct answer easy for most students. For example, the incorrect choices
(distractors) may not have seemed plausible to most students. If the difficulty index is lower than
desirable for a question, the instructor should ask himself if the material was
covered adequately in class. Is the
question written clearly or is it somewhat ambiguous so students are not
certain what the instructor is really asking?
How similar are the distractors to the correct answer? Some instructors may feel that a difficulty
index of 25% is good because it separates the good students from the poorer performing
students. This may not be the case. Consider the fact that students have a 25%
chance of guessing the correct answer in four-response multiple choice
questions.
The discrimination index will help instructors determine
which questions actually do separate the students who perform well on exams
from those who do not. It is calculated
by ranking the students according to
total test score and then selecting the highest group (e.g., top 27%) and the
lowest group (e.g., bottom 27%) in terms of total score [4,5]. For each item, the percentage of students in
the upper and lower groups answering correctly is calculated. The difference is one measure of discrimination. The
formula is:
Discrimination index = (Upper Group % Correct) – (Lower Group %
Correct)
Obviously, the discrimination index is positive when the
response is chosen more frequently by high-scoring students and negative if
chosen more frequently by low-scoring students.
For example, a value of .70 indicates that 70% more high-scoring
students marked the response than low-scoring students. A value of -.40 indicates that 40% more
low-scoring students marked the response than high-scoring students. Guidelines for satisfactory discrimination
scores vary somewhat, although negative scores are not tolerable – the question
needs to be revised or replaced. Values
of .40 to 1.00 are considered good to excellent, while values from .20 to .39
are acceptable to good, but may be evaluated for possible revision [2,4]. A negative value suggests that the students in the upper group were misled
by an ambiguity that the students in the lower group, and the item writer,
failed to discover [2,3].
Suggestions for
making exam questions more discriminating include the following [2,6]: (1) Check the question for unintentional
clues which could help the less-knowledgeable but test-wise student. (2) Revise questions that have low
discrimination scores. (3) Revise
distractors by using common student errors or misconceptions. (4) Distractors can be made more similar to
the correct response, which would require finer discrimination by test
takers. (5) Distractors that are
not chosen by any examinees should be replaced or eliminated. They do not contribute to the test's ability
to distinguish good students from poor students. (6) Items that virtually everyone gets right also
do not discriminate and should be replaced by more difficult items.
It is noteworthy
that sample size is important when using these item statistics. The number of test takers should be at least
20 for these statistics to be useful.
When testing small groups, variation can be quite wide among
responders. If an instructor teaches the
same course more than once to small classes and uses the same exam questions,
then item analysis can be performed after at least 20 students have taken the
exam.
A couple of assumptions must be met before using these
statistics. The content must be
homogeneous and the test must actually reflect the content covered. It is also assumed that cheating does not
occur. Additionally, several factors
should be considered when evaluating an exam (e.g., content delivery method,
student ability level as a class and individually, student backgrounds and
previous academic preparation, testing environment, etc.). Reviewing the exam with students after they
have taken it may provide useful information about the exam as well as help
students understand the material. As all
faculty know, exams themselves are teaching tools. The statistical tools reviewed here only
provide data that help instructors make decisions about their exams. They should not make decisions for the
instructor.
References
1. Birnbaum, L.
(2008) Guidelines for Writing Multiple Choice Questions. Journal
of Professional Exercise Physiology, 6(2).
(http://www.exercisephysiologists.com/JPEPFeb2008MCguidelines/index.html)
2. University Evaluation and Examination
Service. (1981) Technical Bulletin #17. Iowa City, IA: The University
of Iowa.
3. http://scoring.msu.edu/itanhand.html
4. http://www.uwosh.edu/testing/facultyinfo/itemdiscrimone.php
5. http://www.rasch.org/rmt/rmt121r.htm
6. http://pareonline.net/getvn.asp?v=4&n=10
|