Test
Validity and Reliability
Whenever
a test or other measuring device is used as part of
the data collection process, the validity and
reliability of that test is important.
Just as we would not use a math test to
assess verbal skills, we would not want to use a
measuring device for research that was not truly
measuring what we purport it to measure.
After all, we are relying on the results to
show support or a lack of support for our theory and
if the data collection methods are erroneous, the
data we analyze will also be erroneous.
Test
Validity. Validity
refers to the degree in which our test or other
measuring device is truly measuring what we intended
it to measure.
The test question 1 + 1 = _____ is
certainly a valid basic addition question because it
is truly measuring a students ability to perform
basic addition.
It becomes less valid as a measurement of
advanced addition because as it addresses some
required knowledge for addition, it does not
represent all of knowledge required for an advanced
understanding of addition.
On a test designed to measure knowledge of
American History, this question becomes completely
invalid. The
ability to add two single digits has nothing do with
history.
For
many constructs, or variables that are artificial or
difficult to measure, the concept of validity
becomes more complex.
Most of us agree that 1 + 1 = _____
would represent basic addition, but does this
question also represent the construct of
intelligence? Other
constructs include motivation, depression, anger,
and practically any human emotion or trait.
If we have a difficult time defining the
construct, we are going to have an even more
difficult time measuring it.
Construct validity is the term given to a
test that measures a construct accurately and there
are different types of construct validity that we
should be concerned with.
Three of these, concurrent validity, content
validity, and predictive validity are discussed
below.
Concurrent
Validity.
Concurrent Validity refers to a measurement
devices ability to vary directly with a measure
of the same construct or indirectly with a measure
of an opposite construct. It allows you to show that your test is valid by comparing it
with an already valid test.
A new test of adult intelligence, for
example, would have concurrent validity if it had a
high positive correlation with the Wechsler Adult
Intelligence Scale since the Wechsler is an accepted
measure of the construct we call intelligence.
An obvious concern relates to the validity of
the test against which you are comparing your test.
Some assumptions must be made because there
are many who argue the Wechsler scales, for example,
are not good measures of intelligence.
Content
Validity.
Content validity is concerned with a tests
ability to include or represent all of the content
of a particular construct.
The question 1 + 1 = ___ may be a valid
basic addition question.
Would it represent all of the content that
makes up the study of mathematics?
It may be included on a scale of
intelligence, but does it represent all of
intelligence? The
answer to these questions is obviously no.
To develop a valid test of intelligence, not
only must there be questions on math, but also
questions on verbal reasoning, analytical ability,
and every other aspect of the construct we call
intelligence. There
is no easy way to determine content validity aside
from expert opinion.
Predictive
Validity.
In order for a test to be a valid screening
device for some future behavior, it must have
predictive validity.
The SAT is used by college screening
committees as one way to predict college grades. The GMAT is used to predict success in business school.
And the LSAT is used as a means to predict
law school performance.
The main concern with these, and many other
predictive measures is predictive validity because
without it, they would be worthless.
We
determine predictive validity by computing a
correlational coefficient comparing SAT scores, for
example, and college grades.
If they are directly related, then we can
make a prediction regarding college grades based on
SAT score. We
can show that students who score high on the SAT
tend to receive high grades in college.
Test
Reliability.
Reliability is synonymous with the
consistency of a test, survey, observation, or other
measuring device.
Imagine stepping on your bathroom scale and
weighing 140 pounds only to find that your weight on
the same scale changes to 180 pounds an hour later
and 100 pounds an hour after that.
Base don the inconsistency of this scale, any
research relying on it would certainly be
unreliable. Consider
an important study on a new diet program that relies
on your inconsistent or unreliable bathroom scale as
the main way to collect information regarding weight
change. Would
you consider their results accurate?
A
reliability coefficient is often the statistic of
choice in determining the reliability of a test.
This coefficient merely represents a
correlation (discussed in chapter 8), which measures
the intensity and direction of a relationship
between two or more variables.
Test-Retest
Reliability.
Test-Retest reliability refers to the
tests consistency among different
administrations.
To determine the coefficient for this type of
reliability, the same test is given to a group of
subjects on at least two separate occasions.
If the test is reliable, the scores that each
student receives on the first administration should
be similar to the scores on the second.
We would expect the relationship between he
first and second administration to be a high
positive correlation.
One
major concern with test-retest reliability is what
has been termed the memory effect.
This is especially true when the two
administrations are close together in time. For example, imagine taking a short 10-question test on
vocabulary and then ten minutes later being asked to
complete the same test.
Most of us will remember our responses and
when we begin to answer again, we may just answer
the way we did on the first test rather than reading
through the questions carefully.
This can create an artificially high
reliability coefficient as subjects respond from
their memory rather than the test itself.
When a pre-test and post-test for an
experiment is the same, the memory effect can play a
role in the results.
Parallel
Forms Reliability.
One way to assure that memory effects do not
occur is to use a different pre- and posttest.
In order for these two tests to be used in
this manner, however, they must be parallel or equal
in what they measure. To determine parallel forms reliability, a reliability
coefficient is calculated on the scores of the two
measures taken by the same group of subjects.
Once again, we would expect a high and
positive correlation is we are to say the two forms
are parallel.
Inter-Rater
Reliability.
Whenever observations of behavior are used as
data in research, we want to assure that these
observations are reliable.
One way to determine this is to have two or
more observers rate the same subjects and then
correlate their observations.
If, for example, rater A observed a child act
out aggressively eight times, we would want rater B
to observe the same amount of aggressive acts.
If rater B witnessed 16 aggressive acts, then
we know at least one of these two raters is
incorrect. If
there ratings are positively correlated, however, we
can be reasonably sure that they are measuring the
same construct of aggression. It does not, however, assure that they are measuring it
correctly, only that they are both measuring it the
same.
|