Chapter 7.3 Test Validity & Reliability

Test Validity and Reliability41

Whenever a test or other measuring device is used as part of the data collection process, the validity and reliability of that test is important.  Just as we would not use a math test to assess verbal skills, we would not want to use a measuring device for research that was not truly measuring what we purport it to measure.  After all, we are relying on the results to show support or a lack of support for our theory and if the data collection methods are erroneous, the data we analyze will also be erroneous.

Test Validity.

Validity refers to the degree in which our test or other measuring device is truly measuring what we intended it to measure.  The test question “1 + 1 = _____” is certainly a valid basic addition question because it is truly measuring a student’s ability to perform basic addition.  It becomes less valid as a measurement of advanced addition because as it addresses some required knowledge for addition, it does not represent all of knowledge required for an advanced understanding of addition.  On a test designed to measure knowledge of American History, this question becomes completely invalid.  The ability to add two single digits has nothing do with history.

For many constructs, or variables that are artificial or difficult to measure, the concept of validity becomes more complex.  Most of us agree that “1 + 1 = _____” would represent basic addition, but does this question also represent the construct of intelligence?  Other constructs include motivation, depression, anger, and practically any human emotion or trait.  If we have a difficult time defining the construct, we are going to have an even more difficult time measuring it.  Construct validity is the term given to a test that measures a construct accurately and there are different types of construct validity that we should be concerned with.  Three of these, concurrent validity, content validity, and predictive validity are discussed below.

Concurrent Validity.  Concurrent Validity refers to a measurement device’s ability to vary directly with a measure of the same construct or indirectly with a measure of an opposite construct.  It allows you to show that your test is valid by comparing it with an already valid test.  A new test of adult intelligence, for example, would have concurrent validity if it had a high positive correlation with the Wechsler Adult Intelligence Scale since the Wechsler is an accepted measure of the construct we call intelligence.  An obvious concern relates to the validity of the test against which you are comparing your test.  Some assumptions must be made because there are many who argue the Wechsler scales, for example, are not good measures of intelligence.

Content Validity.  Content validity is concerned with a test’s ability to include or represent all of the content of a particular construct.  The question “1 + 1 = ___” may be a valid basic addition question.  Would it represent all of the content that makes up the study of mathematics?  It may be included on a scale of intelligence, but does it represent all of intelligence?  The answer to these questions is obviously no.  To develop a valid test of intelligence, not only must there be questions on math, but also questions on verbal reasoning, analytical ability, and every other aspect of the construct we call intelligence.  There is no easy way to determine content validity aside from expert opinion.

Predictive Validity.  In order for a test to be a valid screening device for some future behavior, it must have predictive validity.  The SAT is used by college screening committees as one way to predict college grades.  The GMAT is used to predict success in business school.  And the LSAT is used as a means to predict law school performance.  The main concern with these, and many other predictive measures is predictive validity because without it, they would be worthless.

We determine predictive validity by computing a correlational coefficient comparing SAT scores, for example, and college grades.  If they are directly related, then we can make a prediction regarding college grades based on SAT score.  We can show that students who score high on the SAT tend to receive high grades in college.

Test Reliability.

Reliability is synonymous with the consistency of a test, survey, observation, or other measuring device.  Imagine stepping on your bathroom scale and weighing 140 pounds only to find that your weight on the same scale changes to 180 pounds an hour later and 100 pounds an hour after that.  Base don the inconsistency of this scale, any research relying on it would certainly be unreliable.  Consider an important study on a new diet program that relies on your inconsistent or unreliable bathroom scale as the main way to collect information regarding weight change.  Would you consider their results accurate?

A reliability coefficient is often the statistic of choice in determining the reliability of a test.  This coefficient merely represents a correlation (discussed in chapter 8), which measures the intensity and direction of a relationship between two or more variables.

Test-Retest Reliability.  Test-Retest reliability refers to the test’s consistency among different administrations.  To determine the coefficient for this type of reliability, the same test is given to a group of subjects on at least two separate occasions.  If the test is reliable, the scores that each student receives on the first administration should be similar to the scores on the second.  We would expect the relationship between he first and second administration to be a high positive correlation.

One major concern with test-retest reliability is what has been termed the memory effect.  This is especially true when the two administrations are close together in time.  For example, imagine taking a short 10-question test on vocabulary and then ten minutes later being asked to complete the same test.  Most of us will remember our responses and when we begin to answer again, we may just answer the way we did on the first test rather than reading through the questions carefully.  This can create an artificially high reliability coefficient as subjects respond from their memory rather than the test itself.  When a pre-test and post-test for an experiment is the same, the memory effect can play a role in the results.

Parallel Forms Reliability.  One way to assure that memory effects do not occur is to use a different pre- and posttest.  In order for these two tests to be used in this manner, however, they must be parallel or equal in what they measure.  To determine parallel forms reliability, a reliability coefficient is calculated on the scores of the two measures taken by the same group of subjects.  Once again, we would expect a high and positive correlation is we are to say the two forms are parallel.

Inter-Rater Reliability.  Whenever observations of behavior are used as data in research, we want to assure that these observations are reliable.  One way to determine this is to have two or more observers rate the same subjects and then correlate their observations.  If, for example, rater A observed a child act out aggressively eight times, we would want rater B to observe the same amount of aggressive acts.  If rater B witnessed 16 aggressive acts, then we know at least one of these two raters is incorrect.  If there ratings are positively correlated, however, we can be reasonably sure that they are measuring the same construct of aggression.  It does not, however, assure that they are measuring it correctly, only that they are both measuring it the same.