|
The term “reliability” refers to the degree to which a measurement procedure is free from unsystematic errors of measurement and the degree to which it gives the same values if the measurement procedure is repeated. An individual responding to a measure is likely to have different results if he or she took the instrument again. Systematic differences in scores (e.g., improvement on a test taken at two different times because the individual’s knowledge has increased between tests) should not be considered the unreliability of a measure. But an individual’s results may change when measured more than once on the same measure because of unsystematic effects (e.g., miss-marking a response to an item; feeling tired one day, but not the next). Such unsystematic differences are considered unreliability. Low reliability limits the ability to have confidence about individuals’ results from a single measurement (i.e., results may or may not lack precision). The higher the reliability of a measure, the more confidence you can have in the information obtained from the measure.
There are several ways to assess the reliability of measurement, depending on the type of consistency with which one is most concerned, including test-retest reliability, alternate or parallel forms reliability, and internal consistency. The following subsections will present the evidence gathered on the WIL that related to each of these types of reliability.
Test-Retest Reliability. This type of reliability refers to the consistency of results when the same individual is assessed on the same measure at two points in time. This information is obtained by looking at the degree of relationship (i.e., correlation) between an examinee’s scores obtained on the measure at different points in time. Estimates of test-retest reliability are particularly useful if the characteristic being measured is not expected to change over the time between the two measurement periods (e.g., a measure of personality characteristics of normal adults at two points in time that are a month apart, as opposed to a measure of knowledge administered before and after a course on the subject of the measure). Given that work values of adults are considered to be relatively stable characteristics, it would be expected that individuals’ responses to the WIL should be stable across time.
Two hundred and thirty vocational/technical and community college students were administered the WIL twice, with a two-month interval between the first administration and the second administration. Evidence of the WIL’s ability to reliably measure individuals’ top-ranked work value was moderately high, with a person’s top work value being the same between administrations 62 percent of the time. However, the correlation for the first administration’s six work value scores and the second administration’s six work values scores ranged between .35 (Achievement) and .58 (Support), indicating that the WIL has a low-to-moderate ability to reliably measure each of the six work values over the two month interval. Overall, this evidence reinforced the use of the WIL to help clients discover their highest work value, while also demonstrating that the WIL should not be used by clients to determine the rank order or profile of all six of their work values.
Alternate or Parallel Forms Reliability. This type of reliability is the evaluation of similar responses by the same individuals on forms which have been created to be alternative or parallel forms of the same measure. This estimate of reliability was important because the pilot studies of the instrument were conducted with pen-and-paper versions of the test, while the current version is an online one. Similar results for the same individuals on these different mediums would support using the mediums interchangeably. The same sample of 230 vocational/technical and community college students used in the test-retest reliability study described above were also administered the computerized version, allowing them to provide data relevant to alternate forms reliability. The scores of the two measures were reformulated in a manner that allowed for direct comparison and corrected for “ipsatization” problems — this correction reduces the adverse effects of forced-choice rank order information on a correlation coefficient (e.g., it reflects the impaired ability of users to rate associated needs in similar ways given that they have used up the available spaces at their preferred level of importance). The six work value scores derived from both measures had correlations ranging from .70 to .80, with a median correlation of .77. This indicates a relatively high agreement for the measurement of values between the two work values measures.
Internal Consistency. This type of reliability is used to determine whether different items, which are measuring the same subject on the same measure, have highly related results. For example, if a test included 10 items on math ability and 10 items on reading ability, one would expect to see higher interrelationships within the set of 10 math ability items and within the set of 10 reading ability items than between items from the two different sets. Thus, internal consistency reliability is another type of reliability analysis which can be applied to the WIL to assess the adequacy of its development. In terms of the WIL, it would be desirable to have high internal consistencies among items within the same scale (i.e., the needs that are used to measure each of the six work values).
The responses of 1,199 employment service clients and junior college students drawn from 23 sites were used to examine the internal consistency of the WIL. While the examination of internal consistency is important, the rank order format of the WIL provides data that, for statistical reasons, inhibit its ability to demonstrate high internal consistency values. The rank order format leads to the presence of negative inter-item correlations, attenuating the measurement of internal consistency reliability. The median coefficient alpha obtained for the sample was .20, indicating a very low level of internal consistency. An examination of coefficient alphas for each of the six scales after the data were “corrected for ipsatization” (i.e., reducing the adverse effects of rank order information) yielded an average increase of .38 per scale, indicating that while the rank order format did in fact adversely affect the coefficient alphas, the internal consistency of the six scales was, at best, moderate.
Summary. Overall, the WIL demonstrated moderate reliability across the majority of reliability analyses. The test-retest results showed moderate correspondence within individuals administered the WIL at two times several weeks apart. Individuals had the same top value 62 percent of time. After the effects of ipsatization were adjusted for, the correlations between the pen-and-paper and computerized versions of the test were in the .70’s and .80’s, with a median of .77 (indicating that the measures do have a degree of interchangeability). Internal consistencies were low, with a median value of .20, due in part to the effects of ipsatization.
|
|