Multi-item scales are the key components to scientific questionnaire design, yet this concept is surprisingly little known in the L2 profession. The core of the issue is that when it comes to assessing attitudes, beliefs, opinions, interests, values, aspirations, expectations, and other personal variables, the actual wording of the questions assumes an unexpected importance: minor differences in how the question is formulated and framed can produce radically different levels of agreement or disagreement, or a completely different selection of answers (Gillham, 2000). We do not have such problems with factual questions: if you are interested in the gender of the respondent, you can safely ask about this using a single item, and the chances are that you will get a reliable answer (although the item: "Your sex: " might elicit very creative responses in a teenage sample...). However, with non-factual answers it is not unusual to find that responses given by the same people to two virtually identical items differ by as much as 20% or more (Oppenheim, 1992). Here is an illustration:
Converse & Presser (1986, p. 41) report on a case when simply changing "forbid" to "not allow" in the wording produced significantly different responses in the item "Do you think the United States should [forbid/not allow] public speeches against democracy? " Significantly more people were willing to "not allow" speeches against democracy than were willing to "forbid" them. Although it may be true that on an impressionistic level "not allow" somehow does not sound as harsh as "forbid," the fact is that 'allow' and 'forbid' are exact logical opposites and therefore it was not unreasonable to assume that the actual content of the two versions of the question was identical. Yet, as the differing response pattern indicated, this was not the case. Given that in this example only one word was changed and that the alternative version had an almost identical meaning, this is a good illustration that item wording in general has a substantial impact on the responses. However, there does not seem to be a reliable way of knowing exactly what kind of an effect to expect.
So what is the solution? Do we have to conclude that questionnaires simply cannot achieve the kind of accuracy that is needed for scientific measurement purposes? We would have to if measurement theoreticians - and particularly Rensis Likert in the 1930s - had not discovered an ingenious way of getting around the problem: by using multi-item scales. These scales refer to a cluster of several differently worded items that focus on the same target (e.g., five items targeting attitudes toward language labs). The item scores for the similar questions are summed, resulting in a total scale score (which is why these scales are sometimes referred to as summative scales), and the underlying assumption is that any idiosyncratic interpretation of an item will be averaged out during the summation of the item scores. In other words, if we use multi-item scales, "no individual item carries an excessive load, and an inconsistent response to one item would cause limited damage" (Skehan, 1989, p. 11). For example, the question "Do you learn vocabulary items easily?" is bound to be interpreted differently by different people, depending on how easy they consider 'easily,' but if we include several more items asking about how good the respondents' memorization skills are, the overall score is likely to reflect the actual level of the development of this skill. Thus, multi-item scales maximize the stable component that the items share and reduce the extraneous influences unique to the individual items.
"When we sometimes despair about the use of language as a tool for measuring or at least uncovering awareness, attitude, percepts and belief systems, it is mainly because we do not yet know why questions that look so similar actually produce such very different sets of results, or how we can predict contextual effects on a question, or in what ways we can ensure that respondents will all use the same frame of reference in answering an attitude question."
Because of the fallibility of single items, there is a general consensus among survey specialists that more than one item is needed to address each identified content area, all aimed at the same target but drawing upon slightly different aspects of it. How many is 'more than one'? The most well-known standardized questionnaire in the L2 field, Robert Gardner's (1985) Attitude/Motivation Test Battery (AMTB), contains 4-10 items to measure each scale. It is rather risky to go below 4 items per subarea because if the post hoc item analysis
(cf. Section 2.9.3) reveals that certain items did not work in the particular sample, their exclusion will result in too short (or single-item) scales. The technicalities of how to produce reliable and valid multi-item scales will be discussed in the section on "rating scales" (Section 2.4.1).
Of course, nothing is perfect. While multi-item scales do a good job in terms of psychometric reliability, they may not necessarily appeal to the respondents. Ellard and Rogers (1993) report that respondents sometimes react negatively to items that appear to be asking the same question because this gives them the impression that we are trying to "trick them or check their honesty" (p. 19). This problem, however, can be greatly reduced by using effective item-writing strategies (see Section 2.6, for a summary).
Was this article helpful?
You won’t want to miss this. Get Paid Taking Surveys In Your Spare Time. I know what you’re thinking, ‘Oh great, another get rich quick scheme’. WRONG! This is your guide to making money from home by participating in paid surveys on the internet.