A common methodology in behavioural science is to use self-report questionnaires to gather data. Data from these questionnaire can be used to identify relationships between scores on the variable(s) that the questionnaire is assumed to measure and either performance on behavioural tasks, physiological measures taken during an experiment, or even scores obtained from other questionnaires (some studies just report on the correlations between batches of self-report measures!). Self-report measures are popular for a number of reasons. Firstly they represent a ‘cheap’ way (in terms of both time and cost) of obtaining data. Secondly they can be easily implemented to large samples, especially with the advent of on-line questionnaire distribution sites such as Survey Monkey. Finally they can be used to measure constructs that would be difficult to obtain with behavioural or physiological measures (for example facets of personality such as introversion). This issue of self-report methodology is important because studies that use this method are regularly reported in the media (see http://www.bbc.co.uk/news/health-17209448 for a recent example) and therefore have a significant impact on how the general public perceive scientific research. I therefore think it is important to discuss potential problems with self-report measures.
Most (but certainly not all) questionnaires that are used in behavioural research undergo testing for reliability, to check that they produce consistent results when applied to the same population over time. More importantly they are normally also tested for validity, to check that the questionnaire measures what it claims to measure. Such tests are done following the logic that the questionnaire should be able to discriminate participants in a similar way to relevant non-self report measures. For example scores on a questionnaire measuring depression should be able to discriminate between depressed patients and controls, while scores on a questionnaire measuring diet should be able to predict the ‘Body Fat Percentage’ of respondents with reasonable accuracy. While such tests can act to increase confidence that a questionnaire is measuring what it claims to measure they are not foolproof. For example just because a depression questionnaire can discriminate between patients and controls does not mean that it measures depression well, as the two groups will likely vary in several different ways. Likewise a questionnaire that distinguishes between patients and controls may not be able to identify the (presumably) more subtle differences between depressed and non-depressed healthy individuals, or the range of depressive tendencies within the healthy population. In fact that are a large number of reasons why questionnaire may not be entirely valid, including the following:
Honesty/Image management – researchers who use self-report questionnaires are relying on the honesty of their participants. The degree to which this is a problem will undoubtedly vary with the topic of the questionnaire, for example participants are less likely to be honest about measures relating to sexual behaviour, or drug use, than they are about caffeine consumption, although it is unwise to assume, even when you are measuring something relatively benign, that participants will always be truthful. Worse, the level at which participants will want to manage how they appear will no doubt vary depending on personality, which means that the level of dishonesty may vary significantly between different groups that a study is trying to compare.
Introspective ability – Even if a participant is trying to be honest, they may lack the introspective ability to provide an accurate response to a question. We are probably all aware of people who appear to view themselves in a completely different light to how others see them. Undoubtedly we are all to some extent unable to introspectively assess ourselves completely accurately. Therefore any self-report information we provide may be incorrect despite our best efforts to be honest and accurate.
Understanding – Participants may also varying regarding their understanding or interpretation of particular questions. This is less a problem with questionnaires measuring concrete things like alcohol consumption, but is a very big problem when measuring more abstract concepts such as personality. From personal experience I have participated in an experiment where I was asked at regular intervals to report how ‘dominant’ I felt. As I can honestly say I don’t monitor my feelings of ‘dominance’ and how they change over time, I know that my responses to the question were pretty random. Even if I could conjure an understanding of what the question was getting at, it would be impossible to ensure that everyone who completed the questionnaire interpreted that question in the same way that I did.
Rating scales – Many questionnaires use rating scales to allow respondents to provide more nuanced responses than just yes/no. While yes/no questions do often appear restrictive in terms of how you can respond, using rating scales can bring their own problems. People interpret and use scales differently, what I might rate as ’8′ on a 10 point scale, someone with the same opinion might only rate as a ’6′ because they interpret the meanings of the scale points differently. There is research which suggests that people have different ways of filling out ratings scales (1). Some people are ‘extreme responders’ who like to use the edges of the scales, whereas other like to hug around the midpoints and rarely use the most outer points. This naturally produces differences in scores between participants that reflects something other than what the questionnaire was designed to measure. A related problem is that of producing nonsense distinctions. For example studies sometimes appear where participants are given a huge rating scale to choose from, for example a scale of 1-100 to rate the confidence of a decision as to whether two lines are the same length (2). Is anyone really capable of segmenting their certainty over such a decision into 100 different units? Is there really any meaningful difference, even within the same individual, between a certainty of 86 and a certainty of 72 in such a paradigm? Any differences found in such experiments therefore run the risk of being spurious.
Response bias – This refers to individual’s tendency to respond a certain way, regardless of the actual evidence they are assessing. For example on a yes/no questionnaire asking about personal experiences, some participants might be biased towards responding yes (i.e. they may only require minimal evidence to decide on a yes response, so if an experience has happened only once they may still respond ‘yes’ to a question relating to whether they have had that experience). Alternatively other participants may have a conservative response bias and only respond positively to such questions if the experience being inquired about has happened regularly. This is a particular problem when the relationship between different questionnaires is assessed, as a correlation between two different questionnaires may simply reflect the response bias of the participants being consistent across questionnaires, rather than any genuine relationship between the variables the questionnaire is measuring.
Ordinal Measures – Almost all self-report measures produce ordinal data. Ordinal data is that which only tells you the order that units can be ranked in, not the distances between them. It is contrasted with interval data which tells you the exact distances between different units. This distinction is easiest to define by thinking of a race. The position in which each runner finishes in is an ordinal measure. It tells you who is fastest and slowest, but not the relative differences between the different runners. In contrast the finishing time is an interval measure, as it provides information relating to the relative differences between the runners. Even when the questionnaire measures something that could be measured in SI units, and is therefore theoretically an interval scale (i.e. alcohol consumption) it is doubtful whether the responses can really be treated as interval because of the problems relating to response accuracy raised above. More pertinently most self-report measures in behavioural science relate to constructs, such a personality measures, that can’t be measured in interval units and are therefore always ordinal. The problem with ordinal data is not the data itself, but the common practice of using parametric statistical techniques with such data, because these tests make assumptions about the distribution of the data that cannot be met when said data is ordinal. Deviations from such assumptions can lead to incorrect inferences being made (3) bringing the conclusions of such studies into question.
Control of sample – this has become more of an issue with the advent of online questionnaire distribution sites like Survey Monkey. Previously a researcher had to be present when a participant completed a questionnaire, now with these tools the researcher need never meet any of their participants. While this allows much bigger samples to be collected much more quickly, it does cause several concerns over the sample make up. For example there are few controls to stop the same person filling in the same questionnaire multiple times. There is also little disincentive for participants to respond with spurious responses, and there is little control over how much attention the participant pays to various parts of the questionnaire. Conversely, from personal experience, I know that sometimes it is hard to complete these questionnaires because there is no way of asking the researcher for clarification as to the meaning of various questions. Finally as the researcher has lost control over the make up of their sample, they may end up with a sample which is vastly skewed towards a certain type of person, as only certain types of people are likely to fill in such questionnaires. These issues existed even before the advent of online data collection (e.g. (4)), but collecting data ‘in absentia’ exacerbates the size of such problems.
Although there are many problems with using self-report questionnaires they will continue to be a popular methodology in behavioural science because of their utility. While it might be preferable for every variable a researcher wants to investigate to be manipulated systematically using behavioural techniques, this is in practice impossible as it would severely restrict what each individual research design could achieve, and would make certain topics effectively impossible to research. Self-report measures are therefore a necessary tool for behavioural research. Furthermore some of the problems listed above can be countered through the careful design and application of self-report measures. For example response bias can be removed by ‘reversing’ half the questions on a questionnaire so that the variable is scored by positive responses on half the questions and negative responses on the other half, thus cancelling out any response bias. Likewise statistical techniques are being devised to attempt to pick out dishonest reporting, a problem that can also be attenuated by ensuring anonymity and confidentiality of responses (e.g. the researcher leaving the room when the participant is completing the questionnaire). Given this it would be wrong to dismiss any findings that are reliant on self-report measures. However whenever you read about research where self-report measures have been used to draw conclusions about human behaviour, it is always worth bearing in mind the multitude of problems associated with such measures, and how they might impact on the validity of the conclusions that have been drawn.
(1) Austin, E. J., Gibson, G. J., Deary, I. J., McGregor, M. J., & Dent, J. B. (1998). Individual response spread in self-report scales: personality correlations and consequences. Personality and Individual Differences, 24, 421–438. http://www.sciencedirect.com/science/article/pii/S019188699700175X
(2) Balakrishnan, J. D. (1999). Decision processes in discrimination: Fundamental misrepresentations of signal detection theory. Journal of Experimental Psychology: Human Perception & Performance, 25, 1189-1206. http://psycnet.apa.org/psycinfo/1999-11444-002
(3) Wilcox, R. R. (2005). Introduction to robust estimation and hypothesis testing. Academic Press. ISBN: 0127515429
(4) Fan, X., Miller, B. C., Park, K., Winward, B. W., Christensen, M., Grotevant, H. D., et al. (2006). An exploratory study about inaccuracy and invalidity in adolescent self-report surveys. Field Methods,18, 223–244. http://fmx.sagepub.com/content/18/3/223.short