7400.685-080 - Research Methods in FCS
School of Family and Consumer Sciences
Spring Semesters - Tuesday Evenings 5:20-7:55pm in 209 Schrank Hall South
Instructor: David D. Witt, Ph.D.
Hypothesis Testing and Operationalization of Variables
There is a lot of information in this chapter, so pay close attention!

After thoroughly reviewing the literature, charting out key advances on your research topic, making a determination regarding the next logical steps in the research process, and identifying hypotheses or research questions for purposes of theory testing, the researcher relies on methodologies designed to operationalize hypothesis concepts into variables. We've spent some time discussing the logical connection between a theoretical hypothesis and an operational hypothesis (recall the connection between the ideal and the real).

A theoretical hypothesis asserts a relationship between two concepts.
An operational hypothesis asserts a measurable relationship between to concrete measures we call variables.
Those concrete measures (variables) are operationalized theoretical concepts.

Here are some examples of Research and Operational Hypotheses from actual research in all the areas of FCS.
  • Child and Family Development
    Research Hypothesis: There is a positive relationship between job satisfaction and marital adjustment.
    Operational Hypothesis: There is a positive relationship between respondent scores on the Minnesota Job Satisfaction Questionnaire and their scores on the Marital Adjustment Test.
  • Clothing & Textiles
    Research Hypothesis: Do cooled women select soft or hard clothing?
    Operational Hypothesis: Women shoppers will select softer clothing as the temperature in the shopping environment decreases.
  • Interior Design
    Research Hypothesis: Natural lighting is the most valued office amenity among those in the work force.
    Operational Hypothesis: Office workers will choose a window for their workspace over more office space, convenient location of office, or more privacy.
  • Nutrition & Dietetics
    Research Hypothesis: Nutrition Education Programs increase children's nutritional knowledge and dietary habits.
    Operational Hypothesis: Children's Scores on the Nutritional Knowledge Scale will be higher after participation in a Nutrition Education Program.
  • FCS Teacher Education
    Research Hypothesis: Students will be more likely to comply with school regulations when they have limited responsibility for the creation and implementation of school disciplinary procedures.
    Operational Hypothesis: The number of student disciplinary actions will decline as the implementation of student generated disciplinary procedures increases.
  • Child Life Specialist
    Research Hypothesis: Treatment compliance of patients with cystic fibrosis will
    increase when the patient is on a daily activity schedule.
    Operational Hypothesis: Measures on a treatment compliance scale will increase for cystic fibrosis patients after a personalized daily schedule of activities is created and maintained for two weeks.
In each case above, the Research Hypothesis is deductively derived from the previous research on the topic and is presented in conceptual terms.  Each concept has to be operationally defined - that is, each concept in the hypothesis must be practically, concretely measured. Taking the Family Development example above, the Research Hypothesis states:
There is a positive relationship between job satisfaction and marital adjustment.

This means we have to measure job satisfaction and marital adjustment in real, objective terms.  Our literature review uncovered two measures that seem to be what we need.  The Minnesota Job Satisfaction Questionnaire is a series of summable questions designed to measure how satisfied a respondent is with their employment situation. By adding the respondent's answers together, the research arrives at an overall measure of job satisfaction.  Similarly, the Marital Adjustment Test, or MAT, is another set of summable questions that seems to measure the concept of marital adjustment.
By asking a sample of respondents to complete both sets of questions, the researcher can later correlate the two measures to see if, in fact, a positive relationship exists.  Thus the Operational Hypothesis is:
There is a statistically significant correlation between scores on the MJSQ and the MAT.

Once again, we cannot really measure a concept - we can only select a concrete measure that represents the concept.  Operationalizing a concept into a measurable variable seems simple enough, but there are some issues to consider. Following the Platonic notion of the ideal and the real, the researcher must bridge the gap between the hypothetical ideal (the Concept) and empirical measurable reality (the Variable). Since it is virtually impossible to completely and accurately measure most concepts, the researcher is forced to resort to estimates.

Think about a concept such as love. Everybody has a pretty good idea (actually several competing ideas) about what love is. Once we attempt to operationalize love - to measure love - it becomes immediately apparent that we don't have very accurate information about love's magnitude, depth, breadth, weight, or influence on other aspects of life.  This means we have to circumnavigate the concept by measuring parts of it, like this:
This is the idea behind Scales and Indexes. Since many of the concepts used in hypothesis testing are broad and complex, researchers are constantly constructing measurements that serve the dual purposes of illustrating
a concept's dimension without becoming unwieldy or cumbersome.

Cognitive/behavioral concepts such as religiosity, affection, purchasing habits, learning styles, or self-preservation are much more complex than physical measures of heart rate, body mass index, or the physical attributes of various fabrics.   Both straightforward measures and very complex ones are important.  In the case of highly complex or controversial measures, it is important for the researcher to have courage and to try to construct measures, even if the measures prove to be imperfect or somehow deeply flawed.  When in doubt about variable construction of difficult concepts, try to remember this: the initial measure for intelligence was response time between seeing a light come on and hitting a button. Had psychology stayed with that measure, instead of further developing finer and more precise measures for intelligence, then the brightest people on the planet would be defined by their facility at the arcade game "Bop the Weasel".

Researchers could rely on simple, one item measures of complex concepts, but the gains in simplicity would not outweigh the losses in reliability and validity - both of which are crucial to accurate and precise operationalization.  Here's an example of a scale that purports to measure family violence / domestic conflict. The author of the scale was not convinced simplistic questions regarding the topic were fully describing the reality of violence in families, so he set to work redefining the concept.

 The Conflict Tactics Scale measure consists of 80 items developed by Straus (1979) to explore intrafamily conflict and violence, focusing particularly on the adults in the family. Of these 80 items, 20 are administered to the parent about his/her relationship with the child. The next 20 questions are directed to the parent about the partner and his/her interactions with the child. If there is no partner, these questions are not asked. The last 40 questions of the measure address the interactions between the parent and the parent's partner using the same questions. The measure assesses how the parent reacts in a conflict with the child, such as trying to discuss an issue calmly, yelling at or insulting the child, stomping out of the room or house, threatening to spank the child, and hitting or trying to hit the child. The items gradually become more coercive and aggressive as they progress. The items are rated on a seven-point scale, ranging from 0=never to 6=almost every day.

This instrument has four scales: Parent-Child (Scale 1), Partner-Child (Scale 2), Parent-Partner (Scale 3), and Partner-Parent (Scale 4). The parent-child and partner-child conflict scales each have five subscales and the two parent-partner scales have four subscales each. The five subscales (Strassberg, Dodge, Bates, and Pettit, 1992; Strassberg, Dodge, Pettit, and Bates, 1994) are: verbal discussion, verbal aggression, hostile-indirect withdrawal, physical aggression, and spanking. The parent-partner and partner-parent scales do not include the spanking subscale. Subscale scores are created by taking the mean for each set of variables for a given subscale by observation and then by finding the subscale means across all observations.

Analysts should be aware of possible distributional issues; subscales for physical aggression and hostile-indirect withdrawal were highly skewed in a positive direction. Two other subscales were also skewed: verbal aggression and spanking. The verbal discussion subscale was almost normally distributed.

This scale and literally thousands of others on various topics have been designed, tested in pilot studies and used widely in research around the world. Try to remember to check with the Resource Librarians when your hypotheses are in need of better measurement.  Resources such as these are often kept in the library reference section and entitled "Handbook of Statistical Measures".   Here are just a few titles to illustrate:
  • Assessments A to Z: a collection of 50 questionnaires instruments and inventories Burn B and Payment M. San Francisco, CA: Jossey-Bass/Pfeiffer, c2000 Ref HF 5549.5 .T7B795 2000 (50 complete instruments)
  • Assessing Alcohol Problems: a guide for clinicians and researchers Allen JP and Columbus M. Bethesda, MD: NIAAA, c1995 Ref HV 5279 .A8 1995 (72 complete instruments)
  • Communication Research Measures: a sourcebook Rubin RB, Palmgreen P and Sypher HE. New York: Guilford Press, c1994 Ref P 91.3 .C62 1994 (68 complete instruments)
  • Handbook of Assessment Methods for Eating Behaviors and Weight Related Problems: measures, theory and research Allison DB. Thousand Oaks, CA: Sage Publications, c1995 Ref RC 552 .E18H357 1995 (45 complete instruments)
  • Handbook of family measurement techniques Touliatos, John ; Perlmutter, Barry F. and Straus, Murray A.. Thousand Oaks, CA: Sage Publications 2001. Ref HQ728.T68 2001 (189 complete instruments)
  • Handbook of Geriatric Assessment, 3rd Ed.Gallo J J, Fulmer T, Paveza G J and Reichl W. Gaithersburg, MD: Aspen Publishers, Inc., c2000Ref RC 953 .G34 2000 (30 complete instruments)
  • Handbook of marketing scales: multi-item measures for marketing and consumer behavior researchBearden WO, Netemeyer RG and Mobley MF. Newbury Park, CA: Sage Publishing, c1993Ref HF 5415.3 .B323 1993 and 1999 (197 complete instruments)
  • Handbook of Organizational MeasurementPrice JL and Mueller CW. Marshfield, MA : Pitman, c1986 Ref HM 131 .P728 1986 (26 complete instruments)
  • Handbook of Research Design and Social MeasurementMiller D C. 5th Ed. Newbury Park, CA: SagePublications, c1991Ref H 62 .M44 1991 (51 complete instruments)
  • Handbook of Scales for Research in Crime and DelinquencyStanley BD and O'Neal S. New York: Plenum Press, c1983Ref HV 9274 .B76 1983 (99 complete instruments)
  • Handbook of Sexuality-Related MeasuresThousands Oaks, CA: Sage Publications, c1998Ref HQ 60 .H36 1998 (196 complete instruments)
  • Handbook of Tests and Measurements for Black Populations (2 vols.)Jones RL. Hampton, VA: Cobb & Henry Publishers, c1996Ref BF 176 .H37 1996 (82 complete instruments)
  • Handbook of Tests and Measurement in Education and the Social SciencesLester PE and Bishop LK. Lancaster, PA: Technomic Publishing Co., c1997Ref LB 3051 .L4543 1997 (125 complete instruments)
  • Marketing Scales Handbook: a compilation of multi-item measuresBruner II GC and Hensel PJ. Chicago, IL: American Marketing Association, c1992Ref HF 5415.3 .B785 1992 (588 complete instruments)
  • Measurement of Attitudes toward People with DisabilitiesAntonak RF and Livneh H. Springfield, IL: Charles C. Thomas Publishers, c1988Ref HV 1553 .A62 1988 (24 complete instruments)
  • Measurement of Love and Intimate Relations: theories, scales, and applications for love development, maintenance, ands dissolution Tzeng O. Westport, CT: Praeger, c1993Ref HM 132 .T84 1993 (28 complete instruments)
  • Measurement Tools in Patient EducationRedman BK. New York: Springer Pub. Co., c1998Ref R 727.4 .M4 1998 (50 complete instruments)
  • Measures for Clinical Practice (2 vols.): A Sourcebook. Fischer J and Corcoran K. 3rd Ed. New York: Maxwell Macmillan International, c2000 Vol.1 - Couples, families, children; Vol.2 - AdultsRef BF 176 .C66 2000 (423 complete instruments)
  • Measures of Personality and Social Psychological Attitudes Robinson JP, Athanasiou R and Wrightsman LS. Dan Diego, CA: Academic Press, c1991Ref BF 698.4 .M38 1990 (112 complete instruments)
  • Measures of Political Attitudes: measures of social psychological attitudes,v.2Robinson JP and Shaver PR. New York: Academic Press, c1999Ref JA 74.5 .M43 1999 (149 complete instruments)
  • Measures of ReligiosityHill PC and Hood RW. Birmingham, AL: Education Press, c1999Ref BR 110 .M43 1999 (124 complete instruments)
  • Measuring Health: a guide to rating scales and questionnairesMcDowell I and Newell C. 2nd Ed. New York, Oxford University Press, c1996Ref RA 408.5 .M38 1996 (46 complete instruments)
  • Outcome Measures for Child Welfare Services: theory and applications
  • Magura S and Moses, BS. Washington, D.C.: Child Welfare League of America, c1986Ref HV 741 .M335 1986 (2 instruments)
  • Scales for the Measurement of AttitudesShaw ME and Jack WM. New York: McGraw Hill, c1967Ref BF 378 .A75S45 (186 complete instruments)
  • Sociological Measurement: an inventory of scales and indicesBonjean CM, Hill RJ and McLemore SD. San Francisco, CA: Chandler Publishers, c1967Ref Z 7164 .S68B6 (25 complete instruments)
It does little good to operationalize concepts without a) knowing that we are measuring what we think we are measuring (validity), and b) choosing measures that result in the same measurement every time they are used (reliability). 
  • Validity is the critical step in measurment. It is the extent to which a measure actually measures the concept it is intended to measure. In medical terms, a lack of validity would be a misdiagnosis. Something is being observed, but we don't know what that something actually is - we just think we do.  For validity, the analogy that works for me is that of a shooter who identifies the target as that which is intended to be hit.
  • Reliability is almost as important to measurement. It is the demonstrated ability of a measure to consistently perform, yielding similar results every time it is used.  For reliability, the analogy that works for me is that of the rifle in the hands of the shooter that will consistently place bullets exactly where it is aimed, regardless of whether the target is the right one.
There are types of Validity:
  • Content Validity indicates how well a measure "covers", or encapsulates, the concept.  Content validity is determined by nonstatistical means - a measure is defined and given to experts in the field for their judgment about the measure's representativeness for the concept in question.  If the research is alone in the judgment of this type of validity, the measure may wind up a source of criticism after publication.  Content validity is closely related to Face Validity, in which an untrained person is asked to evaluate the measure (i.e., if it walks like a duck).
  • Criterion Validity (also called Emprical Validity) is a established when a new measure is statistically related to an old standard measure of the same concept and is arrived at statistically.  Researchers are constantly attempting to create new, more elegant, and simple-to-use measures of important concepts in their respective fields.  To establish that the new E-Z measure is equally as good as the old Complex measure, both measures would be used together in a research project.  The object is to collect data with both measures included and correlate them. The result is a validity coefficient. The higher the validity coefficient, the higher the likelihood that the new measure is a good one.
  • Predictive Validity is a form of criterion validity, with the substitution of a known correlate of an old measure instead of the old measure itself.  If the new measure results in similar statistical predictive power on the known correlate, then the argument could be made that the new measure is as good as the old one.
  • Construct Validity is a technique used by researchers developing a scale or index - a set of several questions designed to circumnavigate a concept.  Such a scale or index might include several dozen questions or smaller measures designed to "cover" the entirity of a theoretical concept.  Since the question set is logically deduced by the researcher, there is always the possibility that some of the item questions will be erroneous.   Here all questions are checked for correlation with a couple of theoretically determined correlates of the concept under scrutiny.  Questions that do not correlate as hypothesized are removed from the scale, leaving only those questions that survive this test.  The result is the start of a new scale or index.
Reliability requires no ad hoc judgment on the part of the researcher. Every reliability test is a statistical one. There are four methods for estimating reliability (since we are using inferential statistical methods, we are always estimating rather than concretely determining).
  • Test-retest reliability - a measure is administered to the same sample of respondents under similar circumstances on more than one occasion. This is actually a measure of stability over time.  If time one responses correlate highly with other responses at other times, the researcher can claim test-retest reliability.
  • Equivalent forms reliability - a measure is written two different ways and administered to the same respondents at the same time.  This technique is particularly helpful to determine whether or not simple differences in wording have an unintended, and thus error inducing, effect on results.  The two measures are then correlated as before.
  • Internal consistency - uses a single administration of a single measure. Here the sample is artificially and randomly divided in half. Values from the first half of the sample is correlated with the measure from the second half. Again, the reliability coefficient will have to be very high to claim internal consistency.
  • Interrater reliability - used primarily when observations are being made by researchers, as opposed to the self-report method where resondents do subjective, self-observations.   Raters, or observers, are trained together on the methods they are expected to use in data collection - with part of the data being a coding of the rater's identity. The data are separated by rater code, as if they were separate samples, and compared for disagreements and agreements.
Take this opportunity to practice writing good questions in the Key Words section for this week.

Return to Syllabus