Discussion: Estimating Models Using Dummy Variables You have had plenty of opportunity

Discussion: Estimating Models Using Dummy Variables

You have had plenty of opportunity to interpret coefficients for metric variables in regression models. Using and interpreting categorical variables takes just a little bit of extra practice. In this Discussion, you will have the opportunity to practice how to recode categorical variables so they can be used in a regression model and how to properly interpret the coefficients. Additionally, you will gain some practice in running diagnostics and identifying any potential problems with the model.

To prepare for this Discussion:

Review Warner’s Chapter 12 and Chapter 2 of the Wagner course text and the media program found in this week’s Learning Resources and consider the use of dummy variables.

Create a research question using the General Social Survey dataset that can be answered by multiple regression. Using the SPSS software, choose a categorical variable to dummy code as one of your predictor variables.

By Day 3

Estimate a multiple regression model that answers your research question. Post your response to the following:

What is your research question?

Interpret the coefficients for the model, specifically commenting on the dummy variable.

Run diagnostics for the regression model. Does the model meet all of the assumptions? Be sure and comment on what assumptions were not met and the possible implications. Is there any possible remedy for one the assumption violations?

Be sure to support your Main Post and Response Post with reference to the week’s Learning Resources and other scholarly evidence in APA Style.

By Day 5

Respond to at least one of your colleagues’ posts and provide a constructive comment on their assessment of diagnostics.

Were all assumptions tested for?

Are there some violations that the model might be robust against? Why or why not?

Explain and provide any additional resources (i.e., web links, articles, etc.) to provide your colleague with addressing diagnostic issues.

Discrete Data

Discrete independent and dependent variables often lead to plots that are difficult to interpret. A simple example of this phenomenon appears in Figure 8.1, the data for which are drawn from the 1989 General Social Survey conducted by the National Opinion Research Center. The independent variable, years of education completed, is coded from 0 to 20. The dependent variable is the number of correct answers to a 10-item vocabulary test; note that this variable is a disguised proportion—literally, the proportion correct × 10.

Figure 8.1. Scatterplot (a) and residual plot (b) for vocabulary score by year of education. The least-squares regression line is shown on the scatterplot.

Click here to downloadicon download

The scatterplot in Figure 8.1a conveys the general impression that vocabulary increases with education. The plot is difficult to read, however, because most of the 968 data points fall on top of one another. The least-squares regression line, also shown on the plot, has the equation

where V and E are, respectively, the vocabulary score and education.

Figure 8.1b plots residuals from the fitted regression against education. The diagonal lines running from upper left to lower right in this plot are typical of residuals for a discrete dependent variable: For any one of the 11 distinct y values, e.g., y = 5, the residual is e = 5 – b0 – b1x = 3.87 – 0.374x, which is a linear function of x. I noted a similar phenomenon in Chapter 6 for the plot of residuals against fitted values when y has a fixed minimum score. The diagonals from lower left to upper right are due to the discreteness of x.

It also appears that the variation of the residuals in Figure 8.1b is lower for the largest and smallest values of education than for intermediate values. This pattern is consistent with the observation that the dependent variable is a disguised proportion: As the average number of correct answers approaches 0 or 10, the potential variation in vocabulary scores decreases. It is possible, however, that at least part of the apparent decrease in residual variation is due to the relative sparseness of data at the extremes of the education scale. Our eye is drawn to the range of residual values, especially because we cannot see most of the data points, and even when variance is constant, the range tends to increase with the amount of data.

These issues are addressed in Figure 8.2, where each data point has been randomly “jittered” both vertically and horizontally: Specifically, a uniform random variable on the interval [-1/2, 1/2] was added to each education and vocabulary score. This approach to plotting discrete data was suggested by Chambers, Cleveland, Kleiner, and Tukey (1983). The plot also shows the fitted regression line for the original data, along with lines tracing the median and first and third quartiles of the distribution of jittered vocabulary scores for each value of education; I excluded education values below six from the median and quartile traces because of the sparseness of data in this region.

Several features of Figure 8.2 are worth highlighting: (a) It is clear from the jittered data that the observations are particularly dense at 12 years of education, corresponding to high-school graduation; (b) the median trace is quite close to the linear least-squares regression line; and (c) the quartile traces indicate that the spread of y does not decrease appreciably at high values of education.

A discrete dependent variable violates the assumption that the error in the regression model is normally distributed with constant variance. This problem, like that of a limited dependent variable, is only serious in extreme cases—for example, when there are very few response categories, or where a large proportion of observations is in a small number of categories, conditional on the values of the independent variables.

In contrast, discrete independent variables are perfectly consistent with the regression model, which makes no distributional assumptions about the xs other than uncorrelation with the error. Indeed a discrete x makes possible a straightforward hypothesis test of nonlinearity, sometimes called a test for “lack of fit.” Likewise, it is relatively simple to test for nonconstant error variance across categories of a discrete independent variable (see below).

Figure 8.2. “Jittered” scatterplot for vocabulary score by education. A small random quantity is added to each horizontal and vertical coordinate. The dashed line is the least-squares regression line for the unjittered data. The solid lines are median and quartile traces for the jittered vocabulary scores.

Click here to downloadicon download

Testing for Nonlinearity

Suppose, for example, that we model education with a set of dummy regressors rather than specify a linear relationship between vocabulary score and education. Although there are 21 conceivable education scores, ranging from 0 through 20, none of the individuals in the sample has 2 years of education, yielding 20 categories and 19 dummy regressors. The model becomes

TABLE 8.1 Analysis of Variance for Vocabulary-Test Score, Showing the Incremental F Test for Nonlinearity of the Relationship Between Vocabulary and Education

Click here to downloadicon download

Contrasting this model with

produces a test for nonlinearity, because Equation 8.2, specifying a linear relationship, is a special case of Equation 8.1, which captures any pattern of relationship between E(y) and x. The resulting incremental F test for nonlinearity appears in the analysis-of-variance of Table 8.1. There is, therefore, very strong evidence of a linear relationship between vocabulary and education, but little evidence of nonlinearity.

The F test for nonlinearity easily can be extended to a discrete independent variable—say, x1—in a multiple-regression model. Here, we contrast the more general model

with a model specifying a linear effect of x1,

where d1, …, dq-1 are dummy regressors constructed to represent the q categories of x1.

Testing for Nonconstant Error Variance

A discrete x (or combination of xs) partitions the data into q groups. Let yij denote the jth of ni dependent-variable scores in the ith group. If the error variance is constant, then the within-group variance estimates

should be similar. Here, ŷi is the mean in the ith group. Tests that examine the si2 directly, such as Bartlett’s (1937) commonly employed test, do not maintain their validity well when the errors are non-normal.

Many alternative tests have been proposed. In a large-scale simulation study, Conover, Johnson, and Johnson (1981) demonstrate that the following simple F test is both robust and powerful: Calculate the values zij = |yij – yi∗| where yi∗ is the median y within the ith group. Then perform a one-way analysis-of-variance of the variable z over the q groups. If the error variance is not constant across the groups, then the group means  will tend to differ, producing a large value of the F test statistic. For the vocabulary data, for example, where education partitions the 968 observations into q = 20 groups, this test gives F19,948 = 1.48, p = .08, providing weak evidence of nonconstant spread.


Exploratory data analysis

Discover method in the Methods Map

On this page

Discrete Data

Figure 8.1. Scatterplot (a) and residual plot (b) for vocabulary score by year of education. The least-squares regression line is shown on the scatterplot.

Figure 8.2. “Jittered” scatterplot for vocabulary score by education. A small random quantity is added to each horizontal and vertical coordinate. The dashed line is the least-squares regression line for the unjittered data. The solid lines are median and quartile traces for the jittered vocabulary scores.

Testing for Nonlinearity

TABLE 8.1 Analysis of Variance for Vocabulary-Test Score, Showing the Incremental F Test for Nonlinearity of the Relationship Between Vocabulary and Education

Testing for Nonconstant Error Variance


Non-Normally Distributed Errors

The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the central-limit theorem assures that under very broad conditions inference based on the least-squares estimators is approximately valid in all but small samples. Why, then, should we be concerned about non-normal errors?

First, although the validity of least-squares estimation is robust—as stated, the levels of tests and confidence intervals are approximately correct in large samples even when the assumption of normality is violated—the method is not robust in efficiency: The least-squares estimator is maximally efficient among unbiased estimators when the errors are normal. For some types of error distributions, however, particularly those with heavy tails, the efficiency of least-squares estimation decreases markedly. In these cases, the least-squares estimator becomes much less efficient than alternatives (e.g., so-called robust estimators, or least-squares augmented by diagnostics). To a substantial extent, heavy-tailed error distributions are problematic because they give rise to outliers, a problem that I addressed in the previous chapter.

A commonly quoted justification of least-squares estimation— called the Gauss-Markov theorem—states that the least-squares coefficients are the most efficient unbiased estimators that are linear functions of the observations yi. This result depends on the assumptions of linearity, constant error variance, and independence, but does not require normality (see, e.g., Fox, 1984, pp. 42–43). Although the restriction to linear estimators produces simple sampling properties, it is not compelling in light of the vulnerability of least squares to heavy-tailed error distributions.

Second, highly skewed error distributions, aside from their propensity to generate outliers in the direction of the skew, compromise the interpretation of the least-squares fit. This fit is, after all, a conditional mean (of y given the xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently, we may prefer to transform the data to produce a symmetric error distribution.

Finally, a multimodal error distribution suggests the omission of one or more qualitative variables mat divide the data naturally into groups. An examination of the distribution of residuals may therefore motivate respecification of the model.

Although there are tests for non-normal errors, I shall describe here instead graphical methods for examining the distribution of the residuals (but see Chapter 9). These methods are more useful for pinpointing the character of a problem and for suggesting solutions.

Normal Quantile-Comparison Plot of Residuals

One such graphical display is the quantile-comparison plot, which permits us to compare visually the cumulative distribution of an independent random sample—here of studentized residuals—to a cumulative reference distribution—the unit-normal distribution. Note that approximations are implied, because the studentized residuals are t distributed and dependent, but generally the distortion is negligible, at least for moderate-sized to large samples.

To construct the quantile-comparison plot:


Arrange the studentized residuals in ascending order: t(1), t(1), …, t(n). By convention, the ith largest studentized residual, t(i), has gi = (i – 1/2)/n proportion of the data below it. This convention avoids cumulative proportions of zero and one by (in effect) counting half of each observation below and half above its recorded value. Cumulative proportions of zero and one would be problematic because the normal distribution, to which we wish to compare the distribution of the residuals, never quite reaches cumulative probabilities of zero or one.


Find the quantile of the unit-normal distribution that corresponds to a cumulative probability of gi — that is, the value zi from Z ∼ N(0, 1) for which Pr(Z < zi) = gi.


Plot the t(i) against the zi.

If the ti were drawn from a unit-normal distribution, then, within the bounds of sampling error, t(i) = zi. Consequently, we expect to find an approximately linear plot with zero intercept and unit slope, a line that can be placed on the plot for comparison. Nonlinearity in the plot, in contrast, is symptomatic of non-normality.

It is sometimes advantageous to adjust the fitted line for the observed center and spread of the residuals. To understand how the adjustment may be accomplished, suppose more generally that a variable X is normally distributed with mean μ. and variance ζ2. Then, for an ordered sample of values, approximately x(i) = μ + ζzi, where zi is defined as before. In applications, we need to estimate μ and μ, preferably robustly, because the usual estimators—the sample mean and standard deviation—are markedly affected by extreme values. Generally effective choices are the median of x to estimate μ and (Q3 – Q1)/1.349 to estimate ζ, where Q1 and Q3 are, respectively, the first and third quartiles of x: The median and quartiles are not sensitive to outliers. Note that 1.349 is the number of standard deviations separating the quartiles of a normal distribution. Applied to the studentized residuals, we have the fitted line (i) = median(t) + {[Q3(t) – Q1(t)]/1.349} × zi. The normal quantile-comparison plots in this monograph employ the more general procedure.

Several illustrative normal-probability plots for simulated data are shown in Figure 5.1. In parts a and b of the figure, independent samples of size n = 25 and n = 100, respectively, were drawn from a unit-normal distribution. In parts c and d, samples of size n = 100 were drawn from the highly positively skewed χ42 distribution and the heavy-tailed t2 distribution, respectively. Note how the skew and heavy tails show up as departures from linearity in the normal quantile-comparison plots. Outliers are discernible as unusually large or small values in comparison with corresponding normal quantiles.

Judging departures from normality can be assisted by plotting information about sampling variation. If the studentized residuals were drawn independently from a unit-normal distribution, then

where ϕ(zi) is the probability density (i.e., the “height”) of the unit-normal distribution at Z = zi. Thus, zi ± 2 × SE(t(i)) gives a rough 95% confidence interval around the fitted line (i) = zi in the quantile-comparison plot. If the slope of the fitted line is taken as  = (Q3 – Q1)/ 1.349 rather than 1, then the estimated standard error may be multiplied by . As an alternative to computing standard errors, Atkinson (1985) has suggested a computationally intensive simulation procedure that does not treat the studentized residuals as independent and normally distributed.

Figure 5.1. Illustrative normal quantile-comparison plots. (a) For a sample of n = 25 from N(0, 1). (b) For a sample of n = 100 from N(0, 1). (c) For a sample of n – 100 from the positively skewed χ42. (d) For a sample of n = 100 from the heavy-tailed t2.

Click here to downloadicon download

Figure 5.2 shows a normal quantile-comparison plot for the studentized residuals from Duncan’s regression of rated prestige on occupational income and education levels. The plot includes a fitted line with two-standard-error limits. Note that the residual distribution is reasonably well behaved.

Figure 5.2. Normal quantile-comparison plot for the studentized residuals from the regression of occupational prestige on income and education. The plot shows a fitted line, based on the median and quartiles of the fs, and approximate ±2SE limits around the line.

Click here to downloadicon download

Histograms of Residuals

A strength of the normal quantile-comparison plot is that it retains high resolution in the tails of the distribution, where problems often manifest themselves. A weakness of the display, however, is that it does not convey a good overall sense of the shape of the distribution of the residuals. For example, multiple modes are difficult to discern in a quantile-comparison plot.

Histograms (frequency bar graphs), in contrast, have poor resolution in the tails or wherever data are sparse, but do a good job of conveying general distributional information. The arbitrary class boundaries, arbitrary intervals, and roughness of histograms sometimes produce misleading impressions of the data, however. These problems can partly be addressed by smoothing the histogram (see Silverman, 1986, or Fox, 1990). Generally, I prefer to employ stem-and-leaf displays—a type of histogram (Tukey, 1977) that records the numerical data values directly in the bars of the graph—for small samples (say n  1,000).

Figure 5.3. Stem-and-leaf display of studentized residuals from the regression of occupational prestige on income and education.

Click here to downloadicon download

A stem-and-leaf display of studentized residuals from the Duncan regression is shown in Figure 5.3. The display reveals nothing of note: There is a single node, the distribution appears reasonably symmetric, and there are no obvious outliers, although the largest value (3.1) is somewhat separated from the next-largest value (2.0).

Each data value in the stem-and-leaf display is broken into two parts: The leading digits comprise the stem; the first trailing digit forms the leaf; and the remaining trailing digits are discarded, thus truncating rather than rounding the data value. (Truncation makes it simpler to locate values in a list or table.) For studentized residuals, it is usually sensible to make this break at the decimal point. For example, for the residuals shown in Figure 5.4: 0.3039 → 0 |3; 3.1345 → 3 |1; and -0.4981 → -0 |4. Note that each stem digit appears twice, implicitly producing bins of width 0.5. Stems marked with asterisks (e.g., 1∗) take leaves 0 — 4; stems marked with periods (e.g., 1.) take leaves 5—9. (For more information about stem-and-leaf displays, see, e.g., Velleman and Hoaglin [1981] or Fox [1990].)

Figure 5.4. The family of powers and roots. The transformation labeled “p” is actually y’ = (yp – 1)/p; for p = 0, y’ = logey.

Click here to downloadicon download

SOURCE: Adapted with permission from Figure 4-1 from Hoaglin, Mosteller, and Tukey (eds.). Understanding Robust and Exploratory Data Analysis, © 1983 by John Wiley and Sons, Inc.

Correcting Asymmetry by Transformation

A frequently effective approach to a variety of problems in regression analysis is to transform the data so that they conform more closely to the assumptions of the linear model. In this and later chapters I shall introduce transformations to produce symmetry in the error distribution, to stabilize error variance, and to make the relationship between y and the xs linear.

In each of these cases, we shall employ the family of powers and roots, replacing a variable y (used here generically, because later we shall want to transform xs as well) by y’ = yp. Typically, p = -2, -1, -1/2, 1/2, 2, or 3, although sometimes other powers and roots are considered. Note that p = 1 represents no transformation. In place of the 0th power, which would be useless because y0 = 1 regardless of the value of y, we take y’ = log y, usually using base 2 or 10 for the log function. Because logs to different bases differ only by a constant factor, we can select the base for convenience of interpretation. Using the log transformation as a “zeroth power” is reasonable, because the closer p gets to zero, the more yp looks like the log function (formally, limp→0[(yp – 1)/p] = logey, where the log to the base e ≈ 2.718 is the so-called “natural” logarithm). Finally, for negative powers, we take y’ = -yp, preserving the order of the y values, which would otherwise be reversed.

As we move away from p = 1 in either direction, the transformations get stronger, as illustrated in Figure 5.4. The effect of some of these transformations is shown in Table 5.1a. Transformations “up the ladder” of powers and roots (a term borrowed from Tukey, 1977)—that is, toward y2—serve differentially to spread out large values of y relative to small ones; transformations “down the ladder”—toward log y—have the opposite effect. To correct a positive skew (as in Table 5.1b), it is therefore necessary to move down the ladder; to correct a negative skew (Table 5.1c), which is less common in applications, move up the ladder.

I have implicitly assumed that all data values are positive, a condition that must hold for power transformations to maintain order. In practice, negative values can be eliminated prior to transformation by adding a small constant, sometimes called a “start,” to the data. Likewise, for power transformations to be effective, the ratio of the largest to the smallest data value must be sufficiently large; otherwise the transformation will be too nearly linear. A small ratio can be dealt with by using a negative start.

In the specific context of regression analysis, a skewed error distribution, revealed by examining the distribution of the residuals, can often be corrected by transforming the dependent variable. Although more sophisticated approaches are available (see, e.g., Chapter 9), a good transformation can be located by trial and error.

Dependent variables that are bounded below, and hence that tend to be positively skewed, often respond well to transformations down the ladder of powers. Power transformations usually do not work well, however, when many values stack up against the boundary, a situation termed truncation or censoring (see, e.g., Tobin [1958] for a treatment of “limited” dependent variables in regression). As well, data that are bounded both above and below—such as proportions and percentages—generally require another approach. For example the logit or “log odds” transformation given by y’ = log[y/(l – y)], often works well for proportions.

TABLE 5.1 Correcting Skews by Power Transformations


Walden University, LLC. (Producer). (2016m). Regression diagnostics and model

evaluation [Video file]. Baltimore, MD: Author.