# Factors

Factors is an abstraction similar to sets. Each varible is represented by string and integer. The latter is called level.  A simple example of a factor might be a variable called gender with two levels: ‘female’ and ‘male’. If you had three females and two males, you could create the factor like this:

`gender <- factor(c("female", "male", "female", "male", "female")) `
`class(gender)`
`[1] "factor"`
`mode(gender)`
`[1] "numeric"`

More often, you will create a dataframe by reading your data from a file using read.table. When you do this, all variables containing one or more character strings will be converted automatically into factors. Here is an example:

`data <- read.table("c:\\temp\\daphnia.txt",header=T)`
`attach(data)`
`head(data)`

This dataframe contains a continuous response variable (Growth.rate) and three categorical explanatory variables (Water, Detergent and Daphnia), all of which are factors.

There are five major functions for dealing with factors: is.factor, levels, nlevels, as.factor and factor. You will often want to check that a variable is a factor (especially if the factor levels are numbers rather than characters):

`is.factor(Water)`
` [1] TRUE`

To discover the names of the factor levels, we use the levels function:

`levels(Detergent)`
`[1] "BrandA" "BrandB" "BrandC" "BrandD"`

To discover the number of levels of a factor, we use the nlevels function:

`nlevels(Detergent)`
`[1] 4`

The same result is achieved by applying the length function to the levels of a factor:

`length(levels(Detergent))`
`[1] 4`

By default, factor levels are created in alphabetical order. If you want to change this (as you might, for instance, in ordering the bars of a bar chart) then this is straightforward: just type the factor levels in the order that you want them to be used, and provide this vector as the second argument to the factor function.

Suppose we have an experiment with three factor levels in a variable called treatment, and we want them to appear in this order: ‘nothing’, ‘single’ dose and ‘double’ dose. We shall need to override R's natural tendency to order them ‘double’, ‘nothing’, ‘single’:

`frame <- read.table("c:\\temp\\trial.txt",header=T)`
`attach(frame)`
`tapply(response,treatment,mean)`
` double nothing single`
` 25 60 34`

This is achieved using the factor function like this:

`treatment <- factor(treatment,levels=c("nothing","single","double"))`

Now we get the order we want:

`tapply(response,treatment,mean)`
`nothing single double`
`60      34       25`

Only == and != can be used for factors. Note, also, that a factor can only be compared to another factor with an identical set of levels (not necessarily in the same ordering) or to a character vector. For example, you cannot ask quantitative questions about factor levels, like > or <=, even if these levels are numeric.

To turn factor levels into numbers (integers) use the unclass function like this:

`as.vector(unclass(Daphnia))`
` [1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1`
`[39] 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3`

### factor() function

The factor() function can be used to create value labels for categorical variables. Continuing our example, say that you have a variable named gender, which is coded 1 for male and 2 for female. You could create value labels with the code

```patientdata\$gender <- factor(patientdata\$gender,
levels = c(1,2),
labels = c("male", "female"))```

Here levels indicate the actual values of the variable, and labels refer to a character vector containing the desired labels.

As you’ve seen, variables can be described as nominal, ordinal, or continuous. Nominal variables are categorical, without an implied order. Diabetes (Type1, Type2) is an example of a nominal variable. Even if Type1 is coded as a 1 and Type2 is coded as a 2 in the data, no order is implied. Ordinal variables imply order but not amount. Status (poor, improved, excellent) is a good example of an ordinal variable. You know that a patient with a poor status isn’t doing as well as a patient with an improved status, but not by how much. Continuous variables can take on any value within some range, and both order and amount are implied. Age in years is a continuous variable and can take on values such as 14.5 or 22.8 and any value in between. You know that someone who is 15 is one year older than someone who is 14.

Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors. Factors are crucial in R because they determine how data will be analyzed and presented visually. You’ll see examples of this throughout the book.

The function factor() stores the categorical values as a vector of integers in the range [1... k] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

For example, assume that you have the vector

`diabetes <- c("Type1", "Type2", "Type1", "Type1")`

The statement diabetes <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it with 1=Type1 and 2=Type2 internally (the assignment is alphabetical). Any analyses performed on the vector diabetes will treat the variable as nominal and select the statistical methods appropriate for this level of measurement.

For vectors representing ordinal variables, you add the parameter ordered=TRUE to the factor() function. Given the vector

`status <- c("Poor", "Improved", "Excellent", "Poor")`

the statement status <- factor(status, ordered=TRUE) will encode the vector as (3, 2, 1, 3) and associate these values internally as 1=Excellent, 2=Improved, and 3=Poor. Additionally, any analyses performed on this vector will treat the variable as ordinal and select the statistical methods appropriately.

By default, factor levels for character vectors are created in alphabetical order. This worked for the status factor, because the order “Excellent,” “Improved,” “Poor” made sense. There would have been a problem if “Poor” had been coded as “Ailing” instead, because the order would be “Ailing,” “Excellent,” “Improved.” A similar problem exists if the desired order was “Poor,” “Improved,” “Excellent.” For ordered factors, the alphabetical default is rarely sufficient.

You can override the default by specifying a levels option. For example,

```status <- factor(status, order=TRUE,
levels=c("Poor", "Improved", "Excellent"))```

would assign the levels as 1=Poor, 2=Improved, 3=Excellent. Be sure that the specified levels match your actual data values. Any data values not in the list will be set to missing.

The following listing demonstrates how specifying factors and ordered factors impact data analyses.

First, you enter the data as vectors  . Then you specify that diabetes is a factor and status is an ordered factor. Finally, you combine the data into a data frame. The function str(object) provides information on an object in R (the data frame in this case) It clearly shows that diabetes is a factor and status is an ordered factor, along with how it’s coded internally. Note that the summary() function treats the variables differently It provides the minimum, maximum, mean, and quartiles for the continuous variable age, and frequency counts for the categorical variables diabetes and status.

