Extracting information

R can be used to visualise and analyse different types of data. Therefore we need to understand how to handle data in R.

We will start with the data set called women. This data set can be loaded using the function data as follows:

data("women")

To view the data set, we can use the R function View(women). This opens a new tab in our Source panel. Alternatively, we can view the first 6 rows of the data set in the console using the head function:

head(women)

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

The data set consists of two columns : height and weight. If you want to know more about the data set, you can visit the help page using ?women.

This data set is in the form of a data.frame. A data.frame stores variables of different classes in named columns and rows.

We will learn how to extract particular columns, rows and values from our data.frame.

Firstly, to find the dimension (the size) of our data.frame. We will use the function dim:

dim(women)

[1] 15  2

Our data consists of 15 rows and 2 columns.

If we want to extract particular values at certain rows and columns, we can type in the row and column number. For example, say we want the 1st row and the 1st column:

women[1, 1]

[1] 58

Or the 7th value in the 2nd column,

women[7, 2]

[1] 132

It’s more likely that we will want to perform calculations, or plot just one column of the data. We can extract columns from our data in different ways.

Note how the following three methods all give the same result.

Using the column number,

women[, 1]

 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Use the column name in place of the number,

women[, "height"]

 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Use the $ symbol and the column name,

women$height

 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

women[, "height"]

 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

We can use the first two of these methods to extract rows. Note here that the rownames are the same as the row numbers, but this might not always be the case.

women[1, ]

  height weight
1     58    115

women["1", ]

  height weight
1     58    115

We can extract statistical summary information from our data set using the summary function. This function will give measures such as the mean and median for each column on the data set.

summary(women)

     height         weight     
 Min.   :58.0   Min.   :115.0  
 1st Qu.:61.5   1st Qu.:124.5  
 Median :65.0   Median :135.0  
 Mean   :65.0   Mean   :136.7  
 3rd Qu.:68.5   3rd Qu.:148.0  
 Max.   :72.0   Max.   :164.0

Let’s say we want to look at sections of our data, more specifically, we want to look at the heights for women who weighed more than the average.

We can extract the correct column name using the $ symbol and we use the which function to find the row numbers for which the weight is more than the average, 136.7.

which(women$weight > 136.7)

[1]  9 10 11 12 13 14 15

We can improve this code by using the R function mean to find the average, and assign this value to mean_weight:

mean_weight <- mean(women$weight)
which(women$weight > mean_weight)

[1]  9 10 11 12 13 14 15

Our final step to find the heights is to assign the row numbers to which_rows :

mean_weight <- mean(women$weight)
which_rows <- which(women$weight > mean_weight)

women[which_rows, ]

   height weight
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Exercise

Find the weights of women with heights less than or equal to the median height of the sample. The subset of the data set you should obtain is:

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
7     64    132
8     65    135

Solution

Firstly, we find the median height of women in the sample and name this object median_height. Next, we find for which rows contain a height less than or equal to the median height. Finally, we extract the corresponding rows of the data frame.

median_height <- median(women$height)
which_rows <- which(women$height <= median_height)
women[which_rows, ]

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
7     64    132
8     65    135