data("women")
Extracting information
R can be used to visualise and analyse different types of data. Therefore we need to understand how to handle data in R.
We will start with the data set called women
. This data set can be loaded using the function data
as follows:
To view the data set, we can use the R function View(women)
. This opens a new tab in our Source panel. Alternatively, we can view the first 6 rows of the data set in the console using the head
function:
head(women)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
The data set consists of two columns : height and weight. If you want to know more about the data set, you can visit the help page using ?women
.
This data set is in the form of a data.frame
. A data.frame
stores variables of different classes in named columns and rows.
We will learn how to extract particular columns, rows and values from our data.frame
.
Firstly, to find the dimension (the size) of our data.frame
. We will use the function dim
:
dim(women)
[1] 15 2
Our data consists of 15 rows and 2 columns.
If we want to extract particular values at certain rows and columns, we can type in the row and column number. For example, say we want the 1st row and the 1st column:
1, 1] women[
[1] 58
Or the 7th value in the 2nd column,
7, 2] women[
[1] 132
It’s more likely that we will want to perform calculations, or plot just one column of the data. We can extract columns from our data in different ways.
Note how the following three methods all give the same result.
Using the column number,
1] women[,
[1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
Use the column name in place of the number,
"height"] women[,
[1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
Use the $
symbol and the column name,
$height women
[1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
"height"] women[,
[1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
We can use the first two of these methods to extract rows. Note here that the rownames are the same as the row numbers, but this might not always be the case.
1, ] women[
height weight
1 58 115
"1", ] women[
height weight
1 58 115
We can extract statistical summary information from our data set using the summary
function. This function will give measures such as the mean and median for each column on the data set.
summary(women)
height weight
Min. :58.0 Min. :115.0
1st Qu.:61.5 1st Qu.:124.5
Median :65.0 Median :135.0
Mean :65.0 Mean :136.7
3rd Qu.:68.5 3rd Qu.:148.0
Max. :72.0 Max. :164.0
Let’s say we want to look at sections of our data, more specifically, we want to look at the heights for women who weighed more than the average.
We can extract the correct column name using the $
symbol and we use the which
function to find the row numbers for which the weight is more than the average, 136.7.
which(women$weight > 136.7)
[1] 9 10 11 12 13 14 15
We can improve this code by using the R function mean
to find the average, and assign this value to mean_weight
:
<- mean(women$weight)
mean_weight which(women$weight > mean_weight)
[1] 9 10 11 12 13 14 15
Our final step to find the heights is to assign the row numbers to which_rows
:
<- mean(women$weight)
mean_weight <- which(women$weight > mean_weight)
which_rows
women[which_rows, ]
height weight
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Find the weights of women with heights less than or equal to the median height of the sample. The subset of the data set you should obtain is:
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
Firstly, we find the median height of women in the sample and name this object median_height
. Next, we find for which rows contain a height less than or equal to the median height. Finally, we extract the corresponding rows of the data frame.
<- median(women$height)
median_height <- which(women$height <= median_height)
which_rows women[which_rows, ]
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135