Merging and Sorting Dataframes


Note: This topic has a lot to it, so I only cover the basics. There are plenty of online resources to do this on your own. There won't be any homework questions on this section.

Say you have two datasets and you want to merge them based on an ID number. A simple example of merging these by variable ID follows:

> dataset.1<-matrix(c(1,13,12,1, 2,12,10,2, 3,13,9,1, 4,9,8,2, 5,3,7,3, 6,5,6,1, 7,6,5,2, 8,5,5,3), ncol=4, byrow=T)

> dataset.1<-data.frame(dataset.1)

> names(dataset.1)<-c("ID","read 1","read 2","read 3")

> dataset.1

  ID read 1 read 2 read 3

1  1     13     12      1

2  2     12     10      2

3  3     13      9      1

4  4      9      8      2

5  5      3      7      3

6  6      5      6      1

7  7      6      5      2

8  8      5      5      3

 

> dataset.2<-matrix(c(1,12, 2,13, 3,3, 4,15, 5,31, 6,15, 7,4, 8,6), ncol=2, byrow=T)

> dataset.2<-data.frame(dataset.2)

> names(dataset.2)<-c("ID","read 4")

 

> dataset.2

  ID read 4

1  1     12

2  2     13

3  3      3

4  4     15

5  5     31

6  6     15

7  7      4

8  8      6

 

> dataset.merged <- merge(dataset.1,dataset.2,by="ID")

 

> dataset.merged

ID read 1 read 2 read 3 read 4

1  1     13     12      1     12

2  2     12     10      2     13

3  3     13      9      1      3

4  4      9      8      2     15

5  5      3      7      3     31

6  6      5      6      1     15

7  7      6      5      2      4

8  8      5      5      3      6

Many to One Merging 

Now, a nice simple example of many-to-one merging:

> library(reshape)

> my.test.2<-matrix(c(1,13,12,1, 1,12,10,2, 2,13,9,1, 2,9,8,2, 2,3,7,3, 3,5,6,1, 3,6,5,2, 3,5,5,3), ncol=4, byrow=T)

#create sample data

 

# convert to a dataframe, a more robust format

> my.test.2<-as.data.frame(my.test.2)

 

# add column names

> names(my.test.2)<-c("ID","read A","read B","visit")

 

# print it to get a look. This is in "long" format

> my.test.2

  ID read A read B visit

1  1     13     12     1

2  1     12     10     2

3  2     13      9     1

4  2      9      8     2

5  2      3      7     3

6  3      5      6     1

7  3      6      5     2

8  3      5      5     3

 

# above is a matrix of repeated measures on the same patient seen during multiple visits. The function below coerces the data frame into a 'wide' format, with one row per individual, renaming variables to account for missing values.

 

> wide_mytest <- reshape(my.test.2, direction="wide",idvar="ID",timevar="visit")

> wide_mytest

  ID read A.1 read B.1 read A.2 read B.2 read A.3 read B.3

1  1       13       12       12       10       NA       NA

3  2       13        9        9        8        3        7

6  3        5        6        6        5        5        5

 

Finally, you have a matrix you want to sort from smallest to largest of a particular column, but you want to keep each row intact. Do the following:

> setwd("/Users/Avery/Documents/classes/Graduate/720 Intro to R/spring 2015/classes/wk3/sorting matrices")

> read.table("ZZ")

          p result x1         x2

1 0.8417549      1  2  0.4213440

2 0.9136235      1  3 -0.6412975

3 0.8361460      0  2  0.3798271

4 0.9850423      1  4 -0.5625415

5 0.3114491      1  0  1.4566465

#note each row here starts with an assigned number; this is NOT an actual column of (1,2,3,…), it's just a row designation.

#now sort it via column named "p"

 

> ZZ[order(ZZ[,1]),]

 

# or:

 

> attach(ZZ)

> ZZ[order(p),]

               p             result x1         x2

5        0.3114491      1  0  1.4566465

3        0.8361460      0  2  0.3798271

1        0.8417549      1  2  0.4213440

2        0.9136235      1  3 -0.6412975

4        0.9850423      1  4 -0.5625415

Now the matrix is sorted by column "p" while keeping the structure of the dataframe.