Excel is a great tool for processing data. Whether it’s calculating a quick average, standard deviation, or t-test, Excel is fast and simple to learn and use. In fact, most of my colleagues use Excel to preprocess their data from experiments programmed in E-Prime, a software tool for running psychological experiments. However, preprocessing your E-Prime data in Excel will:

  1. limit the statistical methods used to analyze your data
  2. become burdensome with large samples ( > 20 participants)
  3. create large workbooks that are slow and susceptible to crashes
  4. create roadbumps down the line when reviewer #2 asks for a new analysis (I swear, it’s always reviewer #2)

I’ll demonstrate a different method on how to preprocess your E-Prime data with R and the R package dplyr (Wickham & Francois, 2016) using a small data set from a real experiment of mine. This method is much faster, efficient, and saves soooooo much time when it comes to reviewer #2’s requests.

Just a few notes about the experiment: participants completed a task that required them to reason about items on the screen and press #1 or #2 on the keyboard. There were 3 conditions randomly distributed over 4 blocks with 24 trials per block (8 trials per condition). I was interested in the participants’ accuracy and reaction time for correct solutions across the conditions and blocks. Therefore, each participant has only 1 E-Prime file.

Step 1 - Merging E-DataAid Files

E-Prime spat out an E-DataAid file (*.edat2) for every participant upon completion of the experiment. Let’s first concatenate these files row-wise (on top of each other) so that we end up with one big file that has each participant’s data. This is done using E-Prime’s E-Merge software.

Concatenating with E-Merge

  1. Open E-Merge and navigate to the directory where all the E-DataAid files are stored using the Folder Tree
  2. Select all the E-DataAid files and click Merge…
  3. Choose the appropriate option and click Next >:
    1. Standard Merge if all your files are in one directory (option chosen in this tutorial)
    2. Recursive Merge if all your files are stored in folders within folders
  4. Name your file and save it as a *.emrg2 file (the default)

As long as your E-DataAids are consistent with each other, they should seamlessly merge. Next, we have to convert this to a format R and other programs can read using E-DataAid:

Converting Merged E-DataAid

  1. Double click on the *.emrg2 file that you just created
  2. Go to File > Export
  3. Ensure the “Export to:” option is: StatView and SPSS
  4. Ensure that the Unicode box at the bottom is unchecked
  5. Click OK and name/save the file as a *.txt (the default)

Now these data are in one central file and prepared for R. Next, let’s import into R:

Step 2 - Importing into R

  1. Open R or RStudio and **ensure your working directory is set to where you saved your text file (*.txt) from above.**
  2. Import the file into R and save it as a variable:
eData <- read.table('data.txt', # The name of your *.txt file from above
                    sep = '\t', # These data are tab separated
                    header = T) # Appends variables as column names 

For the purposes of this tutorial, these data are available on this website’s github repository and can be downloaded into R like this:

rm(list = ls())                 # Clears workspace
library(RCurl)                  # To retrive data from github repo
eData <- read.table(text = getURL("https://raw.githubusercontent.com/mkmiecik14/mkmiecik14.github.io/master/data/data.txt"), 
                   sep = '\t',
                   header = T)
dim(eData)
> [1] 480  47

As we can see, this dataframe has 480 rows and 47 columns. Each row is a trial from a participant and each column is a measure/information from your E-prime experiment. E-prime gives us way too much information, so I like to clean this up and only include the essentials (Note: These variable names will vary based on your experiment):

# Variables that I want to keep
vars <- c('Subject','probType','stimulus.ACC','stimulus.RT')

# Subsetting these variables
eDataSimple <- subset(eData, select = vars)

# Let's take a look
head(eDataSimple); tail(eDataSimple)
>   Subject probType stimulus.ACC stimulus.RT
> 1     413        a            1        5578
> 2     413        s            0        9889
> 3     413        a            0        4218
> 4     413        p            1        1376
> 5     413        s            0        6169
> 6     413        p            1        2663
>     Subject probType stimulus.ACC stimulus.RT
> 475     416        s            1        4671
> 476     416        a            1        2765
> 477     416        s            1       13400
> 478     416        p            1         957
> 479     416        a            1        2112
> 480     416        s            1        8193

I’ve printed the top and bottom 6 rows of this dataframe. As you can see, the first participant’s ID is 413, while the last participant’s ID is 416. Each trial is a row and has a problem type (either p, a, or s), an accuracy (1 for correct, 0 for incorrect), as well as an associated reaction time (RT) measure in milliseconds. Now, I forgot to program the block each trial appeared in my E-Prime experiment, but I can add it like this:

# Adding block
eDataSimple$Block <- rep(c(1,2,3,4), each = 24) # 24 trials/block

# Let's take another look
head(eDataSimple); tail(eDataSimple)
>   Subject probType stimulus.ACC stimulus.RT Block
> 1     413        a            1        5578     1
> 2     413        s            0        9889     1
> 3     413        a            0        4218     1
> 4     413        p            1        1376     1
> 5     413        s            0        6169     1
> 6     413        p            1        2663     1
>     Subject probType stimulus.ACC stimulus.RT Block
> 475     416        s            1        4671     4
> 476     416        a            1        2765     4
> 477     416        s            1       13400     4
> 478     416        p            1         957     4
> 479     416        a            1        2112     4
> 480     416        s            1        8193     4

Now these data are in a perfect format to summarize using the R package dplyr.

Step 3 - Summarize with dplyr

Let’s calculate the mean accuracy for condition x block (3 x 4 repeated-measures factorial design). But first, let’s group based on our factors, which are 1) the subjects, 2) the conditions, and 3) the blocks. dplyr also lets you rename the columns easily:

library(dplyr) # Loads dplyr

acc <- group_by(eDataSimple,              # Dataframe   
                ss = Subject,             # Subjects
                cond = probType,          # Conditions 
                block = as.factor(Block)) # Blocks

Next, let’s calculate the average based on these factors. In other words, for each subject, what was his or her accuracy for each condition crossed with block?

acc <- summarise(acc, meanAcc = mean(stimulus.ACC))

head(acc); tail(acc) 
> Source: local data frame [6 x 4]
> Groups: ss, cond [2]
> 
>      ss   cond  block meanAcc
>   <int> <fctr> <fctr>   <dbl>
> 1   413      a      1   0.875
> 2   413      a      2   1.000
> 3   413      a      3   1.000
> 4   413      a      4   0.750
> 5   413      p      1   0.750
> 6   413      p      2   1.000
> Source: local data frame [6 x 4]
> Groups: ss, cond [2]
> 
>      ss   cond  block meanAcc
>   <int> <fctr> <fctr>   <dbl>
> 1   418      p      3   1.000
> 2   418      p      4   1.000
> 3   418      s      1   0.875
> 4   418      s      2   0.625
> 5   418      s      3   0.625
> 6   418      s      4   0.875

dplyr can also be used to create pipelines that streamline the code with the %>% operator. Here is an example of a pipeline that calculates both accuracy and RT:

results <-  eDataSimple %>% 
              group_by(ss = Subject,
                       cond = probType,
                       block = as.factor(Block)) %>%
              summarise_each('mean', c(stimulus.ACC, stimulus.RT))

head(results); tail(results)
> Source: local data frame [6 x 5]
> Groups: ss, cond [2]
> 
>      ss   cond  block stimulus.ACC stimulus.RT
>   <int> <fctr> <fctr>        <dbl>       <dbl>
> 1   413      a      1        0.875    5776.375
> 2   413      a      2        1.000    2393.125
> 3   413      a      3        1.000    2516.875
> 4   413      a      4        0.750    1940.000
> 5   413      p      1        0.750    5911.625
> 6   413      p      2        1.000    3227.625
> Source: local data frame [6 x 5]
> Groups: ss, cond [2]
> 
>      ss   cond  block stimulus.ACC stimulus.RT
>   <int> <fctr> <fctr>        <dbl>       <dbl>
> 1   418      p      3        1.000    2223.000
> 2   418      p      4        1.000    2332.750
> 3   418      s      1        0.875    6749.000
> 4   418      s      2        0.625    5387.250
> 5   418      s      3        0.625    7239.000
> 6   418      s      4        0.875    5255.125

The dataframe ‘results’ is ready for stats in R.

Now what about that pesky reviewer #2? Let’s say reviewer #2 asks for a new analysis that, instead of looking at reaction time, asks for the results with reaction time for only correct solutions? If you preprocessed your data in Excel, you would probably have to re-compute all these values in each sheet and then re-do the analyses. But this is simple in R and is only one additional line of code:

corRT <-  eDataSimple %>%
          filter(stimulus.ACC == 1) %>%  # Filters out inaccurate trials
          group_by(ss = Subject,
                   cond = probType,
                   block = as.factor(Block)) %>%
          summarise_each('mean', c(stimulus.ACC, stimulus.RT))

head(corRT); tail(corRT)
> Source: local data frame [6 x 5]
> Groups: ss, cond [2]
> 
>      ss   cond  block stimulus.ACC stimulus.RT
>   <int> <fctr> <fctr>        <dbl>       <dbl>
> 1   413      a      1            1    5999.000
> 2   413      a      2            1    2393.125
> 3   413      a      3            1    2516.875
> 4   413      a      4            1    2069.000
> 5   413      p      1            1    5791.833
> 6   413      p      2            1    3227.625
> Source: local data frame [6 x 5]
> Groups: ss, cond [2]
> 
>      ss   cond  block stimulus.ACC stimulus.RT
>   <int> <fctr> <fctr>        <dbl>       <dbl>
> 1   418      p      3            1    2223.000
> 2   418      p      4            1    2332.750
> 3   418      s      1            1    6574.000
> 4   418      s      2            1    5326.000
> 5   418      s      3            1    6407.000
> 6   418      s      4            1    4871.143

To summarize, the entire R script to process these data is quite concise and can accommodate many more participants with ease:

rm(list = ls()) # Clears workspace

eData <- read.table('data.txt', # The name of your *.txt file from above
                    sep = '\t', # These data are tab separated
                    header = T) # Appends variables as column names

# Variables that I want to keep
vars <- c('Subject','probType','stimulus.ACC','stimulus.RT')

# Subsetting these variables
eDataSimple <- subset(eData, select = vars)

# Adding block
eDataSimple$Block <- rep(c(1,2,3,4), each = 24) # 24 trials/block

library(dplyr) # Loads dplyr

# Results with regular RT
results <-  eDataSimple %>% 
              group_by(ss = Subject,
                       cond = probType,
                       block = as.factor(Block)) %>%
              summarise_each('mean', c(stimulus.ACC, stimulus.RT))

# Correct RT results
corRT <-  eDataSimple %>%
          filter(stimulus.ACC == 1) %>%  # Filters out inaccurate trials
          group_by(ss = Subject,
                   cond = probType,
                   block = as.factor(Block)) %>%
          summarise_each('mean', c(stimulus.ACC, stimulus.RT))

After programming your experiment with the E-Prime beast, dragging undergraduate participants through your study, and wrangling the data into one place, why not make your life easier? Ditch the Excel templates. You’ll thank me when reviewer #2 comes around!

Acknowledgments

This tutorial was inspired by Dr. Jahn’s amazing blog that helped me and I’m sure hundreds of other graduate students stumble through the crazy world that is fMRI analysis. Andy’s Brain Blog is the best!

References

Wickham, H., and R. Francois. 2016. dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.