Are you curious about the timing accuĀraĀcy of your online experiments?
Online stimĀuĀlus and response timing is good enough for a wide range of behavĀiourĀal research. However, there are some quirky comĀbiĀnaĀtions to watch out for. We also make recĀomĀmenĀdaĀtions of how to reduce meaĀsureĀment error when timing senĀsiĀtivĀiĀty is parĀticĀuĀlarĀly important.
We have had a paper pubĀlished that looks at this exact quesĀtion. Itās out in BehavĀiour Research Methods, you can read it here. This paper looks at the timing accuĀraĀcy of experĀiĀments run online, not only in Gorilla but in several other popular platĀforms for behavĀiourĀal research. It also includes parts of the parĀticĀiĀpant analyĀsis we had shown in a preĀviĀous blog post. Your parĀticĀiĀpants could be using any number of devices and web browsers, so we wanted to let you know what they are using and also test a few of these types of browser and devices for their impact on stimĀuĀlus and response timing. We also wanted to assess the perĀforĀmance of a range of experiment buildĀing libraries and tool boxes (weāll put these under the umbrelĀla term ātoolsā).
We tested:
- Browsers: Chrome, Edge, Safari, Firefox
- Devices: Desktop and Laptop
- OperĀatĀing Systems: Mac and PC
- Tools: Gorilla, jsPsych, PsyĀchoPy & Lab.js
TLDR, The major finding was that ā for most users ā the timing accuĀraĀcy is pretty good across all the valĀiĀdatĀed tools. DefĀiĀniteĀly good enough for the vast majorĀiĀty of studies with within-subject designs. However, there are some quirky comĀbiĀnaĀtions you can see in the graphs below.
AccuĀraĀcy vs Precision
Before heading into our results, itās probĀaĀbly helpful to provide the reader with an overview of two conĀcepts: AccuĀraĀcy and Precision.
AccuĀraĀcy in this context is the average difĀferĀence between the ideal timing (e.g. reacĀtion time recordĀed is the exact moment the parĀticĀiĀpant presses a key) and observed timing. The closer the difĀferĀence is to zero, the better.
PreĀciĀsion is a related metric, but probĀaĀbly more imporĀtant in this case (weāll explain why below). It refers to the variĀabilĀiĀty around our average accuracy.
These two things can vary indeĀpenĀdentĀly. To illusĀtrate this, we have used a toy example of arrows shot into a target:
In this example, accuĀraĀcy is the disĀtance from the bullsĀeye, and preĀciĀsion is the spread of the arrows. In our paper these values being low is better ā so a value of zero for accuĀraĀcy means arrows are spot on the bulls-eye.
Itās clear that the top right target is the best ā the archer has shown both high accuĀraĀcy and precision.
But which is the second best? Is it high accuĀraĀcy with low preĀciĀsion, even with the arrows being quite a disĀtance from the centre indiĀvidĀuĀalĀly ā as seen on the top left. Or low accuĀraĀcy with high preĀciĀsion, a tight cluster of precise arrows but missing the mark ā as seen on the bottom right.
In the case of experiment timing itās the later, lower accuĀraĀcy with a higher preĀciĀsion, that is arguably preferable.
This is because we are often comĀparĀing conĀdiĀtions, so a conĀsisĀtent delay on both conĀdiĀtions still leads to the same difĀferĀence. ComĀpared with high accuĀraĀcy, but low preĀciĀsion, where the difĀferĀence might be obscured. In other words, high preĀciĀsion delivĀers less noisy data, thereby increasĀing your chance of detectĀing a true effect.
Visual Display
We first assessed the timing of the visual eleĀments in the experĀiĀments. So how accuĀrate and precise someĀthing in a trial would be renĀdered on the screen.
We assessed the timing of stimuli (a white square) at a range of duraĀtions. This was done by preĀsentĀing the stimĀuĀlus at 29 difĀferĀent duraĀtions from a single frame (16ms) to just under half a second (29 frames). Thatās 4350 trials for each OS-browser-tool comĀbiĀnaĀtion. This means we disĀplayed over 100,000 trials and recordĀed the accuĀraĀcy of each of them. This makes this section of the paper the largest assessĀment of display timing pubĀlished to date.
DepictĀed above is the photo-diode sensor attached to a comĀputĀer screen. We used this to record the true onset and offset of stimuli and comĀpared this to the requestĀed duraĀtion put into the program.
The results of this meaĀsureĀment are broken up in the graphs below:
The dotās disĀtance from the red line repĀreĀsents accuĀraĀcy, and the spread of the line repĀreĀsents preĀciĀsion. You can see that there is a variĀance between browsers, operĀatĀing systems and tools. Each tool/experiment builder has a comĀbiĀnaĀtion it perĀformed parĀticĀuĀlarĀly well on. Almost all of the tools over-preĀsentĀed frames rather than underĀrepĀreĀsentĀed them. Except for PsyĀchoĀJS and Lab.js on macOS-Firefox, where it appears some frames where preĀsentĀed shorter than requestĀed. Under-preĀsentĀing by one or two frames would lead very short preĀsenĀtaĀtions to not be renĀdered at all. At Gorilla, we believe that preĀsentĀing stimuli is essenĀtial, and thereĀfore each stimĀuĀlus being shown slightĀly longer, is a better outcome.
We disĀcovĀered that the average timing accuĀraĀcy is likely to be a single frame (16.66ms) with a preĀciĀsion of 1ā2 frames after (75% of trials in Windows 10 had 19.75ms delay or less, and 75% of trials on macOS had 34.5ms delays or less).
ReasĀsurĀingĀly, the vast majorĀiĀty of the user sample was using the best perĀformĀing comĀbiĀnaĀtion ā Chrome on Windows. A platĀform comĀbiĀnaĀtion that Gorilla perĀformed at highly consistently.
Visual DuraĀtion
Where our paper has gone beyond any preĀviĀous research is that we tested a range of frame duraĀtion. This is imporĀtant because you want the accuĀraĀcy and preĀciĀsion, to remain conĀstant across frame duraĀtions. That way, you can be sure that the visual delay (error) at one duraĀtion (say 20 frames) is that same as the delay at another duraĀtion (say 40 frames). So when you calĀcuĀlate the difĀferĀence in RT at difĀferĀent stimuli duraĀtions, the error cancels out.
As you can see, there is some variĀaĀtion across display duraĀtions. Smooth lines (averĀages and variĀance) are better than spiky ones.
ReacĀtion TimeThe other key factor to timing in online experĀiĀments is how accuĀrateĀly and preĀciseĀly your responsĀes are logged. You want to be sure your reacĀtion time meaĀsures are as close to when your parĀticĀiĀpant pressed down as posĀsiĀble. This introĀduces the factor of the keyĀboard ā which could be a built-in keyĀboard on a laptop or an exterĀnal keyĀboard on a desktop. ThereĀfore, we widened the number of devices we assessed to include laptop and desktops.
To be sure of the speĀcifĀic reacĀtion time used, we proĀgrammed a robotic finger (called an actuĀaĀtor) to press the keyĀboard at a speĀcifĀic time after a stimĀuĀlus had been disĀplayed. This repliĀcates the process a human would go through, but with much less variĀabilĀiĀty. We proĀgrammed the finger to respond at 4 difĀferĀent reacĀtion times: 100ms, 200ms, 300ms & 500ms ā repĀreĀsentĀing a range of difĀferĀent posĀsiĀble human reacĀtion times. These were repeatĀed for each softĀware-hardĀware comĀbiĀnaĀtion 150 times.
You can see a picture of the set-up above on a laptop, the photo-diode senses the white square and then the actuĀaĀtor presses the space bar.
You can see the results below. The ReacĀtion Time latenĀcies we report, averĀagĀing around 80ms and extendĀing to 100ms someĀtimes, are longer than other recent research reportĀed (Bridges et al., 2020).
We believe that this is because preĀviĀous research has used things like a button box (a speĀcialĀist low-latency piece of equipĀment for lab testing of RT), and difĀferĀences in the setting up of devices (that can vary pretty drasĀtiĀcalĀly in terms of keyĀboards). The actuĀaĀtor set up allowed us to report variĀance that is conĀtributed by the keyĀboard, someĀthing that is imporĀtant for users to know about as their parĀticĀiĀpants will be using a variety of keyboards.
In any case, in terms of preĀciĀsion Gorilla here perĀforms fairly conĀsisĀtentĀly across all platĀforms and browsers. It had the lowest stanĀdard deviĀaĀtion (i.e. highest preĀciĀsion) out of all the platĀforms ā averĀagĀing 8.25ms. In general, there was higher variĀabilĀiĀty on laptops relĀaĀtive to desktops.
Again, we find all platĀforms have some variĀabilĀiĀty across devices and browsers ā which seem to be pretty idioĀsynĀcratĀic. For instance, PsyĀchoĀJS shows relĀaĀtiveĀly high variĀabilĀiĀty with Firefox on the macOS desktop, but fairly average perĀforĀmance on the same browser and OS but on a laptop.
ConĀcluĀsions
Our results show that the timing of web-based research for most popular platĀforms is within acceptĀable limits. This is the case for both display and response logging. However, response logging shows higher delays. EncourĀagĀingĀly, the most conĀsisĀtent comĀbiĀnaĀtion of Chrome-Windows for all tools was also the most used in our sample survey.
However, when timing senĀsiĀtivĀiĀty is crucial, we recĀomĀmend employĀing within-parĀticĀiĀpant designs where posĀsiĀble to avoid having to make comĀparĀisons between parĀticĀiĀpants with difĀferĀent devices, operĀatĀing systems, and browsers. AddiĀtionĀalĀly, limĀitĀing parĀticĀiĀpants to one browser could remove further noise. LimĀitĀing parĀticĀiĀpantsā devices and browsers can be done proĀgramĀmatĀiĀcalĀly in all tested platĀforms, and via a graphĀiĀcal user interĀface in Gorilla.
Further Reading
For more in depth reportĀing, please read our paper: https://doi.org/10.3758/s13428-020ā01501ā5
Notes: All platĀforms are conĀtinĀuĀalĀly under develĀopĀment, so we tested all platĀforms between May and August 2019 in order to be as fair as possible.