Timing per­for­mance of online experiments

 Are you curious about the timing accu­ra­cy of your online experiments?

Online stim­u­lus and response timing is good enough for a wide range of behav­iour­al research. However, there are some quirky com­bi­na­tions to watch out for. We also make rec­om­men­da­tions of how to reduce mea­sure­ment error when timing sen­si­tiv­i­ty is par­tic­u­lar­ly important.


We have had a paper pub­lished that looks at this exact ques­tion. It’s out in Behav­iour Research Methods, you can read it here. This paper looks at the timing accu­ra­cy of exper­i­ments run online, not only in Gorilla but in several other popular plat­forms for behav­iour­al research. It also includes parts of the par­tic­i­pant analy­sis we had shown in a pre­vi­ous blog post. Your par­tic­i­pants could be using any number of devices and web browsers, so we wanted to let you know what they are using and also test a few of these types of browser and devices for their impact on stim­u­lus and response timing. We also wanted to assess the per­for­mance of a range of experiment build­ing libraries and tool boxes (we’ll put these under the umbrel­la term ‘tools’).

We tested:

  • Browsers: Chrome, Edge, Safari, Firefox
  • Devices: Desktop and Laptop
  • Oper­at­ing Systems: Mac and PC
  • Tools: Gorilla, jsPsych, Psy­choPy & Lab.js


TLDR, The major finding was that – for most users – the timing accu­ra­cy is pretty good across all the val­i­dat­ed tools. Def­i­nite­ly good enough for the vast major­i­ty of studies with within-subject designs. However, there are some quirky com­bi­na­tions you can see in the graphs below.

Accu­ra­cy vs Precision

Before heading into our results, it’s prob­a­bly helpful to provide the reader with an overview of two con­cepts: Accu­ra­cy and Precision.

Accu­ra­cy in this context is the average dif­fer­ence between the ideal timing (e.g. reac­tion time record­ed is the exact moment the par­tic­i­pant presses a key) and observed timing. The closer the dif­fer­ence is to zero, the better.

Pre­ci­sion is a related metric, but prob­a­bly more impor­tant in this case (we’ll explain why below). It refers to the vari­abil­i­ty around our average accuracy.

These two things can vary inde­pen­dent­ly. To illus­trate this, we have used a toy example of arrows shot into a target:

In this example, accu­ra­cy is the dis­tance from the bulls­eye, and pre­ci­sion is the spread of the arrows. In our paper these values being low is better — so a value of zero for accu­ra­cy means arrows are spot on the bulls-eye.

It’s clear that the top right target is the best – the archer has shown both high accu­ra­cy and precision.

But which is the second best? Is it high accu­ra­cy with low pre­ci­sion, even with the arrows being quite a dis­tance from the centre indi­vid­u­al­ly – as seen on the top left. Or low accu­ra­cy with high pre­ci­sion, a tight cluster of precise arrows but missing the mark — as seen on the bottom right.

In the case of experiment timing it’s the later, lower accu­ra­cy with a higher pre­ci­sion, that is arguably preferable.

This is because we are often com­par­ing con­di­tions, so a con­sis­tent delay on both con­di­tions still leads to the same dif­fer­ence. Com­pared with high accu­ra­cy, but low pre­ci­sion, where the dif­fer­ence might be obscured. In other words, high pre­ci­sion deliv­ers less noisy data, thereby increas­ing your chance of detect­ing a true effect.

Visual Display

We first assessed the timing of the visual ele­ments in the exper­i­ments. So how accu­rate and precise some­thing in a trial would be ren­dered on the screen.

We assessed the timing of stimuli (a white square) at a range of dura­tions. This was done by pre­sent­ing the stim­u­lus at 29 dif­fer­ent dura­tions from a single frame (16ms) to just under half a second (29 frames). That’s 4350 trials for each OS-browser-tool com­bi­na­tion. This means we dis­played over 100,000 trials and record­ed the accu­ra­cy of each of them. This makes this section of the paper the largest assess­ment of display timing pub­lished to date.

Depict­ed above is the photo-diode sensor attached to a com­put­er screen. We used this to record the true onset and offset of stimuli and com­pared this to the request­ed dura­tion put into the program.

The results of this mea­sure­ment are broken up in the graphs below:

The dot’s dis­tance from the red line rep­re­sents accu­ra­cy, and the spread of the line rep­re­sents pre­ci­sion. You can see that there is a vari­ance between browsers, oper­at­ing systems and tools. Each tool/experiment builder has a com­bi­na­tion it per­formed par­tic­u­lar­ly well on. Almost all of the tools over-pre­sent­ed frames rather than under­rep­re­sent­ed them. Except for Psy­cho­JS and Lab.js on macOS-Firefox, where it appears some frames where pre­sent­ed shorter than request­ed. Under-pre­sent­ing by one or two frames would lead very short pre­sen­ta­tions to not be ren­dered at all. At Gorilla, we believe that pre­sent­ing stimuli is essen­tial, and there­fore each stim­u­lus being shown slight­ly longer, is a better outcome.

We dis­cov­ered that the average timing accu­ra­cy is likely to be a single frame (16.66ms) with a pre­ci­sion of 1–2 frames after (75% of trials in Windows 10 had 19.75ms delay or less, and 75% of trials on macOS had 34.5ms delays or less).

Reas­sur­ing­ly, the vast major­i­ty of the user sample was using the best per­form­ing com­bi­na­tion – Chrome on Windows. A plat­form com­bi­na­tion that Gorilla per­formed at highly consistently.

Visual Dura­tion

Where our paper has gone beyond any pre­vi­ous research is that we tested a range of frame dura­tion. This is impor­tant because you want the accu­ra­cy and pre­ci­sion, to remain con­stant across frame dura­tions. That way, you can be sure that the visual delay (error) at one dura­tion (say 20 frames) is that same as the delay at another dura­tion (say 40 frames). So when you cal­cu­late the dif­fer­ence in RT at dif­fer­ent stimuli dura­tions, the error cancels out.

As you can see, there is some vari­a­tion across display dura­tions. Smooth lines (aver­ages and vari­ance) are better than spiky ones.

Reac­tion TimeThe other key factor to timing in online exper­i­ments is how accu­rate­ly and pre­cise­ly your respons­es are logged. You want to be sure your reac­tion time mea­sures are as close to when your par­tic­i­pant pressed down as pos­si­ble. This intro­duces the factor of the key­board – which could be a built-in key­board on a laptop or an exter­nal key­board on a desktop. There­fore, we widened the number of devices we assessed to include laptop and desktops.

To be sure of the spe­cif­ic reac­tion time used, we pro­grammed a robotic finger (called an actu­a­tor) to press the key­board at a spe­cif­ic time after a stim­u­lus had been dis­played. This repli­cates the process a human would go through, but with much less vari­abil­i­ty. We pro­grammed the finger to respond at 4 dif­fer­ent reac­tion times: 100ms, 200ms, 300ms & 500ms – rep­re­sent­ing a range of dif­fer­ent pos­si­ble human reac­tion times. These were repeat­ed for each soft­ware-hard­ware com­bi­na­tion 150 times.

You can see a picture of the set-up above on a laptop, the photo-diode senses the white square and then the actu­a­tor presses the space bar.

You can see the results below. The Reac­tion Time laten­cies we report, aver­ag­ing around 80ms and extend­ing to 100ms some­times, are longer than other recent research report­ed (Bridges et al., 2020).

We believe that this is because pre­vi­ous research has used things like a button box (a spe­cial­ist low-latency piece of equip­ment for lab testing of RT), and dif­fer­ences in the setting up of devices (that can vary pretty dras­ti­cal­ly in terms of key­boards). The actu­a­tor set up allowed us to report vari­ance that is con­tributed by the key­board, some­thing that is impor­tant for users to know about as their par­tic­i­pants will be using a variety of keyboards.

In any case, in terms of pre­ci­sion Gorilla here per­forms fairly con­sis­tent­ly across all plat­forms and browsers. It had the lowest stan­dard devi­a­tion (i.e. highest pre­ci­sion) out of all the plat­forms – aver­ag­ing 8.25ms. In general, there was higher vari­abil­i­ty on laptops rel­a­tive to desktops.

Again, we find all plat­forms have some vari­abil­i­ty across devices and browsers – which seem to be pretty idio­syn­crat­ic. For instance, Psy­cho­JS shows rel­a­tive­ly high vari­abil­i­ty with Firefox on the macOS desktop, but fairly average per­for­mance on the same browser and OS but on a laptop.


Our results show that the timing of web-based research for most popular plat­forms is within accept­able limits. This is the case for both display and response logging. However, response logging shows higher delays. Encour­ag­ing­ly, the most con­sis­tent com­bi­na­tion of Chrome-Windows for all tools was also the most used in our sample survey.

However, when timing sen­si­tiv­i­ty is crucial, we rec­om­mend employ­ing within-par­tic­i­pant designs where pos­si­ble to avoid having to make com­par­isons between par­tic­i­pants with dif­fer­ent devices, oper­at­ing systems, and browsers. Addi­tion­al­ly, lim­it­ing par­tic­i­pants to one browser could remove further noise. Lim­it­ing par­tic­i­pants’ devices and browsers can be done pro­gram­mat­i­cal­ly in all tested plat­forms, and via a graph­i­cal user inter­face in Gorilla.

Further Reading

For more in depth report­ing, please read our paper: https://doi.org/10.3758/s13428-020–01501‑5

Notes: All plat­forms are con­tin­u­al­ly under devel­op­ment, so we tested all plat­forms between May and August 2019 in order to be as fair as possible.