Tim­ing per­for­mance of online experiments

 Are you curi­ous about the tim­ing accu­ra­cy of your online experiments?

Online stim­u­lus and response tim­ing is good enough for a wide range of behav­iour­al research. How­ev­er, there are some quirky com­bi­na­tions to watch out for. We also make rec­om­men­da­tions of how to reduce mea­sure­ment error when tim­ing sen­si­tiv­i­ty is par­tic­u­lar­ly important.


We have had a paper pub­lished that looks at this exact ques­tion. It’s out in Behav­iour Research Meth­ods, you can read it here. This paper looks at the tim­ing accu­ra­cy of exper­i­ments run online, not only in Gorilla but in sev­er­al other pop­u­lar plat­forms for behav­iour­al research. It also includes parts of the par­tic­i­pant analy­sis we had shown in a pre­vi­ous blog post. Your par­tic­i­pants could be using any num­ber of devices and web browsers, so we want­ed to let you know what they are using and also test a few of these types of brows­er and devices for their impact on stim­u­lus and response tim­ing. We also want­ed to assess the per­for­mance of a range of exper­i­ment build­ing libraries and tool boxes (we’ll put these under the umbrel­la term ‘tools’).

We test­ed:

  • Browsers: Chrome, Edge, Safari, Firefox
  • Devices: Desk­top and Laptop
  • Oper­at­ing Sys­tems: Mac and PC
  • Tools: Gorilla, jsPsych, Psy­choPy & Lab.js


TLDR, The major find­ing was that – for most users – the tim­ing accu­ra­cy is pret­ty good across all the val­i­dat­ed tools. Def­i­nite­ly good enough for the vast major­i­ty of stud­ies with with­in-sub­ject designs. How­ev­er, there are some quirky com­bi­na­tions you can see in the graphs below.

Accu­ra­cy vs Precision

Before head­ing into our results, it’s prob­a­bly help­ful to pro­vide the read­er with an overview of two con­cepts: Accu­ra­cy and Precision.

Accu­ra­cy in this con­text is the aver­age dif­fer­ence between the ideal tim­ing (e.g. reac­tion time record­ed is the exact moment the par­tic­i­pant press­es a key) and observed tim­ing. The clos­er the dif­fer­ence is to zero, the better.

Pre­ci­sion is a relat­ed met­ric, but prob­a­bly more impor­tant in this case (we’ll explain why below). It refers to the vari­abil­i­ty around our aver­age accuracy.

These two things can vary inde­pen­dent­ly. To illus­trate this, we have used a toy exam­ple of arrows shot into a target:

In this exam­ple, accu­ra­cy is the dis­tance from the bulls­eye, and pre­ci­sion is the spread of the arrows. In our paper these val­ues being low is bet­ter — so a value of zero for accu­ra­cy means arrows are spot on the bulls-eye.

It’s clear that the top right tar­get is the best – the archer has shown both high accu­ra­cy and precision.

But which is the sec­ond best? Is it high accu­ra­cy with low pre­ci­sion, even with the arrows being quite a dis­tance from the cen­tre indi­vid­u­al­ly – as seen on the top left. Or low accu­ra­cy with high pre­ci­sion, a tight clus­ter of pre­cise arrows but miss­ing the mark — as seen on the bot­tom right.

In the case of exper­i­ment tim­ing it’s the later, lower accu­ra­cy with a high­er pre­ci­sion, that is arguably preferable.

This is because we are often com­par­ing con­di­tions, so a con­sis­tent delay on both con­di­tions still leads to the same dif­fer­ence. Com­pared with high accu­ra­cy, but low pre­ci­sion, where the dif­fer­ence might be obscured. In other words, high pre­ci­sion deliv­ers less noisy data, there­by increas­ing your chance of detect­ing a true effect.

Visu­al Display

We first assessed the tim­ing of the visu­al ele­ments in the exper­i­ments. So how accu­rate and pre­cise some­thing in a trial would be ren­dered on the screen.

We assessed the tim­ing of stim­uli (a white square) at a range of dura­tions. This was done by pre­sent­ing the stim­u­lus at 29 dif­fer­ent dura­tions from a sin­gle frame (16ms) to just under half a sec­ond (29 frames). That’s 4350 tri­als for each OS-brows­er-tool com­bi­na­tion. This means we dis­played over 100,000 tri­als and record­ed the accu­ra­cy of each of them. This makes this sec­tion of the paper the largest assess­ment of dis­play tim­ing pub­lished to date.

Depict­ed above is the photo-diode sen­sor attached to a com­put­er screen. We used this to record the true onset and off­set of stim­uli and com­pared this to the request­ed dura­tion put into the program.

The results of this mea­sure­ment are bro­ken up in the graphs below:

The dot’s dis­tance from the red line rep­re­sents accu­ra­cy, and the spread of the line rep­re­sents pre­ci­sion. You can see that there is a vari­ance between browsers, oper­at­ing sys­tems and tools. Each tool/experiment builder has a com­bi­na­tion it per­formed par­tic­u­lar­ly well on. Almost all of the tools over-pre­sent­ed frames rather than under­rep­re­sent­ed them. Except for Psy­cho­JS and Lab.js on macOS-Fire­fox, where it appears some frames where pre­sent­ed short­er than request­ed. Under-pre­sent­ing by one or two frames would lead very short pre­sen­ta­tions to not be ren­dered at all. At Gorilla, we believe that pre­sent­ing stim­uli is essen­tial, and there­fore each stim­u­lus being shown slight­ly longer, is a bet­ter outcome.

We dis­cov­ered that the aver­age tim­ing accu­ra­cy is like­ly to be a sin­gle frame (16.66ms) with a pre­ci­sion of 1–2 frames after (75% of tri­als in Win­dows 10 had 19.75ms delay or less, and 75% of tri­als on macOS had 34.5ms delays or less).

Reas­sur­ing­ly, the vast major­i­ty of the user sam­ple was using the best per­form­ing com­bi­na­tion – Chrome on Win­dows. A plat­form com­bi­na­tion that Gorilla per­formed at high­ly consistently.

Visu­al Duration

Where our paper has gone beyond any pre­vi­ous research is that we test­ed a range of frame dura­tion. This is impor­tant because you want the accu­ra­cy and pre­ci­sion, to remain con­stant across frame dura­tions. That way, you can be sure that the visu­al delay (error) at one dura­tion (say 20 frames) is that same as the delay at anoth­er dura­tion (say 40 frames). So when you cal­cu­late the dif­fer­ence in RT at dif­fer­ent stim­uli dura­tions, the error can­cels out.

As you can see, there is some vari­a­tion across dis­play dura­tions. Smooth lines (aver­ages and vari­ance) are bet­ter than spiky ones.

Reac­tion TimeThe other key fac­tor to tim­ing in online exper­i­ments is how accu­rate­ly and pre­cise­ly your respons­es are logged. You want to be sure your reac­tion time mea­sures are as close to when your par­tic­i­pant pressed down as pos­si­ble. This intro­duces the fac­tor of the key­board – which could be a built-in key­board on a lap­top or an exter­nal key­board on a desk­top. There­fore, we widened the num­ber of devices we assessed to include lap­top and desktops.

To be sure of the spe­cif­ic reac­tion time used, we pro­grammed a robot­ic fin­ger (called an actu­a­tor) to press the key­board at a spe­cif­ic time after a stim­u­lus had been dis­played. This repli­cates the process a human would go through, but with much less vari­abil­i­ty. We pro­grammed the fin­ger to respond at 4 dif­fer­ent reac­tion times: 100ms, 200ms, 300ms & 500ms – rep­re­sent­ing a range of dif­fer­ent pos­si­ble human reac­tion times. These were repeat­ed for each soft­ware-hard­ware com­bi­na­tion 150 times.

You can see a pic­ture of the set-up above on a lap­top, the photo-diode sens­es the white square and then the actu­a­tor press­es the space bar.

You can see the results below. The Reac­tion Time laten­cies we report, aver­ag­ing around 80ms and extend­ing to 100ms some­times, are longer than other recent research report­ed (Bridges et al., 2020).

We believe that this is because pre­vi­ous research has used things like a but­ton box (a spe­cial­ist low-laten­cy piece of equip­ment for lab test­ing of RT), and dif­fer­ences in the set­ting up of devices (that can vary pret­ty dras­ti­cal­ly in terms of key­boards). The actu­a­tor set up allowed us to report vari­ance that is con­tributed by the key­board, some­thing that is impor­tant for users to know about as their par­tic­i­pants will be using a vari­ety of keyboards.

In any case, in terms of pre­ci­sion Gorilla here per­forms fair­ly con­sis­tent­ly across all plat­forms and browsers. It had the low­est stan­dard devi­a­tion (i.e. high­est pre­ci­sion) out of all the plat­forms – aver­ag­ing 8.25ms. In gen­er­al, there was high­er vari­abil­i­ty on lap­tops rel­a­tive to desktops.

Again, we find all plat­forms have some vari­abil­i­ty across devices and browsers – which seem to be pret­ty idio­syn­crat­ic. For instance, Psy­cho­JS shows rel­a­tive­ly high vari­abil­i­ty with Fire­fox on the macOS desk­top, but fair­ly aver­age per­for­mance on the same brows­er and OS but on a laptop.


Our results show that the tim­ing of web-based research for most pop­u­lar plat­forms is with­in accept­able lim­its. This is the case for both dis­play and response log­ging. How­ev­er, response log­ging shows high­er delays. Encour­ag­ing­ly, the most con­sis­tent com­bi­na­tion of Chrome-Win­dows for all tools was also the most used in our sam­ple sur­vey.

How­ev­er, when tim­ing sen­si­tiv­i­ty is cru­cial, we rec­om­mend employ­ing with­in-par­tic­i­pant designs where pos­si­ble to avoid hav­ing to make com­par­isons between par­tic­i­pants with dif­fer­ent devices, oper­at­ing sys­tems, and browsers. Addi­tion­al­ly, lim­it­ing par­tic­i­pants to one brows­er could remove fur­ther noise. Lim­it­ing par­tic­i­pants’ devices and browsers can be done pro­gram­mat­i­cal­ly in all test­ed plat­forms, and via a graph­i­cal user inter­face in Gorilla.

Fur­ther Reading

For more in depth report­ing, please read our paper: https://doi.org/10.3758/s13428-020–01501‑5

Notes: All plat­forms are con­tin­u­al­ly under devel­op­ment, so we test­ed all plat­forms between May and August 2019 in order to be as fair as possible.