Are you curious about the timing accuĀ­raĀ­cy of your online experiments?

Online stimĀ­uĀ­lus and response timing is good enough for a wide range of behavĀ­iourĀ­al research. However, there are some quirky comĀ­biĀ­naĀ­tions to watch out for. We also make recĀ­omĀ­menĀ­daĀ­tions of how to reduce meaĀ­sureĀ­ment error when timing senĀ­siĀ­tivĀ­iĀ­ty is parĀ­ticĀ­uĀ­larĀ­ly important.

 

We have had a paper pubĀ­lished that looks at this exact quesĀ­tion. Itā€™s out in BehavĀ­iour Research Methods, you can read it here. This paper looks at the timing accuĀ­raĀ­cy of experĀ­iĀ­ments run online, not only in Gorilla but in several other popular platĀ­forms for behavĀ­iourĀ­al research. It also includes parts of the parĀ­ticĀ­iĀ­pant analyĀ­sis we had shown in a preĀ­viĀ­ous blog post. Your parĀ­ticĀ­iĀ­pants could be using any number of devices and web browsers, so we wanted to let you know what they are using and also test a few of these types of browser and devices for their impact on stimĀ­uĀ­lus and response timing. We also wanted to assess the perĀ­forĀ­mance of a range of experiment buildĀ­ing libraries and tool boxes (weā€™ll put these under the umbrelĀ­la term ā€˜toolsā€™).

We tested:

  • Browsers: Chrome, Edge, Safari, Firefox
  • Devices: Desktop and Laptop
  • OperĀ­atĀ­ing Systems: Mac and PC
  • Tools: Gorilla, jsPsych, PsychoPy & Lab.js

 

TLDR, The major finding was that ā€“ for most users ā€“ the timing accuĀ­raĀ­cy is pretty good across all the valĀ­iĀ­datĀ­ed tools. DefĀ­iĀ­niteĀ­ly good enough for the vast majorĀ­iĀ­ty of studies with within-subject designs. However, there are some quirky comĀ­biĀ­naĀ­tions you can see in the graphs below.

AccuĀ­raĀ­cy vs Precision

Before heading into our results, itā€™s probĀ­aĀ­bly helpful to provide the reader with an overview of two conĀ­cepts: AccuĀ­raĀ­cy and Precision.

AccuĀ­raĀ­cy in this context is the average difĀ­ferĀ­ence between the ideal timing (e.g. reacĀ­tion time recordĀ­ed is the exact moment the parĀ­ticĀ­iĀ­pant presses a key) and observed timing. The closer the difĀ­ferĀ­ence is to zero, the better.

PreĀ­ciĀ­sion is a related metric, but probĀ­aĀ­bly more imporĀ­tant in this case (weā€™ll explain why below). It refers to the variĀ­abilĀ­iĀ­ty around our average accuracy.

These two things can vary indeĀ­penĀ­dentĀ­ly. To illusĀ­trate this, we have used a toy example of arrows shot into a target:

In this example, accuĀ­raĀ­cy is the disĀ­tance from the bullsĀ­eye, and preĀ­ciĀ­sion is the spread of the arrows. In our paper these values being low is better ā€” so a value of zero for accuĀ­raĀ­cy means arrows are spot on the bulls-eye.

Itā€™s clear that the top right target is the best ā€“ the archer has shown both high accuĀ­raĀ­cy and precision.

But which is the second best? Is it high accuĀ­raĀ­cy with low preĀ­ciĀ­sion, even with the arrows being quite a disĀ­tance from the centre indiĀ­vidĀ­uĀ­alĀ­ly ā€“ as seen on the top left. Or low accuĀ­raĀ­cy with high preĀ­ciĀ­sion, a tight cluster of precise arrows but missing the mark ā€” as seen on the bottom right.

In the case of experiment timing itā€™s the later, lower accuĀ­raĀ­cy with a higher preĀ­ciĀ­sion, that is arguably preferable.

This is because we are often comĀ­parĀ­ing conĀ­diĀ­tions, so a conĀ­sisĀ­tent delay on both conĀ­diĀ­tions still leads to the same difĀ­ferĀ­ence. ComĀ­pared with high accuĀ­raĀ­cy, but low preĀ­ciĀ­sion, where the difĀ­ferĀ­ence might be obscured. In other words, high preĀ­ciĀ­sion delivĀ­ers less noisy data, thereby increasĀ­ing your chance of detectĀ­ing a true effect.

Visual Display

We first assessed the timing of the visual eleĀ­ments in the experĀ­iĀ­ments. So how accuĀ­rate and precise someĀ­thing in a trial would be renĀ­dered on the screen.

We assessed the timing of stimuli (a white square) at a range of duraĀ­tions. This was done by preĀ­sentĀ­ing the stimĀ­uĀ­lus at 29 difĀ­ferĀ­ent duraĀ­tions from a single frame (16ms) to just under half a second (29 frames). Thatā€™s 4350 trials for each OS-browser-tool comĀ­biĀ­naĀ­tion. This means we disĀ­played over 100,000 trials and recordĀ­ed the accuĀ­raĀ­cy of each of them. This makes this section of the paper the largest assessĀ­ment of display timing pubĀ­lished to date.

DepictĀ­ed above is the photo-diode sensor attached to a comĀ­putĀ­er screen. We used this to record the true onset and offset of stimuli and comĀ­pared this to the requestĀ­ed duraĀ­tion put into the program.

The results of this meaĀ­sureĀ­ment are broken up in the graphs below:

The dotā€™s disĀ­tance from the red line repĀ­reĀ­sents accuĀ­raĀ­cy, and the spread of the line repĀ­reĀ­sents preĀ­ciĀ­sion. You can see that there is a variĀ­ance between browsers, operĀ­atĀ­ing systems and tools. Each tool/experiment builder has a comĀ­biĀ­naĀ­tion it perĀ­formed parĀ­ticĀ­uĀ­larĀ­ly well on. Almost all of the tools over-preĀ­sentĀ­ed frames rather than underĀ­repĀ­reĀ­sentĀ­ed them. Except for PsyĀ­choĀ­JS and Lab.js on macOS-Firefox, where it appears some frames where preĀ­sentĀ­ed shorter than requestĀ­ed. Under-preĀ­sentĀ­ing by one or two frames would lead very short preĀ­senĀ­taĀ­tions to not be renĀ­dered at all. At Gorilla, we believe that preĀ­sentĀ­ing stimuli is essenĀ­tial, and thereĀ­fore each stimĀ­uĀ­lus being shown slightĀ­ly longer, is a better outcome.

We disĀ­covĀ­ered that the average timing accuĀ­raĀ­cy is likely to be a single frame (16.66ms) with a preĀ­ciĀ­sion of 1ā€“2 frames after (75% of trials in Windows 10 had 19.75ms delay or less, and 75% of trials on macOS had 34.5ms delays or less).

ReasĀ­surĀ­ingĀ­ly, the vast majorĀ­iĀ­ty of the user sample was using the best perĀ­formĀ­ing comĀ­biĀ­naĀ­tion ā€“ Chrome on Windows. A platĀ­form comĀ­biĀ­naĀ­tion that Gorilla perĀ­formed at highly consistently.

Visual DuraĀ­tion

Where our paper has gone beyond any preĀ­viĀ­ous research is that we tested a range of frame duraĀ­tion. This is imporĀ­tant because you want the accuĀ­raĀ­cy and preĀ­ciĀ­sion, to remain conĀ­stant across frame duraĀ­tions. That way, you can be sure that the visual delay (error) at one duraĀ­tion (say 20 frames) is that same as the delay at another duraĀ­tion (say 40 frames). So when you calĀ­cuĀ­late the difĀ­ferĀ­ence in RT at difĀ­ferĀ­ent stimuli duraĀ­tions, the error cancels out.

As you can see, there is some variĀ­aĀ­tion across display duraĀ­tions. Smooth lines (averĀ­ages and variĀ­ance) are better than spiky ones.

ReacĀ­tion TimeThe other key factor to timing in online experĀ­iĀ­ments is how accuĀ­rateĀ­ly and preĀ­ciseĀ­ly your responsĀ­es are logged. You want to be sure your reacĀ­tion time meaĀ­sures are as close to when your parĀ­ticĀ­iĀ­pant pressed down as posĀ­siĀ­ble. This introĀ­duces the factor of the keyĀ­board ā€“ which could be a built-in keyĀ­board on a laptop or an exterĀ­nal keyĀ­board on a desktop. ThereĀ­fore, we widened the number of devices we assessed to include laptop and desktops.

To be sure of the speĀ­cifĀ­ic reacĀ­tion time used, we proĀ­grammed a robotic finger (called an actuĀ­aĀ­tor) to press the keyĀ­board at a speĀ­cifĀ­ic time after a stimĀ­uĀ­lus had been disĀ­played. This repliĀ­cates the process a human would go through, but with much less variĀ­abilĀ­iĀ­ty. We proĀ­grammed the finger to respond at 4 difĀ­ferĀ­ent reacĀ­tion times: 100ms, 200ms, 300ms & 500ms ā€“ repĀ­reĀ­sentĀ­ing a range of difĀ­ferĀ­ent posĀ­siĀ­ble human reacĀ­tion times. These were repeatĀ­ed for each softĀ­ware-hardĀ­ware comĀ­biĀ­naĀ­tion 150 times.

You can see a picture of the set-up above on a laptop, the photo-diode senses the white square and then the actuĀ­aĀ­tor presses the space bar.

You can see the results below. The ReacĀ­tion Time latenĀ­cies we report, averĀ­agĀ­ing around 80ms and extendĀ­ing to 100ms someĀ­times, are longer than other recent research reportĀ­ed (Bridges et al., 2020).

We believe that this is because preĀ­viĀ­ous research has used things like a button box (a speĀ­cialĀ­ist low-latency piece of equipĀ­ment for lab testing of RT), and difĀ­ferĀ­ences in the setting up of devices (that can vary pretty drasĀ­tiĀ­calĀ­ly in terms of keyĀ­boards). The actuĀ­aĀ­tor set up allowed us to report variĀ­ance that is conĀ­tributed by the keyĀ­board, someĀ­thing that is imporĀ­tant for users to know about as their parĀ­ticĀ­iĀ­pants will be using a variety of keyboards.

In any case, in terms of preĀ­ciĀ­sion Gorilla here perĀ­forms fairly conĀ­sisĀ­tentĀ­ly across all platĀ­forms and browsers. It had the lowest stanĀ­dard deviĀ­aĀ­tion (i.e. highest preĀ­ciĀ­sion) out of all the platĀ­forms ā€“ averĀ­agĀ­ing 8.25ms. In general, there was higher variĀ­abilĀ­iĀ­ty on laptops relĀ­aĀ­tive to desktops.

Again, we find all platĀ­forms have some variĀ­abilĀ­iĀ­ty across devices and browsers ā€“ which seem to be pretty idioĀ­synĀ­cratĀ­ic. For instance, PsyĀ­choĀ­JS shows relĀ­aĀ­tiveĀ­ly high variĀ­abilĀ­iĀ­ty with Firefox on the macOS desktop, but fairly average perĀ­forĀ­mance on the same browser and OS but on a laptop.

ConĀ­cluĀ­sions

Our results show that the timing of web-based research for most popular platĀ­forms is within acceptĀ­able limits. This is the case for both display and response logging. However, response logging shows higher delays. EncourĀ­agĀ­ingĀ­ly, the most conĀ­sisĀ­tent comĀ­biĀ­naĀ­tion of Chrome-Windows for all tools was also the most used in our sample survey.

However, when timing senĀ­siĀ­tivĀ­iĀ­ty is crucial, we recĀ­omĀ­mend employĀ­ing within-parĀ­ticĀ­iĀ­pant designs where posĀ­siĀ­ble to avoid having to make comĀ­parĀ­isons between parĀ­ticĀ­iĀ­pants with difĀ­ferĀ­ent devices, operĀ­atĀ­ing systems, and browsers. AddiĀ­tionĀ­alĀ­ly, limĀ­itĀ­ing parĀ­ticĀ­iĀ­pants to one browser could remove further noise. LimĀ­itĀ­ing parĀ­ticĀ­iĀ­pantsā€™ devices and browsers can be done proĀ­gramĀ­matĀ­iĀ­calĀ­ly in all tested platĀ­forms, and via a graphĀ­iĀ­cal user interĀ­face in Gorilla.

Further Reading

For more in depth reportĀ­ing, please read our paper: https://doi.org/10.3758/s13428-020ā€“01501ā€‘5

Notes: All platĀ­forms are conĀ­tinĀ­uĀ­alĀ­ly under develĀ­opĀ­ment, so we tested all platĀ­forms between May and August 2019 in order to be as fair as possible.