Last fall, I conducted my first user study to evaluate my first programming language. Looking back, the study was more adventurous than I had thought. I am grateful that it ended with interesting results.
The language design involved a new language construct. I wanted evidence on how this construct affects developers. As a young student entering the field, I was more excited about how people directly use this language than indirect descriptions of code shape. Why don’t we try controlled experiments with human developers, like how people evaluate tools? I did not think much at first.
Conducting a user study is risky. There are generally four possible outcomes: (a) success by detecting evidence, (b) borderline success by detecting counter-evidence, (c) failure by not drawing a conclusion due to lack of participants, and (d) failure by observing more noise than useful information. As a computer systems student, I was used to the first two possibilities. The third possibility is easy to minimize. The last possibility made me nervous, since it was hard to eliminate when dealing with human beings.
Adding to the risk, we may not learn much when the user study does fail. For my study, a failed experiment taught me lessons like “I should have included a table of contents for the language manual” and “I should have supported syntax highlighting in gedit so that participants can easily notice syntax errors.” However, these lessons were not that inspiring and costed at least one participant.
Conducting a user study is pricey, especially for controlled experiments. User studies are pricey in time. Before starting any experiment, there is a long planning stage. To minimize the risks of failure, proceed progressively: first analyze small examples by hand, next design the study protocol, then implement the experimental materials, sanity check with one or two friends, adjust accordingly, then perform small-scale pilot studies, adjust accordingly, and finally start the “real” experiments. The experiments themselves also take time. For controlled experiments, we need all measurements before seeing the final results. Overall, the timescale of this feedback loop is longer than typical tasks in computer science.
Besides time, participants are also pricey. Each participant is one-time use. If a participant becomes an outlier accidentally, this participant is gone forever. Retrying a participant is unacceptable, since humans have memories. More rigidly, for controlled experiments, changing the number of participants or the protocol of experiments halfway through is statistically wrong. Restarting all experiments would require a whole new set of participants. Nightmare.
The stress gets worse if eligible participants are rare. For my study, the participants needed to be capable programmers that satisfy some requirements. They should also be clueless about what my experiments were about. More luxuriously, they must be willing to spend two hours writing nontrivial programs in a new language for a trivial compensation. There may not be a ton of such people.
Other concerns are less problematic: It can be challenging to make a user study effective at evaluating the research project. It needs some training to make a study scientifically meaningful. Analyzing raw data can be repetitive labor work.
This user study became an interesting experience for me as a language designer, both technically and emotionally.
Now, what would I change if I had a chance to rewrite the history? First, think more about the project’s evaluation early on. Next, embrace the risks of user studies and be prepared. Then, ask my advisor for a higher compensation to motivate participants — I want a thousand times more volunteers.
(Thank you, all the anonymous participants! You helped a lot.)
 Claes Wohlin, Per Runeson, Martin Hst, Magnus C. Ohlsson, Bjrn Regnell, and Anders Wessln. 2012. Experimentation in Software Engineering. Springer Publishing Company, Incorporated.
Elaborated the planning stage and the time cost.