Guidelines for Setting Up a Crowdsourcing Annotation Pipeline
01 September, 2022
At Initium, we often require annotations or other forms of data processing that can be best done with the help of crowdsourcing. We share in this blog the main steps we follow when setting up an annotation task on a crowdsourcing platform, with the hope that these general guidelines will be helpful to others who are also building data-driven tools.
The ultimate goal is to define and run the annotation such that the outcome consists of: (1) high-quality annotations; (2) enough annotated items for the target algorithm; (3) reasonable cost.
1. Task definition
First, define the task. This will entail at least two substeps:
1a. Closely look at several annotated examples. If such examples already exist, look at existing annotations. Otherwise, we may need to annotate multiple examples ourselves. Consider looking both at (i) typical examples for the task; and (ii) outlier examples.
- For action item detection, we can look at examples from other benchmarks.
- For summarization, we can look at examples of manually created summaries in English classes
1b. Identify existing annotation guidelines. These guidelines may be available in papers or githubs from research groups that have already explored related tasks. Guidelines can be also found in other domains that have explored the task.
- For keyword extraction, a survey paper or a paper introducing a new benchmark may include explicit annotation guidelines.
- For summarization, an English class literature may include directions for how to summarize text.
2. Create annotation guidelines
Once you have a clear understanding of the task, and what needs to be considered during annotation, the next step is to develop annotation guidelines. The guidelines should include:
- Brief description of the task goal.
- Brief description of the main characteristics of a good annotation.
- (Optional) Brief description of the main characteristics of a wrong annotation (what should not be done)
- (Optional) Examples of good annotations. Keep in mind that for certain tasks the examples may bias the annotator (eg, when asking for one's values, examples of values will bias them)
- (Optional) Examples of bad annotations, to highlight situations that should be avoided.
- (Optional) Field for comments and suggestions. This can be especially useful during trials.
- Example for a task to create one-sentence summaries for brief conversations:
- Please read **carefully** the conversation shown.
- Then write a **one sentence summary** of the conversation. Please write the sentence using third person. The sentence should be:
- comprehensive, i.e., it should capture the main points of the conversation;
- concise, i.e., it should only capture the main points and it should avoid repetitions
- coherent, i.e., it should make sense on its own
Each HIT will be reviewed manually, and incorrect or incomplete answers will not be paid.
3. Consider spam filtering alternatives
Crowdsourcing platforms have the benefit of collecting large amounts of annotations in a short amount of time. However, their biggest downside is the amount of spam that exists on these platforms. Unless one pays close attention to ensuring high annotation quality, the resulting data may end up being useless. As they say "garbage in, garbage out." It is therefore very important to think upfront of strategies to reduce spam and increase annotation quality. Here are some commonly used methods:
3a. Annotation redundancy. Collect multiple annotations per data item, and use this redundancy to (1) identify low quality annotations (those in disagreement with the majority) and filter them out / reject them; and (2) create the final annotation, eg, through majority voting.
3b. Checkpoints in the annotation. For certain tasks, you can add 'checkpoints' with already known answers. Once you collect the annotations, those hits that have the wrong answer for the checkpoint can be removed (and possibly also rejected).
- When multiple sentences are annotated numerically, one sentence could read "For this sentence, enter 7."
- For a survey, you could insert a question "What is the color of the sky?"
- You could insert previously annotated items, for which the annotation is unambiguous
3c. Qualifications. Amazon Mechanical Turk allows you to set up a qualification task, which is a brief task related to the main annotation task, which gets verified by you once the workers complete it. Only the workers who successfully complete the qualification step are eligible to work on your annotation task.
3c. Masters. Amazon Mechanical Turk has the Masters qualification, which entails annotators who have been consistently doing high-quality work. Using Masters for your annotations reduces the risk of spam or low-quality work. The downside is that there are much fewer Masters available than regular workers, and they also need higher incentives.
4. Set up tasks, run trials and adjust guidelines
Now that the guidelines are complete and you have a strategy on how to avoid spam and low-quality annotations, you may be ready to set up the task and run a trial. Depending on the platform you use, setting up a task entails different processes. There are multiple crowdsourcing platforms, including Amazon Mechanical Turk, Figure Eight (formerly Crowdflower), Prolific, and others. Whenever possible, it is ideal to use the sandbox, if you haven't done this before, or use existing templates (eg, Mechanical Turk has multiple templates available). Pay attention to the task settings as well, eg, how many annotators per hit, pay per hit, number of days until automatic approval, etc.
The first trial should only aim at a handful of annotations. The goal is to make sure that:
(1) the task is set up correctly; (2) the workers see it the way you see it; (3) you can collect their annotations correctly. After the first trial, carefully look at the annotation outcome, also read any comments provided by the workers, and adjust the guidelines as needed.
The second trial is often larger (eg, 2-5 times the size of the first trial), and can be helpful for future adjustment. Depending on the size of the second trial, you can also use the data for your algorithm development, which can help with further feedback on the annotations and formats.
This is the final step, when you run the task at scale. Keep a close eye on the annotations to make sure:
- The annotations are coming at the pace you want. Often the pace slows down significantly over time, which may require adjustments such as increasing the pay and/or restarting the task (so that it gets to the top of the list of available tasks)
- The quality continues to be reasonable. When scaling, if you notice a significant decrease in the quality of annotation, stop the task and adjust as needed (this may need going back to step 3 or 4).
We make a distinction between low-quality and spam. When it comes to spam, we want to discourage such behavior on crowdsourcing platforms, and consequently we do not want to pay for it. There are, however, also annotations that are not spam but are low quality. We typically pay for these contributions (possibly with some feedback to the annotators) but do not consider them in the final dataset. In other words, be kind and generous with the annotators, but detach that decision from what is to be included in the dataset.
Crowdworker qualifications may be available. Depending on the task you are running and the qualifications needed, certain platforms may have qualifications available. Sometimes this may be an extra cost. For instance, Mechanical Turk allows for the selection of certain user groups (gender, profession, etc.) for an extra cost. Prolific allows for that selection at no extra cost. You can also define qualifications yourself, specific to the task (eg, have workers do a test you set up, which qualifies them for the main task)
Initium.AI leverages recent advances in Natural Language Processing and Machine Learning to transform natural language into actionable insights.