We have discontinued our cloud-based data annotation platform since Oct 31st. Contact us for private deployment options.
Soon after Microsoft released Copilot+PC, Apple debuted Apple Intelligence at WWDC24, supported by multimodal large models that integrate AI into most of Apple's native apps. The emergence of consumer AI that combines hardware and software heralds the arrival of large models into the lives of "the rest of us".
No data, no AI. Data annotation is seen as a key link to solving AI problems. In the era of large models, data annotation is also being transformed.
In this blog, I'll discuss how data annotation has changed and how it makes large models a reality.
Without further ado, let's get started!
Data Annotation is Transforming in the Era of Large Models
Traditionally, smaller-scale machine learning relies heavily on supervised learning, requiring vast amounts of human-annotated data. This was a manual, labor-intensive process. Take the field of autonomous driving: Annotators had to label/tag pedestrians, vehicles, roads, and other objects in camera and radar data to train perception models.
Some data labeling and annotation platforms, such as BasicAI Cloud*, have since added AI-assisted features, but human annotators still need to carefully review and correct the results.
As models grow larger, the time and cost of data labeling and annotation become unmanageable, struggling to keep pace with the demands of large models. Unsupervised and reinforcement learning have then gained traction, with less need for labeled data.
So is data annotation becoming obsolete in the era of large models?
Far from it. The role of human input is simply evolving, becoming more specialized and integral to the end-to-end model development process. To understand how, let's look at a typical Large Language Models (LLMs) training pipeline.
Data Annotation in the Training Pipeline of Large Models
How Large Language Models (LLMs) Are Trained
Based on the available information, training models like ChatGPT typically involve three key stages:
pretraining >
Supervised Fine-Tuning (SFT) >
Reinforcement Learning with Human Feedback (RLHF).
The latter two stages are critical for aligning the model's outputs with human preferences and values.
Human data labelers play different roles in each stage, quite unlike the simpler data labeling tasks of the past. They must be highly skilled professionals who can provide high-quality answers that align with human logic and expression, or subjectively judge which model responses best match human preferences.
Pretraining: Laying the Foundation
LLM training begins with collecting a large set of documents known as textual corpus, typically from publicly available sources like Common Crawl, The Pile, and RefinedWeb. However, much of this raw web data is noisy and unstructured. This dirty data will cause performance bottlenecks for large models.
Data annotators need to preprocess (or clean) the data — tokenizing text, filtering out irrelevant content, and correcting errors and inconsistencies. It's estimated that only around 1% of the raw data makes it through this process to be used for pretraining.
Subsequently, algorithm engineers pre-train the model using the preprocessed text data. Pretraining adopts an Autoregressive Model, which can learn the conditional probabilities between words through word frequency information in the training data and perform the next token prediction. The goal of pretraining is to enable the model to understand the structure and semantics of natural language, thus achieving better performance in subsequent supervised learning tasks.
The pre-trained model has some general knowledge-answering capabilities, but it is not yet accurate enough.
Supervised Fine Tuning (SFT): Refining the Model
We want the large model to "speak our language", so we need to demonstrate more of how we talk to maximize the accuracy of the predicted next token.
In real-world scenarios, there are many types of talk, so we also need different "Prompt & Response" examples. For example, OpenAI's training of InstructGPT was designed according to the actual usage with different proportions of prompt types: Generation(45.6%), Open QA (12.4%), Brainstorming (11.2%), Chat (8.4%), Rewrite (6.6%), Summarization (4.2%), etc.
In this stage, annotation experts take on a more creative role. Given a prompt, they compose high-quality responses that exemplify desired model behaviors, which are then validated by expert reviewers.
These Prompt & Response pairs (typically 10K+) make up the annotated dataset for SFT. Fine-tuned on this data, the new model parameterized all the knowledge base. For new questions, the model predicts the probability of occurrence of each word to give an answer. For example:
"I'm going to the beach to ["swim" (0.35), "relax" (0.25), "surf" (0.18), etc.]".
However, the limited scale of expert-crafted data constrains the model's performance. Such models still need improvement.
OpenAI took a different approach, collecting comparison data and training a reward model, a process known as Reinforcement Learning from Human Feedback (RLHF).
Reinforcement Learning from Human Feedback (RLHF): Enhancing Performance
RLHF aims to further refine the model's outputs to be more reliable, truthful, and safe. Compared to the SFT stage, the task design for RLHF is simpler.
In this stage, the model generates multiple candidate responses for each prompt. Annotation experts judge which response better aligns with human preferences, involving a large number of labels such as answer ranking and tagging. Note that this is a subjective process that will be influenced by how humans perceive the language model's generated results.
With enough "human feedback" data (typically 100K+), the model can be trained via reinforcement learning. A reward model is trained to incentivize the LLM to produce responses that align with human preferences. The LLM then uses this reward signal to optimize its parameters. Through iteration between the LLM and reward model, the system continuously improves.
Compared to traditional reinforcement learning, this approach accelerates the training process by incorporating human signals. Even after deployment, user reactions to the model's outputs can provide ongoing human feedback for future iterations, allowing it to output relevant results in different real-world environments consistently.
RLHF significantly improves the performance of LLMs. Even without complex prompts, models are better at following instructions, reducing dangerous behaviors and hallucinations. As OpenAI observed, with RLHF, a 1.3 billion-parameter model can beat a 175 billion-parameter basic model.
The annotation work is not yet finished here.
When users receive responses from an LLM-powered chatbot, they may "👍 like" or "👎 dislike" the answers, which also provides new "human feedback" for subsequent model iterations.
Importance of Data Annotation in Large Model Training
From the explanations above, we can see that LLMs can function without labeled data, using next-token prediction to learn how to complete sentences in the most contextually logical way.
However, this is often unsuitable for real-world cases.
Such LLMs might provide irrelevant or inappropriate responses for business-specific tasks. Let's say if not fine-tuned for the user's mobile phone usage scenarios, models might go off-topic and start discussing nutrition facts or dietary restrictions instead of providing helpful suggestions when you want to search for a nearby restaurant suitable for a date.
LLMs are also susceptible to biases stemming from the training data itself (historical data reflecting biases in past human decisions) or unintentionally amplified by the model during the learning process. This can impact the accuracy and appropriateness of their responses.
If deployed without additional oversight, they may be misused or abused. For example, a municipal department may decide to use a pre-trained model to auto-respond to inquiries or complaints. Chances are the model confuses key concepts, cites an outdated tax incentive policy, and provides an incorrect application deadline in its replies.
The gap between expected outputs and actual performance makes human touch indispensable. Human annotators play a big role in optimizing models for practical applications.
Unique Features of Data Annotation for Large Language Models
Subjectivity in Annotation
Unlike the objective "point-and-click" tasks of the previous generation, annotating for large models is more like an open-ended reading comprehension test. Even with guidelines, annotators must subjectively evaluate the quality of model-generated content and select or create the best responses. This shift from objectivity to subjectivity makes annotation work more challenging and reliant on annotators' intelligent participation.
Knowledge-Intensive Task
The training data for large models spans a wide range of domains and topics, from science and technology to culture and art, requiring annotators to have a solid knowledge base and broad expertise.
For example, when annotating a popular science article, annotators need some understanding of the relevant scientific knowledge. Data annotation for fine-tuning LLMs is no longer simple manual labor but rather knowledge work demanding higher educational levels and learning abilities.
Domain Expertise
As LLMs are increasingly applied in vertical domains, there is a growing need for specialized industry knowledge for data annotation. Annotators must have deep expertise in relevant fields.
For instance, when annotating data for fine-tuning a large model in the healthcare application and faced with a question like "Can diabetic patients engage in intense exercise? What precautions should they take?" annotators are required to understand the pathological characteristics, exercise risks, and precautions for diabetes to provide an accurate and responsible answer. This often requires professionals with medical or nursing backgrounds.
Pick up the Right Tool and Prepare for Future Challenges
In this post, we explored the tasks, importance, and unique features of data annotation in the era of large models. It's clear that the industry is undoubtedly facing new challenges.
For annotators, job opportunities will continue to grow, but the roles will become more specialized and segmented. Annotators will need to acquire advanced domain-specific knowledge to excel in roles such as model evaluators, prompt engineers, or domain-specific annotators.
As for traditional annotation businesses that rely on channel and labor edges, the demand side of data will focus on data quality, scenario diversity, and scalability in the future to maximize the model potential. This shift presents a unique opportunity for professional annotation service providers, like BasicAI, who differentiate themselves by leveraging their cross-domain expertise. In addition to upskilling the annotation team, choosing an easy-to-use, collaboration-optimized annotation tool is equally essential.
* To further enhance data security, we discontinue the Cloud version of our data annotation platform since 31st October 2024. Please contact us for a customized private deployment plan that meets your data annotation goals while prioritizing data security.
Paper List
GPT-1: Improving Language Understanding by Generative Pre-Training
WebGPT: WebGPT: Browser-assisted question-answering with human feedback
InstructGPT: Training language models to follow instructions with human feedback
ChatGPT: Introducing ChatGPT
GPT-4: GPT-4 Technical Report