Data Annotation AI
March 6, 2020 • 5 minute read

Five reasons why AI teams should in-source data labeling

This blog is written courtesy of Interactions R&D team.

Annotating data is hard work. Depending on the speech and natural language technology you’re developing, few innovative projects can be done with the same type of data — or even the same annotation scheme. In order to develop a high performing system, both the R&D team and the media annotators need to be adaptable.

Recently I’ve been working on a challenging task that requires real-time entity capture in customer care calls. For this task we need transcribed audio, entity labels, and form-friendly values for complicated sequences of characters, such as account numbers and email addresses. This kind of data is hard to find, so we needed to invest in data annotation. Since some of the data has sensitive personal information, outsourcing the annotation work is not an option. We value our customer’s privacy too much to risk a breach. So we’ve “in-sourced” much of our entity annotation by staffing media annotation specialists on-site. And in the process, we’ve reaped many ancillary benefits.

In-sourcing data annotation has many advantages over outsourcing. Here’s why:

1. Data security

The first obvious place to start is data security. With stringent regulations like HIPAA, GDPR and PCI DSS, as well as intellectual property, companies need to take extra steps to protect their data and their customers’ data. While mature annotation firms have measures in place to demonstrate compliance in many of these areas, the risk of data breaches in transit and unknowns about who has access to the data often cause companies to restrict or severely limit data sharing. This creates a frustrating situation for data scientists and machine learning practitioners, who need annotated data to build models.

Bringing annotation in-house gives companies the opportunity to create the environment they require to ensure regulatory compliance. And having an on-site annotation staff creates an opportunity to build trusted relationships with the people closest to your data.

2. Optimizing workflows

What should the annotation team work on next? Most AI companies have multiple diverse machine learning tasks that require data annotation. Resource constraints means that transcription and annotation teams may juggle multiple tasks with different rules. I’d argue that this is a good thing. Transcribing and labeling data is an exhausting task. With an on-site team, I can periodically monitor for fatigue and can plan a “change in scenery”. After negotiations with other teams, we can co-develop an annotation schedule that mitigates fatigue and continues to provide annotated data that each AI team needs to improve their models.

3. Agile annotation

Annotation guidelines should not be created in a vacuum. Linguistic experts can design beautiful annotation rules that seem perfect for AI model building. But can the annotation team label new data exactly to spec? If so, how quickly? And will their peers agree with their judgment?

Like agile planning, annotations fall into prioritization categories, such as must have now, must have later, nice to have, and not needed. With active discussion and feedback, it’s easy to make minor modifications to the annotation guidelines to keep a steady stream of accurately annotated data coming and to ensure from one week to the next to make sure your models are improving in quality.

4. Shared vision and mutual appreciation

The highest performing teams are teams that are aligned in mission. The on-site annotators we hire empathize with our mission. Each week we can check in with them and remind them of the value they provide to our project. Just as a data scientist loves data, our scientists look for ways to show our appreciation for filling a critical yet often overlooked gap in AI model development. And our annotators are encouraged that there is a purpose beyond the sometimes repetitive tasks they are doing. Also, our annotators have some of the most interesting personalities! They contribute to our work culture as well as our data quality.

5. Production support?

Yes, production support. Thanks to active learning, my team is collecting and annotating last night’s production data to focus on areas where our AI models need to improve. In the process, we encourage our annotation team to follow the NYC Subway motto: “If you see something, say something.” I can confidently say that in the past month, our labeling team detected three significant production issues related to the last model deployment. Customer service agents were having increased problems servicing customer calls, but the problem wasn’t discovered until our annotators observed some significant changes in the call transcripts they were annotating. If the labeling team was off-site, they wouldn’t have been able to come to my desk and let me know.

Should I go 100% on-site?

One challenge about going all-in for on-site annotation is that it can be more expensive than outsourcing. In order to respect budget constraints, I recommend building a core team of dependable annotators that serve as your sounding board. You can experiment with the annotation guidelines and gather some of the benefits I’ve described above. Once you’ve stabilized the guidelines, feel free to grow your annotator pool by outsourcing the safer tasks.

 

Data annotation is certainly a challenging process. But having an on-site annotation team helps us quickly find the right match of detail versus speed and allows us to avoid compromises in data security. Few people understand your data better than the people that annotate it, so as a data scientist, make sure to take advantage of their insights!

Want to learn more? Let’s talk.