Review and Annotate Conversations

Download the full CDD Playbook - no email required.

Includes 5 guided activities that help conversational AI teams adopt conversation-driven development, and build the assistants users want.

Get the Playbook

UX expert Jakob Nielsen is often quoted as saying that the first rule of UX is to base your product on what users do, not what they say. That is, you should consider what users self-report to be unreliable, and instead base your product decisions on user behaviors you observe.

With AI assistants, what users say is actually what they do. Software teams often have to infer and guess at user interactions by looking at heat maps or telemetry, but an assistant keeps a detailed record of every user interaction automatically, in its conversation data. Conversation data leaves nothing to the imagination—you can see exactly how every user interaction went. 

Unlocking the power of this data is one of the most important things you can do to build better assistants, and it’s one of the central ideas behind conversation-driven development. 

Rasa X collects the conversations users have with your assistant and makes them accessible through a user interface. You can filter your conversations to surface interactions where a fallback action occurred, the channel the conversation came through, the length of the conversation, and more. You can also tag conversations to keep track of important trends.

But, conversation-driven development goes deeper than just uncovering insights about users. Conversations can be converted to training data for your NLU model, creating a virtuous cycle: users talk to your assistant and your assistant becomes better and better at understanding what they say. With Rasa X, you can see which user messages don’t yet exist in your training data. Then, you can easily label the intent and entities for the message and add it to your training data file. You can also use Rasa X to convert whole or partial conversations into training stories.

In this play, we’ll walk through a few workflows you should know in order to analyze your conversation data and create training examples. Run this play with the members of your team who are responsible for adding new training data to the assistant and steering the direction of your assistant’s design.


Play 3: Filter Conversations and Annotate Messages

Rasa X instance, running locally or on a server
A few conversations, collected in Rasa X. You can collect this data by running Play 2, or, if you’re already in production, by connecting your live assistant to Rasa X

1-1.5 hours

1-3 team members

Step 1: Filter conversations

Open Rasa X and navigate to the Conversations screen. First, we’ll apply a filter to omit very short conversations. Open the filter menu, select Other as the filter type, and set the Message length filter to 5. This helps us weed out conversations where the user opened the chat session and immediately closed it. (Note that filtering for very short conversations can also be useful, to diagnose low engagement). 

Read through 3-5 conversations. Ask yourself:

  • Did the assistant make any mistakes? What type of mistakes?
  • Did the user’s sentiment seem positive or negative?
  • Was there anything surprising about the way the user phrased their requests?
  • Was the user able to complete their goal?

Step 2: Use tags to apply labels

In the right hand panel, click the Tags gear and select the option to create a new tag. For each conversation you read, consider which labels might help you categorize and evaluate the conversation. Here are a few ideas:

  • Positive or negative sentiment
  • NLU error
  • Out-of-scope request ( or feature request)
  • Goal completed

Apply one or more tags to the conversation. If you find individual messages that require further discussion, use the message flag feature to mark and share them.

Step 3: Search for significant conversations

Return to the filters on the Conversation screen and note that you can now filter by the tags you’ve created. Use the filters to search your conversations and consider which produce an interesting sample set. For example, you might want to filter only by the Tester channel, to exclude conversations created by the development team, or you could filter for conversations where a fallback action was invoked or the NLU confidence was low.

As you read conversations, use the Mark as reviewed button to indicate you’ve read them (you can use this label to exclude conversations that have already been reviewed from your filtered view). You can also use the Save for later button to set conversations aside if they seem notable but need further review.

Step 4: Annotate user messages

Navigate to the NLU Inbox. The NLU Inbox displays user messages that don’t match any training examples you already have in your NLU data. 

Use the sort dropdown to filter messages with low confidence to the top of the inbox. Working your way from lowest to highest confidence, correct the intent and entity labels if needed and mark the message Correct to add it to your training data. If you find messages you don’t want to add to your training data, e.g. spam, delete them.

After annotating data, retrain and test your new model.

Step 5: Document what needs to be fixed

Create tickets for issues that need to be fixed, using your organization’s issue-tracking or project management tool. If you’re running Rasa X in server mode, you can include a deep link to the message or conversation where the issue was observed, to provide additional context.

Discussion Questions

  1. Did anything surprise you about the way users formed their requests or interacted with your assistant?
  2. Did users ask your assistant to do anything it hadn’t yet been programmed to do? How did your assistant respond, and how could it respond better?
  3. Which filters yielded the most interesting results?
  4. Were there synonyms in user messages that your assistant hadn’t seen before?  
  5. How do team members keep track of which messages have been reviewed?
  6. Which roles on your team are responsible for annotating data? Which tools have they used for this task?
  7. Did you find any user messages that weren’t easily categorized into an existing intent label? How do you handle these messages?
  8. Are there any patterns you see in the conversations that weren’t successful?
  9. How frequently are conversations reviewed by the team - daily, weekly, or monthly?

Next: Fix and Test

Back: Share your Assistant