Webinar
55:00

Research Updates from Rasa

Transformers in NLU and Dialogue

Originally aired: March 11, 2020

In this recorded webinar, Alan Nichol presents a 1-hour talk on the role transformer-based architectures play in state-of-the-art models for dialogue and language understanding.

Alan covers the dialogue transformer (aka the TED policy), as well as an early look at results on joint intent and entity prediction with a new neural architecture.

Transcript:

All right, fantastic, so thanks everyone for phoning in. We're going to talk about some recent research projects we've done here at Rasa and doing this as an online webinar is kind of an experiment.

So it's cool that so many people were interested in it, and if you have other ideas for things you'd like to have us talk about, things that you're interested in, please post them in the chat here in the Zoom call and if you don't get a chance and you think of something great later on, just post something in the Rasa forum.

The way we'll do questions is the same, some of you have already pre-submitted questions. Thanks very much for that, we'll make sure to get to them at the end. And if questions come up while I'm talking, just post them in the Zoom chat and we'll answer as many as we can in the last minutes of the call.

Cool, so just a quick outline. What we'll talk about today, I'll start with just a couple of words about why we actually do research in the first place, why is it something that we even bother with at Rasa.

Then there's two topics that I want to talk about, one is DIET, which is our new NLU architecture, and I won't talk so much about how it works, I'll mostly talk about the results that we've seen so far, and what we need from you as a community.

And then I'll talk about TED, which is our new dialogue policy and kind of common thread between these two bits of research is the use of transformer architectures

We'll talk about the role that the transformer plays in each of those pieces, then we'll take some time for questions that come up throughout the conversation today. And then I'll leave you with some more resources for further reading. And if you want to get into more depth on any of these topics.

Okay, first of all, so why do we do research? Right? Why do we have such a significant investment into research for the size of the company that we are?

Well, like everything we do, we do our research to help us achieve our mission. And our mission as a company is to empower all makers to create AI assistant that work for everyone. And we do that because I believe it's a very important piece of technology that's becoming more mature every year. And I think it would be a real shame if we only get to use conversational AI in the context of being an Apple customer or an Amazon customer or anything like that.

And so we want to make sure that everyone has the tools to build great AI assistants that help them, that help people in their lives, and not just the things that big tech companies want to build. And to make that happen, to achieve that mission, we do three things. We build open-source software. We're building Rasa to be the standard infrastructure for conversational AI. We invest very heavily in the community around the open source, so celebrating the great things that everybody's building. We have a showcase on our website. And if you build something cool with Rasa, please reach out, we got to feature you on the showcase there as well. And applied research is the third piece that we do. So what we aim to be very good at is doing applied research and bringing those ideas into a production level code base, so that people can actually use them and benefit from them. And that's mostly, of course, the piece that we'll talk about today.

And it's great to see the community growing so quickly. We've got people really all over the world.

And I think that's reflected in the decisions that we make about how to evolve the code base, that we have such a global footprint. And thanks everyone who's contributed, been on the forum and helped each other out. It's just amazing to see how many people are building really cool things with Rasa. And if you're interested in more of the recent projects that we're working on, we have a page on our website now, rasa.com/research.

So we'll only pick off two of these today, but there's a whole bunch of different things. And if any of these are things that you'd like us to go into more detail on in a future webinar, just, again, comment in the Zoom, and we'll get to it.

And in cases anybody asks, by the way, we'll make this webinar available online at some point in the future.

That brings us to today's topics. And there are two pieces that we'll cover today. And they are respectively about NLU or language understanding and dialogue management.

And so this is not the only way you can build conversational AI, but this is the framework or the paradigm that Rasa works in, that language understanding and dialogue are two separate pieces. So the first task when you get a new message in from a user is the language understanding piece.

So in this example, the person has said, Yes, we classify that as the affirm intent. And so this is just turning those freeform messages into some structured data, that's typically an intent, and then some entities.

And then the second piece is to say, Well, given this new information and everything I already know, what's the next best thing that I should do? Should I ask the user for more information? Should I make an API call? Should I complete something, and that's a dialogue manager piece.

And we'll talk about each of those in turn and the role that transformers play in each of those types of models.

Cool. So the first thing we'll talk about is DIET, which is the dual intent and entity transformer.

So DIET is our new state-of-the-art neural network architecture for performing this task of taking in a message and predicting the intent and the entities.

And the way DIET works is that it consumes one message and actually spits out the intent and the entities at the same time, so it's one model which predicts both. And a really key feature of the way that DIET works is that it can use any pre-trained language model to help make the model understand more about language. So you can use word vectors like GloVe, you can use big language models like BERT, and you can swap them in in a very plug and play fashion.

So what I won't talk about today is how DIET works in detail. I've got this big gray box, just indicating that DIET exists. And there's a great video that Vincent just created on YouTube. There's a link here. Check it out on our YouTube channel. This goes into a lot of detail about the algorithm, the neural network architecture, what all the pieces do, how they fit together, and what sort of dials you have to play with when you're working with DIET yourself. So I'm not going to talk about the how it works in detail. We're going to talk about some of the results that we've seen, and some of the things I'd love you all to try and give us feedback on.

So first and foremost, how do you use DIET in your Rasa project? And so if you're already a Rasa user, you'll know about the config YAML file. And so you define a pipeline, which contains a bunch of components to process your messages and perform the language understanding piece. And so the example pipeline I've given here uses a ConveRT featurizer. You need a specific tokenizer for that, because a lot of these language models, they don't split on whole words or white space. They have embeddings for sub words and pieces of words. And so you need a specific tokenizer to go with that featurizer, but then the features get passed into the DIET classifier, which is the DIET model.

And in the way that you're familiar with working with Rasa NLU, you can swap out this featurizer, for anything else you like. So you can use pre-trained language models like BERT, you can use word vectors like GloVe. In this case, I've used ConveRT, and whatever you put before there, we'll get past to the DIET classifier, and used for predicting the intent and entities. And you have a bunch of hyper parameters that you can tweak on the DIET classifier. So by default, it does this multitask, predicting both the intent and the entities.

There's also an extra task, which is a bit more advanced, which is whether you want to use the mass language model. And there are a whole bunch of parameters you can play with here. We tried to provide sensible defaults, of course, but you can just swap in the featurizer pieces that you want, and DIET will happily use them to perform the downstream classification. So this means you're not restricted to English. You can use anything you like.

And in the experiments I'm about to show, we tried four different types of features to pass into DIET.

The first one are sparse features. And that just means that we're counting character engrams, so sequences of characters, and we're just counting a bunch of ones and zeros, of which ones show up in a word. So we don't have any kind of pre-trained word embedding or language model.

And the second type we tried is GloVe, which is a variant of word2vec. And it's a model which provides pre-trained word vectors for English. We tried BERT, which is a very well known, very large language model. And we also tried ConveRT, which is a smaller language model, but it's specifically trained for conversations and on conversational data. So we'll look at those four different pieces in our experiment.

The benchmark, the data set we use for doing the experiments is called NLU benchmark data set. There's a link here to the GitHub repo, where you can check it out yourself. The domain is human robot interaction. It's kind of like a smart robot, smart Home assistant type of scenario. It's quite large, 64 different intents, 54 different entity types, and about 26,000 labeled examples.

And the previous state of the art on this data set is a model called HERMIT by Vanzo, Bastianelli and Lemon from Heriot-Watt University, and that was at SIGdial last year and that was using ELMo embeddings. ELMo is another well known language model.

Cool. So I'm going to show some things that we found playing with DIET on this large data set. And some of them are maybe counterintuitive, but we think they're interesting enough to share. And we think it adds a lot of nuance to the story of what are big language models good for, and when you really need them.

So the first thing that we found that's maybe surprising is that even with just the sparse features, DIET outperforms the previous state of the art. So the HERMIT model that I mentioned previously, makes use of ELMo embeddings. So there's a lot of pre training that went into that model. And actually, DIET outperforms it by close to a point on the intent and entities without any access to any pre-trained model, so just counting character engrams.

So it's maybe not something we would have expected. I think it's probably partially related to the size of the data set. There's a fair amount of data in there, but we thought that was a really interesting thing to see. And that's especially encouraging because, like I showed, Rasa is a very global community, so we're not only interested in English, we're not only interested in European languages.

And so it's obviously great if you can get really strong performance without saying to everyone, Oh, you need to have a massive pre-trained language model.

The second piece, which is also maybe counterintuitive, is that actually GloVe, which is word vector model. So the word vectors are not contextual. The representation of each word doesn't depend on the order of words or on anything like that, actually outperforms BERT, and BERT is a much slower, much larger, heavier model, so you would think it would give you lots of benefit. It's trained on far more data with far more computation power. And but actually, in our setup, we found that GloVe features cause DIET to perform better than using part as a featurizer, which is really interesting.

Okay, and the third is that actually the ConveRT embeddings perform the best on the NLU benchmark data set that we work with. So The best model that we can get for the intent and for the entities actually uses this ConveRT model, which is from Henderson and collaborators. And those features have been available in Rasa for a couple of months already. But now, with DIET, we can get even more out of them with this new architecture.

Result four is that actually DIET outperforms not just using BERT as a featurizer, but it outperforms fine tuning the whole BERT model.

And fine tuning a large language model is something people are keen to jump to these days and try, but it's very computationally intensive, and typically quite sensitive to the learning rate and other hyper parameters. And we show that actually, using DIET with sparse features and ConveRT as a featurization without any fine tuning outperforms fine tuning the BERT model. So this is using the BERT model inside of DIET and back propagating all the way through and fine tuning all of the weights. And so the model at the bottom is not only more accurate, it's also six times faster to train than fine tuning the whole thing.

And I think probably the most interesting thing that we found for everyone here is that when we look at different data sets, and we look at different featurizers that we can use, there isn't one featurizer or one pre-trained model that uniformly outperforms everything else.

So it's definitely worth playing around, if you're using DIET in your model, with different types of pre-trained language models, so if you have word vectors for your language, try those. I mean, certainly try just using sparse features, because that's obviously very fast, and you can do that for any language.

And play around with these different things and let us know how you get on, because we're very interested in helping everyone who uses Rasa and giving them good advice and instructions and suggestions on what kind of models are likely to work well for their use case.

And so if you're trying this out, and you're getting better results with one featurizer than with another or you found a particularly good set of hyper parameters that works for you, I would love it if you came and joined us in the forum. There are over 8,000 other Rasa developers in there and it'd be great if we have everyone sort of sharing their knowledge, sharing their best practice and what they find.

So, of course, we always try to provide the best defaults we can. But I'm not personally a believer in one size fits all machine learning. So we really believe in empowering everyone who's building AI assistants to come in and tweak the models, tweak hyper parameters, play around, decide for themselves, what's best for them.

Cool. That brings us to the second piece, which is on the dialogue model. So I'm going to talk about the transformer embedding dialogue policy, which is a model that we published a few months ago.

So like I said, there are two different pieces. And so what we're talking about now is once the message has come in, what's the next thing that we should do? How should we respond? Should we make an API call? Should we ask another question? Should we answer the user's question? Whatever that might be.

Now, the place to start, typically, if you're building an AI assistant, is to start with your business logic.

So if we are performing a checkout for the user, we probably need a few pieces of information. We need to know the address that we ship to, we need to know their card number, so we can charge them. And we need to know the shipping method.

And so what we need to do is collect this information from the user. And that's not something you need to learn from data. That's something you know a priori. That's something that you determine.

And so the things that your business logic, you should define in code, because they're very easy to describe as a set of rules.

And so that's the place to cover your happy paths. And so the way we do that in Rasa is with something called a form, and so I've got an example class here of a checkout form.

And a form is actually at its heart, it's very simple. It's just a while loop. And what it will do is it will just ask the user for each of these slots until it's completed, until it has all the information that it needs.

So that's a great way to build in that kind of business logic in a very concise way. And you can, in this required slots method, if you override it, if you have branching logic, like maybe you have certain shipping methods are only available in certain countries or something like that, you have some kind branch logic, you can encode that here in a couple of their statements in the required slot. And that keeps that separate from the actual training data, the the things that you need to learn from real conversations.

Because, of course, the reality is that real conversations never actually follow the happy path, or well, a vanishingly small fraction of them actually do. And in what ways do they deviate?

I can't stress enough how strongly we believe that the only way to build a good AI system is to build a bad AI system and give it to people to try out and get their feedback.

It's definitely not possible to go off into the mountains for two months and polish something without any user testing, and then give it out to the world because it will always fall on its face. It will always fail.

Users will always surprise you with what they say, how they say it, what questions they have. And so I'm a strong believer in just building the absolute minimal amount of stuff you can build and then giving it to people to see how they interact with it.

So one of the ways that people might deviate in our checkout form scenario is, okay, so we asked them a yes or no question, shall I charge the card that you used last time? And instead of the user saying yes or no, they have a follow up question. They are curious if they actually still have credit in their account, and we have to deal with that before we return to the main task of checking out the user.

Now you have a little sub dialog here, which is highlighted in orange. And you think, Okay, well, I have this concept of a sub dialogue. And so maybe a way to do that is to keep track of these topics. And then when the user introduces a new topic, I can put them on a stack. And I can push this refund topic onto the stack, and then when it's resolved, I can pop it off the stack, and then we can returned to whatever was was most recently in there. That seems like a very natural model for this.

That isn't necessarily the best way to think about things, because users can revisit previous topics, and they will do, and they revisit them in no particular order, and with no particular set of strategies.

And so thinking of topics as things that you enumerate, exhaustively, all the topics you cover, and pushing and popping them off the stack gives you some thing, but it's not quite the level of flexibility that we want.

And you just cannot anticipate all the different ways that users will interleave and intermingle with different topics. So if you read through this conversation, we have the different topics highlighted in different colors.

This isn't even a particularly contrived example. It's a perfectly sensible, coherent conversation. But there are a whole bunch of different pieces are being juggled at the same time. And users, humans don't have any trouble with this. But the question is, can we build a model that can handle this kind of complexity?

Because if we're trying to build a dialogue system, and we're going to try and add rules for switching between each of these possible topics, and how to handle intents in all these different cases and go back and forth, we're going to end up with a bunch of spaghetti that's completely impossible to maintain.

So the question is, can we build a model that can handle this kind of complexity? And that was the research question that set us off on the direction of what became TED.

So when we looked at the literature, what people have typically done in the past is the model dialog using a recurrent neural net. And the way this works is recurrent neural networks, mostly like a regular neural network. And so in this diagram here, I've got a series of outputs, so Y1, Y2, Y3, and these are the predictions the model has to make. So in this case, this prediction that asking this question is prediction is that has to listen and wait for user input, and in this case, the prediction is this confirmation at the end.

And I've been very sloppy with the notation because I don't want to clutter the picture. And but I said that all the weights of all the neural networks are W, they should probably have different letters or subscripts, or something.

But just for simplicity's sake, because the point that we just want to make is that we have this user input, and it gets multiplied by a set of weights in the neural network. And then before we go, we keep track of this middle state, and then apply another set of weights before we get to the output, and this middle state, we keep track of. So when we make the second prediction, we actually keep we take the middle state from the previous one, we transform it again and add it here. So we evolve this memory throughout the sequence.

And so that means that the prediction we make here can depend not just on this input, but on the previous inputs. And so this means you get a model, which can account for context, can account for history. And so that's the key idea behind a recurrent network.

Now, the potential downfall of this model is this line here is a very strict bottleneck. And so what happens is that every time there's a new step, we have to update these weights. And that's not necessarily the right way to do things. It's not necessarily quite the intuition that we have about how conversations work.

RNNs have been used very successfully to model sentences. So the direction of the RNN is over the tokens in the sentence. And the starting assumption, which is a pretty good starting assumption, for sentences is that all of the words in the sentence are important and contribute to the meaning. That's kind of the starting assumption. And that's reflected here in the fact that this hidden state always gets updated with every single new thing which goes in, which for modeling and dialogue history is not necessarily ideal because we have these different topics.

Now, of course, other variants to the the plain RNN, you can use an LSTM you can use a GRU, all these things. And in theory, given enough data, this model could learn anything that we want. The point is just kind of learn it with a small amount of training data and can it learn it efficiently? And because it has the starting assumption that every turn in the input is relevant, that biases it very heavily to getting confused. We'll dive into that a second.

So there's a very important idea from machine translation, which is called attention. And the basic premise is that not all input should be regarded equally every time we make a prediction.

And so I've stolen this image from the original blog post from the Google blog about the transformer model. And what we're looking at is the attention of a model that's translating from French into English. And we're looking at the attention and the connection between the word it's and the the previous words in the sentence. And so we have two different sentences. The animal didn't cross the street because it was too tired, or because it was too wide.

And the way we should interpret the word it, in this case, obviously depends on which adjective is coming at the end, because the street has the property of being wide, and that's the problem, whereas the animal is being referred to here, because the animal gets tired in the street in the meantime.

And there's a very nice visualization of this idea here, which I'll show. Let's take a look. It's a great blog post Distill.pub, if you haven't checked it out yet. And it describes very nicely this idea of attention. And we'll just have a look at it here in a second.

Right, so we have here a translation model growing between English and French. And we can look at how much attention the model is paying to the different words in the input as it's producing each of these tokens.

And so, we can see here that some have a very simple direct one to one correspondence. Whereas like a noun phrase, like the European Economic Area, has sort of multiple contributions. And if we look for something difficult to produce, like a full stop, which is the end of the sentence, it has contributions from all the previous pieces in the source sentence.

And so the idea of a transformer is to work with self attention, so when you're looking at the sequence itself to produce the next token.

And so what we do with the transformer embedding dialogue policy with TED is to see, can we use this idea of self attention to build a better dialogue model? And so, there's a lot going on in this picture, so I'll try and explain it piece by piece.

So what this is, is a conversation, here. And it's not showing the actual messages that were sent. It's showing it in what we call a story. And so it's a conversation but it's represented at a slightly more abstract level.

So we don't say exactly what the user said, I'm looking for a hotel, we just show the, the intent here, request hotel. And then we don't show exactly how the bot responded, we just say, utter_askdetails. So we're asking, Okay, can you tell us about where you're going and when. So when the user starts by cooperating, telling the bot when they're checking in, when they're checking out, where they want to go.

And then they have a bunch of chitchat where they're talking about their holiday plans or whatever else, asking some questions, checking if they're talking to a bot, whatever that might be, but a bunch of things which are off topic, and then they return to the topic at hand. So finally they go, Okay, well, actually, yeah, I'm looking an expensive price range and there are four of us, and you complete the task.

And what we're looking at here is the plot of the attention of the model. So how much attention is the model paying to its own history? So as we predict each of these system actions, the others, as you predict each of these, how much are we paying attention to the past? And the key piece is these white sections here, which are telling us that things are getting ignored. And what this model learns to do is it learns to completely ignore all of the chitchat once you get here, to the point where the user is cooperating again.

And that makes sense, because you're back onto onto the task that's important. And you don't need to know about all this chitchat. It doesn't help you. And actually, the interesting thing is that you can take this train model, and you can poke at it. And you can put in a whole bunch more chit chat utterances here than it's ever seen. And it will still recover nicely because it just goes back and looks, Oh, what was the last actually relevant piece of information that the user provided? And return from there.

And so because, at each step in the dialogue, the transformer goes back and looks at the history and decides which parts are relevant, it's much less likely to get confused. Whereas if you have a recurrent neural network here, then every single time user goes off the rails, if something unexpected happens, you're now away from where you want it to be, and you're going to get a little further away each time and it becomes impossible to recover. Whereas with the transformer model, every time you have to make a new prediction, you can look at the dialogue history and say, Which parts are relevant? Which are now going to contribute to how I need to progress in this conversation?

And so if you're interested in all the experiments, the code of course is open source. The data is open source, so if you want check out the paper, there's a link here. We'll share it in the Zoom as well.

And so if you want to try it out and use it, your configuration should look like this. I've got an example here using TED, which is the embedding policy. And it is better at handling unseen edge cases, it's less likely to get confused when your users are expecting or behaving ways that you hadn't anticipated. And of course, you would use it, as everything else, you would use it in combination with other policies.

So in this case, I've also got the mapping policy active and the form policy, so that they can use forms and simple FAQs. But whenever I'm away from things that I've seen before, whenever I need to make a prediction, whenever I'm away from things that were necessarily my training data, the TED policy is going to come in and make a sensible suggestion of how to proceed.

And I'm anticipating one of the questions that was written in before the webinar started, which is how do you actually test that all of this works?

I am very happy to receive that question. It's extremely important to write tests. So there's a blog post that's linked here about on how to do continuous integration and deployment with Rasa, and it's a really, really key piece. So it doesn't really matter if you're using Rasa or not, please, please, please, always write tests for your AI assistant. You need to have tests. Anything that's not tested, you can assume will fail.

And that doesn't matter if your dialog policy is a mess of thousand of if statements, or if you're using a model trained with Rasa, or just memoization policy and a few other pieces. You should assume that anything you haven't tested probably doesn't work.

And so please, please, please, write tests. Before you change anything, make sure you don't break anything. Just because you're building something that uses machine learning doesn't mean that you should give up good software engineering practices. We really encourage people to follow all the good habits that they have from other kinds of software that they've built, and bring those into building AI assistants.

And of course, there's a whole bunch of hyper parameters you can play with here in the embedding policy. If you go to the evaluate models section of the Rasa docs, you can also see some scripts for doing some ablation studies with the training data, cutting out different pieces of your training data, you can cross validation, you can do all sorts of things and get a sense of how well your model is generalizing to things that it hasn't seen before.

So I think with TED, we have one of the pieces that we need, or potentially one of the pieces that we need, for handling these kinds of complex conversations, where users interweave multiple topics, we can pick them up and leave them as we like. But of course, it's not all you need, we can build all the algorithms we ever want. People still need training data.

And so that's why we built Rasa X, so that everybody has the tools that they need to collect a really good data set that's in domain, that's for your AI assistant. And so we just made deployment much easier as well. So go ahead and spin up a Rasa X server, and you can see all the conversations that people are having with your assistant, and you can turn those into new training data.

And like I said, the best thing you can do really is build the simplest possible minimal viable assistant that you can and then give it to people to test.

First test it by just talking to it yourself, give it to some friends to try out, and then hook it up to where real users are going to see it and get them to talk to it. And learning from real conversations is really key to get something that performs well. You can't do that on your own in a lab.

And so I'm very excited to see that people all over the world are downloading Rasa X, and actually collecting great data sets. So thanks for that and keep the feature requests coming.

And so with that, I'll switch over to questions. I've got a link here, so I'll have a look.

And if questions come up, please just post them in the chat and we'll handle as many as we possibly can.

Okay, so the first one I've got is from Andre. Oh, actually, we have some questions here. The pre submitted questions. I'll take these first.

So how can the transitions be effectively tested in a large dialogue tree to ensure that the policy works as expected?

I hope that it already answered that. The way to do it is to write tests and to do some cross validation and to do the model evaluation. So please check out those sections of the of the doc.

And the other question is, will Rasa provide a way to select the best policy based on my use case and training data?

It's a very sensible question and a very good one. So like I said, we have some tools for doing cross validation, which you can do, you can obviously do a typical training test split. You can evaluate your model in different ways. And we're very keen to provide better guidance, especially on DIET, which is brand new, and we'll have to see what the best configuration is on everybody's different data sets, so we're very keen to get everyone to try it out. And share their experiences.

And the same is of course true for the dialog policies as well. So the more you share with each other, the more collective wisdom we can bake into the product and put into the documentation.

The next one says, does Rasa support multi-label classification for intents and entities? we have the multi-intent feature. So you can have multiple intents predicted in a single message, so that's well documented. A common feature request we get is do we ever predict multi-intents that you haven't seen before in your training data?

The answer is no. And so we'll only predict ones that you actually have at least one example for, because if we wanted to predict every possible multi-intent, that would get very messy, very noisy, very quickly. And so, like everything else, make sure you've covered the multi-intents that you actually see. And don't try and dream up ones that you think you might, you might get, but you never actually get from real users. It's most important to learn from real data.

With regards to entities, we're literally working on it right now to build a multi-label entity system, but it's early days, so I can't make any promises about when it will be ready.

But if you keep your eye on the repo, we're doing all the development in the open source branches, so you can read the tea leaves and see how well things are progressing there.

Is there a way to do cross domain transfer learning using Rasa?

Great question. So the data set that we used in the TED paper is exactly that. So it's a hotel and restaurant domain transfer learning task. And so we set out to test it there explicitly. And we're currently working on collecting a better data set that's going to be even more challenging and has even more domains, so stay tuned. We'll do a proper evaluation of that on a on a big data set as soon as we have it ready.

And cool, then I'll go through some more questions that have been asked since we started the webinar today.

And so the question first one from Andrei is, what's the future for retrieval actions as of now, it is still experimental?

That's a very good question. So retrieval actions aren't still experimental. I would recommend using them. I use them in Carbon Bot, and they're very effective and they drastically simplify what your dialog model has to learn, so we definitely recommend using them. We have still marked them as experimental and probably now, the major reason for that is that they're not really particularly well supported by Rasa X yet. And so you can work with them with Rasa X, but the one thing that doesn't work is the training button doesn't work in Rasa X. That's fine. I mean, of course, a better system than hitting the train button is to train your model elsewhere, NCI, do some quality assurance on the model and then upload it to Rasa X that way. But there's not full support yet. And so we we still mark it as experimental, but we're not getting rid of it anytime soon. If anything, we're going to invest more into it.

Saurabh asked the question, can we use DIET to extract our custom entities?

Yes, absolutely. So DIET is not a pre-trained NER model. So it doesn't come out of box with predicting people's names and locations, and that kind of thing. It's specifically for you to train your custom entities.

The question from Arnav, any results for multilingual performance of DIET, or an attention to visualization exploration to understand why it works better?

So we have a paper, which we plan to make public very soon about DIET and our results so far. We haven't done any multilingual experiments yet, but I think it's a very promising thing to pursue, especially because we can plug and play different language models, so we can plug in a multilingual BERT and see how well that does. I think that's a really interesting idea and definitely one we'll pursue, but we haven't done yet.

Vlad asked, can you share a link to the data set for the intent and entity recognitions that you've benchmarked?

Yes, I hope by now it's already in the Zoom. And we'll also share these slides, so all the links will be in there for you to find.

Then Harsha asked, will multi domain support be added to Rasa? What are some of the challenges multi domain support?

So we have the multi-project importer feature. What that does is if you have a project created by Rasa in it, and you maybe have five different projects created by Rasa in it, then you specify the multi-project importer, and it will pull the domains and training data from each of those projects at training time, combine them and train a single model.

So that means you can do independent development of different parts of your assistant, and you can still have all the nice features of having testing of everything together in CI and integration test that everything still works together. So that's a level of support we have now, but it's definitely a relevant topic that people ask about all the time. And we'll keep investing in making it better.

Then, Al Mahdi Marhou asked, I wonder if there's any benchmark for TED on multi-domain contexts? For example, in contexts like DSCC8. We haven't done experiments on DSCC8 yet, but that's a very good suggestion and some experiments, we should definitely run. If anyone else is keen to try it out and run and see what kind of performance you get, we'd be very, very curious to see how it does.

Misha asked the question, do we need to change the way we write training data when deciding to use DIET or TED policy?

Thankfully, no. So that's great. A lot of engineering went into that to make sure that all of these pieces are very plug and play. So the only thing you have to change is your config YAML. And you can put in any of these models that you like, and you don't have to change your training at all.

Raj Kumal asked, please provide some insight on how to use REST API in Rasa.

I think there are two ways I could interpret that. One is to have a Rasa assistant make calls to a REST API. The way to do that is to build a custom action, which sends that request, so I would suggest that you check out the documentation on custom actions. If the question is about how to use the Rasa server's REST API, there's a page on that, so if you go to the Rasa docs, you go to the Rasa open source section of the Rasa docs, and you scroll on the left, down to the bottom, you'll see there's a section on the HTTP API. And there's a swagger file, full documentation of all the endpoints. It's a little hard to find, sorry, but you'll find it in there if you scroll to the bottom.

And then Ganesh asked, How do I handle the conditional intents and actions? For example, check my balance, if have enough money, then make the payment.

So I think if you mean that the user provides this whole thing in one message, so check my balance, if I have enough money, then make the payment.

That's a good question. I mean, you could hack it in. You could find you can hack it in, of course. And you could use a multi intent, you could use it as an entity, just off the top of my head, as a conditional, you can look. You could look for words like if and or and either, but you would have to do a bit of hacking.

So it's a cool suggestion. Again, I would be cautious of not over engineering something before checking that users really say this kind of thing. Because I think if a user knows that they're talking to a bot, they're probably less likely to give such complicated instructions. But it's definitely an interesting challenge to work on.

Mahesh asks, any resources for Spanish?

I know there's publication online called Phonetic Chat Bot, who have translated a few of our blog posts into Spanish, so check that out. Otherwise, we have, of course, the Rasa models will work with Spanish, and I know that we have a big community in Brazil who obviously work on on Portuguese. I don't know if there's a similar sized community in anywhere that's Spanish speaking.

But let me look into that and hopefully we can hook you up. The top tip I would give is go hang out in the forum. And maybe post a question and ask if anyone's working with Spanish, and if they have any recommendations of things you could do.

Yang asked, do you believe more in end to end solutions or a modular pipeline? Which one is TED?

So if you check out the paper, actually, TED can be used both as an end to end model or as a modular model. Both settings work with TED. I think going fully end to end, including generating responses, is not something that's ready for production yet. But we're doing lots of experiments where we're doing end to end where we're skipping the step of predicting the intent, we go straight from the user message to the next action.

I have a blog post from the end of last year, and where I argue that we should get rid of intents, and that's where I lay out my thinking about why I think that's the future. But I think fully end to end, we're a long way away from that being sort of ready for primetime, but I'm partial end to end is something we're experimenting with actively, and I'm sure we'll have some interesting results to share this year.

And then the next one, the question is have you run on SGD or Multiwoz data sets?

Yes, we have. I don't know what SGD refers to but we have run on Multiwz. We have two papers with results on Multiwoz, which aren't public yet, but I'll share them as soon as they are.

Is there any ongoing experiment on a spellchecker on Rasa?

I know multiple people have included a spell checker as a custom component. I think, generally, that's the best way to do it, because everyone has domain specific jargon. Everyone has different languages. I think there's no universal spell checker that's really best for everyone. And ultimately, if you're using these sparse features with character engrams, you can have some typos in your text and Rasa shouldn't get confused. It should be fine.

Then the question from Mehdi is, when using DIET with a transformer architecture, do you think it's best to multitask, fine tune the architecture, or extract features and then use separate models for NER and classification?

I would suggest that you check out Vincent's video on YouTube with the DIET explainer. That's more than I can explain just now when I'm answering questions. But it's a really great question. If the video doesn't clear that up, then please leave a comment on YouTube or post in our forum.

Then any tip on which pipeline would be more suitable with DIET classifier for Indian languages?

I would always suggest start with just the sparse features. You don't need any pre-trained model. That's a very good baseline. And then I think the next step up to test from that is if you have some kind of word vectors, whether it's GloVe or fastText or something, or whether it's Bengali or or Gujarati or Hindi or something. you can plug those in and see if we get a performance boost. But I would always start with just the sparse features in a purely supervised model.

Any support for Hinglish?

Yeah, very common request, obviously from Hindi speakers who mix in a lot of English. Yes, I've seen multiple people in the forum build something like that. If you just put Hinglish in your training data, it should work. Again, if you're using something like GloVe, then you won't have embeddings for any of the Hindi words, or if using Hindi ones, you won't have them for English words. But maybe using something like a multilingual language model will work there.

Vikas asks, is telephony integration possible with Rasa?

It's a good question. We have a post on our blog from Josh Converse, who's a community member, about how to build your own Google Duplex style automated voice assistant using Rasa and Twilio. So check that out on the blog. I'll also share the link. It's a really cool project that Josh built, and shows how you can integrate all the ways so that you can actually have a phone call with your Rasa assistant.

Then Arnav asks, how much limited data would you say gets a decent performance with DIET?

I would say it's too early to really say conclusively. So if you knock out all the fancy pieces of DIET, there's actually configuration that we mentioned in the docs that will reproduce the old embedding and tech classifier. And so when you strip out all the fancy things, it just reduces back down to that, and that we know does very well on small amounts of data. But this is exactly why I'm excited to get everybody's feedback in the forum, is to see where is DIET working well for you, where is it no, where is it taking too long to train, where are you not getting the performance that you want? So please try it out and give us that feedback so that we can give better advice to the next group of people.

Serdar asked, how can we deal with negation? So if a user says, I don't want to cancel to my policy, how do you differentiate this from I want to cancel my policy?

It depends on the language. Negation is, like all other things in conversational AI, in that the easy things are really easy, and the hard things get hard very quickly. So if you're working with English, you have a pretty good heuristic, which is look for the word not, and then look for what comes after that. And that piece is negated.

Unfortunately, you can't really particularly do much better than that. I'm not super up to date on the literature. But the last thing I saw was that large models don't tend to outperform that kind of simple heuristic of looking for the word not. So I would say start with that. And then if there are educations that you still can't handle, please share that with us because Thomas Coburn, who's on our research team, is looking into this currently, and he's writing up a proposal for a research project to work on negation, so any data that you have there is super helpful to us.

Then DC asks, can we get any extra metadata on the NLU state when using the Rasa server, beyond just the reply? Does the HTTP API get full insight into all conversational states, for example, to build a graph view, like an interactive visualization debugger.

So the NLU states, so the Rasa server returns quite a few things when you ask it for a prediction. So it will tell you the the predicted intent, the ranking of all the intents, a suggested response, if you're using a retrieval model, it will give you the confidence, the entities, et cetera. So it already gives you a fair bit of metadata. But in the message objects inside of Rasa Core, it has a metadata attribute. So you can attach arbitrary metadata to messages. So when you're sending messages into Rasa Core, or you can attach metadata there, and that will get carried all the way through and it will get end up in the tracker. And so the tracker is the object that you want that contains all of the states, and those are the endpoints that you want.

So I would suggest check out the HTTP API docs on the Rasa docs. You have to scroll to the bottom. It's on the left. And if there's anything there that's you want that we don't provide yet, please get an issue on on GitHub and we'll happily take a look.

Cool. So that brings me to the end of the questions, which is great, because we're almost out of time. So I'll just leave you with a few more resources if you want to dive deeper into any of this.

So if you're not a Rasa user yet, and you're just getting started, or you want to get a refresher on some of this stuff, I can highly recommend the Rasa Masterclass on YouTube. It's been very popular since we first released it and it goes in quite a bit of detail into all the different pieces of using Rasa and how all the pieces fit together.

And if you want to dive into any of these topics in more depth, here are some links that are useful. So there's a post on the Rasa blog about the TED policy, what it does, how it works.

There's also a blog post about DIET and what the main results are and why we think it's interesting. And then there are two more videos on YouTube specifically about DIET, about the architecture, about how the algorithm works, and why it's designed the way that it is. And so keep your eyes on that. I can definitely recommend looking into all the details there.

And then if you still have more questions, please join us on the forum. Like I said, there are thousands of people in there. So please jump in, ask any questions that you have and connect with other people who are building awesome stuff with Rasa.

So thanks, everyone, for tuning in. That's the end of the webinar. And thanks for everyone who submitted questions beforehand. Thanks everyone who posted questions while we were talking. Have wonderful day and stay healthy.

Speakers
alan
Alan Nichol

Co-Founder & CTO

Rasa

Is your Enterprise Ready for a
Conversational Customer Experience?