Building A Question Answering System - Part 3
Welcome back to our blog series on building your own Question Answering system! In the previous two posts, we walked our readers through implementing Question Understanding and Document Retrieval — two out of three essential steps of a Q&A system — step-by-step using Forte. In this final blog post, you will learn to build the Answer Extraction step and have your own Q&A system working end to end!
Recall in the last two blog posts, we implemented Query Understanding and Document Retrieval to help filter a large corpus containing more than 190k articles down to just 10 most relevant documents. Reducing the input size is vital because the immediate next step, Answer Extraction, is computationally expensive and feasible only for small inputs.
With the reduced input size, Answer Extraction will apply the following steps to determine the sentences containing the answers to the user questions. Some of the steps involves applying expansive NLP operations on each of the sentences in documents.
- Select DataPacks containing retrieved documents.
- Use SpacyProcessor to identify named entities in the documents.
- *Reuse AllenNLPProcessor to extract semantic roles from all sentences
- Reuse NLTK processors to extract more language features.
- Generate the final answer using a ResponseCreator.
*: This step takes most of the execution time and can be improved by using GPU.
Following the above steps, this is how Answer Extraction works at a high level: it loops through all sentences in the 10 documents and utilizes language features extracted by steps (3) and (4) to determine whether the sentence has the same semantic structure as the user question. All matching sentences are considered an answer to the question and are returned to the user in step (5) together with entity definitions found by step (2).
Environment Setup
Re-activate the python environment from last time
Option 1: Using Conda
Option 2: Using Python venv
Recap: Question Understanding and Document Retrieval as a Forte pipeline
In the previous two blog posts, we introduced the following code for the Question Understanding step and Document Retrieval:
Extending the Forte pipeline with Answer Extraction
- Select DataPacks containing retrieved documents
The current input is a MultiPack that contains one DataPack for user query and a few other DataPacks for retrieved documents. To remove the query DataPack from the input , we can use the RegexNameMatchSelector to select DataPacks by their names.
As shown in the next steps, this selector can then be applied to processors which only need document DataPacks.
2. Use SpacyProcessor to identify named entities in the documents
SpacyProcessor is a wrapper processor provided by Forte that can be configured to process documents using SpaCy and SciSpaCy models. Our Q&A pipeline relies on two SciSpaCy entity linkers to disambiguate the medical terms in the final answer and provide web URL’s to the term definitions.
Note: selector is applied here to make sure that SpacyProcessor only gets documents as inputs, not the user query.
The following code displays named entities extracted from “passage_0” — the first document from elastic search.
3. Reuse AllenNLPProcessor to extract semantic roles from all sentences
Note: This processor is configured to use two GPUs for better performance. Remove the cuda_devices entry in the config to use CPU instead.
4. Use NLTK processors to extract more language features
Like the Query Understanding step in our pipeline, some other language features are needed for the pipeline to better understand the sentences in the document and make more accurate comparisons with the question.
Refer to the first blog post on how to test these two steps.
5. Generate the final answer using a ResponseCreator
Finally, we have all the information ready to determine the exact sentences in the documents that contain answers to the user question, and this can be accomplished by a custom processor we call ResponseCreator.
The complete code for ResponseCreator is too complicated to show here since it involves lots of formatting of the console output. Instead, we will walk you through the algorithm it uses to find the answer sentence.
In this simplified semantic role matching algorithm, we assume that the user question contains the following three semantic roles: ARG-0, ARG-1 and a predicate/verb that connects them. They are PropBank Annotations where ARG-0 is defined as an Agent and ARG-1 is defined as a Patient. For example, in the question “What does covid-19 cause?”, “covid-19” is an ARG-0 and “what” is an ARG-1 and “cause” is the verb. We also assume that the expected answer should look something like “Covid-19 causes A” where “A” is an ARG-1, and this is exactly what the algorithm is going to do next — — parse all the sentences in the documents and locate the ones who also has the same three semantic roles and “covid-19” appears to be an ARG-0. It also works the other way around when “covid-19” happens to be an ARG-1, we only need to find the matching sentences where “covid-19” is also at ARG-1.
The following code snippet is a simple implementation of the above-mentioned algorithm which finds answers from one of the documents found earlier (its DataPack name is “passage_0”).
The complete implementation of ResponseCreator is provided in the repo and can be attached to the pipeline in the same way as other processors.
The Q&A pipeline is finally complete! Let’s spin it up and see what it can do!
Looking forward
In the course of three(3) blog posts, we introduced how to build a Question Answering system in Forte by stitching together existing third-party tools. Another powerful feature of Forte is that it enables effortless component switching and pipeline repurposing. In the next blog posts in this series, you will learn to build a training pipeline in Forte and how easy it is to replace a component in this pipeline with an in-house trained model (regardless of whether it is trained using Forte) for better performance. Moreover, we are also going to show you how to repurpose this pipeline to a different domain, finance for example, with only configuration changes.
Read parts 1 & 2 of Building a Question and Answering System
Part 1 | Query Understanding in 18 lines of Code
Why use Forte?
We hope you liked this guide to building a Q&A pipeline in Forte! We created Forte because data processing is the most expensive step in AI pipelines, and a big part is writing data conversion code to “harmonize” inputs and outputs across different AI models and open-source tools (NLTK and AllenNLP being just two examples out of many). But conversion code needs to be rewritten every time you switch tools and models, which really slows down experimentation! That is why Forte has DataPacks, which are essentially dataframes for NLP — think Pandas but for unstructured text. DataPacks are great because they allow you to easily swap open-source tools and AI models, with minimal code changes. To learn more about DataPacks and other time-saving Forte features, please read our blog post!
About CASL
CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.
Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and Forte announcements soon: https://www.casl-project.ai. If you’re interested in working professionally on CASL, visit our careers page at Petuum!