Hi, I'm Clio, your slightly grumpy but friendly assistant, designed to help you with all your DevOps-related tasks using CLI programs.
Data is the new oil—I’m sure you’ve heard this phrase a gazillion times as we witness an unprecedented explosion of data. From emails to social media posts and research papers, the sheer amount of data is a blessing and a curse.
This information overload is difficult for individuals and organizations alike, as digging through it and finding what’s relevant to you is tedious. This is where the advent of Artificial Intelligence and large language models (LLMs) is a blessing. These help us summarize these massive amounts of data into byte-sized information using different document summarization techniques.
In this blog post, we’ll dive into document summarization to understand what it is and its challenges. We’ll also see how to use GPTScript for document summarization with a demo.
As the name suggests, document summarization is a technique that extracts the most important information from a large text document and presents a concise summary. This makes it easy for us to comprehend and make sense of large documents, which otherwise would be complex or take a lot of time.
For example, imagine reading a 200-page research paper. It will take a few days, if not weeks, to fully understand it. Instead, you can understand all the information in minutes by sending it to an LLM using document summarization.
There are two ways document summarization works:
Document summarization techniques, as discussed above, have many advantages. They help reduce the time required to analyze documents and enable faster decision-making. These techniques highlight important texts, enabling faster document retrieval. All of this combined helps with better knowledge retention.
While document summarization sounds like an easy thing to do, for computers, it is still a complex task. Some common challenges associated with document summarization are:
Tokens are units of text that an LLM can process. Most LLMs and summarization models have a fixed number of tokens that they can process at once, commonly referred to as the maximum token limit or context window.
This depends on many things, from the availability of the raw processing power to the algorithm's efficiency. The size of the context window directly impacts the ability of models to summarize large documents effectively, as crucial information may be spread across multiple parts of the document.
Due to this limitation, they may be unable to analyze the document effectively, leading to a poor summary.
On the other hand, tailoring the summary to meet specific user needs is still a challenge. Every user may have a different requirement and need different kinds of summary based on their objective or interests.
For example, a CFO might analyze a market research report to understand the trends in numerical data like sales, revenues, EPS, etc. to make informed decisions. For the same report, a product manager would want to know customer feedback and category insights to better understand customer expectations.
Thus, providing a one-size-fits-all summary is difficult as diverse users have diverse needs.
Our focus for this blog post is context limitation. One way to overcome this challenge is to have a dynamic context. You can build models that dynamically alter their context window size to fit a document.
Alternatively, you can employ document segmentation or chunking techniques to split the documents into chunks that fit within the context window and treat each chunk as an independent document and part of a whole.
In the next section, we will see how we can achieve that using GPTScript.
Having understood document summarization and the challenges pertaining to context windows, we’ll look at document summarization using GPTScript.
GTPScript is a scripting language that allows you to automate your interactions with an LLM using natural language. With syntax in natural language, it is easy to understand and implement. One of the highlights of GPTScript is its ability to mix natural language prompts with traditional scripts, making it extremely flexible and versatile.
To understand how document summarization using GTPScript works, we’ll build a web app called Legal Simplifier. As the name suggests, this application will simplify large legal documents that are otherwise difficult to understand.
Here’s a high-level overview of how it works:
summary.md
I’ve made the application available in the examples section of the GPTScript examples repo, but I’ll also walk through how I built it so you can build your own document summarizer.
Before you can start, make sure you have all of the following:
The first step to building the Legal Simplifier is to create the GPTScript. Since GPTScript is written primarily in natural language, it’s very easy to write the script for summarizing documents. Below is the
legalsimplifier.gpt
tools: legal-simplifier, sys.read, sys.write You are a program that is tasked with analyizing a legal document and creating a summary of it. Create a new file "summary.md" if it doesn't already exist. Call the legal-simplifier tool to get each part of the document. Begin with index 0. Do not proceed until the tool has responded to you. Once you get "No more content" from the legal-simplifier stop calling it. Then, print the contents of the summary.md file. --- name: legal-simplifier tools: doc-retriever, sys.read, sys.append description: Summarizes a legal document As a legal expert and practicing lawyer, you are tasked with analyzing the provided legal document. Your deep understanding of the law equips you to simplify and summarize the document effectively. Your goal is to make the document more accessible for non-experts, leveraging your expertise to provide clear and concise insights. Get the part of legal document at index $index. Read the existing summary of the legal document up to this point in summary.md. Do not leave out any important points focusing on key points, implications, and any notable clauses or provisions. Do not introduce the summary with "In this part of the document", "In this segment", or any similar language. Give a list of all the terms and explain them in one liner before writing the summary in the document. For each summary write in smaller chunks or add bullet points if required to make it easy to understand. Use headings and sections breaks as required. Use the heading as "Summary" as only once in the entire document. Explain terms in simple language and avoid legal terminologies until unless absolutely necessary. Add two newlines to the end of your summary and append it to summary.md. If you got "No more content" just say "No more content". Otherwise, say "Continue". --- name: doc-retriever description: Returns a part of the text of legal document. Returns "No more content" if the index is greater than the number of parts. args: index: (unsigned int) the index of the part to return, beginning at 0 #!python3 main.py "$index"
Let’s understand what we’re doing in this gptscript:
summary.md
legal-simplifier
summary.md
legal-simplifier
doc-retriever
main.py
chunks
doc-retriever
The Python script reads the PDF file uploaded by the user and breaks it into chunks. The crux of the script lies in the
TokenTextSplitter
gpt-4-turbo-preview
Below is a part of the main.py script.
… # Initializing a TokenTextSplitter object with specified parameters splitter = TokenTextSplitter( chunk_size=10000, chunk_overlap=10, tokenizer=tiktoken.encoding_for_model("gpt-4-turbo-preview").encode) …
To execute this script, you must first configure the
OPENAI_API_KEY
The script requires an image named
legal.pdf
legal.pdf
Execute the script using the following command:
gptscript legalsimplifier.gpt
doc-retriever
main.py
legal-simplifier
summary.md
The whole process takes a few minutes to complete based on the size of the document, and you can see the output in the terminal or the
summary.md
With the scripts running as expected, let us add a UI. We’ll be using Flask to build one, but you can use any other framework.
The web app has a simple UI that allows the user to upload a file to the directory and then execute the
legalsimplifier.gpt
Below are the steps that were involved in building this:
After the environment setup, the first step was to integrate
legalsimplifier.gpt
app.py
prun
gptscript
```python # Execute the script to summarize the legal document subprocess.run(f"gptscript {SCRIPT_PATH}", shell=True, check=True) ```
Once I got the
legalsimplifier.gpt
/
/upload
Finally, I added the logic to handle file upload. This function takes the file uploaded by the user and saves it as
legal.pdf
legalsimplifier.gpt
templates
Pro Tip: You can even use ChatGPT to build the UI for this.
Once both these parts are ready, execute the application. If you’re using an IDE like VSCode, simply hit F5, or you can execute
flask run
python app.py
The application will be available at http://127.0.0.1:5000/. Go ahead and upload a PDF file. The GPTScript runs in the background, segments the document, analyzes it, and shows a concise summary.
In this blog post we saw how document summarization plays a crucial role in analyzing and summarizing large documents. The technique helps us navigate through the context limitations of LLMs and still helps summarize documents precisely.
Not only theory, but we also also how we are able to implement document summarization using GPTScript by using custom Python script along with natural language prompts.
Give GPTScript a try to see how simple it is to build AI applications. If you’re new here or need guidance, join GPTScript discord server and get all your queries answered.
Document Summarization 101