Document Summarization using GPTScript

Data is the new oil—I’m sure you’ve heard this phrase a gazillion times as we witness an unprecedented explosion of data. From emails to social media posts and research papers, the sheer amount of data is a blessing and a curse.

This information overload is difficult for individuals and organizations alike, as digging through it and finding what’s relevant to you is tedious. This is where the advent of Artificial Intelligence and large language models (LLMs) is a blessing. These help us summarize these massive amounts of data into byte-sized information using different document summarization techniques.

In this blog post, we’ll dive into document summarization to understand what it is and its challenges. We’ll also see how to use GPTScript for document summarization with a demo.

Document Summarization 101

As the name suggests, document summarization is a technique that extracts the most important information from a large text document and presents a concise summary. This makes it easy for us to comprehend and make sense of large documents, which otherwise would be complex or take a lot of time.

For example, imagine reading a 200-page research paper. It will take a few days, if not weeks, to fully understand it. Instead, you can understand all the information in minutes by sending it to an LLM using document summarization.

How it works

There are two ways document summarization works:

Extractive: In this method, we identify and extract key phrases from a large document. These extracted snippets are then stitched together to form the summary. Since exact sentences are taken from the document, it’s important to ensure that the original context is preserved.
Abstractive: This method generates new sentences to summarize the document. It involves extracting text, analyzing it using deep learning and natural language algorithms to understand it, and creating a summary from it. This is an advanced technique that often mimics human-like behavior.

Document summarization techniques, as discussed above, have many advantages. They help reduce the time required to analyze documents and enable faster decision-making. These techniques highlight important texts, enabling faster document retrieval. All of this combined helps with better knowledge retention.

Challenges

While document summarization sounds like an easy thing to do, for computers, it is still a complex task. Some common challenges associated with document summarization are:

Context Limitation

Tokens are units of text that an LLM can process. Most LLMs and summarization models have a fixed number of tokens that they can process at once, commonly referred to as the maximum token limit or context window.

This depends on many things, from the availability of the raw processing power to the algorithm’s efficiency. The size of the context window directly impacts the ability of models to summarize large documents effectively, as crucial information may be spread across multiple parts of the document.

Due to this limitation, they may be unable to analyze the document effectively, leading to a poor summary.

Customization

On the other hand, tailoring the summary to meet specific user needs is still a challenge. Every user may have a different requirement and need different kinds of summary based on their objective or interests.

For example, a CFO might analyze a market research report to understand the trends in numerical data like sales, revenues, EPS, etc. to make informed decisions. For the same report, a product manager would want to know customer feedback and category insights to better understand customer expectations.

Thus, providing a one-size-fits-all summary is difficult as diverse users have diverse needs.

Our focus for this blog post is context limitation. One way to overcome this challenge is to have a dynamic context. You can build models that dynamically alter their context window size to fit a document.

Alternatively, you can employ document segmentation or chunking techniques to split the documents into chunks that fit within the context window and treat each chunk as an independent document and part of a whole.

In the next section, we will see how we can achieve that using GPTScript.

Having understood document summarization and the challenges pertaining to context windows, we’ll look at document summarization using GPTScript.

GTPScript is a scripting language that allows you to automate your interactions with an LLM using natural language. With syntax in natural language, it is easy to understand and implement. One of the highlights of GPTScript is its ability to mix natural language prompts with traditional scripts, making it extremely flexible and versatile.

Legal Simplifier

To understand how document summarization using GTPScript works, we’ll build a web app called Legal Simplifier. As the name suggests, this application will simplify large legal documents that are otherwise difficult to understand.

Here’s a high-level overview of how it works:

The app allows users to upload a large legal document in PDF format.
The GPTScript employs a Python script that reads this document and “creates chunks” based on a predefined context size until it reaches the end of the file.
Each chunk is sent to the LLM (OpenAI in this case), and a summary is returned and stored in a markdown file summary.md.
The summary for each chunk is then analyzed as a whole, and a final summary is generated.
This summary is then displayed to the user on the screen.

I’ve made the application available in the examples section of the GPTScript examples repo, but I’ll also walk through how I built it so you can build your own document summarizer.

Pre-requisites

Before you can start, make sure you have all of the following:

An OpenAI API Key
Python 3.8 or later
Flask

Writing the GPTScript

The first step to building the Legal Simplifier is to create the GPTScript. Since GPTScript is written primarily in natural language, it’s very easy to write the script for summarizing documents. Below is the legalsimplifier.gpt script.

tools: legal-simplifier, sys.read, sys.write
You are a program that is tasked with analyizing a legal document and creating a summary of it.
Create a new file "summary.md" if it doesn't already exist.
Call the legal-simplifier tool to get each part of the document. Begin with index 0.
Do not proceed until the tool has responded to you.
Once you get "No more content" from the legal-simplifier stop calling it.
Then, print the contents of the summary.md file.

---
name: legal-simplifier
tools: doc-retriever, sys.read, sys.append
description: Summarizes a legal document
As a legal expert and practicing lawyer, you are tasked with analyzing the provided legal document.
Your deep understanding of the law equips you to simplify and summarize the document effectively.
Your goal is to make the document more accessible for non-experts, leveraging your expertise to provide clear and concise insights.
Get the part of legal document at index $index.
Read the existing summary of the legal document up to this point in summary.md.
Do not leave out any important points focusing on key points, implications, and any notable clauses or provisions.
Do not introduce the summary with "In this part of the document", "In this segment", or any similar language.
Give a list of all the terms and explain them in one liner before writing the summary in the document.
For each summary write in smaller chunks or add bullet points if required to make it easy to understand.
Use headings and sections breaks as required.
Use the heading as "Summary" as only once in the entire document.
Explain terms in simple language and avoid legal terminologies until unless absolutely necessary.
Add two newlines to the end of your summary and append it to summary.md.
If you got "No more content" just say "No more content". Otherwise, say "Continue".

---
name: doc-retriever
description: Returns a part of the text of legal document. Returns "No more content" if the index is greater than the number of parts.
args: index: (unsigned int) the index of the part to return, beginning at 0
#!python3 main.py "$index"

Let’s understand what we’re doing in this gptscript:

In the first part, we define what the model is supposed to do. It creates a new summary.md file and calls the legal-simplifier tool to retrieve and analyze the document at index zero and go until the end of the file when it should print the summary into a summary.md file.
In the second part, we define the legal-simplifier tool. This tool is tasked to analyze the ‘chunk’ of the document. The prompt is in natural language and is self-explanatory.
In the last part, we define the doc-retriever tool that calls a Python script main.py, which goes through the document and creates chunks, which is returned to the doc-retriever for analysis.

Document Segmentation

The Python script reads the PDF file uploaded by the user and breaks it into chunks. The crux of the script lies in the TokenTextSplitter that is responsible for breaking the documents into chunks based on the context size provided for the gpt-4-turbo-preview model.

Below is a part of the main.py script.


…
# Initializing a TokenTextSplitter object with specified parameters
splitter = TokenTextSplitter(
	chunk_size=10000,
	chunk_overlap=10,
	tokenizer=tiktoken.encoding_for_model("gpt-4-turbo-preview").encode)
…

Executing the GPTScript

To execute this script, you must first configure the OPENAI_API_KEY environment variable.

The script requires an image named legal.pdf in the current directory. So you can find a large legal document and save it with the name legal.pdf in the directory. We’ll use the Indian Contact Act document for analysis.

Execute the script using the following command:

gptscript legalsimplifier.gpt

The script uses doc-retriever tool to call the main.py script to chunk the document.
These chunks are analyzed by legal-simplifier tool, and the summary is saved in summary.md

The whole process takes a few minutes to complete based on the size of the document, and you can see the output in the terminal or the summary.md file.

Building a UI in Python

With the scripts running as expected, let us add a UI. We’ll be using Flask to build one, but you can use any other framework.

The web app has a simple UI that allows the user to upload a file to the directory and then execute the legalsimplifier.gpt script using subprocess.

Backend Logic

Below are the steps that were involved in building this:

After the environment setup, the first step was to integrate legalsimplifier.gpt in Python. I created an app.py file and used Python’s subprocess capability to invoke the GPTScript. I defined the path to the GPTScript and used prun to invoke the script using gptscript prompt
```
  ```python
    # Execute the script to summarize the legal document
          	subprocess.run(f"gptscript {SCRIPT_PATH}", shell=True, check=True) 
  ```
```
Once I got the legalsimplifier.gpt to work from my Python app, I added Flask to the project and made necessary changes with respect to routes. Added a / route that would execute when the app is launched and another /upload route that will execute when the file is uploaded.
Finally, I added the logic to handle file upload. This function takes the file uploaded by the user and saves it as legal.pdf, which will be used by legalsimplifier.gpt.

Frontend UI

After the backend was ready, it was time to build the frontend. I used simple HTML, CSS, JS, and Bootstrap to build the front end.
Since we are using Flask, all the UI-related files go to templates directory. So created an index. html file with a few elements.
The page also has client-side logic written in JS to take the form input, send it to the backend, get the response, and display it on the screen.

Pro Tip: You can even use ChatGPT to build the UI for this.

Once both these parts are ready, execute the application. If you’re using an IDE like VSCode, simply hit F5, or you can execute flask run or python app.py.

The application will be available at http://127.0.0.1:5000/. Go ahead and upload a PDF file. The GPTScript runs in the background, segments the document, analyzes it, and shows a concise summary.

Summary

In this blog post we saw how document summarization plays a crucial role in analyzing and summarizing large documents. The technique helps us navigate through the context limitations of LLMs and still helps summarize documents precisely.

Not only theory, but we also also how we are able to implement document summarization using GPTScript by using custom Python script along with natural language prompts.

Give GPTScript a try to see how simple it is to build AI applications. If you’re new here or need guidance, join GPTScript discord server and get all your queries answered.

Document Summarization using GPTScript – Building a Legal Document Simplifier App

March 15, 2024 by atulpriya sharma

Document Summarization 101

How it works

Challenges

Context Limitation

Customization