Query PDFs Flexibly: Ask Questions or Extract JSON with UniPDF & AI

Need to extract text from a PDF and convert it into structured data? Whether you’re automating data entry extracting text from PDFs is a common challenge.
But with the right tools, you can turn unstructured PDF text into structured data for analysis and automation.

Here’s how you can use UniPDF and our AI tools to extract text from PDFs and convert it into structured data. As we made our AI tools compatible with UniPDF, you just need to pass directly the PDF file into our AI and ask questions based on the content to get the answer.

Why Extract Text from PDFs?

  • PDFs are everywhere. From invoices and reports to legal documents and research papers, PDFs are a common format for sharing information.
  • Extracting text from PDFs can be a pain. PDFs are designed to preserve the layout of a document, which can make it difficult to extract text accurately.
  • Extracting text from PDFs can save time and effort. By converting PDF text into structured data, you can automate data entry, analysis, and other tasks.

The Tools of The Trade: UniPDF and AI

  • UniPDF is a lightweight and powerful PDF library for Go that lets you extract text from PDFs, splits PDF pages and more.
  • UniDoc’s AI tools which is https://ai.unidoc.io can help you convert it into structured data and answer question based on document content.

How to Extract Text from PDFs with UniPDF and Process with AI

  • Step 1: Extract Text from PDFs with UniPDF
    • Use UniPDF to extract text from PDFs. UniPDF can extract text from PDFs with high accuracy, preserving the layout and formatting of the original document.
  • Step 2: Process Text with AI
    • Use AI tools to convert the extracted text into structured data. AI tools can help you extract key information from the text, such as names, dates, and numbers.
  • Step 3: Analyze and Automate
    • Use the structured data to analyze the text, automate data entry, or perform other tasks. By converting PDF text into structured data, you can streamline your workflows and save time.

Example: Ask Question Based on PDF Content

  • Scenario: We have a PDF document containing a CV data from a candidate, and we will ask about recent experience of candidate.
  • Step 1: Get page content from the PDF using UniPDF.
  • Step 2: Supply and process the page content with AI to make AI understand the document contents.
  • Step 3: Extract the PDF text and print it out as JSON format.
  • Step 4: Ask question to the AI about recent experience of candidate.
  • Step 5: Get the answer from the AI based on PDF of candidate CV.

Example Document

You can use the following example PDF document to test the code: Page 1 Page 2

Code Example: Get PDF page with UniPDF

Import the required packages.

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"mime/multipart"
	"net/http"
	"os"
	"strings"
	"github.com/unidoc/unipdf/v3/common/license"
	"github.com/unidoc/unipdf/v3/model"
)

Initialize UniCLOUD or UniPDF offline license, this is required to run the UniPDF, in this example we use UniCLOUD API Key license.

func init() {
	// Make sure to load your metered License API key prior to using the library.
	// If you need a key, you can sign up and create a free one at https://cloud.unidoc.io
	err := license.SetMeteredKey(os.Getenv(`UNIDOC_LICENSE_API_KEY`))
	if err != nil {
		panic(err)
	}
}

Main function for processing PDF file, here at main function we get the path for pdf file from argument and set the question for AI that we will use later and for the results, it will be print out into terminal. We set the question as “What is the latest experience?”, with that question we will except the AI will answer with latest experience of candidate based on the PDF CV, just before we get the answer, we will print out the response from AI, this will make PDF extracted content as structured data with JSON format.

func main() {
	if len(os.Args) < 2 {
		fmt.Printf("Usage: go run main.go input.pdf\n")
		os.Exit(1)
	}
	inputPath := os.Args[1]
	question := "What is the latest experience?"
	pages, err := splitPdfPages(inputPath)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		os.Exit(1)
	}
	answers := make([][]string, len(pages))
	for i, page := range pages {
		resp, err := processPageToAI(page, question)
		if err != nil {
			fmt.Printf("Error: %v\n", err)
			os.Exit(1)
		}
		defer resp.Body.Close()
		if resp.StatusCode != http.StatusOK {
			fmt.Printf("failed to upload file: %s", resp.Status)
			os.Exit(1)
		}
		responseBody, err := io.ReadAll(resp.Body)
		if err != nil {
			panic(err)
		}
		var jsonResponse map[string]interface{}
		err = json.Unmarshal(responseBody, &jsonResponse)
		if err != nil {
			panic(err)
		}
		jsonFormattedResponse, err := json.MarshalIndent(jsonResponse, "", "  ")
		if err != nil {
			panic(err)
		}

		// We print the JSON response for each page.
		fmt.Printf("Page %d:\n", i+1)
		fmt.Println(string(jsonFormattedResponse))
		var aiRes []string
		if files, err := jsonResponse["files"].([]interface{}); err {
			for _, file := range files {
				if fileMap, ok := file.(map[string]interface{}); ok {
					if cvContent, ok := fileMap["cv_content"].(map[string]interface{}); ok {
						if fullName, ok := cvContent["full_name"].(string); ok {
							aiRes = append(aiRes, fullName)
						}
						if answer, ok := cvContent["answer"].(string); ok {
							aiRes = append(aiRes, question, answer)
						}
					}
				}
			}
		}
		answers[i] = aiRes
	}
    // Print out the full name, question and answer
	for i, answer := range answers {
		fmt.Printf("%d. %+s\n", i+1, strings.Join(answer, ";"))
	}
}

When the code run into function splitPdfPages, it will read the PDF file and splits pages using UniPDF, and we use UniPDF model.PdfWriter to write the page into bytes.Buffer and will returns []bytes.Buffer and error.

// splitPdfPages splits the input PDF file.
func splitPdfPages(inputPath string) ([]bytes.Buffer, error) {
	pdfReader, f, err := model.NewPdfReaderFromFile(inputPath, nil)
	if err != nil {
		return nil, err
	}
	defer f.Close()
	numPages, err := pdfReader.GetNumPages()
	if err != nil {
		return nil, err
	}
	pdfPages := make([]bytes.Buffer, numPages)
	for i := 1; i <= numPages; i++ {
		pageNum := i
		page, err := pdfReader.GetPage(pageNum)
		if err != nil {
			return nil, err
		}
		var buf bytes.Buffer
		pdfWriter := model.NewPdfWriter()
		pdfWriter.AddPage(page)
		pdfWriter.Write(&buf)
		pdfPages[i-1] = buf
	}
	return pdfPages, nil
}

After we got pdf page as bytes.Buffer, we pass it to the function processPageToAI along with the question, this function will returns *http.Response and error, then we process the response into readable string in the main function above.

// processPageToAI processes the page and sends it to the AI.
func processPageToAI(pageBuf bytes.Buffer, question string) (*http.Response, error) {
	// write http curl request to send to AI
	var requestBody bytes.Buffer
	writer := multipart.NewWriter(&requestBody)
	part, err := writer.CreateFormFile("files", "cv.pdf")
	if err != nil {
		return nil, err
	}
	_, err = io.Copy(part, &pageBuf)
	if err != nil {
		return nil, err
	}
	err = writer.WriteField("question", question)
	if err != nil {
		return nil, err
	}
	err = writer.Close()
	if err != nil {
		return nil, err
	}
	req, err := http.NewRequest("POST", "https://ai.unidoc.io/extract_text", &requestBody)
	if err != nil {
		return nil, err
	}
	req.Header.Set("Content-Type", writer.FormDataContentType())
	client := &http.Client{}
	return client.Do(req)
}

Output

We expected the AI response output will be printed out as JSON format for each pages for the content of PDF. The output will be like:

Page 1:
{
  "files": [
    {
      "cv_content": {
        "achievements": "2001 Became CEO of Wayne Enterprises; 2005 Established the Wayne Foundation; 2010 Launched Wayne Tech's R\u0026D division; 2015 Introduced Wayne Enterprises' renewable energy initiative; 2020 Awarded Gotham Humanitarian of the Year",
        "answer": "CEO, Wayne Enterprises (2010 - Present) - Spearheaded corporate expansions into renewable energy, biotech, and security technologies. Led R\u0026D in advanced transportation and defense technologies.",
        "certifications": "",
        "education": "Bachelor's in Business Administration \u0026 Criminology from Gotham University",
        "email": "[email protected]",
        "experience": "CEO, Wayne Enterprises (2010 - Present) - Spearheaded corporate expansions into renewable energy, biotech, and security technologies. Led R\u0026D in advanced transportation and defense technologies. Founder, Wayne Foundation (2005 - 2010) - Funded social welfare programs focusing on education, healthcare, and crime prevention. Developed the Gotham Outreach Program.",
        "full_name": "Bruce Wayne",
        "languages": [],
        "phone": "+1 555-1940",
        "skills": [],
        "summary": "Visionary businessman, philanthropist, and CEO of Wayne Enterprises with expertise in corporate management, technological innovation, and security solutions.",
        "type": "CV"
      },
      "filename": "cv.pdf",
      "type": "CV"
    }
  ]
}
Page 2:
{
  "files": [
    {
      "cv_content": {
        "achievements": "1992 Graduated from MIT at age 17, 1999 Revolutionized the defense industry, 2008 Created first Arc Reactor-powered exosuit, 2010 Transitioned Stark Industries to clean energy, 2015 Developed AI-driven robotics innovations",
        "answer": "Founder, Stark Industries (2011 - Present) - Led groundbreaking research in clean energy, AI, and defense. Developed advanced exosuits and military-grade technology.",
        "certifications": "",
        "education": "Bachelor's in Electrical Engineering \u0026 Physics from MIT",
        "email": "[email protected]",
        "experience": "Founder, Stark Industries (2011 - Present) - Led groundbreaking research in clean energy, AI, and defense. Developed advanced exosuits and military-grade technology.",
        "full_name": "Tony Stark",
        "languages": [],
        "phone": "+1 555-3000",
        "skills": [],
        "summary": "Genius entrepreneur, inventor, and former CEO of Stark Industries. Innovator in robotics, AI, and defense technology with expertise in high-tech engineering and business strategy.",
        "type": "CV"
      },
      "filename": "cv.pdf",
      "type": "CV"
    }
  ]
}

On the last part of the output, we will get the full name, question and answer for each pages, the output will be like:

1. Bruce Wayne;What is the latest experience?;CEO, Wayne Enterprises (2010 - Present) - Spearheaded corporate expansions into renewable energy, biotech, and security technologies. Led R&D in advanced transportation and defense technologies.
2. Tony Stark;What is the latest experience?;Founder, Stark Industries (2011 - Present) - Led groundbreaking research in clean energy, AI, and defense. Developed advanced exosuits and military-grade technology.

Output image: Output

Complete Source Code Example

You can get and copy the complete source code for this at our gist, after you copy the code, you can simply run it with command:

go run main.go <your_cv_pdf_file.pdf>

Conclusion

By combining the power of UniPDF and AI, you can extract text from PDFs and convert it into structured data for analysis and automation.
This approach simplifies data retrieval, enhances workflow automation, and unlocks valuable insights from documents such as invoices, CVs, reports, and legal papers.
The AI-powered document processing continues to evolve, whether for business automation, research, or legal document analysis, this technology offers a scalable way to handle complex PDF data with minimal manual effort.

As this is the early adoption for our AI tools, currently this AI tools focused on CV (Curriculum Vitae) format document, as we continue to enhance our AI tools to provide more advanced document processing capabilities and works with different types of documents.