How to extract content from a PDF in java

Java class < Apache Tika library file < is used. To recognize document types and extract content from various file formats, it uses various document analyzers and document type recognition techniques to recognize and retrieve data. It offers a common API for analyzing various file formats. All these analysis libraries are encapsulated in one interface called parsing interface.

Java supports many built-in classes and packages for obtaining and accessing PDF document content. The following classes are used when getting content:

 

BodyContentHandler is an in-built class that makes a handler for the text, which writes these XHTML body character events and stores them in an indoor string buffer. it's inherited from the parent class ContentHandlerDecorator in Java. the required text are often retrieved using the tactic ContentHandlerDecorator.toString() provided by the parent class.

PDFParser Java provides an in-built package that gives a category PDFParser, which parses the contents of PDF documents. It extracts the contents of a PDF Document stored within paragraphs, strings, and tables (without invoking tabular boundaries). It are often wont to parse encrypted documents too if the password is specified as an argument.

ParseContext: This class may be a component of the Java package org.apache.tika.parser, which is employed to parse context and pass it on to the Tika parsers.

Procedure:

  1. Create a content handler.
  2. Create a PDF file at the local directory in the system.
  3. Now, create a FileInputStream having the same path as that of the above PDF file created.
  4. Create a content parser using a metadata type object for the PDF document.
  5. PDF document is now parsed using the PDF parser class.
  6. Print the content of the PDF document as created above to illustrate the extraction of content in the above PDF.

Implementation: The following Java program is used to demonstrate content extraction from a PDF document.

 

// Java Program to Extract Content from a PDF

// Importing java input/output classes
import java.io.File;
import java.io.FileInputStream;
// Importing Apache POI classes
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;

// Class
public class GFG {

	// Main driver method
	public static void main(String[] args) throws Exception
	{

		// Create a content handler
		BodyContentHandler contenthandler
			= new BodyContentHandler();

		// Create a file in local directory
		File f = new File("C:/extractcontent.pdf");

		// Create a file input stream
		// on specified path with the created file
		FileInputStream fstream = new FileInputStream(f);

		// Create an object of type Metadata to use
		Metadata data = new Metadata();

		// Create a context parser for the pdf document
		ParseContext context = new ParseContext();

		// PDF document can be parsed using the PDFparser
		// class
		PDFParser pdfparser = new PDFParser();

		// Method parse invoked on PDFParser class
		pdfparser.parse(fstream, contenthandler, data,
						context);

		// Printing the contents of the pdf document
		// using toString() method in java
		System.out.println("Extracting contents :"
						+ contenthandler.toString());
	}
}

 

Submit Your Programming Assignment Details