Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

PDFBox: Get metadata of PDF document

Metadata contains information about the Document that describes about itself such as document creation data, title, author etc.,

Accessing basic metadata using PDDocumentInformation
PDDocumentInformation class provide getter methods to extract information about the PDF document.

Method
Description
public String getTitle()
Return the title of the document. This will Return Null if no title exists.
public String getAuthor()
Return the author of the document. This will return null if no author exists.
public String getSubject()
Return the subject of the document. This will return null if no subject exists.
public String getKeywords()
Return keywords of the document. This will return null if no keywords exists.
public String getCreator()
Return creator of the document. This will return null if no creator exists.
public String getProducer()
Return Producer of the document. This will return null if no producer exists.
public Calendar getCreationDate()
Return creation date of the document. This will return null if no creation date exists.
public Calendar getModificationDate()
Return modification date of the document. This will return null if no modification date exists.
public String getTrapped()
Return trapped value of the document. This will return null if no trapped value exists.


Following statements print metadata of PDF document.
PDDocument pdDoc = PDDocument.load(new File("/Users/harikrishna_gurram/Downloads/Saurabh.pdf"));
PDDocumentInformation info = pdDoc.getDocumentInformation();
System.out.println( "Title=" + info.getTitle() );
System.out.println( "Author=" + info.getAuthor() );
System.out.println( "Subject=" + info.getSubject() );
System.out.println( "Keywords=" + info.getKeywords() );
System.out.println( "Creator=" + info.getCreator() );
System.out.println( "Producer=" + info.getProducer() );
System.out.println( "Creation Date=" + info.getCreationDate() );
System.out.println( "Modification Date=" + info.getModificationDate());
System.out.println( "Trapped=" + info.getTrapped() );

In addition to above methods PDDocumentInformation  class provides getMetadataKeys method, which will get the keys of all metadata information fields for the document.

public Set<String> getMetadataKeys()
This will get the keys of all metadata information fields for the document.
public static Optional<Map<String, Object>> getDocumentBasicMetaData(final String fileName) {
if (Objects.isNull(fileName)) {
throw new NullPointerException("fileName shouldn't be null");
}

try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
PDDocumentInformation docInfo = pdDoc.getDocumentInformation();
Set<String> keys = docInfo.getMetadataKeys();

Map<String, Object> map = new HashMap<>();

for (String key : keys) {
map.put(key, docInfo.getPropertyStringValue(key));
}

return Optional.of(map);

} catch (IOException e) {
return Optional.empty();
}
}


PDF documents can have XML metadata associated with them. Following classes are used to extract the XML meta data.

PDDocumentCatalog
PDPage
PDXObject
PDICCBased
PDStream
Following snippet is used to get catalog metadata from PDDocumentCatalog.

public static Optional<List<String>> getCatalogMetaData(final String fileName) {
if (Objects.isNull(fileName)) {
throw new NullPointerException("fileName shouldn't be null");
}

try (final PDDocument pdDoc = PDDocument.load(new File(fileName))) {
PDDocumentCatalog catalog = pdDoc.getDocumentCatalog();
PDMetadata metadata = catalog.getMetadata();
return getMeatData(metadata);
} catch (IOException e) {
System.out.println(e.getMessage());
return Optional.empty();
}

}

private static Optional<List<String>> getDataFromStream(InputStream in) {

try (BufferedReader br = new BufferedReader(new InputStreamReader(in))) {
List<String> data = new ArrayList<>();
String str;

while ((str = br.readLine()) != null) {
data.add(str);
}
return Optional.of(data);
} catch (IOException e) {
System.out.println(e.getMessage());
return Optional.empty();
}

}

private static Optional<List<String>> getMeatData(PDMetadata metadata) {
if (metadata == null) {
System.out.println("There is no meta data associated");
return Optional.empty();
}

try (InputStream in = metadata.createInputStream()) {
return getDataFromStream(in);
} catch (IOException e) {
return Optional.empty();
}
}


Following snippet is used to get meta data of a PDF page.


This post first appeared on Java Tutorial : Blog To Learn Java Programming, please read the originial post: here

Share the post

PDFBox: Get metadata of PDF document

×

Subscribe to Java Tutorial : Blog To Learn Java Programming

Get updates delivered right to your inbox!

Thank you for your subscription

×