Management magazine search

Loading

Friday, September 23, 2011

Eric Blue’s Blog » Learning Faster – Automatically Extract Highlighted Text from PDF Documents

Eric Blue’s Blog » Learning Faster – Automatically Extract Highlighted Text from PDF Documents: "Learning Faster – Automatically Extract Highlighted Text from PDF Documents"

Overview

Image courtesy http://www.flickr.com/photos/liveandrock/I never really considered myself a “highlighter” until a couple years ago. Back in school I would, on occasion, highlight some interesting passages while doing homework or reading books and jot them down later. More often then not though many of those highlights would go to waste. After all, what good are highlighting interesting bits of text if you don’t use them later? My highlight compulsion increased about 6 years ago when I dove head first into mindmapping and starting experimenting with a technique called MMOST (Mind Map Organic Study Technique). In a nutshell, MMOST is a strategy for quickly digesting books and summarizing what you’ve learned into a mindmap so you can recall or reference at a later date. For a great intro to the MMOST technique, check out the post on How to Understand a Business Book in Four Hours. What does highlighting have to do with MMOST? While I’m reading a book I’ll highlight the passages that stick out to me and use those as the basis for creating the mindmap summary. It can take a lot of time, but the process of highlighting, reviewing, and creating the mindmap can significantly improve your recall and what you get out of a book (or any research project).

Another big change happened earlier this year when I started using an iPad. I’ve been gradually accumulating more digital books (using PDFs and purchasing books through Amazon using Kindle). After using Kindle for a short time I was blown away by the feature that let’s you highlight book passages and get summaries of the highlighted text and page number (The direct URL ishttp://kindle.amazon.com/your_highlights. This is REALLY useful for accelerating the summarizing process and the beauty of it is that it’s automatic – the extraction just works! Around the time I started using Kindle for iPad I discovered a fantastic PDF Document reader called GoodReader.

GoodReader is a full-featured document reader with some powerful features. Not only can you take all of your documents on the go, you can access remotely using WebDAV, Google Docs, DropBox, Email, and other online services. Starting a couple months ago it got even better by supporting PDF highlighting and annotations. I thought to myself, “Hey, it would be great if I could somehow extract all my highlighted text just like Kindle. I could TRIPLE the number of books I read and create summaries for almost all of them!”. It turns out this IS possible, but it is no where near as simple as I initially hoped. I dove down the deep rabit hole of reviewing the ~ 1,000 page Adobe PDF specification, hacked and tinkered with Perl and Java code, reviewed numerous open source and commercial offerings, and have emerged (slightly scathed but wiser) with some good solutions.

The Challenge

I won’t get into the nitty-gritty details here, but what would seem a simple operation of extracting highlighted text from a PDF turns out to be exceedingly difficult depending on what strategy you use. In fact, as near as I can tell, there is no existing open source or commercial solution that can reliably extract the 100% text accurately from all documents. The main challenge with PDF is that it isn’t a markup language like HTML that will explicitly tell you how text should be rendered. For example:

This is an example sentence that I would like to highlight.

The PDF format, while parsable, uses concepts like dictionaries, objects, streams and coordinate systems that tell PDF readers how to correctly render the doc. What this means is that things like annotations (notes) and highlights are rendered separately from the text itself. The best way to visualize this is to think of the highlighted PDF as having 2 distinct layers: the top layer is the highlight itself and the bottom layer is the text. The straightforward strategy is to simply say: “Find the X,Y coordinates of the region of highlight, then find the X,Y coordinates of all text in that same region and simply copy it”. Well, the unfortunate complexity is that in order to find the coordinates of the text you also have to take into consideration the font type and size of the font. After many hours of hacking with only minimal success, I’ve concluded that this method is not currently possible without a lot of additional coding. And, unless somebody can point me in the right direction, I haven’t found any open source or commercial offerings that do this. OK, so you’re probably wondering why I’ve made you read this much of the post only to tell you it’s not technically possible. It is possible, just using a slightly different method.

The Solutions

It turns out that you can automatically extract the highlight with 100% accuracy, but there is a caveat that requires a little more manual work. It sounds much more painful than it really is. The trick is to not only highlight the passage of text, but also copy the text and paste as an annotation (note) on top of the highlight. For GoodReader it’s simply a matter of a couple extra clicks. And for people who use Adobe Acrobat or Acrobat Reader, there is an option in most versions to automatically copy/paste text into a note whenever you select text to highlight (Go to Settings -> Commenting Preferences -> “Copy selected text into Highlight, Cross-Out, and Underline comment pop-ups.”). Here’s how you accomplish using GoodReader as of v3.2.0:

  1. Select the text you would like to highlight and select Copy. As soon as you click Copy, the menu option above the text will remain.
  2. Next select the Highlight option. At this point the text will now be highlighted.
  3. Tap the highlighted text and select the Open option. A note dialogue will appear.
  4. Hold down for 2 sections on the note until the Paste option appears and select. Click Save.

Basically 6 quick clicks/taps and you’re done. It’s not ideal, but certainly a good trade-off if it means you get to extract automatically and have 100% reliability. Now, there are a couple options for easily extracting your highlights.

Option 1 – Use a PDF Reader to create highlight summaries

If you have the money, Adobe Acrobat has many features that let you view and print all of your annotations (notes, highlights, etc.). Although not significantly cost prohibitive most people (myself included) don’t really want to spend money if you can find a comparable free or open source solution. Adobe Acrobat Reader (the free version most people use) does allow you to view the highlights in a summary pane, but doesn’t allow you to extract and print (You’ll notice that if you don’t create the annotated note with your highlight the entry will show blank.) The best free PDF viewer that I experimented with is Foxit Readerand it allows you to easily create a PDF summary of your highlights. Simply go to Comments -> Summary Comments and you’ll be prompted to save a new PDF file that only contains the highlighted text along with the page number.

Option 2 – Programmatically extract highlights

For those inclined to hack, there are a couple open source options for parsing PDF files. I first started experimenting with a great Perl module called CAM::PDF. After a few weekends of tinkering around and subsequently needing to dig into the official Adobe PDF specificaiton I realized how complicated PDF parsing, rendering, and text extraction can be. CAM::PDF does make it easy parse the overall structure of the document and extract text for an entire page, but it is very difficult to extract for exact coordinates (for a number of technical reasons). At this point I was still trying to solve the problem with the original strategy of extracing text by x,y coordinates, and after researching for countless hours I realized my open source options were limited. My next step was to experiment with PDFBox, an Apache open source JAVA PDF library. After some searching I was very excited to at least scratch the surface and get preliminary results of text extraction based on the highlight x,y coordinates. I soon discovered that needing to take the font style, orientation, and spacing into consideration to grab the exact text would prove to be time consuming. I haven’t yet found other examples, or reached out on the mailing list, but I’m sure with sufficient determination and time this could be done. Not wanting to devote this amount of time right now to solve this problem, I opted to go for the pragmatic solution of saving the note and extracting that. For those interested, I’ve attached some very simple test code that will extract the annotated comment and I’ve included commented out code for doing very basic (and not yet accurate) extraction based on region/coordinates. When I have more time I may make this a standalone executable so you can run from the command-line and bulk extract highlights from multiple documents:

import java.awt.geom.Rectangle2D; import java.io.File; import java.util.List; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.common.PDRectangle; import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation; import org.apache.pdfbox.util.PDFTextStripperByArea; public class ExtractHighlights { public static void main(String args[]) { try { PDDocument pddDocument = PDDocument.load(new File("sample.pdf")); List allPages = pddDocument.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { int pageNum = i + 1; PDPage page = (PDPage) allPages.get(i); List<PDAnnotation> la = page.getAnnotations(); if (la.size() < 1) { continue; } System.out.println("Total annotations = " + la.size()); System.out.println("\nProcess Page " + pageNum + "..."); // Just get the first annotation for testing PDAnnotation pdfAnnot = la.get(0); System.out.println("Annot type = " + pdfAnnot.getSubtype()); System.out.println("Modified date = " + pdfAnnot.getModifiedDate()); System.out.println("Rectangle = " + pdfAnnot.getRectangle()); // Sample code taken from Canoo unit test - extractAnnotations // See https://svn.canoo.com/trunk/webtest/src/main/java/com/canoo/webtest/plugins/pdftest/htmlunit/pdfbox/PdfBoxPDFPage.java // Experimental - Not completely working since rectangle doesn't take font size/spacing into account // PDFTextStripperByArea stripper = new PDFTextStripperByArea(); // stripper.setSortByPosition(true); // // PDRectangle rect = pdfAnnot.getRectangle(); // float x = rect.getLowerLeftX() - 1; // float y = rect.getUpperRightY() - 1; // float width = rect.getWidth() + 2; // float height = rect.getHeight() + rect.getHeight() / 4; // int rotation = page.findRotation(); // if (rotation == 0) { //     PDRectangle pageSize = page.findMediaBox(); //       y = pageSize.getHeight() - y; //} // // Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height); // stripper.addRegion(Integer.toString(0), awtRect); // stripper.extractRegions(page); // // System.out.println("Getting text from region = " + awtRect + "\n"); // System.out.println(stripper.getTextForRegion(Integer.toString(0))); System.out.println("Getting text from comment = " + pdfAnnot.getContents()); } pddDocument.close(); } catch (Exception ex) { ex.printStackTrace(); } } }

Of all the APIs I reviewed PDFBox appears to be one of the best: enumerating through the annotations is easy, extracting the note is just as simple, and the basic API is there to extract highlights with no need for the note (just be prepared to dig in and do some work). I also spent some time researching Adobe’s Javascript API and saw some forum posts where a person had mentioned they wrote a JavaScript plugin for Adobe Acrobat Reader that extracted the highlight without the need for the notes. However, I could not find a working example. With further research I’m sure this could be another option.

For the short-term, my practical solution is going to use Foxit Reader to create the highlight summaries. Foxit works under Wine (linux) and I’ve been able to share my GoodReader docs over WiFi and mount that Goodreader share as a WebDav folder. This means that once I’m done reading and highlighting a PDF I can easily open up in FoxitReader without needing to copy anything, generate the highlight summary, and save back to my Documents folder. Longer-term I’ll probably elaborate on the PDFBox code and write a program to automatically extract the highlights and save as text, XML, or HTML.

Other Links of Interest


Happy Highlighting

2 comments:

Unknown said...

Dear Holden,

I am trying to reach out SumNotes to inquire about their API for this. But haven't get any response. Any help in getting me in touch with them is highly appreciated.

Venkata Dikshit

Unknown said...
This comment has been removed by the author.