In the previous guide, we describe several methods for turning PDFs into data usable for spreadsheets. However, those only handle PDFs that have actual text embedded within them. When a PDF contains just images of text, as they do in scanned documents, then the problem isn't just how to convert them into neat tabular data, but how to extract any text, period.
In this tutorial, we'll explain how to write a program to extract the data into tabular format. Here's an overview of the basic steps:
1. Determine the positions of the lines that divide the rows and columns on a page.
2. Break the image apart along those lines to create (hundreds of) individual image files, one for each cell.
3. Perform optical character recognition on each cell to translate the image into a textfile.
4. Reassemble these (hundreds of) text files in the same order that you divided the main image, creating a (text) spreadsheet of the data.
A caveat: The code examples provided here are specific to the FlashPaper version of Eli Lilly's doctor payment disclosures, such as the black outlines of its table cells.
Background
Eli Lilly became the first major drug company to post its physician payments online in July 2009. However, Lilly was criticized in an April 2010 New York Times article for using a proprietary and discontinued format -- Adobe's "FlashPaper" -- which made the data virtually impossible to download or copy. In fact, PharmaShine, a company that maintains a commercial database of physician payments, said it had to manually retype the entire list.
Eli Lilly disagrees with the Times' story's characterization that it had "purposely made its report impossible to download." In an e-mail to ProPublica, Lilly spokesman J. Scott MacGregor called the characterization "misleading, as it was never our intention to make it difficult for people to access information" and said that preserving integrity of the data was the reason for not making it downloadable initially.
In any event, Lilly now provides their data as a PDF, with copyable text. However, when we began work on Dollars for Docs, Lilly was only providing data in the FlashPaper format. We came up a way to download the file and programmatically to extract tabular data from it.
Software to Get
- Ruby
- RMagick - a library that provides Ruby methods for image processing. Warning: installing it can be laborious.
- Tesseract - An optical character recognition engine maintained by Google. It's used to do the computerized translation of images to text.
- Firefox and the Firebug plugin - read my tutorial on Flash data-scraping for a primer on how to use Firebug to discover the files sent to your web browser.
- Adobe Acrobat Pro (or just Distiller?) -- we used this to convert FlashPaper to a regular PDF
- PDF-Splitter - My colleague Jeff Larson's command-line utility to split a PDF into the TIFF image format (requires Mac OS X Snow Leopard). This can be done (though not as easily) with Adobe Acrobat Pro.
Downloading the FlashPaper Document
Visit the Lilly faculty registry and observe your browser's traffic through Firebug (see my previous tutorial). You should see a file called data.swf with a size of more than 1MB.
Download the file and open it in your web browser. There's a printer icon in the top-right. Click on it and choose PDF as the output. You'll end up with a PDF weighing over 200MB. When you open it up, you'll see that it appears to be a normal table of text. But you won't be able to highlight-and-copy (which is also the case for secure PDFs), and saving as text will create an empty file.
So let's convert it to the TIFF image format, the only format Tesseract can read. You can do this either with Adobe Acrobat Pro's Export function, or my colleague Jeff Larson's pdf-splitter:
pdf-splitter data.pdf.pdf '%d.tif' 2500
This will result in .tif files for every page in the original PDF.
Step 1: Reading Lines With RMagick
RMagick is a library for Ruby that allows us do a variety of graphics operations in our program, such as changing an image's color. In fact, that's the first thing we want to do because Tesseract works best with black-and-white images. This is done using RMagick's quantize method. We also can reduce the gray boosting the image's contrast.
We had limited success doing this programmatically and so just used a Photoshop batch operation to get the desired black-and-white contrast. For the purposes of this tutorial, you can use this sample black-and-white excerpt: table-to-ocr.tif
require 'rubygems' require 'rmagick' img = Magick::Image::read('1.tif').first # Your mileage will vary with this following method; we ended up just using a Photoshop batch job to correctly produce a # black-and-white version img = img.sigmoidal_contrast_channel(5,40,true).quantize(16, Magick::GRAYColorspace).posterize(5) # write out the result just to see what it looks like img.write('bw-1.tif'){|f| f.depth = 8} # the block sets the saved image to a depth of 8-bits
Detecting the lines in the .tif file simply involves finding the lines in which all the pixels are non-white (some may be gray, which you'll see if you *****m in at the pixel level). This can be done with using RMagick's get_pixels method on every horizontal and vertical line; get_pixels returns an array of pixels within the boundaries we specify.
RMagick's Pixel class has red, blue, and green attributes. Examining the red, blue, and green values of white pixel should give you 65535 for each; a black pixel will return the value 0. Gray pixels are anywhere in between.
So first we crop the image to remove the white space surrounding the table (using bounding_box). Then we examine each pixel of a line and record the positions where every pixel in that line had color values less than a dark gray (63000 seems to be enough tolerance):
box = img.bounding_box img.crop!(box.x, box.y, box.width, box.height) line_rows = (0..img.rows-1).inject([]){ |arr, line_index| test_length = (img.columns*0.10).floor arr << line_index if img.get_pixels(0, line_index, (test_length), 1).select{|pixel| pixel.red < 63000 }.length >= test_length*0.95 arr } # same for the column lines # line_cols = ... # Some lines are wider than one pixel. The following call on line_cols and line_rows compresses it: # do the following code for rows, i.e., rows==line_rows columns = line_cols[1..-1].inject( [[ (line_cols[0]), (line_cols[0] )]]){|arr, line| if line == arr.last[1]+1 arr.last[1] = line else arr.push << [line,line] end arr } # Which returns this array of columns, each column defined by a two-element array with the start and end, width-wise, of each line. Example: # => [[0, 1], [211, 212], [420, 421], [630, 631], [747, 747], [900, 901], [942, 942], [1115, 1116], [1156, 1157], [1301, 1301], [1341, 1342], [1500, 1500]]
Step 2: Breaking It Into Pieces
Now that we have two arrays defining rows and columns, we iterate through each one and call RMagick's constitute method, which creates new images based on the dimensions we provided it. We then write each image to a file named column_numberxrow_number.tif:
Dir.mkdir('cell-files') if !File.exists?('cell-files') rows[0..-2].each_with_index do |row, i| new_row = [] columns[0..-2].each_with_index do |column, j| x,y= column[1], row[1] w,h= columns[j+1][0]-x, rows[i+1][0]-y Magick::Image.constitute(w, h, "RGB", img.get_pixels(x,y,w,h).map{ |pixel| [pixel.red, pixel.green, pixel.blue]}.flatten).write("cell-files/#{j}x#{i}.tif"){|out| out.depth=8} `tesseract cell-files/#{j}x#{i}.tif cell-files/#{j}x#{i}.txt ` end end
You should end up with a directory called cell-files with nearly 500 TIFF files in it.
Step 3: Tesseract Each Image
Tesseract is a free optical character recognition program, first developed by HP and now maintained as open-source software by Google. Its operation is simple: point it to an image file, and it produces a text file with what it interprets as text from that image. So, in the above code, we simply run Tesseract on each TIFF as it is created. Add this to the above code, after the constitute call:
`tesseract /cell-files/#{j}x#{i}.tif /cell-files/#{j}x#{i}.txt `
Now you should have nearly 500 text files in cell-files.
Step 4: All Together Now
In the previous code, we're essentially stepping through the image column by column, line by line. While we're in this loop, we might as well record each of the text files' content into one master text file. If we add a delimiting character each time, we end up with tabular data.
So, in the previous block of code, open a text file called "1-table.txt" and after the tesseract call, write the contents of that tiny text file into "1-table.txt." The combined code for Steps 2-4 is:
require 'rubygems' require 'rmagick' img = Magick::Image::read('1.tif').first box = img.bounding_box img.crop!(box.x, box.y, box.width, box.height) line_rows = (0..img.rows-1).inject([]){ |arr, line_index| test_length = (img.columns*0.10).floor arr << line_index if img.get_pixels(0, line_index, (test_length), 1).select{|pixel| pixel.red < 63000 }.length >= test_length*0.95 arr } # same for the columns line_cols = (0..img.columns-1).inject([]){ |arr, line_index| test_length = (img.rows*0.10).floor arr << line_index if img.get_pixels(line_index, 0, 1, (test_length)).select{|pixel| pixel.red < 63000 }.length >= test_length*0.95 arr } columns = line_cols[1..-1].inject( [[ (line_cols[0]), (line_cols[0] )]]){|arr, line| if line == arr.last[1]+1 arr.last[1] = line else arr.push << [line,line] end arr } rows = line_rows[1..-1].inject( [[ (line_rows[0]), (line_rows[0] )]]){|arr, line| if line == arr.last[1]+1 arr.last[1] = line else arr.push << [line,line] end arr } Dir.mkdir('cell-files') if !File.exists?('cell-files') output_file = File.open('1-table.txt', 'w') rows[0..-2].each_with_index do |row, i| text_row = [] columns[0..-2].each_with_index do |column, j| x,y= column[1], row[1] w,h= columns[j+1][0]-x, rows[i+1][0]-y Magick::Image.constitute(w, h, "RGB", img.get_pixels(x,y,w,h).map{ |pixel| [pixel.red, pixel.green, pixel.blue]}.flatten).write("cell-files/#{j}x#{i}.tif"){|out| out.depth=8 } `tesseract cell-files/#{j}x#{i}.tif cell-files/#{j}x#{i} ` # open the file Tesseract just created and store the text into an array text_row << File.open("cell-files/#{j}x#{i}.txt", 'r').readlines.map{|line| line.strip}.join(" ") end # join the array with tab characters and then output it as a line output_file.puts( text_row.join("\t")) end output_file.close
Cleanup Time
If that seemed too easy, it was. Open the resulting text file in a spreadsheet:
Tesseract isn't perfect, and on the first pass it may mistranslate many characters, especially ones that look similar to another, such as 'O' and '0' (zero). You can train the Tesseract engine, though it's tedious and involves giving it a character-by-character correction of a sample test.
Even then it won't be perfect. Your next step will be to determine the best way to clean this data. You could start by validating each column against what you know it should contain (such as a currency format). Or you could design a Mechanical Turk task in which you send an individual text file and TIFF for each cell and ask workers to perform a simple verification. You could even write your own Rails application to display the images and text-values side by side, so that your co-workers can collaboratively do the verification (this is what we did until Lilly released their data as PDFs).
Again, the code here is specific to Lilly's format and may not be as successful on a scanned document where, for example, the lines aren't as easy to determine.
We hope to craft a more generalized version of this guide in the near future. We'll call the project "Tableract" for now.
The Dollars for Docs Data Guides
Introduction: The Coder's Cause – Public records gathering as a programming challenge.
- Using Google Refine to Clean Messy Data – Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.
- Reading Data from Flash Sites – Use Firefox's Firebug plugin to discover and capture raw data sent to your browser.
- Parsing PDFs – Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.
- Scraping HTML – Write Ruby code to traverse a website and copy the data you need.
- Getting Text Out of an Image-only PDF – Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).