Wednesday, February 25, 2015

Coldfusion 10 Solr Indexing Zip File that Contains PDF files

Seems like I should be surprised, if I don't find some surprises in Coldfusion every week. :) Here is another one that took me a few hours to find a solution. And hopefully will save a few hours for someone else.

Coldfusion 10 Update 15
Windows Server 2012

When indexing a bunch of files, Coldfusion stopped indexing without any exception or getting into any error state. It just stopped in the middle of indexing. If I was not looking at it closely, I would not have noticed that it has failed.

Again, Coldfusion stopped  the execution without throwing a fuss is a big surprise for me. If I run it in Brower, there is no usual 500 server error. Everything is just hunky-dory as far as Coldfusion is concerned!?

After some digging, I found out the following
  • It stopped on a zip file
  • The zip file has some PDF files in it
  • Coldfusion-error.log has the following message
Feb 25, 2015 8:30:28 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [CfmServlet] in context with path [/] threw exception [ROOT CAUSE: 
java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument
 at org.apache.tika.parser.pdf.PDFParser.parse(
 at org.apache.tika.parser.CompositeParser.parse(
 at org.apache.tika.parser.AutoDetectParser.parse(
 at org.apache.tika.parser.DelegatingParser.parse(
 at org.apache.tika.parser.pkg.PackageParser.parseArchive(
 at org.apache.tika.parser.pkg.ZipParser.parse(
 at org.apache.tika.parser.CompositeParser.parse(
 at org.apache.tika.parser.AutoDetectParser.parse(

So, obviously, our Coldfusion distribution is missing some libraries.

Short answer: find jar file for PDFBox, throw them under Coldfusion lib folder and restart Coldfusion. And I got the jar file from here: pdfbox-0.8.0-incubating.jar

Long Answer: However, as with all Open Source projects, there is not much consideration of backward compatibility or official supported bundled distribution. I tried to download latest version of PDFBox, and it just does not work. So, I will need to find the original bundled version, and here is the journey (without detours I took :( ) to the right jar file


  • Google "pdfbox 0.8.0 incubating"
  • voila

Another Challenge (unsolved)
There is still some unsolved challenge for Solr. For example, Verity can index our PDF files correctly, but Solr's PDF reader seem to be sub-par. It only got some fragmented text from our PDF file, and it's missing a lot of keywords in our PDF files.

No comments: