Wednesday, February 25, 2015

Coldfusion 10 Solr Indexing Zip File that Contains PDF files

Seems like I should be surprised, if I don't find some surprises in Coldfusion every week. :) Here is another one that took me a few hours to find a solution. And hopefully will save a few hours for someone else.

Environment
Coldfusion 10 Update 15
Windows Server 2012

Symptom
When indexing a bunch of files, Coldfusion stopped indexing without any exception or getting into any error state. It just stopped in the middle of indexing. If I was not looking at it closely, I would not have noticed that it has failed.

Again, Coldfusion stopped  the execution without throwing a fuss is a big surprise for me. If I run it in Brower, there is no usual 500 server error. Everything is just hunky-dory as far as Coldfusion is concerned!?

Cause
After some digging, I found out the following
  • It stopped on a zip file
  • The zip file has some PDF files in it
  • Coldfusion-error.log has the following message
Feb 25, 2015 8:30:28 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [CfmServlet] in context with path [/] threw exception [ROOT CAUSE: 
java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:52)
 at org.apache.tika.parser.pkg.PackageParser.parseArchive(PackageParser.java:78)
 at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:49)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at coldfusion.tagext.search.SolrUtils.getMetadata(SolrUtils.java:599)
 at coldfusion.tagext.search.SolrUtils.getSolrDocument(SolrUtils.java:753)
 at coldfusion.tagext.search.SolrUtils.addDocument(SolrUtils.java:1339)
 at coldfusion.tagext.search.IndexTag.doUpdate(IndexTag.java:651)
 at coldfusion.tagext.search.IndexTag.doStartTag(IndexTag.java:340)

So, obviously, our Coldfusion distribution is missing some libraries.

Solution
Short answer: find jar file for PDFBox, throw them under Coldfusion lib folder and restart Coldfusion. And I got the jar file from here: pdfbox-0.8.0-incubating.jar

Long Answer: However, as with all Open Source projects, there is not much consideration of backward compatibility or official supported bundled distribution. I tried to download latest version of PDFBox, and it just does not work. So, I will need to find the original bundled version, and here is the journey (without detours I took :( ) to the right jar file

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>0.8.0-incubating</version>
</dependency>

  • Google "pdfbox 0.8.0 incubating"
  • voila


Another Challenge (unsolved)
There is still some unsolved challenge for Solr. For example, Verity can index our PDF files correctly, but Solr's PDF reader seem to be sub-par. It only got some fragmented text from our PDF file, and it's missing a lot of keywords in our PDF files.

No comments: