Tips for software engineer: Coldfusion 10 Solr Indexing Zip File that Contains PDF files

Seems like I should be surprised, if I don't find some surprises in Coldfusion every week. :) Here is another one that took me a few hours to find a solution. And hopefully will save a few hours for someone else.

Environment
Coldfusion 10 Update 15
Windows Server 2012

Symptom
When indexing a bunch of files, Coldfusion stopped indexing without any exception or getting into any error state. It just stopped in the middle of indexing. If I was not looking at it closely, I would not have noticed that it has failed.

Again, Coldfusion stopped the execution without throwing a fuss is a big surprise for me. If I run it in Brower, there is no usual 500 server error. Everything is just hunky-dory as far as Coldfusion is concerned!?

Cause
After some digging, I found out the following

It stopped on a zip file
The zip file has some PDF files in it
Coldfusion-error.log has the following message

Feb 25, 2015 8:30:28 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [CfmServlet] in context with path [/] threw exception [ROOT CAUSE: 
java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:52)
 at org.apache.tika.parser.pkg.PackageParser.parseArchive(PackageParser.java:78)
 at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:49)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at coldfusion.tagext.search.SolrUtils.getMetadata(SolrUtils.java:599)
 at coldfusion.tagext.search.SolrUtils.getSolrDocument(SolrUtils.java:753)
 at coldfusion.tagext.search.SolrUtils.addDocument(SolrUtils.java:1339)
 at coldfusion.tagext.search.IndexTag.doUpdate(IndexTag.java:651)
 at coldfusion.tagext.search.IndexTag.doStartTag(IndexTag.java:340)

So, obviously, our Coldfusion distribution is missing some libraries.

Solution
Short answer: find jar file for PDFBox, throw them under Coldfusion lib folder and restart Coldfusion. And I got the jar file from here: pdfbox-0.8.0-incubating.jar

Long Answer: However, as with all Open Source projects, there is not much consideration of backward compatibility or official supported bundled distribution. I tried to download latest version of PDFBox, and it just does not work. So, I will need to find the original bundled version, and here is the journey (without detours I took :( ) to the right jar file

Search CF10 folder for tika jar file, found tika-parsers-0.6.jar
Google Tika source, found it SVN root
Find maven POM file under 0.6 tag: http://svn.apache.org/repos/asf/tika/tags/0.6/tika-parsers/pom.xml
Located reference of PDFBox

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>0.8.0-incubating</version>
</dependency>

Google "pdfbox 0.8.0 incubating"
voila

Another Challenge (unsolved)
There is still some unsolved challenge for Solr. For example, Verity can index our PDF files correctly, but Solr's PDF reader seem to be sub-par. It only got some fragmented text from our PDF file, and it's missing a lot of keywords in our PDF files.

Tips for software engineer

Wednesday, February 25, 2015

Coldfusion 10 Solr Indexing Zip File that Contains PDF files

No comments:

About Me