Environment
Coldfusion 10 Update 15
Windows Server 2012
Symptom
When indexing a bunch of files, Coldfusion stopped indexing without any exception or getting into any error state. It just stopped in the middle of indexing. If I was not looking at it closely, I would not have noticed that it has failed.
Again, Coldfusion stopped the execution without throwing a fuss is a big surprise for me. If I run it in Brower, there is no usual 500 server error. Everything is just hunky-dory as far as Coldfusion is concerned!?
Cause
After some digging, I found out the following
- It stopped on a zip file
- The zip file has some PDF files in it
- Coldfusion-error.log has the following message
Feb 25, 2015 8:30:28 AM org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet [CfmServlet] in context with path [/] threw exception [ROOT CAUSE: java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:52) at org.apache.tika.parser.pkg.PackageParser.parseArchive(PackageParser.java:78) at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:49) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) at coldfusion.tagext.search.SolrUtils.getMetadata(SolrUtils.java:599) at coldfusion.tagext.search.SolrUtils.getSolrDocument(SolrUtils.java:753) at coldfusion.tagext.search.SolrUtils.addDocument(SolrUtils.java:1339) at coldfusion.tagext.search.IndexTag.doUpdate(IndexTag.java:651) at coldfusion.tagext.search.IndexTag.doStartTag(IndexTag.java:340)
So, obviously, our Coldfusion distribution is missing some libraries.
Solution
Short answer: find jar file for PDFBox, throw them under Coldfusion lib folder and restart Coldfusion. And I got the jar file from here: pdfbox-0.8.0-incubating.jar
Long Answer: However, as with all Open Source projects, there is not much consideration of backward compatibility or official supported bundled distribution. I tried to download latest version of PDFBox, and it just does not work. So, I will need to find the original bundled version, and here is the journey (without detours I took :( ) to the right jar file
- Search CF10 folder for tika jar file, found tika-parsers-0.6.jar
- Google Tika source, found it SVN root
- Find maven POM file under 0.6 tag: http://svn.apache.org/repos/asf/tika/tags/0.6/tika-parsers/pom.xml
- Located reference of PDFBox
<dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>0.8.0-incubating</version> </dependency>
- Google "pdfbox 0.8.0 incubating"
- voila
Another Challenge (unsolved)
There is still some unsolved challenge for Solr. For example, Verity can index our PDF files correctly, but Solr's PDF reader seem to be sub-par. It only got some fragmented text from our PDF file, and it's missing a lot of keywords in our PDF files.