Wednesday, February 25, 2015

Coldfusion 10 Solr Indexing Zip File that Contains PDF files

Seems like I should be surprised, if I don't find some surprises in Coldfusion every week. :) Here is another one that took me a few hours to find a solution. And hopefully will save a few hours for someone else.

Environment
Coldfusion 10 Update 15
Windows Server 2012

Symptom
When indexing a bunch of files, Coldfusion stopped indexing without any exception or getting into any error state. It just stopped in the middle of indexing. If I was not looking at it closely, I would not have noticed that it has failed.

Again, Coldfusion stopped  the execution without throwing a fuss is a big surprise for me. If I run it in Brower, there is no usual 500 server error. Everything is just hunky-dory as far as Coldfusion is concerned!?

Cause
After some digging, I found out the following
  • It stopped on a zip file
  • The zip file has some PDF files in it
  • Coldfusion-error.log has the following message
Feb 25, 2015 8:30:28 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [CfmServlet] in context with path [/] threw exception [ROOT CAUSE: 
java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:52)
 at org.apache.tika.parser.pkg.PackageParser.parseArchive(PackageParser.java:78)
 at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:49)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at coldfusion.tagext.search.SolrUtils.getMetadata(SolrUtils.java:599)
 at coldfusion.tagext.search.SolrUtils.getSolrDocument(SolrUtils.java:753)
 at coldfusion.tagext.search.SolrUtils.addDocument(SolrUtils.java:1339)
 at coldfusion.tagext.search.IndexTag.doUpdate(IndexTag.java:651)
 at coldfusion.tagext.search.IndexTag.doStartTag(IndexTag.java:340)

So, obviously, our Coldfusion distribution is missing some libraries.

Solution
Short answer: find jar file for PDFBox, throw them under Coldfusion lib folder and restart Coldfusion. And I got the jar file from here: pdfbox-0.8.0-incubating.jar

Long Answer: However, as with all Open Source projects, there is not much consideration of backward compatibility or official supported bundled distribution. I tried to download latest version of PDFBox, and it just does not work. So, I will need to find the original bundled version, and here is the journey (without detours I took :( ) to the right jar file

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>0.8.0-incubating</version>
</dependency>

  • Google "pdfbox 0.8.0 incubating"
  • voila


Another Challenge (unsolved)
There is still some unsolved challenge for Solr. For example, Verity can index our PDF files correctly, but Solr's PDF reader seem to be sub-par. It only got some fragmented text from our PDF file, and it's missing a lot of keywords in our PDF files.

Monday, February 23, 2015

CFScript bug?

An extra semicolon at the end of "if" block is causing a lot of head scratching for me recently. Please see the code snippet below. Expected output should be:
Figure 1. Expected Output
However, below is the actual output:
Figure 2. Actual Output
Now, please notice this extra ";" at the end of line 6. If I remove it, everything will just work as expected.
<cfscript>
    private struct function test() {
        if(1 == 1) {
            if(1 == 0) {
                writeOutput("1==0");
            }; //<- look at here
            writeOutput("true");
            return {data = 1};
        }
        writeOutput("false");
        return {data = 2};
    }
    writeDump(test());
</cfscript>
Due to the lack of specifications for CFScript language, I cannot tell if this grammar is even allowed. But I can tell you this: the journey of discovering and finding the root cause of this problem is not fun at all!

Symptom

CFScript fall through the "if-else" statement, and did not return to caller from the expected branch.

Cause

An extra semicolon at the end of a block is the root cause.