Tips for software engineer: February 2015

Wednesday, February 25, 2015

Coldfusion 10 Solr Indexing Zip File that Contains PDF files

Seems like I should be surprised, if I don't find some surprises in Coldfusion every week. :) Here is another one that took me a few hours to find a solution. And hopefully will save a few hours for someone else.

Environment
Coldfusion 10 Update 15
Windows Server 2012

Symptom
When indexing a bunch of files, Coldfusion stopped indexing without any exception or getting into any error state. It just stopped in the middle of indexing. If I was not looking at it closely, I would not have noticed that it has failed.

Again, Coldfusion stopped the execution without throwing a fuss is a big surprise for me. If I run it in Brower, there is no usual 500 server error. Everything is just hunky-dory as far as Coldfusion is concerned!?

Cause
After some digging, I found out the following

It stopped on a zip file
The zip file has some PDF files in it
Coldfusion-error.log has the following message

Feb 25, 2015 8:30:28 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [CfmServlet] in context with path [/] threw exception [ROOT CAUSE: 
java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:52)
 at org.apache.tika.parser.pkg.PackageParser.parseArchive(PackageParser.java:78)
 at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:49)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
 at coldfusion.tagext.search.SolrUtils.getMetadata(SolrUtils.java:599)
 at coldfusion.tagext.search.SolrUtils.getSolrDocument(SolrUtils.java:753)
 at coldfusion.tagext.search.SolrUtils.addDocument(SolrUtils.java:1339)
 at coldfusion.tagext.search.IndexTag.doUpdate(IndexTag.java:651)
 at coldfusion.tagext.search.IndexTag.doStartTag(IndexTag.java:340)

So, obviously, our Coldfusion distribution is missing some libraries.

Solution
Short answer: find jar file for PDFBox, throw them under Coldfusion lib folder and restart Coldfusion. And I got the jar file from here: pdfbox-0.8.0-incubating.jar

Long Answer: However, as with all Open Source projects, there is not much consideration of backward compatibility or official supported bundled distribution. I tried to download latest version of PDFBox, and it just does not work. So, I will need to find the original bundled version, and here is the journey (without detours I took :( ) to the right jar file

Search CF10 folder for tika jar file, found tika-parsers-0.6.jar
Google Tika source, found it SVN root
Find maven POM file under 0.6 tag: http://svn.apache.org/repos/asf/tika/tags/0.6/tika-parsers/pom.xml
Located reference of PDFBox

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>0.8.0-incubating</version>
</dependency>

Google "pdfbox 0.8.0 incubating"
voila

Another Challenge (unsolved)
There is still some unsolved challenge for Solr. For example, Verity can index our PDF files correctly, but Solr's PDF reader seem to be sub-par. It only got some fragmented text from our PDF file, and it's missing a lot of keywords in our PDF files.

Monday, February 23, 2015

CFScript bug?

An extra semicolon at the end of "if" block is causing a lot of head scratching for me recently. Please see the code snippet below. Expected output should be:

Figure 1. Expected Output

However, below is the actual output:

Figure 2. Actual Output

Now, please notice this extra ";" at the end of line 6. If I remove it, everything will just work as expected.

<cfscript>
    private struct function test() {
        if(1 == 1) {
            if(1 == 0) {
                writeOutput("1==0");
            }; //<- look at here
            writeOutput("true");
            return {data = 1};
        }
        writeOutput("false");
        return {data = 2};
    }
    writeDump(test());
</cfscript>

Due to the lack of specifications for CFScript language, I cannot tell if this grammar is even allowed. But I can tell you this: the journey of discovering and finding the root cause of this problem is not fun at all!

Symptom

CFScript fall through the "if-else" statement, and did not return to caller from the expected branch.

Cause

An extra semicolon at the end of a block is the root cause.