Archive for April, 2007

Solr revisited

April 23, 2007

Pretty much everything I wrote in my previous post about Solr is now obsolete. Up until last Sunday evening I had Solr running with Cocoon. However I had all sorts of problems with Cocoon, some stemming from my complete inability to go back to using XSLT 1.0 (which I needed to do in order to take advantage of daisy-chaining), and some stemming from bad (non HTML) characters in our metadata, most likely from pasting from Word documents.

At the same time I was struggling with Cocoon, this conversation was happening on the eXist listserv, which reminded me that I could use the eXist doc() function to send Solr requests, and transform the resulting response. I’m blaming being overworked as the reason I wasted so much time with Cocoon when I already use this function for retrieving XSL stylesheets for doing transformations in nearly every XQuery that I write.

So now my requests to Solr are sent via an xquery that looks like this:

xquery version "1.0";

declare namespace util="http://exist-db.org/xquery/util";

declare namespace request="http://exist-db.org/xquery/request";

declare namespace x="http://exist.sourceforge.net/dc-ext";

declare namespace xlink = "http://www.w3.org/1999/xlink";

declare namespace xslt="http://exist-db.org/xquery/transform";

declare namespace bh = "http://cdi.uvm.edu/cdi/ns";
(:Fields for limiting search : )

declare variable $field1 {request:request-parameter('field1', 'ft')};

declare variable $field2 {request:request-parameter('field2', 'ft')};

declare variable $field3 {request:request-parameter('field3', 'ft')};(:Search terms:)

declare variable $term1 {replace(request:request-parameter('term1', ''), "'", '"')};

declare variable $term2 {replace(request:request-parameter('term2', ''), "'", '"')};

declare variable $term3 {replace(request:request-parameter('term3', ''), "'", '"')};
(:Boolean operators: )

declare variable $bool1 {request:request-parameter('bool1', 'and')};

declare variable $bool2 {request:request-parameter('bool2', 'and')};
(:Variables for paging through results: )

declare variable $start {request:request-parameter('start', 0) cast as xs:integer};

declare variable $rows {request:request-parameter('rows', 25) cast as xs:integer};
(:Filters applied to search results: )

declare variable $filter {request:request-parameter('filter', '')};
(: Applies correct Solr field for fielded searching : )
declare function bh:field($field as xs:string) as xs:string {
   if ($field = "au") then
      "creator:"
   else if ($field = "ti") then
      "title:"
   else if ($field = "ab") then
      "abstract_text:"
   else if ($field = "su") then
      "topic_text:"
    else ''
};
(: Builds query parameters as a string : )
declare function bh:build-query()as xs:string{
let $queryString :=
   concat(
        if ($term1 != '') then
          concat(bh:field($field1), $term1)
        else '',
        if ($term2 != '') then
           concat(
            if ($bool1 = 'and' and $term1 != '') then ' AND '
            else if ($bool1 = 'or' and $term1 != '') then ' OR '
            else if($bool1 = 'not' and $term1 !='') then ' NOT '
            else ' ',bh:field($field2), $term2)
            else '',
         if ($term3 != '') then
           concat(
             if ($bool2 = 'and' and ($term1 != '' or $term2 != '')) then ' AND '
             else if ($bool2 = 'or' and $term1 != '' or $term2 != '') then ' OR '
             else if ($bool2 = 'not' and $term1 != '' or $term2 != '') then ' NOT '
             else ' ',bh:field($field3), $term3)
             else '',
         if ($term1 = '' and $term2 = '' and $term3 = '') then
            concat('/no-search-terms',' ')
         else '' )
  return encode-for-uri($queryString)
 };
declare function bh:filter(){
 if($filter != '') then
    encode-for-uri(concat(' ',translate($filter,';',' ')))
 else ''
};
declare function bh:fullQuery(){
let $searchPath :=
    concat('http://pathtoSolr/solr/select/?q=',bh:build-query(),bh:filter(),
    '&version=2.2&start=',$start,'&rows=',$rows,'&facet=true&facet.limit=-1
    &facet.sort=true&facet.zeros=false&facet.field=parent_facet&facet.mincount=1
  &facet.field=creator_facet&facet.mincount=1&facet.field=coverage_facet&facet.mincount=1
 &facet.field=genre_facet&facet.mincount=1&facet.field=topic_facet&facet.mincount=2')
return  $searchPath
};
(:Stylesheet used for dispay: )
let $xsl := doc('/path/search.xsl')
(:Format results : )
let $results :=
<query-results term1="{$term1}" field1="{$field1}" bool1="{$bool1}"
   term2="{$term2}"field2="{$field2}" bool2="{$bool2}"
   term3="{$term3}" field3="{$field3}" filter="{bh:filter()}">
      {
         if((exists($term1) and $term1 = '') and (exists(term2) and $term2 = '')
           and (exists(term3) and $term3 = '') ) then
             <response hits="0">Your search returned 0 results</response>
         else    doc(bh:fullQuery())/child::*
}
</query-results>
return xslt:stream-transform($results, $xsl, () )

The results are transformed using XSLT (2.0).

I find this works pretty well, but I’m also very interested in exploring this new HTTP extension model which is pretty much what I was hoping for back when I started exploring the Solr/eXist combination. (Which just demonstrates once again what a great community of developers eXist has.)

Documents are still added to the index using a combination of XQuery and XForms. Next week I’ll be refining our editor to make submitting completed records to the index a one (maybe two) button process. I’m pretty pleased with Solr and have gotten a very positive response to the browsing and limiting features. I still have some features to work on, for example, while my users can add filters to search results, they can not remove them. This seems like a pretty easy javascript fix, but I haven’t really had the time to implement it yet.

Advertisements

XForms for Code4lib update

April 20, 2007

Check out the XForms wiki, I posted an example of a DC XForm and Parmit has posted an example of the MODS editor that they are working on. I have the DC form in a few flavors, xhtml, xsl, and xquery but have only posted the xhtml version. The xsl, and xquery are pretty easy to derive from the xhtml version, but I could post those as well if there is an interest. The Princeton MODS editor is using Orbeon, and is still a work in progress, it looks great though, so check it out.

I’m hoping to have some time in the next few weeks to get back to working on XForms, I’ll try to post my work to the wiki, so check back every once in a while.

Center for Digital Initiatives: Virtual Tour

April 19, 2007

Check out our new office space (to see how far the space has come, here are some earlier pictures), visit our website [http://cdi.uvm.edu/], and sign our virtual guest book.

The new home for the Center for Digital Initiatives at UVM. This is room 313 in the Bailey/Howe Library, three floors up and tucked away in the stacks, I believe we are in the horticulture section.  We are still waiting for a sign for the door, and hopefully a few signs elsewhere in the library to help people find us. But if you manage to make to the third floor, just go all the way to the corner farthest from the front door and you should find this:

The seating area includes a data port for visitors with laptops.

CDI seating area

Looking down the hallway you can see the doorway to the scanning room on the right, my office on the left and the new conference room all the way at the back. Right above the door to the conference room is a wireless router, which means I have a really great (strong, and consistent) signal in my office.

Looking into the scanning room, you can also see some of the photographs we are using. The one on the right is from the Tennie Toussaint collection, and is available on the website.

The scanning room is a light controlled environment designed to maximize color accuracy. Color neutral, daylight balanced, lights are provided on dimmer switches allowing the technicians the low level light environment need for evaluating color accuracy.

The conference room:

Some collection highlights:

We have six collections, most of them are related to congressional papers and with items ranging in date from 18182004 and on topics as such as milk, slavery, and the maple sugar industry. In addition to the congressional papers we have a collection of Vermont historical photographs that has some real gems.

Also of interest are some of the new features such as the browse within a collection and the ability to do faceted searching (using the “Narrow your search” options).

Don’t forget to sign our virtual guest book.

Launched

April 17, 2007

UVM Libraries Center for Digital Initiatives: http://cdi.uvm.edu/

We are officially live. There was a press conference yesterday to launch the site, you can read the press release here, and there will be additional events during the week including open houses on Thursday, April 19th and Friday, April 20th from 1 to 3PM in Bailey/Howe’s Room 313. We will be providing some refreshments, and raffling off an iPod nano at the open house, so stop by.

I will also be hosting a virtual open house here on Thursday with pictures of the center, and some highlights from the online collections.

Solr, finally

April 4, 2007

It took me about 3 weeks from the Solr preconference event at code4lib, but I finally have Solr running semi smoothly with my web application using Cocoon. I didn’t expect it to take so long, but most of that time was spent learning how to use cocoon (and trying to learn Java) . Ideally I would like to have my xqueries send POST and GET requests to Solr, which can be done using Java. However, the Java solution has a much larger learning curve than the Cocoon solution that I currently have in place. Because the release is only two weeks away, I’m sticking with Cocoon for now, with an eventual move to a Java/XQuery solution.Here what my setup currently looks like:

1) A Solr instance on port 8983 , with my website running on port 80 on the same machine. Port 8983 is firewalled so no one can come along and wipe out my index with a delete request.

2) An xquery that pulls data from my METS records for indexing, either a single record or multiple records, depending on the parameters. Using an XSL stylesheet I generate an XForm (with the xquery results as the instance data section of the form). This form then uses POST to send the data to the Solr index. A second button on the form sends a commit command to Solr.

3) A cocoon pipeline that sends GET requests to Solr and transforms the response using xsl. This feature took me a depressingly long time to figure out, in spite of the fact that I found this thread pretty early on.

One of the problems that I was running into was that I had changed my XSLT transformer from Xalan to Saxon (so I could use XSL 2.0). Saxon does not allow daisy chaining (pulling results from one pipeline through another pipeline, or applying multiple transformations). I adjusted my coccon.xconf and sitemap.xmap to use Xalan as an additional transformer and only call it when using the pipeline below.

The pipline for handling search requests looks like this:

<map:match pattern="search">
   <map:generate type="request">
      <map:parameter name="generate-attributes" value="true"/>
   </map:generate>
   <map:transform type="xslt-xsltc" src="solr.xsl">
      <map:parameter name="use-request-parameters" value="true"/>
   </map:transform>
   <map:transform type="cinclude" />
   <map:transform type="xslt-xsltc" src="searchResults.xsl" />
   <map:serialize type="xml"/>
</map:match>

solr.xsl transforms the prameters sent from the search form into Solr style prameters. The cinclude is passed form solr.xsl to Solr as a GET request (you can also use cincludes to POST data but I found it more difficult than posting from the XForm). The final XSL stylesheet transforms the results something attractive for the user.

Here is what my solr.xsl looks like:

<xsl:stylesheet xmlns:h="http:cocoon.apache.org/h"
   xmlns:cinclude="http://cocoon.apach.org/"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   xmlns="http://www.w3.org/1999/xhtml" version="1.0">

<xsl:strip-space elements="*"/>
<xsl:output media-type="text/xml" method="xml"/>
<xsl:param name="term1"/>
<xsl:param name="field1"/>
<xsl:param name="term2"/>
<xsl:param name="field2"/>
<xsl:param name="term3"/>
<xsl:param name="field3"/>
<xsl:param name="bool1"/>
<xsl:param name="bool2"/>
<xsl:param name="start"/>
<xsl:param name="rows"/>
<xsl:param name="indent"/>
<xsl:template match="/">
   <xsl:variable name="param1">
      <xsl:choose>
         <xsl:when test="string-length(normalize-space($term1)) > 1">
            <xsl:choose>
               <xsl:when test="$field1 = 'kw'">
		 <xsl:value-of select="$term1"/></xsl:when>
  	       <xsl:when test="$field1 = 'ti'">
		 <xsl:value-of select="concat('title:','(',$term1,')')"/></xsl:when>
	       <xsl:when test="$field1 = 'au'">
		 <xsl:value-of select="concat('creator:','(',$term1,')')"/></xsl:when>
	       <xsl:when test="$field1 = 'su'">
		 <xsl:value-of select="concat('subject:','(',$term1,')')"/></xsl:when>
	       <xsl:when test="$field1 = 'ab'">
		 <xsl:value-of select="concat('text:','(',$term1,')')"/></xsl:when>
	       <xsl:otherwise><xsl:value-of select="$term1"/></xsl:otherwise>
	   </xsl:choose>
         </xsl:when>
      </xsl:choose>
   </xsl:variable>
   <xsl:variable name="param2">
	<!-- same as param 1 using field2 and term2 -->
   </xsl:variable>
   <xsl:variable name="param3">
 	<!-- same as param 1 using field2 and term2 -->
   </xsl:variable>
   <xsl:variable name="boolean1">
      <xsl:choose>
        <xsl:when test="string-length(normalize-space($term2)) > 1">
         <xsl:choose>
          <xsl:when test="$bool1 = 'and'"> AND </xsl:when>
          <xsl:when test="$bool1 = 'or'"> OR </xsl:when>
          <xsl:when test="$bool1 = 'not'"> NOT </xsl:when>
          <xsl:otherwise> AND </xsl:otherwise>
         </xsl:choose>
       </xsl:when>
       <xsl:otherwise> </xsl:otherwise>
     </xsl:choose>
   </xsl:variable>
<xsl:variable name="boolean2">
 <!-- same as boolean1 -->
</xsl:variable>
<!-- pulling all the params together-->
<xsl:variable name="params">
<xsl:value-of select="concat($param1,' ',$boolean1,' ',$param2,' ',$boolean2,' ',$param3)"/>
</xsl:variable>
   <ci:include
      xmlns:ci="http://apache.org/cocoon/include/1.0"
      src="http://localhost:8983/solr/select/?q=$params&version=2.2&start=$start&rows=$rows&indent=$indent"/>
</xsl:template>
</xsl:stylesheet>

For other approaches using cocoon check out SolrForrest, flowscripts, or try using the webdav module to talk to REST interfaces.

Resources:

Solr

Cocoon

The XForms vs. Ruby Debate

April 3, 2007

I had read Adriaan de Jonge’s post on XForms vs. Ruby on Rails a few months ago, and found it interesting but without much impact on what I was doing with XForms. For me XForms were solving the solution of the creation and management of complicated xml records. Curt Kagel’s article from last week enticed me to go back and read both articles again. While I am interested in exploring Ruby, I’m doubtful that I will be giving up on XForms any time soon. In spite of the difficulties in using XForms I find them very well suited for the kind of data management that I’m doing and they tie in very well with my XML based system. However, I think the two articles are worth a read for anyone looking at either Ruby or XForms.