Archive for the ‘Progress reports’ Category

Quick update

January 8, 2008

I have a steadily growing list of things I need to get finished, but I finally managed to get around to adding RSS (actually atom) feeds for tracking search results and also for tracking collections. This will be particularly handy for a collection like the McAllister Photographs which will be a work in progress for quite some time. (It is currently growing at a rate of about 50-100 photos a week, but I imagine this will slow down once the cataloging staff has caught up to our part time scanning tech.)

In the process of adding the feeds I also took a the opportunity to upgrade to Solr 1.2, which was a pretty painless upgrade, with some nice additional functionality. I hope to get a chance to install 1.3 on my development machine next week to explore the MoreLikeThis functionality. I’d like to use this feature on the item pages, allowing users to get some immediate related items, in addition to using the subject and geographic headings to get related items.

Advertisements

Miscellany that has been keeping me busy

June 25, 2007

We are starting to creep up on the end of my initial grant period (in September) and are in the process of spending the remaining money. The library has offered to purchase the CDI’s development server, and a machine to be used as storage for our master tiff files. These will be shared resource, at least in that the library will benefit from CDI expertise, and server space, for other library digital content (this came up because the library is starting to manage a collection of digital thesis). This leaves us with enough money to purchase a book scanner, and to buy some additional workstations a laptop for the conference room, or other equipment.I now have a development machine (just a co-opted desktop running Linux) on my desk, and I spent a part of the past week learning how to install Linux and setting up the development environment. I work closely with someone in systems who takes care of our production machine, but he let me do the Linux installation on my own (while watching over my shoulder). It was pretty easy, and I’ve now moved on to more interesting problems, such as solving my url issues (involving some convoluted issue with mod_jk), and reading up on Subversion. I would like to have the development environment on a branch in subversion, that I can merge to the trunk when it is ready to go live, I assume this is possible, but so far have been making all of my edits to the trunk so I will need to read up on this.

In addition, I have been working on leftover details for the web site, I now have the “remove filters” option working (go ahead, try it) and will be looking into zooming for images and working on continual improvement of the interface and adding of new features.

I’m also still working on the new finding aids site (found here, but still a work in progress). I have a Solr instance set up for the EADs, but am having trouble indexing documents, namely, outputting all the next nodes in the document only once and with the correct spacing (without writing a hugely complex stylesheet) . My brain insists that there is a simple way to do it, but I haven’t managed it yet. Everything I try either outputs some elements multiple times (i.e. a parent and all its children, and then the children again, as it works through the document tree) and/or does not insert spaces between elements, which makes the output fairly useless for searching.

The CDI is also involved in our first collaborative project. We are collaborating with the Landscape Change project here at UVM to digitize several hundred lantern slides from the Long Trail. I have only seen a few of the images, but it looks like a great collection. For this project the CDI will keep the master tiff files, and then each website will host jpg copies and copies of the metadata. I hate the duplication but am not sure how best to coordinate shared metadata at this point, and I didn’t want to stall the project while we figure that out.

I also have a paper to write and a presentation to come up with. So far I have a title for the presentation (“Innovative Interfaces: making the most of the data we have”) and nothing for the paper.

Solr, finally

April 4, 2007

It took me about 3 weeks from the Solr preconference event at code4lib, but I finally have Solr running semi smoothly with my web application using Cocoon. I didn’t expect it to take so long, but most of that time was spent learning how to use cocoon (and trying to learn Java) . Ideally I would like to have my xqueries send POST and GET requests to Solr, which can be done using Java. However, the Java solution has a much larger learning curve than the Cocoon solution that I currently have in place. Because the release is only two weeks away, I’m sticking with Cocoon for now, with an eventual move to a Java/XQuery solution.Here what my setup currently looks like:

1) A Solr instance on port 8983 , with my website running on port 80 on the same machine. Port 8983 is firewalled so no one can come along and wipe out my index with a delete request.

2) An xquery that pulls data from my METS records for indexing, either a single record or multiple records, depending on the parameters. Using an XSL stylesheet I generate an XForm (with the xquery results as the instance data section of the form). This form then uses POST to send the data to the Solr index. A second button on the form sends a commit command to Solr.

3) A cocoon pipeline that sends GET requests to Solr and transforms the response using xsl. This feature took me a depressingly long time to figure out, in spite of the fact that I found this thread pretty early on.

One of the problems that I was running into was that I had changed my XSLT transformer from Xalan to Saxon (so I could use XSL 2.0). Saxon does not allow daisy chaining (pulling results from one pipeline through another pipeline, or applying multiple transformations). I adjusted my coccon.xconf and sitemap.xmap to use Xalan as an additional transformer and only call it when using the pipeline below.

The pipline for handling search requests looks like this:

<map:match pattern="search">
   <map:generate type="request">
      <map:parameter name="generate-attributes" value="true"/>
   </map:generate>
   <map:transform type="xslt-xsltc" src="solr.xsl">
      <map:parameter name="use-request-parameters" value="true"/>
   </map:transform>
   <map:transform type="cinclude" />
   <map:transform type="xslt-xsltc" src="searchResults.xsl" />
   <map:serialize type="xml"/>
</map:match>

solr.xsl transforms the prameters sent from the search form into Solr style prameters. The cinclude is passed form solr.xsl to Solr as a GET request (you can also use cincludes to POST data but I found it more difficult than posting from the XForm). The final XSL stylesheet transforms the results something attractive for the user.

Here is what my solr.xsl looks like:

<xsl:stylesheet xmlns:h="http:cocoon.apache.org/h"
   xmlns:cinclude="http://cocoon.apach.org/"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   xmlns="http://www.w3.org/1999/xhtml" version="1.0">

<xsl:strip-space elements="*"/>
<xsl:output media-type="text/xml" method="xml"/>
<xsl:param name="term1"/>
<xsl:param name="field1"/>
<xsl:param name="term2"/>
<xsl:param name="field2"/>
<xsl:param name="term3"/>
<xsl:param name="field3"/>
<xsl:param name="bool1"/>
<xsl:param name="bool2"/>
<xsl:param name="start"/>
<xsl:param name="rows"/>
<xsl:param name="indent"/>
<xsl:template match="/">
   <xsl:variable name="param1">
      <xsl:choose>
         <xsl:when test="string-length(normalize-space($term1)) > 1">
            <xsl:choose>
               <xsl:when test="$field1 = 'kw'">
		 <xsl:value-of select="$term1"/></xsl:when>
  	       <xsl:when test="$field1 = 'ti'">
		 <xsl:value-of select="concat('title:','(',$term1,')')"/></xsl:when>
	       <xsl:when test="$field1 = 'au'">
		 <xsl:value-of select="concat('creator:','(',$term1,')')"/></xsl:when>
	       <xsl:when test="$field1 = 'su'">
		 <xsl:value-of select="concat('subject:','(',$term1,')')"/></xsl:when>
	       <xsl:when test="$field1 = 'ab'">
		 <xsl:value-of select="concat('text:','(',$term1,')')"/></xsl:when>
	       <xsl:otherwise><xsl:value-of select="$term1"/></xsl:otherwise>
	   </xsl:choose>
         </xsl:when>
      </xsl:choose>
   </xsl:variable>
   <xsl:variable name="param2">
	<!-- same as param 1 using field2 and term2 -->
   </xsl:variable>
   <xsl:variable name="param3">
 	<!-- same as param 1 using field2 and term2 -->
   </xsl:variable>
   <xsl:variable name="boolean1">
      <xsl:choose>
        <xsl:when test="string-length(normalize-space($term2)) > 1">
         <xsl:choose>
          <xsl:when test="$bool1 = 'and'"> AND </xsl:when>
          <xsl:when test="$bool1 = 'or'"> OR </xsl:when>
          <xsl:when test="$bool1 = 'not'"> NOT </xsl:when>
          <xsl:otherwise> AND </xsl:otherwise>
         </xsl:choose>
       </xsl:when>
       <xsl:otherwise> </xsl:otherwise>
     </xsl:choose>
   </xsl:variable>
<xsl:variable name="boolean2">
 <!-- same as boolean1 -->
</xsl:variable>
<!-- pulling all the params together-->
<xsl:variable name="params">
<xsl:value-of select="concat($param1,' ',$boolean1,' ',$param2,' ',$boolean2,' ',$param3)"/>
</xsl:variable>
   <ci:include
      xmlns:ci="http://apache.org/cocoon/include/1.0"
      src="http://localhost:8983/solr/select/?q=$params&version=2.2&start=$start&rows=$rows&indent=$indent"/>
</xsl:template>
</xsl:stylesheet>

For other approaches using cocoon check out SolrForrest, flowscripts, or try using the webdav module to talk to REST interfaces.

Resources:

Solr

Cocoon

Loose Ends

January 26, 2007

Loose ends that I’m still working on in no order whatsoever:

  1. OAI data provider, looks like I’ll be doing this in XQuery, thanks to Kevin and Mike for the jumping off point for this.
  2. Lucene search, since the release date has been pushed back (to match the grand opening of our non-virtual office) I think I’ll hold off on this until after the cod4lib pre-conference. I’m also waiting until after the conference to start exploring some faceted browse and/or search options.
  3. Collection level pages, need a little redesigning
  4. Adding a save as pdf option to items and EADs.
  5. Adding a sandbox for development.
  6. Those other XForms, METS and MODS.
  7. Creating a better browse page. Currently it is a browse by title, I’d like to add the options to browse by subject, date, people, and places. Still working on getting the metadata to support these features.

Breaking ground

January 5, 2007

The CDI (Center for Digital Initiatives) is getting a home. So far I have been peacefully coexisting with the cataloging department while the scanning has been going on in an unused office in Special Collections. There have been plans in the works for a while now to renovate a room in the library for the CDI. This new space includes a dedicated scanning room, an office, and a conference room. After months of research, discussion, and reviewing and revising architectural drawings, construction has begun. I went up this afternoon and the framing for the walls is in, which means I can really visualize what the space is going to look like.

Responding to users

December 15, 2006

My one-on-one focus groups were very productive. I chose this format to begin site evaluations because it is a handy way to assess user needs and was helpful in giving me a clearer picture of the types of tasks that users would bring to the website. I divided my results into three groups, suggestions, observations (that I made during the session), and bugs.

Suggestions

Citations – All of the users suggested that we provide some sort of citation or citation help. Suggestions included a citation on each item page, or a link to a “how to” guide on citing online resources.

I added this feature as soon as testing concluded. It is a dynamic citation created by pulling information from the descriptive metadata. We are using Chicago Style as our default citation style. It would probably be fairly easy to allow people to select a preferred citation (out three or so) and dynamically deliver the requested format, perhaps as a future feature.

Download/Save records – All of the users also wanted a way to print and save pages/items. Several people suggested that it would be nice to have the ability choose to print/save a single page, or the entire item, with citation information included.

I knew this was a feature that we wanted to add, but was unsure how people would want to download items and pages, so this feedback was pretty helpful in clarifying that. I’m still working out the format the items will be downloaded in (maybe PDF) and how the options will look in the interface.

Browse by people – Two users suggested that it would be useful to have someway of finding out who the people listed in the subject headings are. We don’t currently have this information. The names are from LCSH authority records, and I don’t know that we have the staff to create a solution to this, but it is an interesting idea that I would like to keep in mind, in case I can come up with a creative solution.

Full text – Users asked about full text searching and were disappointed when they were told we do not currently search full text. Users also expressed a desire for full text to help with reading some of the handwritten materials.

This is something we are currently working on. We are doing dirty OCR and using it for searching. The handwritten material is more difficult because we will need someone to transcribe them.

Search/browse results – Two users suggested that search and browse results should be returned by date and also add an limit by document type to search and search results. One user also requested adding checkboxes to the search results so that users could save and download selected results sets.

I’m holding off on these until I investigate Lucene, and have some idea how long it will take me to get it integrated with eXist. I don’t want to invest a lot of time into code that will be subsequently abandoned, but they are high on my list of improvements.

Faceted browsing – One user suggested taking the browsing options that I have provided a step further and provide some sort of faceted browsing.

I would like to investigate this idea, because I think it would not be too hard to do, and could be very useful. I also had several comments about how users would like to be able to “poke around” to find materials, for non-research projects and faceted browsing might be a step towards this kind of browsing.

Observations

All but one of the users used the browse boxes on the homepage, collection pages, and item level pages. However, some of the collections have pretty limited metadata making browsing difficult. When faced with these collections, most users ended up either searching or, if the collection was very big, scrolling through the list of items. I think the collection level pages would benefit from more metadata, either more data in the record, or aggregated metadata from the items in the collection. (I’m a little hesitant about aggregating the metadata every time someone calls the collection page, seems like this would be unnecessary and would slow things down the page.)

Also it might be useful to make the browsing options more prominent (provided we can commit to the necessary metadata to make this feasible). I’m also wondering if providing the results on the first page of the collection is too much information. Currently the page includes an introduction to the collection, browsing/searching options and then lists the items in the collection, with titles, authors, and descriptions (and thumbnails if available). I may try just having an introductory page that users have to click through, using either a search or a browse option to get to the items in the collection. I’m torn because it adds another click, but it might make the page more approachable.

In addition one of the users found it confusing that the search results returned both collection and item level pages. Although the results specify document type this distinction was not clear to the user. I think one way of handling this would be to allow users to limit the results by type, collection, image, text etc. I like the way this project uses tabs to accomplish this.

Bugs

  • Search results display smooshed together terms. This was a xsl stylesheet issue is already fixed.
  • Search within this collection was broken (fixed after the first session).
  • Searches do not pay attention to stop words. Something is wrong with my set up in eXist, I haven’t solved this one yet.
  • If you put a # in keyword search, like a date, you will get no results (I think this is because of typing)

Next steps

Once these adjustments and additions are made I’ll be ready to do some more formal user testing including task based tests, and heuristic reviews of the site.

In the works

December 11, 2006

I haven’t done a progress report in a while, so here goes:

User testing
I finished up step one of usability testing this morning (actually I have one more user, but she had to postpone because she is sick). For step one my goal was to get an evaluation of the site and the site functionality. The sessions were arranged as one-on-one interviews, structured around a research question. The research question served as a way to frame the discussion and to make sure all the functions of the site were explored. There are some obvious trends emerging and I expect the later half of this week and early next week to be taken up by making changes to the interface in response to the tests.

Developing additional XForms

I’m currently working on a METS XForm, and will probably start working on a MODS form as well. I need to have at least a rudimentary MODS XForm in place before I can make the switch from Dublin Core to MODS.

I have successfully used xf:bind to calculate all the values in the mets:file from two data entry fields. This was a bit difficult for me because the mets:file element is a in an xf:repeat, and I had trouble getting the index in the bind statements to work correctly. I’m now working with some success on getting the mets:file data to generate a mets:strucMap (without doing post-processing).

On the horizon

  1. I need to start exploring using Lucene with eXist. There are several people on the eXist mailing list who have implemented this and there is also the code4lib pre-conference workshop, that I imagine will be helpful, though I hope to have this working before then.
  2. I also need look into OAI-harvesting for our data.
  3. I have some cleanup to do on the EAD finding aid part of the website, mostly just fixing bugs found in the XSL stylesheets.
  4. Add XSL-FO stylesheet for the finding aids.

Loose ends

  1. The about section of the website is still completely bare. The mission statement is in progress (not by me) but the rest of the content needs to be added. In addition to generic about pages, we will be including guidelines for project proposals and perhaps an online project proposal form that will be submitted to a committee for review. I would also like to include documentation about our metadata standards, some discussion of eXist and also scanning guidelines and resources.
  2. Enable sessions in searching.
  3. Also enable sessions for saving search results/or items to a “favorites list”
  4. Optimize eXist indexes

Deadline looming

November 2, 2006

We have a demo of the web site on Wednesday the 8th and I’m feeling the need for a list or two to keep me organized.

Here’s what I have:

  1. No more links going nowhere, although my news feed is not actually acting like a news feed yet, and my about pages are simply place holders at this point.
  2. Simple and advanced searches
  3. Browse collections by title
  4. Collection pages
    1. Collection overview, with a list of all items in the collection.
    2. Search within a collection
    3. Browse collections by
      1. Format
      2. People
      3. Topics
      4. Place
  5. Item level pages
    1. Tabbed view to switch between page images and descriptive metadata
    2. Paging for multi page items
    3. Links to related items by:
      1. Format
      2. Topics
      3. People etc.
    4. Links back to parent collection(s)
  6. Data processing
    1. Password protected data entry interface
    2. Queue of items in progress
    3. XForms for editing descriptive metadata

Here’s what’s left (for the demo):

  1. Add options for limiting the search results by text, images, or collections.
  2. Implement a more elaborate browse for the browse collections option, but this relies partially on metadata that we just don’t have yet.
  3. Format the search results pages so that they are more cohesive and easy to understand.
  4. Format the browse within a collection so that it is more organized looking.
  5. I need collection level metadata records for at least the two completed collections that we have on the site.

It isn’t a big list and I think I will be able to get it all done, except for #2. The big time sink at this point is messing with the CSS and creating icons to pretty up the pages, which I may have to skip in favor of getting all my code working. As a designer I cringe at having poorly designed pages up, but I only have so much time between now and Wednesday (10 am).

One step forward, one step back

October 13, 2006

eXist crashed on Wednesday. Actually crash is probably the wrong word, it seemed to be running fine but then failed to restart when I restarted Tomcat. We have backups, run early every morning, but that doesn’t help for the data that was entered during the day on Wednesday. More worrisome is that it is still unclear to me what caused the corruption in the database. I found a few discussions on the exist mailing list that seemed to be about similar problems, but without any satisfactory answers as to why the corruption occurred.

http://thread.gmane.org/gmane.text.xml.exist/5254/focus=5254

http://thread.gmane.org/gmane.text.xml.exist/7161/focus=7248

After a day and a half of trying to figure out what went wrong I caved and wrote to the list. I try to put that off for as long as possible, because while the list is very active, and generally helpful, I hate asking a question and then figuring out the answer myself later (or, I’ll be honest, getting an answer back that makes me feel stupid). I ‘ve been unable to reproduce the error after replacing the corrupt instance with the one from the backup. I have a feeling it was something I was working on during the morning on Wednesday, which means either my search xquery (which was outputting some java exceptions), or perhaps some of the xupdates I was using to add new elements to a few hundred documents at once.

Now that we are back up and running I’m returning to the question of my search xqueries, which I think need to be a little more sophisticated, the heart of which looks like this:

let $results :=
for $hits in collection('/db/mets')/mets:mets/mets:dmdSec[@ID='dmdDC']
    //descendant::dc:dc/child::* [self::* |= $_query]let $type := 
$hits/ancestor::mets:mets/@TYPElet $title := $hits/parent::*/dc:title[1]
let $id := $hits/parent::*/dc:identifier
return
<item id="{string($id)}" type="{string($type)}">
  <title>{string($title)}</title>
  {$hits}
</item>

It is problematic because it returns multiple hits for a single document. This is a pretty easy fix to make, but I also ran into a problem with this query when I had over 1000 hits, I encountered a java error (as noted above), so I will need rework this. I can limit the number of results returned, or I could use this search to only search collection level records, not item level records, most likely the first option. I have also had a request to include the author/creator field in the results which is a minor fix.

Update: My answer from the eXist list about the database corruption:

I fixed your issue. It wasn’t a “real” corruption, just removing the .lck files would have helped. As the exception shows, the lock files were damaged.

> org.exist.storage.lock.FileLock.read(FileLock.java:208)
> at
> org.exist.storage.lock.FileLock.tryLock(FileLock.java:108)
> at
> org.exist.storage.BrokerPool.canReadDataDir(BrokerPool.java:596)

Anyway, the startup process should handle this. After a database crash, the file locks might be incomplete. eXist will now check this.

So, that is good to know. Also eXist 1.0 and 1.1 final have just been released, I may take some time this week to upgrade to 1.1 final.

Interface design, progress notes

October 10, 2006

We have a demo of the prototype scheduled for mid-November, I’m hoping that I will have full functionality by that time, and I’m getting a lot closer. I spend almost equal amounts of time designing the interface as I do implementing the design. My original design (the one that got approval) was only for the home page, so I have been designing the internal pages (browse, collection and item level pages) and writing the code for them at the same time.

Here is what I have so far:

  • Home page – The home page is populated with images from (and links to) the 4 most recently added collections. Beneath these “featured collections” is a large browse box with several different avenues for browsing the site. This is currently static information but will be dynamic in the future. There is then the obligatory “about” blurb and the latest news from our non-existent news feed.
  • Collection pages -The collection pages were a challenge. I wanted the pages to contain a brief overview of the collection, and then the full list of items in the collection. I also wanted to make different “filters” for browsing the collection available. So I designed them with a smaller version of the browse box from the homepage that allows the user to browse all items in the collection or limit by genre, topic, people, place, or time. These categories are dynamically generated from the items in each collection. There is also a search with-in the collection option.
  • Item pages – I was originally planing on using the Mets Navigator from Indiana University as a page turning application, but it is a bit difficult to use with databases due to the fact that the navigator caches the pages and does not refresh when changes are made to the item. However, the way our METS records are formatted has made it very easy to implement my own version of the this application using xquery. The item pages have two parts, the page turning side and the description/metadata. I have also included a “find related materials” box that links to other items tagged with similar places, genres, people, and topics, as well as links back to all of the parent collections.

I’m still working on:

  • The advanced search
  • The browse collections page – I’m still contemplating the xquery I will need to write for this, and have’t quite figured it out yet.
  • About – This is mostly a content issue. Some of this content will need to be written by committee (our mission statement) and some of it I just haven’t had a chance to write yet.
  • News – There is no news (but I still need to put together the feed, and the query that will call it).