Cheshire3 Tutorials - Configuring Workflows

Introduction

Workflows are first class objects in the Cheshire3 system - they’re configured at the same time and in the same way as other objects. Their function is to provide an easy way to define a series of common steps that can be reused by different Cheshire3 databases/systems, as opposed to writing customised code to achieve the same end result for each.

Build Workflows are the most common type as the data must generally pass through a lot of different functions on different objects, however as explained previously the differences between Databases are often only in one section. By using Workflows, we can simply define the changed section rather than writing code to do the same task over and over again.

The disadvantage, currently, of Workflows is that it is very complicated to find out what is going wrong if something fails. If your data is very clean, then a Workflow is probably the right solution, however if the data is likely to have XML parse errors or has to go through many different PreParsers and you want to verify each step, then hand written code may be a better solution for you.

The distribution comes with a generic build workflow object called buildIndexWorkflow. It then calls buildIndexSingleWorkflow to handle each individual Document, also supplied. This second Workflows then calls PreParserWorkflow, of which a trivial one is supplied, but this is very unlikely to suit your particular needs, and should be customised as required. An example would be if you were trying to build a Database of legacy SGML documents, your PreParserWorkflow would probably need to call an: SgmlPreParser, configured to deal with the non-XML conformant parts of that particular SGML DTD.

For a full explanation of the different tags used in Workflow configuration, and what they do, see the Configuration section dealing with workflows.

Example 1

Simple workflow configuration:

1
2
3
4
5
6
7
8
<subConfig type="workflow" id="PreParserWorkflow">
    <objectType>workflow.SimpleWorkflow</objectType>
    <workflow>
        <!-- input type:  document -->
        <object type="preParser" ref="SgmlPreParser"/>
        <object type="preParser" ref="CharacterEntityPreParser"/>
    </workflow>
</subConfig>

Example 2

Slightly more complex workflow configurations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<subConfig type="workflow" id="buildIndexWorkflow">
    <objectType>workflow.SimpleWorkflow</objectType>
    <workflow>
        <!-- input type:  documentFactory -->
        <log>Loading records</log>
        <object type="recordStore" function="begin_storing"/>
        <object type="database" function="begin_indexing"/>
        <for-each>
            <object type="workflow" ref="buildIndexSingleWorkflow"/>
        </for-each>
        <object type="recordStore" function="commit_storing"/>
        <object type="database" function="commit_metadata"/>
        <object type="database" function="commit_indexing"/>
    </workflow>
</subConfig>

<subConfig type="workflow" id="buildIndexSingleWorkflow">
    <objectType>workflow.SimpleWorkflow</objectType>
    <workflow>
        <!-- input type:  document -->
        <object type="workflow" ref="PreParserWorkflow"/>
        <try>
            <object type="parser" ref="LxmlParser"/>
        </try>
        <except>
             <log>Unparsable Record</log>
        </except>
        <object type="recordStore" function="create_record"/>
        <object type="database" function="add_record"/>
        <object type="database" function="index_record"/>
        <log>Loaded Record</log>
    </workflow>
</subConfig>

Explanation

The first two lines of each configuration example are exactly the same as all previous objects. Then there is one new section - <workflow>. This contains a series of instructions for what to do, primarily by listing objects to handle the data.

The workflow in Example 1 is an example of how to override the PreParserWorkflow for a specific database. In this case we start by giving the document input object to the SgmlPreParser in line 5, and the result of that is given to the CharacterEntityPreParser in line 6. Note that lines 4 and 20 are just comments and are not required.

The workflows in Example 2 are slightly more complex with some additional constructions. Lines 5, 26, 31 use the log instruction to get the Workflow to log the fact that it is starting to load Records.

In lines 6 and 7 the object tags have a second attribute called function. This contains the name of the function to call when it’s not derivable from the input object. For example, a PreParser will always call process_document(), however you need to specify the function to call on a Database as there are many available. Note also that there isn’t a ‘ref’ attribute to reference a specific object identifier. In this case it uses the current session to determine which Server, Database, RecordStore and so forth should be used. This allows the Workflow to be used in multiple contexts (i.e. if configured at the server level it can be used by several Databases).

The for-each block (lines 8-10) then iterates through the Documents in the supplied DocumentFactory, calling another Workflow, buildIndexSingleWorkflow (configured in lines 17-33), on each of them. Like the PreParser objects mentioned earlier, Workflow objects called don’t need to be told which function to call - the system will always call their process() function. Finally the Database and RecordStore have their commit functions called to ensure that everything is written out to disk.

The second workflow in Example 2 is called by the first, and in turn calls the PreParserWorkflow configured in Example 1. It then calls a Parser, carrying out some error handling as it does so (lines 22-27), and then makes further calls to the RecordStore (line 28) and Database (lines 29-30) objects to store and Index the record produced.