Cheshire3 Object Model - Index

API

class cheshire3.baseObjects.Index(session, config, parent=None)[source]

An Index defines an access point into the Records.

An Index is an object which defines an access point into Records and is responsible for extracting that information from them. It can then store the information extracted in an IndexStore.

The entry point can be defined using one or more Selectors (e.g. an XPath expression), and the extraction process can be defined using a Workflow chain of standard objects. These chains must start with an Extractor, but from there might then include Tokenizers, PreParsers, Parsers, Transformers, Normalizers, even other Indexes. A processing chain usually finishes with a TokenMerger to merge identical tokens into the appropriate data structure (a dictionary/hash/associative array)

An Index can also be the last object in a regular Workflow, so long as a Selector object is used to find the data in the Record immediately before an Extractor.

begin_indexing(session)[source]

Prepare to index Records.

Perform tasks before indexing any Records.

clear(session)[source]

Clear all data from Index.

commit_indexing(session)[source]

Finalize indexing.

Perform tasks after Records have been indexed.

construct_resultSet(session, terms, queryHash={})[source]

Create and return a ResultSet.

Take a list of the internal representation of terms, as stored in this Index, create and return an appropriate ResultSet object.

construct_resultSetItem(session, term, rsiType='')[source]

Create and return a ResultSetItem.

Take the internal representation of a term, as stored in this Index, create and return a ResultSetItem from it.

delete_record(session, rec)[source]

Delete a Record from the Index.

Identify terms from the Record and delete them from IndexStore. Depending on the configuration of the Index, it may be necessary to do this by repeating the extracting the terms from the Record, finding and removing them. Hence the Record must be the same as the one that was indexed.

deserialize_term(session, data, nRecs=-1, prox=1)[source]

Deserialize and return the internal representation of a term.

Return the internal representation of a term as recreated from a string serialization from storage. Used as a callback from IndexStore to take serialized data and produce list of terms and document references.

data := string (usually retrieved from indexStore) nRecs := number of Records to deserialize (all by default) prox := boolean flag to include proximity information

extract_data(session, rec)[source]

Extract data from the Record.

Deprecated?

fetch_proxVector(session, rec, elemId=-1)[source]

Fetch and return a proximity vector for the given Record.

fetch_summary(session)[source]

Fetch and return summary data for all terms in the Index.

e.g. for sorting, then iterating. USE WITH CAUTION! Everything done here for speed.

fetch_term(session, term, summary, prox)[source]

Fetch and return the data for the given term.

fetch_termById(session, termId)[source]

Fetch and return the data for the given term id.

fetch_termFrequencies(session, mType, start, nTerms, direction)[source]

Fetch and return a list of term frequency tuples.

fetch_termList(session, term, nTerms, relation, end, summary)[source]

Fetch and return a list of terms from the index.

fetch_vector(session, rec, summary)[source]

Fetch and return a vector for the given Record.

index_record(session, rec)[source]

Index and return a Record.

Accept a Record to index. If begin indexing has been called, the index might not commit any data until commit_indexing is called. If it is not in batch mode, then index_record will also commit the terms to the indexStore.

merge_term(session, currentData, newData, op='replace', nRecs=0, nOccs=0)[source]

Merge newData into currentData and return the result.

Merging takes the currentData and can add, replace or delete the data found in newData, and then returns the result. Used as a callback from IndexStore to take two sets of terms and merge them together.

currentData := output of deserialize_terms newData := flat list op := replace | add | delete nRecs := total records in newData nOccs := total occurrences in newdata

scan(session, clause, nTerms, direction='>=')[source]

Scan (browse) through an Index to return a list of terms.

Given a single clause CQL query, return an ordered term list with document frequencies and total occurrences with a maximum of nTerms items. Direction specifies whether to move backwards or forwards from the term given in clause.

search(session, clause, db)[source]

Search this Index, return a ResultSet.

Given a CQL query, execute the query and return a ResultSet object.

serialize_term(session, termId, data, nRecs=0, nOccs=0)[source]

Return a string serialization representing the term.

Return a string serialization representing the term for storage purposes. Used as a callback from IndexStore to serialize a list of terms and document references to be stored.

termId := numeric ID of term being serialized data := list of longs nRecs := number of Records containing the term, if known nOccs := total occurrences of the term, if known

sort(session, rset)[source]

Sort and return a ResultSet object.

Sort and return a ResultSet object based on the values extracted according to this index.

store_terms(session, data, rec)[source]

Store the indexed Terms in the configured IndexStore.

Implementations

The following implementations are included in the distribution by default:

class cheshire3.index.SimpleIndex(session, config, parent)[source]
class cheshire3.index.ProximityIndex(session, config, parent)[source]

Index that can store term locations to enable proximity search.

An Index that can store element, word and character offset location information for term entries, enabling phrase, adjacency searches etc.

Need to use an Extractor with prox setting and a ProximityTokenMerger

class cheshire3.index.XmlIndex(session, config, parent)[source]

Index to store terms as XML structure.

e.g.:

<rs tid="" recs="" occs="">
    <r i="DOCID" s="STORE" o="OCCS"/>
</rs>
class cheshire3.index.XmlProximityIndex(session, config, parent)[source]

ProximityIndex to store terms as XML structure.

e.g.:

<rs tid="" recs="" occs="">
  <r i="DOCID" s="STORE" o="OCCS">
    <p e="ELEM" w="WORDNUM" c="CHAROFFSET"/>
  </r>
</rs>
class cheshire3.index.RangeIndex(session, config, parent)[source]

Index to enable searching over one-dimensional range (e.g. time).

Need to use a RangeTokenMerger

class cheshire3.index.BitmapIndex(session, config, parent)[source]
class cheshire3.index.RecordIdentifierIndex(session, config, parent=None)[source]
class cheshire3.index.PassThroughIndex(session, config, parent)[source]

Special Index pull in search terms from another Database.