Cheshire3 Object Model - DocumentFactory¶

API¶

class cheshire3.baseObjects.DocumentFactory(session, config, parent=None)[source]¶

A DocumentFactory takes raw data, returns one or more Documents.

A DocumentFacory can be used to return Documents from e.g. a file, a directory containing many files, archive files, a URL, or a web-based API.

load(session, data, cache=None, format=None, tagName=None, codec='')[source]¶

Load documents into the document factory from data.

Returns the DocumentFactory itself which acts as an iterator DocumentFactory’s load function takes session, plus:

data := the data to load. Could be a filename, a directory name,

the data as a string, a URL to the data etc.

cache := setting for how to cache documents in memory when reading

them in.

format := format of the data parameter. Many options, most common:

tagName := name of the tag which starts (and ends!) a Record.

codec := name of the codec in which the data is encoded.

classmethod register_stream(session, format, cls)[source]¶

Register a new format, handled by given DocumentStream (cls).

Class method to register an implementation of a DocumentStream (cls) against a name for the format parameter (format) in future calls to load().

The following implementations are included in the distribution by default:

class cheshire3.documentFactory.SimpleDocumentFactory(session, config, parent)[source]¶

class cheshire3.documentFactory.ClusterExtractionDocumentFactory(session, config, parent)[source]¶: Load lots of records, cluster and return the cluster documents.