Cheshire3 Object Model - PreParser

API

class cheshire3.baseObjects.PreParser(session, config, parent=None)[source]

A PreParser takes a Document and returns a modified Document.

For example, the input document might consist of SGML data. The output would be a Document containing XML data.

This functionality allows for Workflow chains to be strung together in many ways, and perhaps in ways which the original implemention had not foreseen.

process_document(session, doc)[source]

Take a Document, transform it and return a new Document object.

Implementations

The following implementations are included in the distribution by default:

class cheshire3.preParser.NormalizerPreParser(session, config, parent)[source]

Calls a named Normalizer to do the conversion.

class cheshire3.preParser.UnicodeDecodePreParser(session, config, parent)[source]

PreParser to turn non-unicode into Unicode Documents.

A UnicodeDecodePreParser should accept a Document with content encoded in a non-unicode character encoding scheme and return a Document with the same content decoded to Python’s Unicode implementation.

class cheshire3.preParser.CmdLinePreParser(session, config, parent)[source]
class cheshire3.preParser.FileUtilPreParser(session, config, parent)[source]

Call ‘file’ util to find out the current type of file.

class cheshire3.preParser.MagicRedirectPreParser(session, config, parent)[source]

Map to appropriate PreParser based on incoming MIME type.

class cheshire3.preParser.HtmlSmashPreParser(session, config, parent)[source]

Attempts to reduce HTML to its raw text

class cheshire3.preParser.RegexpSmashPreParser(session, config, parent)[source]

Strip, replace or keep only data which matches a given regex.

class cheshire3.preParser.HtmlTidyPreParser(session, config, parent)[source]
class cheshire3.preParser.SgmlPreParser(session, config, parent)[source]

Convert SGML into XML

class cheshire3.preParser.AmpPreParser(session, config, parent)[source]

Escape lone ampersands in otherwise XML text.

class cheshire3.preParser.MarcToXmlPreParser(session, config, parent=None)[source]

Convert MARC into MARCXML

class cheshire3.preParser.MarcToSgmlPreParser(session, config, parent=None)[source]

Convert MARC into Cheshire2’s MarcSgml

class cheshire3.preParser.TxtToXmlPreParser(session, config, parent=None)[source]

Minimally wrap text in <data> XML tags

class cheshire3.preParser.PicklePreParser(session, config, parent=None)[source]

Compress Document content using Python pickle.

class cheshire3.preParser.UnpicklePreParser(session, config, parent=None)[source]

Decompress Document content using Python pickle.

class cheshire3.preParser.GzipPreParser(session, config, parent)[source]

Gzip a not-gzipped document.

class cheshire3.preParser.GunzipPreParser(session, config, parent=None)[source]

Gunzip a gzipped document.

class cheshire3.preParser.B64EncodePreParser(session, config, parent=None)[source]

Encode document in Base64.

class cheshire3.preParser.B64DecodePreParser(session, config, parent=None)[source]

Decode document from Base64.

class cheshire3.preParser.PrintableOnlyPreParser(session, config, parent)[source]

Replace or Strip non printable characters.

class cheshire3.preParser.CharacterEntityPreParser(session, config, parent)[source]

Change named and broken entities to numbered.

Transform latin-1 and broken character entities into numeric character entities. eg &amp;something; –> &amp;#123;