Cheshire3 Object Model - Tokenizer

API

class cheshire3.baseObjects.Tokenizer(session, config, parent=None)[source]

A Tokenizer takes a string and returns an ordered list of tokens.

A Tokenizer takes a string of language and processes it to produce an ordered list of tokens.

Example Tokenizers might extract keywords by splitting on whitespace, or by identifying common word forms using a regular expression.

The incoming string is often in a data structure (dictionary / hash / associative array), as per output from Extractor.

process_hash(session, data)[source]

Process and return tokens found in the keys of a hash.

process_string(session, data)[source]

Process and return tokens found in a raw string.

Implementations

The following implementations are included in the distribution by default:

class cheshire3.tokenizer.SimpleTokenizer(session, config, parent)[source]
class cheshire3.tokenizer.RegexpSubTokenizer(session, config, parent)[source]

Substitute regex matches with a character, then split on whitespace.

A Tokenizer that replaces regular expression matches in the data with a configurable character (defaults to whitespace), then splits the result at whitespace.

class cheshire3.tokenizer.RegexpSplitTokenizer(session, config, parent)[source]

A Tokenizer that simply splits at the regex matches.

class cheshire3.tokenizer.RegexpFindTokenizer(session, config, parent)[source]

A tokenizer that returns all words that match the regex.

class cheshire3.tokenizer.RegexpFindOffsetTokenizer(session, config, parent)[source]

Find tokens that match regex with character offsets.

A Tokenizer that returns all words that match the regex, and also the character offset at which each word occurs.

class cheshire3.tokenizer.RegexpFindPunctuationOffsetTokenizer(session, config, parent)[source]
class cheshire3.tokenizer.SentenceTokenizer(session, config, parent)[source]
class cheshire3.tokenizer.LineTokenizer(session, config, parent)[source]

Trivial but potentially useful Tokenizer to split data on whitespace.

class cheshire3.tokenizer.DateTokenizer(session, config, parent)[source]

Tokenizer to identify date tokens, and return only these.

Capable of extracting multiple dates, but slowly and less reliably than single ones.

class cheshire3.tokenizer.DateRangeTokenizer(session, config, parent)[source]

Tokenizer to identify ranges of date tokens, and return only these.

e.g.

>>> self.process_string(session, '2003/2004')
['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999']
>>> self.process_string(session, '2003-2004')
['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999']
>>> self.process_string(session, '2003 2004')
['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999']
>>> self.process_string(session, '2003 to 2004')
['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999']

For single dates, attempts to expand this into the largest possible range that the data could specify. e.g. 1902-04 means the whole of April 1902.

>>> self.process_string(session, "1902-04")
['1902-04-01T00:00:00', '1902-04-30T23:59:59.999999']
class cheshire3.tokenizer.PythonTokenizer(session, config, parent)[source]

Tokenize python source code into token/TYPE with offsets