Class Ferret::Analysis::LetterTokenizer
In: ext/r_analysis.c
Parent: Ferret::Analysis::TokenStream

Summary

A LetterTokenizer is a tokenizer that divides text at non-letters. That is to say, it defines tokens as maximal strings of adjacent letters, as defined by the regular expression _/[[:alpha:]]+/_ where [:alpha] matches all characters in your local locale.

Example

  "Dave's résumé, at http://www.davebalmain.com/ 1234"
    => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]

Methods

new  

Public Class methods

Create a new LetterTokenizer which optionally downcases tokens. Downcasing is done according the current locale.

lower:set to false if you don‘t wish to downcase tokens

[Validate]