com.google.appengine.api.search.dev
Class WordSeparatorAnalyzer
- java.lang.Object
-
- Analyzer
-
- com.google.appengine.api.search.dev.WordSeparatorAnalyzer
-
public class WordSeparatorAnalyzer extends AnalyzerA custom analyzer to tokenize text like the Search API backend. It detects when provided text is in a CJK language and usesCJKTokenizerto tokenize it if it is.CJKTokenizertokenizes based on bigrams, so a string like "ABCD" will be tokenized to ["A", "AB", "BC", "CD", "D"]. If the string is not CJK, we assume that it uses standard latin word separators. For latin text, this uses a slightly-customized LetterTokenizer and passes tokens through StandardFilter and LowerCaseFilter. The LetterTokenizer is customized to use the same word separators as ST-BTI.
-
-
Constructor Summary
Constructors Constructor and Description WordSeparatorAnalyzer()Create a new WordSeparatorAnalyzer that always tries to detect CJK.WordSeparatorAnalyzer(boolean detectCjk)Create a new WordSeparatorAnalyzer.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description static java.lang.Stringnormalize(java.lang.String tokenizeString)Transforms to lowercase and replaces all word separators with spaces.static java.lang.StringremoveDiacriticals(java.lang.String input)Removes all diacritical marks from the input.static java.util.List<java.lang.String>tokenList(java.lang.String tokenizeString)Returns a list of tokens for a string.TokenStreamtokenStream(java.lang.String fieldName, java.io.Reader reader)Constructs a tokenizer that can tokenize CJK or latin text.
-
-
-
Constructor Detail
-
WordSeparatorAnalyzer
public WordSeparatorAnalyzer(boolean detectCjk)
Create a new WordSeparatorAnalyzer.- Parameters:
detectCjk- If true, will attempt to detect and segment CJK. If false, assumes all text can be segmented using word separators.
-
WordSeparatorAnalyzer
public WordSeparatorAnalyzer()
Create a new WordSeparatorAnalyzer that always tries to detect CJK.
-
-
Method Detail
-
tokenStream
public TokenStream tokenStream(java.lang.String fieldName, java.io.Reader reader)Constructs a tokenizer that can tokenize CJK or latin text.- Parameters:
fieldName- Ignored.reader- A stream to tokenize. mark() and reset() support is not needed.- Returns:
- A
TokenStreamthat represents the tokenization of the data in reader.
-
tokenList
public static java.util.List<java.lang.String> tokenList(java.lang.String tokenizeString)
Returns a list of tokens for a string.
-
normalize
public static java.lang.String normalize(java.lang.String tokenizeString)
Transforms to lowercase and replaces all word separators with spaces.
-
removeDiacriticals
public static java.lang.String removeDiacriticals(java.lang.String input)
Removes all diacritical marks from the input. This has the effect of transforming marked glyphs into their "equivalent" non-marked form. For example, "éøç" becomes "eoc".
-
-