Class Ferret::Search::Searcher
In: ext/r_search.c
Parent: Object

Summary

The Searcher class basically performs the task that Ferret was built for. It searches the index. To search the index the Searcher class wraps an IndexReader so many of the tasks that you can perform on an IndexReader are also available on a searcher including, most importantly, accessing stored documents.

The main methods that you need to know about when using a Searcher are the search methods. There is the Searcher#search_each method which iterates through the results by document id and score and there is the Searcher#search method which returns a TopDocs object. Another important difference to note is that the Searcher#search_each method normalizes the score to a value in the range 0.0..1.0 if the max_score is greater than 1.0. Searcher#search does not. Apart from that they take the same parameters and work the same way.

Example

  searcher = Searcher.new("/path/to/index")

  searcher.search_each(TermQuery.new(:content, "ferret")
                       :filter => RangeFilter.new(:date, :< => "2006"),
                       :sort => "date DESC, title") do |doc_id, score|
      puts "#{searcher[doc_id][title] scored #{score}"
  end

Methods

[]   close   doc_freq   explain   get_document   highlight   max_doc   new   reader   scan   search   search_each  

Public Class methods

Create a new Searcher object. dir can either be a string path to an index directory on the file-system, an actual Ferret::Store::Directory object or a Ferret::Index::IndexReader. You should use the IndexReader for searching multiple indexes. Just open the IndexReader on multiple directories.

Public Instance methods

Retrieve a document from the index. See LazyDoc for more details on the document returned. Documents are referenced internally by document ids which are returned by the Searchers search methods.

Close the searcher. The garbage collector will do this for you or you can call this method explicitly.

Return the number of documents in which the term term appears in the field field.

Create an explanation object to explain the score returned for a particular document at doc_id in the index for the query query.

Usually used like this;

  puts searcher.explain(query, doc_id).to_s

Retrieve a document from the index. See LazyDoc for more details on the document returned. Documents are referenced internally by document ids which are returned by the Searchers search methods.

Returns an array of strings with the matches highlighted.

Options

:excerpt_length:Default: 150. Length of excerpt to show. Highlighted terms will be in the centre of the excerpt. Set to :all to highlight the entire field.
:num_excerpts:Default: 2. Number of excerpts to return.
:pre_tag:Default: "<b>". Tag to place to the left of the match. You‘ll probably want to change this to a "<span>" tag with a class. Try "\033[7m" for use in a terminal.
:post_tag:Default: "</b>". This tag should close the +:pre_tag+. Try tag "\033[m" in the terminal.
:ellipsis:Default: "…". This is the string that is appended at the beginning and end of excerpts (unless the excerpt hits the start or end of the field. You‘ll probably want to change this so a Unicode ellipsis character.

Returns 1 + the maximum document id in the index. It is the document_id that will be used by the next document added to the index. If there are no deletions, this number also refers to the number of documents in the index.

Return the IndexReader wrapped by this searcher.

Run a query through the Searcher on the index, ignoring scoring and starting at +:start_doc+ and stopping when +:limit+ matches have been found. It returns an array of the matching document numbers.

There is a big performance advange when using this search method on a very large index when there are potentially thousands of matching documents and you only want say 50 of them. The other search methods need to look at every single match to decide which one has the highest score. This search method just needs to find +:limit+ number of matches before it returns.

Options

:start_doc:Default: 0. The start document to start the search from. NOTE very carefully that this is not the same as the +:offset+ parameter used in the other search methods which refers to the offset in the result-set. This is the document to start the scan from. So if you scanning through the index in increments of 50 documents at a time you need to use the last matched doc in the previous search to start your next search. See the example below.
:limit:Default: 50. This is the number of results you want returned, also called the page size. Set +:limit+ to +:all+ to return all results.

TODO: add option to return loaded documents instead

Options

  start_doc = 0
  begin
    results = @searcher.scan(query, :start_doc => start_doc)
    yield results # or do something with them
    start_doc = results.last
    # start_doc will be nil now if results is empty, ie no more matches
  end while start_doc

Run a query through the Searcher on the index. A TopDocs object is returned with the relevant results. The query is a built in Query object. Here are the options;

Options

:offset:Default: 0. The offset of the start of the section of the result-set to return. This is used for paging through results. Let‘s say you have a page size of 10. If you don‘t find the result you want among the first 10 results then set +:offset+ to 10 and look at the next 10 results, then 20 and so on.
:limit:Default: 10. This is the number of results you want returned, also called the page size. Set +:limit+ to +:all+ to return all results
:sort:A Sort object or sort string describing how the field should be sorted. A sort string is made up of field names which cannot contain spaces and the word "DESC" if you want the field reversed, all separated by commas. For example; "rating DESC, author, title". Note that Ferret will try to determine a field‘s type by looking at the first term in the index and seeing if it can be parsed as an integer or a float. Keep this in mind as you may need to specify a fields type to sort it correctly. For more on this, see the documentation for SortField
:filter:a Filter object to filter the search results with
:filter_proc:a filter Proc is a Proc which takes the doc_id, the score and the Searcher object as its parameters and returns either a Boolean value specifying whether the result should be included in the result set, or a Float between 0 and 1.0 to be used as a factor to scale the score of the object. This can be used, for example, to weight the score of a matched document by it‘s age.

Run a query through the Searcher on the index. A TopDocs object is returned with the relevant results. The query is a Query object. The Searcher#search_each method yields the internal document id (used to reference documents in the Searcher object like this; +searcher[doc_id]+) and the search score for that document. It is possible for the score to be greater than 1.0 for some queries and taking boosts into account. This method will also normalize scores to the range 0.0..1.0 when the max-score is greater than 1.0. Here are the options;

Options

:offset:Default: 0. The offset of the start of the section of the result-set to return. This is used for paging through results. Let‘s say you have a page size of 10. If you don‘t find the result you want among the first 10 results then set +:offset+ to 10 and look at the next 10 results, then 20 and so on.
:limit:Default: 10. This is the number of results you want returned, also called the page size. Set +:limit+ to +:all+ to return all results
:sort:A Sort object or sort string describing how the field should be sorted. A sort string is made up of field names which cannot contain spaces and the word "DESC" if you want the field reversed, all separated by commas. For example; "rating DESC, author, title". Note that Ferret will try to determine a field‘s type by looking at the first term in the index and seeing if it can be parsed as an integer or a float. Keep this in mind as you may need to specify a fields type to sort it correctly. For more on this, see the documentation for SortField
:filter:a Filter object to filter the search results with
:filter_proc:a filter Proc is a Proc which takes the doc_id, the score and the Searcher object as its parameters and returns a Boolean value specifying whether the result should be included in the result set.

[Validate]