PyFTS - Python interface to the OpenFTS full-text search engine Version 0.3 Python modules are copyright (C) 2003 by Brandon Craig Rhodes; see LICENSING below, and the COPYING file in this directory. WELCOME These Python modules are designed to replace the Perl modules that form the current programming interface to the OpenFTS project; at the moment they are rudimentary but do implement the same basic functions as the standard Perl interface. This distribution includes the `fts' Python module itself, as well as example commands with which you can index your own documents then search them - from either the command line, or through a web browser! Note that these routines use different database table names than the ones used by the Perl interface, so you cannot use these modules to access a database you have already set up with the Perl routines. Other differences, some simple and others technical, are listed below under DIFFERENCES WITH PERL INTERFACE. INSTALLATION You should already have installed Postgresql and the contrib modules that come with OpenFTS; you can obtain both source code and installation instructions for these at: http://openfts.sourceforge.net/ Next, untar the pyfts-0.3.tar.gz archive (which you have probably already done since you are reading its README file) and make pyfts-0.3/ your working directory. At this point you have two choices; if you have write permission on the site-packages/ directory (which on many systems is restricted to root) and want to make `fts' module availabe to all users on your system, then simply enter: python setup.py install If you instead just want to play with the `fts' module yourself, you can build it locally and point PYTHONPATH to its directory: mkdir pylib python setup.py install --home . export PYTHONPATH=$PWD/lib/python/ Remember that to use the `fts' module when running Python from another shell, you will need to either set PYTHONPATH manually in that instance of the shell, or set it in your profile. Programmers will want to `import fts' in their Python programs and read the documentation under FTS MODULE below; but those wanting a quick demonstration can run the sample code under the examples/ directory by reading the next section. EXAMPLE DATABASE AND WEB SEARCH ENGINE The commands under example/ require that Postgresql is up and running; they create an `example_texts' database where they index documents you provide and allow you to search them from either the command line or your web browser. While the commands under the example/ directory are almost ready to run, they lack the knowledge of where you have installed the openfts.sql and tsearch.sql scripts with which a database can be prepared for using OpenFTS. You must edit the example/init.sh file and change PG_CONTRIB to where the postgresql/contrib directory lies on your own system. Having corrected the value of PG_CONTRIB, change your working directory to the example/ directory and initialize the database: ./init.sh then you can index an entire directory of documents at once: ./load.sh The script calls find(1) on the target directory and submits the resulting list of filenames to the ./load.py program for indexing. Its one clever feature is that it creates a file named .index_time inside the target directory, whose time of modification it can use to avoid re-indexing unchanged documents the next time you run it on the given directory. (Actually it has one other clever feature, since I use it to index my email: lines that look like base64-encoded binary are discarded from the input so that random strings of binary data do not bloat your document indexes.) So by running ./load.sh on the same directory every day, your index can stay up to date with the file contents without having to rebuild the entire database. To search you can either run ./search.sh with one or more search terms as arguments: ./search.sh ... or run the web server provided and connect your browser to the same port. For example, while running: python web_server.py 2080 you will find a simple search engine at: http://localhost:2080/ FTS MODULE The `fts' module provides a create_index(...) function which you should call exactly once when you want to create a document index; it creates the necessary tables and store the preferences you give it. Then, each time your application runs, you can load your index configuration back in with the load_index(...) function. fts.create_index(db [, ...] ) Creates a full-text search index, including all necessary tables and indexes, in the database to which you provide an open Pygresql connection (here named `db'). Many parameters can be specified, but all are optional: prefix = Provides a prefix with which all full text index tables will be prefixed. By default this is `fts', meaning that all of the objects created in your database will start with `fts_' which can help you keep them separate both from other indexes and from other tables your application might require. parser = The parse function should accept a single string as its argument, and return a list of (type, word) pairs giving an integer type and string word for each token found in the input. The default parser function if no other is specified is fts.ftsParser().parse. ignore_types = [ , ... ] A list of word types to ignore; words returned by the parser whose type appears in this list are discarded and not indexed. By default types 7, 12, 13, 14, and 23 are ignored (see the OpenFTS documentation for what the default parser means by each type). longest_word = Words returned by the parser which are longer than this maximum length are discarded and not indexed. dictionaries = [ , ... ] Each function listed should accept a string containing a single word argument, and return a list of stems for that word. A dictionary can return None for words whose form it does not recognize, in which case the next dictionary in the list will be tried. The default dictionary fts.PorterEnglish.lexeme is used if the user does not specify a `dictionaries' option. type_dictionaries = { : [ , ... ], ... } Sometimes your parser can identify words that belong to specific languages, or other word categories, for which you have special stemming dictionaries. In this case, have the parser mark those words with a particular integer token type, and then list the applicable dictionaries here under that type; dictionaries listed here for a type are used instead of the normal list of dictionaries. stopper = This function should accept a lexeme and return true if it is a stop word, or return false otherwise. Stop words are discarded and not indexed. By default the system uses the fts.PorterEnglish.is_stoplexeme function. All of these options are saved in the database so that you will not have to specify them all over again when you run your application the next time; instead you will just run the following function. load_index(db [, prefix= ]) Retrieves an index for use that has been previously created in the database by the create_index(...) function. The database should be provided as an open Pygresql connection. The prefix must be specified if you created the index with anything other than the default prefix value of `fts'. Both create_index(...) and load_index(...) return an ftsIndex object which you will probably want to assign to a variable for further use, like so: index = load_index(db) Each ftsIndex object supports the basic interface for Python dictionaries but indexes documents instead of storing them. After the above assignment, the `index' object would support: index[key] = body index[key] = (title, body) This indexes a document whose content is provided as a string; if a title string is also provided, its text will be weighted more heavily when ranking search results. The key is used to identify the document in the other operations listed below, and can be an arbitrary text string. The example/ routines use it to store the path of the document file. del index[key] Removes the document from the index which was inserted under the given key. index.search(query) Returns a list of (key, relevance) pairs, one for each document that matches the query, sorted in decreasing order of relevance. The keys are the ones you provided when indexing the documents, and each relevance measure is a floating-point number. The query should be a whitespace-separated list of words that you want each of the returned documents to contain. index.headline(query, document, [ limit=None, ] [ ignore_types=[16] ] ) Returns quotes from the document that contain the search terms from the query, with the search terms highlighted with `' and `' delimeters. The document content should be given as a text string. The limit should be an integer specifying how many tokens the parser should read, to ignore the rest of very large files, while ignore_types lists token types to ignore when building the headline. Other dictionary operations act normally as though the ftsIndex were a normal Python dictionary: len(index) Returns how many documents are in the index. index.keys() Returns a list of the keys of all documents currently in the index. key in index Returns true if a document with the given key is in the index, and returns false otherwise. TO DO - Add Python docstrings to each Python module, class, and method. - Add a snowball parser interface. - Allow the user to specify the ranking function. - Make the headline code more sophisticated, like the Perl version. - Improve example/load.py so that it deletes the indexes for files that have been removed. DIFFERENCES WITH PERL INTERFACE - Currently lacks snowball parser module. - Creates a single lexeme position table rather than several. - Creates its own tables rather than making you create them. LICENSING This collection of Python modules is free software; you can redistribute them and/or modify them under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. They are distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA