Remembrance Agent, version 1.40
Jan-Christian Nelson and Bradley Rhodes, MIT Media Lab
March 17, 1998

Contents
--------
0.  Introduction
1.  Building the Remembrance Agent
2.  Creating a database from a collection of text
3.  Using the Remembrance Agent in Emacs
4.  Savant's document template system and other customizations
5.  What's new?

0.  Introduction
----------------

The Remembrance Agent is one of the projects being developed by the MIT
Media Lab's software agents group.  Given a collection of the user's
accumulated email, usenet news articles, papers, saved HTML files and other
text notes, it attempts to find those documents which are most relevant to
the user's current context.  That is, it searches this collection of text
for the documents which bear the highest word-for-word similarity to the
text the user is currently editing, in the hope that they will also bear
high conceptual similarity and thus be useful to the user's current work.
These suggestions are continuously displayed in a small buffer at the
bottom of the user's emacs buffer.  If a suggestion looks useful, the full
text can be retrieved with a single command.

The Remembrance Agent works in two stages.  First, the user's collection of
text documents is indexed into a database saved in a vector format.
Briefly, the concept behind the vector format is that any given document
may be represented by a vector, each entry of which corresponds to a single
word and is equal to the number of times that word appears in the document.
The advantages gained are that retrieval from disk is relatively speedy,
and similarity between two documents is simple to compute: you take the dot
product of their (normalized) vectors, giving more weight to words that
occure infrequently.  After the database is created, the other stage of the
Remembrance Agent is run from emacs, where it periodically takes a sample
of text from the working buffer, converts it to this vector format, and
finds those documents from the collection whose dot products with it are
highest.  It summarizes the top documents in a small emacs window and
allows you to retrieve the entire text of any one with a keystroke.

The Remembrance Agent was the idea of Thad Starner and Brad Rhodes,
graduate students at the Media Lab, and is the product of long hours of
coding by MIT undergrad Jan Nelson.  They can be reached with questions and
comments about the Remembrance Agent via email at ra-bugs@media.mit.edu,
and new versions of the software can be found at
http://www.media.mit.edu/~rhodes/RA/

Good luck, and thanks for using the Remembrance Agent.

1.  Building the Remembrance Agent
----------------------------------
The important files to a user are:
README:                         this file
other/remem-display-mode.el:    emacs interface to savant
remem-display-mode.elc:         byte-compiled form of remem-display-mode.el
other/savantrc:                 configuration file for savant (move to
                                ~/.savantrc) 

For hackers and the curious, these are the directories included in the RA
distribution:

main/                   top-level sources for Savant
other/                  front end elisp code, savantrc, and some old
                        (sometimes out of date) documentation
savutil/                utilities used by Savant
template/               code for parsing documents to index
wordvec/                code for handling word vectors

The RA back-end has two executables, "ra-index" and "ra-retrieve", which
together make the system called Savant.  To build savant, cd to the
RA directory type:
./configure; make

This should analyze your system, and then make appropriate binaries of
ra-index, ra-retrieve, and remem-display-mode.elc.  Once the compilation is
finished, the executables will be:
        main/ra-index
        main/ra-retrieve
        other/remem-display-mode.el
        other/remem-display-mode.elc

Move the first two files to wherever you normally keep executables.  You
should also copy the two remem-display-mode files to wherever you like to
keep emacs-lisp files.  Finally, rename savantrc to .savantrc and move it
to your home directory (it is called savantrc in the distribution so that
it won't accidentally go unnoticed).  You can then delete the source files
if you like.


2.  Creating a database from a collection of text
-------------------------------------------------
Savant has two executables, one which indexes documents into
databases, and one which performs interactive retrievals from these
databases.  To use the first mode, you must have a set of source
text-files, and a directory savant can put database files into.
Usage:
    ra-index [-v] [-c <config-file>] <base-dir> <source1> [<source2>] ... 
             [-e <excludee1> [<excludee2>] ...]

The <source> arguments may be files or directories.  If a directory is in
the list, savant will use all its contents, recursing into all
subdirectories.  Non-text files and backup files (those appended with ~ or
prepended with #) are ignored.  Any files or directories specified after
the optional -e flag will be excluded.  Savant will use any files it finds
to create a database in the specified base directory, which must already
exist.  The optional -v argument (verbose) will direct savant to keep you
updated on its progress.  So for example,
	ra-index -v ~/Database/src-base ~/src -e ~/src/*.h
will build a database in my Database/src-base directory, made up of
files from my src directory, excluding files ending in '.h', and it will
do this verbosely.

There is also an optional [-c <config-file>] argument, which tells 
savant to use <config-file> instead of the .savantrc in your homedir.
Section 4 of this file details how .savantrc can be modified or how
a new one can be written.

***IMPORTANT***: Savant can build databases in any directory you like, but
the emacs interface for the Remembrance Agent expects a particular
structure.  For each database you want to make, you should create a
directory, and all these directories should live in the same parent
directory.  For example, for my own use I have a directory
~cabbage/Database/, and within that directories ~cabbage/Database/mail/,
~cabbage/Database/papers/, etc. which actually contain the database files.

To see how savant interacts with emacs while the remembrance agent is
running, try running ra-retrieve with the command 'ra-retrieve -v <base-dir>'
after creating a database using index.

3.  Using the Remembrance Agent in Emacs
----------------------------------------
You can load the Remembrance Agent automatically every time you run emacs
by placing the line (load "remem-display-mode") in your .emacs file in your 
homedir.  This assumes that one of remem-display-mode.el or 
remem-display-mode.elc exist in your emacs load-path.  

Before the Remembrance Agent can be used, several variables must be 
configured.  They are set by placing these lines in your .emacs file:
  (setq remem-prog-dir <prog-dir-string>)
  (setq remem-database-dir <database-dir-string>)
  (setq remem-display-scope-databases <3-element list of strings>)

<prog-dir-string> should be the full name of the directory where you put
the ra-retrieve executable, enclosed in double quotes.  In my own use, for
example, I set this to "/home/cabbage/bin".

<remem-relevant-database-dir> should be the full name of the directory
which holds your database directories, enclosed in double quotes.  I use,
for example "/home/cabbage/Database".  Note that this is the name of a
directory containing directories, not the directory containing the database
files themselves.

remem-relevant-display-scope-databases: the emacs version of the
remembrance agent has three "scopes," separate processes performing
retrievals simultaneously.  remem-relevant-display-scope-databases should
be set to a three-element list of names of sub-directories of the
remem-relevant-database-dir from which these scopes should retrieve.  They
may be repeated.  In my own .emacs, for example, I have this set to
'("mail" "mail" "papers") (note the single quote to indicate a list),
corresponding to /home/cabbage/Database/mail and
/home/cabbage/Database/papers.  NOTE: You can specify 'nil' instead of
"dirname" for any element of this list, and that scope will not start its
process.  This will save some memory and processor time.

Okay!  After all these customizations are made, you can start the
Remembrance Agent by typing C-c r r.  It will create its window and
after a moment or two begin to display suggestions like:
   1: 0.26 | Golan Levin       | 07 Feb 96 | this mess  
   2: 0.08 | cabbage           | 04 May 95 | 502journal.tex
   3: 0.07 | cabbage           | 06 Dec 94 | timing.tex
This can be summarized as 
 ID#: rating | author or file owner | date | subject or filename

The rating is a measure from 0 to 1 of how relevant the document is to a 
sample of your current buffer.  To see a suggested document, type
C-c <ID# of document>.

further customizations
----------------------
There are three other variables the Remembrance Agent allows you to
customize, and they can be set in the same way as the three above.

remem-display-scope-number-lines is a three-element list, like
remem-display-scope-databases.  Instead of controlling which database
a scope searches from, an element controls how many lines a scope gets 
in the suggestion summary shown above.  The default is '(1 1 1), one line 
for each scope, but you have a total of nine to parcel out as you see fit.
Specifying one of these to be 0 has the same effect as specifying an
element of remem-relevant-scope-databases to be nil, that is, the scope 
is disabled.

remem-display-scope-update-times is a three-element list controlling
how often, in seconds, each scope updates its suggestions.  The
default is '(10 10 10), ten seconds for each scope.

remem-display-scope-range is another three-element list controlling the
size in words of the samples each scope takes when it goes to find new
suggestions.  The default is '(100 100 100), one hundred words (roughly)
for each scope.  (A word is assumed to be 5 characters long, so in
actuality we're just looking at 5 * 100 characters by default).


4.  Savant's document template system and other customizations
-------------------------------------------------------------- 
Two types of things appear in .savantrc or another customization file.
There are variables and templates, and there is no distinct section
for either (i.e., variable assignments may be mixed in with templates).
The variables are exceedingly simple: to set one, include in .savantrc
<variable_name> followed by <value>.  At the present, there are only
four variables in use:
  source_field_width - specifies the number of characters allotted to the
    source field of savant's summary lines, the leftmost text-field,
    specifying who authored the document being suggested.  This is the
    only such variable because the date field is a fixed width, and the
    subject field will take up the remaining available space.
  ellipses - may be true or false.  Specifies whether the name in the
    source field should be abbreviated with ellipses ("...") or simply
    truncated.
  document_windowing - may be true or false.  If this is true, each document
    is broken up into overlapping "windows", each lines_per_window lines long
    (see lines_per_window).  The motivation for this is that since very large
    documents may only have a small section that is relevant to the 
    current context, we want to have only that section suggested to us.
  lines_per_window - length in lines of each "window" (section) to break
    documents into.

If you can think of more variables you'd like to see, please let us know
(ra-bugs@media.mit.edu).

Savant uses a system of templates to decide where to mark off the separate
documents in a file and what to use in the summary line emacs displays
to you.  The current version of the package provides templates that
will correctly dissect RMAIL, plain email, usenet articles, LaTeX
documents, and HTML documents.  Those files not recognized are treated 
as one large document.  You need not read this section unless you have
another file format you wich savant to recognize.  If you do create one,
let us know! (ra-bugs@media.mit.edu).

A template has this format:
Template <template-name>
{
  Recognize
    <pattern>
  Delimiter
    <pattern>
  Format
    <pattern>
}
Patterns consist of strings surrounded by double quotes (which may contain
C escape sequences like \n and control sequences like ^L), the variables
SOURCE, DATE, SUBJECT and BODY, and the commands startline, ignore, optional,
anyorder, anyof and icase, and any pattern may also contain subpatterns
delimited with curly braces.  Savant will use the Recognize pattern
to decide what kind of file it is dealing with, for each file.  Whichever
template gets the earliest match in the file wins.  Then the corresponding
format template is used to start parsing off documents; when it comes
to one of the variable names, text from the current location in the
file will be used to fill that variable.  If it comes across the delimiter
pattern, it starts a new "document."  Here are the definitions of the
keywords - don't worry, an example is coming! 

startline: this indicates that the next string or subpattern should
	appear at the start of a line of text (like ^ in regular expressions)
ignore: ignore text until the next specified string or subpattern
optional: the next string or subpattern is optional
anyorder: the remaining strings and subpatterns in the pattern may
	appear in any order
anyof: stop when any of the remaining strings or subpatterns in this
	pattern is found
icase: the next string should be matched case-insensitively
Variables SOURCE, DATE, SUBJECT and BODY:  When savant finds one of
	these, it starts filling the variable with text from the
	file.  It stops at the next pattern item it finds.  During 
	retrieval, the first three variables are used in the 
	emacs summary line like so:
	# | 0.xx | SOURCE | DATE | SUBJECT
	and BODY is the text returned when you hit C-#.

So let's look at RMAIL's format pattern from .savantrc.
  Format 
    {"^_^L\n", ignore, startline, "*** EOOH ***\n",
     {anyorder, {startline, "From: ", SOURCE, "\n"},
               {startline, "Date: ", DATE, "\n"},
               optional {startline, "Subject: ", SUBJECT, "\n"}},
     "\n\n", BODY}

What savant starts looking for is "^_^L\n"; when this is found it ignores
whatever text comes next until "*** EOOH ***\n" is found at the head of
a line.  Next is a subpattern, which is declared to be orderless by the
anyorder command.  It contains three subpatterns, the third of which
is optional, each of which looks for a string at the head of a line,
and upon finding it, fills a variable with the text following the
string, up until the next newline character.  This will result in 
SOURCE getting the From: line, DATE getting the Date: line, and SUBJECT
getting the Subject: line (some email has no subject, so this last one
is optional.  If an optional variable is not filled, then the file 
owner is used for SOURCE, or file creating time for DATE, or filename 
for SUBJECT).  Finally, after everything in the anyorder subpattern
has been dealt with, it goes on to look for a blank line (\n\n), and
the remainder of the document (up until the next document) is the BODY.

By the way, all commands and variable names are case insensitive.
If you experiment with creating new templates, be forewarned that 
there is very little if any syntax checking built in right now,
so try to observe the style used in the five templates provided.
Feel free to email us for tips and pointers, at
remembrance-maintainer@media.mit.edu

5.  What's New?
---------------
Version 1.0b   -  June 1996.  First stable public release.
Version 1.01b  -  September 1996.  Fixed a documentation bug, 
Version 1.1b   -  October 1996.  Fixed a bug causing common words not to be 
		  discarded.  Databases will now be smaller by a significant
		  factor, and faster to search by an even greater factor.  
                  Also fixed a bug where (stupidly) I was reading index -1 of
                  an array, often causing core dumps.  Lastly, we added a 
                  hook to emacs rmail-mode which will help the mail interface
                  ignore mail headers.  This should be easily extendable to any
	          emacs mail package, so send me mail about your favorite one.
Version 1.11b  -  November 1996.  Here I fixed the -1 bug mentioned above, this 
		  time for real.  Also made savant to strip document titles of
		  any control codes, and implemented a new system of indexing
		  a document's subject along with its body.
Version 1.2b   -  March 1997.  Seperated the "savant" program in to "index"
                  and "retrieve" to make them both more lightweight.  Also
                  cleaned up a lot of the internals, making it easier to
                  integrate savant into other programs.
Version 1.22b  -  June 1997.  Fixed several memory leaks, and added some
                  hooks for future development.
Version 1.38   -  March 1998.  The next public release, though there have
                  been several internal releases.  Fixed many more bugs,
                  and it's now installable under GNU autoconf.  Also
                  upgraded the relevance algorithm to TFiDF.  Finally,
                  there are lots of additional features for using new
                  fields in a search (like the location of where a
                  note-file has been taken, the time of day and date,
                  person present, etc) though these aren't currently
                  documented.
Version 1.40    - April 1998.  A minor upgrade.  This fixes date & time
                  searches, which somewhere along the line got completely
                  busted.  It also adds the "fullquery" commmand, which
                  gives lots of information as a query result, instead of
                  just the nicely formatted, human readable stuff.  Also
                  fixed ra-index so not having a "-v" still works.
                  Finally, the template system now only recognizes a file
                  as a certain type if the "recognize" field appears in the
                  first 500 characters (#define'ed in template.h).  This
                  keeps ra-index from thinking a text file is plain email,
                  just because it has "From " at the start of a line somewhere.

Jan-Christian Nelson and Bradley Rhodes

