The Delphi Magazine Secure Online Ordering

This article has been reproduced from The Delphi Magazine: the best magazine for Delphi and Kylix developers! Published monthly, each issue is packed with ideas to help you get the most out of Delphi and Kylix: the development tools from Borland which have taken the world by storm. We cater especially for professional developers and our team of authors includes some of the foremost Delphi and Kylix experts in the world.

If you like what you've seen here why not take a look at some other sample articles, or simply request a sample copy of the magazine (for UK and Western Europe only).

The source code for this article can be downloaded from the Companion Disk area of our website.


This article first appeared in The Delphi Magazine Issue 78 (February 2002)

Algorithms Alfresco Logo

Ask A Thousand Times

How does Google do it?

Well. There are times when you can argue about something and there are times when you just have to accept the job you・re given. I stared at the screen at the email from Our Esteemed Editor. It started off with .Can・t remember if I・ve ever suggested this before...・ and went on to ask about a particular algorithm.

Ha! A .suggestion・, no less! Right! I can read between the lines as well as anyone, I didn・t get where I am today, etc. I dashed off a quick reply to the effect that my grandmother learned that algorithm whilst a teacher at Malory Towers (or was it The Chalet School?) and then secretly perused some algorithm books, sweat beading on the old brow. I brought up Google, which only made matters worse. How did they do it?

Seriously, OEE had raised a thorny problem about whose solution I only had the vaguest idea. The question was that of text searching. How do Google and all the other web search engines enable you to find relevant web pages using only a couple of words? They can・t just download a billion pages and then search them all at your request. They must preprocess them and index them somehow. This article is about how they do this.

By Default By Design

Let・s get some assumptions and definitions out of the way first. We shall assume that we have numerous documents in some kind of database that we wish to search. By document, I mean some text as a separate object. It could be an HTML page, a newsgroup posting, a Word document, or whatever. By database, I don・t necessarily mean an actual database, like Oracle or SQL Server, but some kind of persistent .container・ holding the documents. This could be something as simple as a folder on your machine or it could actually be a database, with the documents stored as BLOBs. The text is assumed to be an ASCII stream: in other words, to consist of words strung together with spaces and the usual punctuation. I don・t want to get into the problems of indexing Word or PDF documents, thank you very much.

When we search, we want to be able to specify a series of words separated by the reserved words AND, OR, and NOT. These will be our logical operators. In other words, to search the documents for those with both .text・ and .indexing・ we would specify text AND indexing as our search criterion. To find the documents that contain either the word text or indexing (or both) we would use text OR indexing. If we specify two or more words without any ANDs or ORs, the search engine will assume that the words are separated by AND: hence text indexing as a search phrase will be read as text AND indexing. NOT is obviously the logical negation operation. We・ll assume that NOT has the highest precedence, that AND and OR have the same precedence, and that they・re evaluated from left to right. So NOT text AND indexing is read as (NOT text) AND (indexing).

From these definitions, I・m sure that you can see that this month・s article will be split up into several subtopics. First, we need to parse each document into separate words. Second, we need to build the word index using these words. Third, we need to be able to parse the search phrase to understand it. Last of all, we need to apply the parsed search phrase to the word index to be able to find the documents we want. A tall order, but we・ll manage without taking over the entire magazine. I think.

A To Z

Let・s take things from easy to hard. The third operation is possibly the easiest to do, so let・s start there.

Whenever we write a parser, it・s generally easier to start by writing down the grammar we・re going to be parsing in BNF. (For details on BNF, see Algorithms Alfresco for January 2001.) Table 1 shows the BNF for our particular search phrase grammar. (The ::= defines a production and means .is defined as・ and the | denotes an alternative. The first production therefore states that an expr is either a factor, or is a factor immediately followed by another expr, or is a factor followed by AND followed by another expr, or is a factor followed by OR followed by another expr. The other productions are read in a similar manner.)

? Table 1: BNF for the search grammar.

<expr> ::= <factor> | <factor> <expr> | <factor> AND <expr> | <factor> OR <expr>

<factor> ::= <term> | NOT <factor>

<term> ::= <word> | ( <expr> )

<word> ::= a series of characters delimited by white space

Having written the BNF, it is a fairly simple matter to convert it to code. We write a parser class, and for each production, we create a method called ParseXxx where the Xxx is the name of the object on the left-hand side of the production. The code in each of these methods implements the right-hand side of the production. This is known as recursive descent parsing.

Let・s take the easy one first: ParseFactor. The production states that a factor is a term or is the word NOT followed by another factor. So ParseFactor should start off with a check for the current word to be NOT. If it is, we should recursively call ParseFactor again. If is isn・t, we should call ParseTerm. The other methods can be written in a similar manner.

However, before we do so, we have to consider something else. It・s all very well to be able to parse the search phrase, but we still need to store the results of our parsing in a data structure so that we can apply it to a word index. We・ll store the parsed phrase in a Reverse Polish Notation (RPN) string where the operator occurs after its operands. The query word1 AND word2 would therefore be stored as word1 word2 AND; NOT (word1 OR word2) would be stored as word1 word2 OR NOT, and so on. (For more information on RPN expressions, see Algorithms Alfresco from June 1999.) You・ll see why I choose this layout when it・s time to apply the parsed string.

Listing 1 shows the TaaSearchParser class as well as the classes that go to make up the RPN string. (If you follow the code, you・ll see that the RPN string gets stored as a linked list of TaaRPNNodes and not as a real .string・.) If you look at the method that parses a factor, spParseFactor, you・ll see that it first checks if the current word is NOT. If it isn・t, the method returns the result of calling the spParseTerm method. If it is, the method calls spParseFactor again. To the result of this call, a new NOT node is created and appended. The Append method of the TaaRPNNode class walks the linked list and adds the node to the final item in the list. Other parsing methods follow the same scheme.

? Listing 1: Parsing a search phrase.

type
  // a base node that helps form an RPN string
  TaaRPNNode = class
    private
      FNext : TaaRPNNode;
    public
      destructor Destroy; override;
      procedure Append(aNode : TaaRPNNode);
      property Next : TaaRPNNode read FNext;
  end;
  TaaRPNWord = class(TaaRPNNode)  // an RPN node for a word
    private
      FWord : string;
    public
      constructor Create(const aPhraseWord : string);
      property PhraseWord : string read FWord;
  end;
  // an RPN node for the AND operator
  TaaRPN_AND = class(TaaRPNNode);
  // an RPN node for the OR operator
  TaaRPN_OR = class(TaaRPNNode);
  // an RPN node for the NOT operator
  TaaRPN_NOT = class(TaaRPNNode);
  TaaSearchParser = class  // a parser for search phrases
    private
      FCurWord : string;
      FPhrase  : string;
      FPosn    : integer;
      FRPN     : TaaRPNNode;
    protected
      function spGetRPN : TaaRPNNode;
      procedure spSetPhrase(const aPhrase : string);
      function spParseExpr : TaaRPNNode;
      function spParseFactor : TaaRPNNode;
      function spParseTerm : TaaRPNNode;
      procedure spParsePhrase;
      procedure spGetNextWord;
    public
      constructor Create(const aPhrase : string);
      destructor Destroy; override;
      property Phrase : string read FPhrase
        write spSetPhrase;
      property RPN : TaaRPNNode read spGetRPN;
  end;
destructor TaaRPNNode.Destroy;
begin
  Next.Free; // recursively destroys linked list of nodes
  inherited Destroy;
end;
procedure TaaRPNNode.Append(aNode : TaaRPNNode);
var
  Walker : TaaRPNNode;
begin
  Walker := Self;
  while (Walker.Next <> nil) do
    Walker := Walker.Next;
  Walker.FNext := aNode;
end;
constructor TaaRPNWord.Create(const aPhraseWord : string);
begin
  inherited Create;
  FWord := aPhraseWord;
end;
constructor TaaSearchParser.Create(const aPhrase : string);
begin
  inherited Create;
  Phrase := aPhrase;
end;
destructor TaaSearchParser.Destroy;
begin
  FRPN.Free;
  inherited Destroy;
end;
procedure TaaSearchParser.spGetNextWord;
var
  Walker    : PAnsiChar;
  WordStart : PAnsiChar;
begin
  {advance past the current word}
  inc(FPosn, length(FCurWord));
  FCurWord := '';
  {skip past the white space}
  Walker := @FPhrase[FPosn];
  while (Walker^ = ' ') do begin
    inc(FPosn);
    inc(Walker);
  end;
  {check for the parentheses first; if found, copy them to
    the internal field}
  if (Walker^ = '(') then
    FCurWord := '('
  else if (Walker^ = ')') then
    FCurWord := ')'
  {otherwise search for the end of the current word (it'll
    end with the end of the string, with white space, or
    with parentheses); copy the word to the internal field}
  else begin
    WordStart := Walker;
    while (Walker^ <> #0) and (Walker^ <> ' ') and
          (Walker^ <> '(') and (Walker^ <> ')') do
      inc(Walker);
    FCurWord := Copy(FPhrase, FPosn, Walker - WordStart);
  end;
end;
function TaaSearchParser.spGetRPN : TaaRPNNode;
begin
  if (FRPN = nil) then
    spParsePhrase;
  Result := FRPN;
end;
function TaaSearchParser.spParseExpr : TaaRPNNode;
begin
  Result := spParseFactor;
  spGetNextWord;
  if (FCurWord = 'and') then begin
    spGetNextWord;
    Result.Append(spParseExpr);
    Result.Append(TaaRPN_AND.Create);
  end else if (FCurWord = 'or') then begin
    spGetNextWord;
    Result.Append(spParseExpr);
    Result.Append(TaaRPN_OR.Create);
  end else if (FCurWord <> '') and (FCurWord <> ')') then
    begin
    Result.Append(spParseExpr);
    Result.Append(TaaRPN_AND.Create);
  end;
end;
function TaaSearchParser.spParseFactor : TaaRPNNode;
begin
  if (FCurWord <> 'not') then
    Result := spParseTerm
  else begin
    spGetNextWord;
    Result := spParseFactor;
    Result.Append(TaaRPN_NOT.Create);
  end;
end;
procedure TaaSearchParser.spParsePhrase;
begin
  if (FPhrase <> '') then begin
    FPosn := 1;
    spGetNextWord;
    if (FCurWord <> '') then
      FRPN := spParseExpr;
  end;
end;
function TaaSearchParser.spParseTerm : TaaRPNNode;
begin
  if (FCurWord = '(') then begin
    spGetNextWord;
    Result := spParseExpr;
    if (FCurWord <> ')') then
      raise Exception.Create('TaaSearchParser: missing close
        parenthesis in phrase');
  end
  else begin
    if (FCurWord = '') then
      raise Exception.Create('TaaSearchParser: missing final
        search word');
    if (FCurWord = 'and') or (FCurWord = 'or') or
       (FCurWord = 'not') then
      raise Exception.Create('TaaSearchParser: operator used
        as search word');
    Result := TaaRPNWord.Create(FCurWord);
  end;
end;
procedure TaaSearchParser.spSetPhrase(const aPhrase :
 string);

begin
  FPhrase := LowerCase(aPhrase);
  FRPN.Free;
  FRPN := nil;
end;

Errors in the search phrase are all caught by the spParseTerm method (it is at the bottom of the BNF grammar, after all). Errors include: a missing right parenthesis; an operator (AND, OR, or NOT) being used as a search word (example: :this AND and;); and a missing word after the phrase is completely read (example: :this AND;).

The most interesting method is perhaps spGetNextWord. It is this method that scans the search phrase for the individual words. The search phrase is assumed to be pretty simple: words are separated by white space. There・s a small extra wrinkle, however: we must take account of parentheses (otherwise :NOT (this AND that); would be parsed as :NOT;, :(this;, :AND;, and :that);). So we must use parentheses as word delimiters as well as white space.

This Must Be Magic

Well, that was the easy bit. Now onto the real meat: generating the index. The first step is parsing out all the words from the text.

In the search phrase parser, we had to read the phrase word by word. It should come as no surprise that we have to do the same for every one of our documents; however, the definition of what constitutes a word becomes more complex. Indeed, my research on Google led into all sorts of blind alleys and to some remarkably nasty looking code.

Let・s encapsulate my discoveries. There・s the school of thought that says .the space is the delimiter・. In other words, to find the words, we drop an anchor and parse through the text character by character until a space is reached. The text from the anchor to the space is a word. Continue like this (drop, parse, cut) until you have parsed all the text and found all the words.

This simplistic algorithm is a little too simple for our needs. Punctuation is not stripped, and .hangs・ on to the beginnings or the ends of words. For instance, in that last sentence, we・d have [stripped,] [.hangs・] and [words.] as some of the .words・ in the sentence. Another problem is that if we follow the algorithm strictly we・d get some false positives when there are two or more spaces in a row. Hence, in general, there・s some scanning through spaces after finding a word.

The next school of thought is the .words just consist of letters and digits・ brigade. Here we have two .modes・: scanning an alphanumeric word or scanning all the punctuation or spaces in between (that is, the non-alphanumeric characters). It・s a simple enough state machine that a non-beginner programmer could write in 15 minutes. However, in that last sentence, it would convert [It・s] into two words [It] and [s] and [non-beginner] into [non] and [beginner]: perhaps not what we intended.

What we will do instead is to follow a middle path. We・ll make the delimiting characters a parameter passed to the routine (a string would be best). The routine would then be coded as a state machine, as the second example, but this time it would determine state switches based on whether the current character is, or isn・t, in the delimiter parameter.

This is all very well, but what do we want to do with the words that we discover? Before I answer that question, let・s take a detour.

Welcome To The Real World

What I・ve done so far is describe a particular algorithm with certain well-defined features: it takes a bunch of text and extracts the words, the words being delimited by a set of delimiter characters. I・m now just about to go into details about how I want to use those words. Surely, it makes sense to implement the word parser now, before we go onto the next step? I should put my programmer・s hat on firmly and code up the parser in a generic way. Once I・ve done that, I can then think about how I want to use the words that the parser is going to spit out at me.

In other words, we should ideally decouple the code that processes the words from the routine that discovers them. Decoupling (or separating) code is a Good Thing in programming: it enables better reuse of code that you・ve already written.

Let・s continue with this argument. By doing this, writing a generic implementation of a word parser, I can use it in other situations: implementing a word count routine, finding the distinct words used in the text, or performing some analysis of the writer・s vocabulary. By not doing it, I end up with a word parser that can only be used for my particular need today: tomorrow be damned.

Of course, there is another very important reason: by writing decoupled code, you get to test the various algorithms separately and thoroughly. If you had written tightly coupled code, you・d run into the problem of not knowing where a bug might be. (Is it in the parser? Or is it in the code that processes each word? Maybe in the coupling?)

So, by separating out the .word-processor・ part from the .word-parser・, I could easily write the parser and then, for testing, write a simple .word-processor・ routine that wrote the words as separate lines in a text file, for example. Later on, I could write the real routine to process the words, secure in the knowledge that the word parsing routine works and has been thoroughly tested.

So, let・s do that. I・ll make the word parsing routine take in a stream containing the text to be parsed. It・ll also, as I・ve said, take in a string containing the word delimiter characters. There are a couple of choices for how to present the words parsed out of the text. First, put them all into a container (a string list or a text file, or similar) to be processed en masse. Or, second, fire off an event for each word and let the word-processor decide what to do with them (whether to save them, whether to count them, whether to process them all individually when they arrive, and so on). To me the latter choice makes the most sense: it uses less memory and it defers any decision about saving the words to the code making the call.

So, the final parameter is an Action routine. For every word discovered, this routine will be called to process the word. Our initial Action routine will just write the words out to a text file, but, later on, we・ll be writing the heavy-duty routine that helps us in our quest. Normally, I would write the Action routine as a simple procedure. However, I・ve been brought to task in the past for this simplistic eighties type of design and so I shall make it a method of an object instead.

Phew! After all that, Listing 2 shows the parser. If you look at it, you will see that I・ve made a designer・s decision that any character less than or equal to a space is also a delimiter (characters like carriage return, line feed, tab, and space: the so-called non-visible characters). The delimiter string therefore just needs to hold visible punctuation.

? Listing 2: Parsing a stream of text into words.

function IsDelimiter(aCh : char; const aDelims : string) :
  boolean;
// returns true if aCh is a delimiter in S or is <= space
var
  i : integer;
begin
  Result := true;
  if (aCh > ' ') then begin
    for i := 1 to length(aDelims) do
      if (aDelims[i] = aCh) then
        Exit;
    Result := false;
  end;
end;
procedure AAParseWords(aText : TStream; const aDelims :
 string; aAction : TaaWordParseAction);

const
  BufSize = 16 * 1024;
var
  State   : (ScanWord, ScanOther);
  Buffer  : PChar;
  BufPos  : PChar;
  BufLeft : integer;
  Ch      : char;
  StopNow : boolean;
  CharQ   : TaaCharQueue;
begin
  // prepare for the memory allocations
  Buffer := nil;
  CharQ := nil;
  try
    // create a character queue with which to build words
    CharQ := TaaCharQueue.Create;
    // allocate a buffer to hold data from the stream
    GetMem(Buffer, BufSize);
    // force the stream to be read first time through
    BufLeft := 1;
    BufPos := Buffer;
    // we'll start in the scanning delimiter state
    State := ScanOther;
    StopNow := false;
    // continue until there is no more data in the stream...
    while (not StopNow) and (BufLeft <> 0) do begin
      // advance the buffer variables (reading more data if
      // needed)
      dec(BufLeft);
      if (BufLeft <> 0) then
        inc(BufPos)
      else begin
        BufLeft := aText.Read(Buffer^, BufSize);
        BufPos := Buffer;
      end;
      // get the next character
      Ch := BufPos^;
      // process the character according to the state
      case State of
        ScanWord :
          begin
            if IsDelimiter(Ch, aDelims) then begin
              aAction(CharQ.AsString, StopNow);
              State := ScanOther;
            end
            else
              CharQ.Append(Ch);
          end;
        ScanOther :
          begin
            if not IsDelimiter(Ch, aDelims) then begin
              CharQ.Clear;
              CharQ.Append(Ch);
              State := ScanWord;
            end;
          end;
      end;{case}
    end;
  finally
    if (Buffer <> nil) then
      FreeMem(Buffer);
    CharQ.Free;
  end;
end;

Skyscraping

What next? Well, we have to decide what to do with the words we parse out of the text. Remember that our goal is to create a word index that we can search against. So it certainly makes sense that we have this huge structure containing items keyed by word. What must each item contain? The word for a start: it・s the key into the index. And then it will contain a list of documents that contain the word. I・m sure you can see that, in a loose sense anyhow, given a word, we can easily find the documents that contain that word. This structure is known as an inverted file.

Of course, there is a group of words that we really shouldn・t bother to index since every document can be expected to contain them, or, more precisely, so many documents will contain them it just doesn・t make sense to index them. In English text, these include a, the, of, you, and, or, and so on. These words are known as stop words, since when we get one we might as well stop processing it. There is no single list of stop words, I・m afraid; it・s more of a subjective thing. For our purposes, we・ll use a simple list I found on the internet.

At first blush, when I say huge for the inverted file, it seems I really mean gigantic. Suppose the documents were URLs for a web search engine. Then, for example, the word text would have a list of possibly millions of URLs. Each item in the index could then be tens (hundreds) of megabytes in size and there would be thousands of them (possibly millions). I don・t think so.

The first obvious thing we notice to reduce the space requirements is that the list of documents for each word doesn・t have to contain the actual names of the documents. We could have an array of document names somewhere and then for each item in the index we would have a list of integer values, each value being the number of the document in the external array (a simple lookup, in other words, to find the document name for a given value).

That・s still pretty steep memory- wise for each item in the index, but manageable. However, it doesn・t make the next stage particularly easy to do: applying a search phrase to the index. For a search phrase consisting of a single word, it would be pretty easy to get a list of documents containing the word. But remember that the search phrase can contain the logical operators: AND, OR, and NOT. How would we implement these?

Suppose that we had the search phrase text AND indexing. We can look up both words and get the two lists. Now, to perform an efficient AND on these two lists of integers would require the lists to be sorted (note to self: must maintain the document lists sorted). Imagine that each list stands vertically, side by side. Look at the two top items. If they are different, remove the smallest item and throw it away, and then go back to look at the two top items again. If they are the same, remove both items from both lists and store the number in a result list, and then go back to look at the two top items again. I・m sure that you can see that the resulting list would only contain the numbers that originally appeared in both lists; in other words, we・ve done an AND operation.

For OR, we would do the same, except that every distinct number gets put in the result list. For NOT, it・s a little more complicated, but I・m sure you get the idea. (I・ve deliberately glossed over the fact that one of the two lists could be empty during all this: an easy exercise for the reader.)

However, it・s all a bit complicated. We must make sure the lists are maintained in ascending order. We must code up the merge-type operation for each logical operator, and make sure that we cater for the running-out-of-list problem. If we・re not careful we・d run into efficiency problems should the output list not be large enough and has to be grown. Etc, etc.

Oh, if only we could use some nice logical arithmetic instead of all this messing around with merging integer lists. Well, actually, we can. Back in December 1999, I presented a bitset class, an array of Boolean values, all either on or off, true or false. Methods were given to perform AND, OR and NOT operations on bitset objects. What if we had a bitset for each word in our index, instead of a list of integers? Each bit would indicate a particular document; if the nth bit were set, the nth document contained that word, if clear it would not. The operations we・d perform in our search would then be simple indeed: logical arithmetic on bitsets.

All That Matters

Consider this idea in more detail, though. Each bitset corresponds to a particular word. For unusual words the majority of the bits in the bitset are likely to be off, indicating that most documents will not contain the word. Since the bits are gathered eight to a byte in this bitset class, that means that the majority of bytes in the internal buffer are zero. And that means, in general, that the bitsets are highly compressible, most easily by run-length encoding (RLE).

RLE is a simple compression scheme where essentially runs of equal bytes are replaced by a length value byte (or perhaps a word) followed by the byte itself (known as the literal byte). The fun part comes when the run length is 1: how do you encode this? A length byte of 1 followed by the byte itself? That would mean that a buffer of bytes with no runs greater than 1 would double in size when we compress it. We can・t just output the literal byte and hope for the best: there is no way to know whether a byte is a literal byte or the length byte for an RLE pair. Unfortunately, there is no standard method of solving this conundrum.

I decided to use a single bit, a state bit, to indicate whether the next byte was literal or was a length byte for the byte value that immediately followed. On its face, however, this is nasty: outputting single bits is quite simply a pain in the neck. I didn・t want to output a Boolean either: too big. In the end, to avoid having to write individual bits to the output buffer, I gathered the state bits eight at a time and output them as a byte before the data that corresponds to the eight state bits. If you remember that far back, this is like the scheme I used in the LZ77 compression from the May 1999 issue. In our worst-case scenario (no run length is greater than one), this RLE compression algorithm would grow the data by about an eighth in size.

So, I altered the bitset class to store and load its data from a stream using RLE compression. To avoid spending too much time decompressing the data, I changed the bitset class to defer unpacking the bit data until it was actually needed. Listing 3 shows the bitset・s RLE compression and decompression.

? Listing 3: Bitset RLE (de)compression (continued on facing page).

type
  TRLEData = packed record
    rdLen : byte;
    rdVal : byte;
  end;
  TRLEPacker = class
    private
      FRLEData : array [0..7] of TRLEData;
      FRLEInx  : integer;
      FBuffer  : PByteArray;
      FBufInx  : integer;
    protected
      procedure rpWriteEncoding(aCount : integer);
    public
      constructor Create(aBufSize : integer);
      destructor Destroy; override;
      procedure Add(aLen : byte; aValue : byte);
      procedure MarkComplete;
      procedure CopyBuffer(var aDest);
      property BufferLen : integer read FBufInx;
  end;
constructor TRLEPacker.Create(aBufSize : integer);
begin
  inherited Create;
  GetMem(FBuffer, aBufSize);
end;
destructor TRLEPacker.Destroy;
begin
  if (FBuffer <> nil) then
    FreeMem(FBuffer);
  inherited Destroy;
end;
procedure TRLEPacker.Add(aLen : byte; aValue : byte);
begin
  FRLEData[FRLEInx].rdLen := aLen;
  FRLEData[FRLEInx].rdVal := aValue;
  inc(FRLEInx);
  if (FRLEInx = 8) then
    rpWriteEncoding(8);
end;
procedure TRLEPacker.CopyBuffer(var aDest);
begin
  Move(FBuffer^, aDest, FBufInx);
end;
procedure TRLEPacker.MarkComplete;
begin
  {add the sentinel to indicate end-of-compressed-data (a
    code for a length of zero does this)}
  Add(0, 0);
  {write out any final encoding}
  if (FRLEInx <> 0) then
    rpWriteEncoding(FRLEInx);
end;
procedure TRLEPacker.rpWriteEncoding(aCount : integer);
var
  i : integer;
  ControlByte : byte;
begin
  {initialize the control byte}
  ControlByte := 0;
  {for all the encodings, set the relevant bit of the
    control byte if a run, leave it clear otherwise (note:
    the end-of-data sentinel has a length of zero and this
    code treats it as an actual length)}
  for i := 0 to pred(aCount) do begin
    ControlByte := ControlByte shl 1;
    if (FRLEData[i].rdLen <> 1) then
      inc(ControlByte);
  end;
  {if the number of encodings is less than 8, set the rest
    of the bits as clear}
  if (aCount <> 8) then
    for i := aCount to 7 do
      ControlByte := ControlByte shl 1;
  {write out the control byte}
  FBuffer^[FBufInx] := ControlByte;
  inc(FBufInx);
  {write out the encodings, either as run length followed by
    the byte or as the byte itself if the run length were 1}
  for i := 0 to pred(aCount) do begin
    case FRLEData[i].rdLen  of
      0 : begin
            FBuffer^[FBufInx] := 0;
            inc(FBufInx);
          end;
      1 : begin
            FBuffer^[FBufInx] := FRLEData[i].rdVal;
            inc(FBufInx);
          end;
    else {any other value: 2..255}
      FBuffer^[FBufInx] := FRLEData[i].rdLen;
      inc(FBufInx);
      FBuffer^[FBufInx] := FRLEData[i].rdVal;
      inc(FBufInx);
    end;
  end;
  {start over}
  FRLEInx := 0;
end;
procedure TaaBitSet.bsPack;
var
  i      : integer;
  B      : byte;
  PrevB  : byte;
  RunLen : byte;
  Packer : TRLEPacker;
begin
  {allocate a packer object with a buffer big enough for the
    worst case, in which all runs are of length one--that
    is, packing will grow the data by 1 byte for each 8
    unpacked bytes, plus one byte over for the sentinel}
  Packer := TRLEPacker.Create(FBitArraySize +
    ((FBitArraySize + 8) div 8));
  try
    {set it up so previous byte is the first byte and
     current run length is zero: makes loop code easier}
    PrevB := FBitArray^[0];
    RunLen := 0;
    {process the rest of the bytes}
    for i := 0 to pred(FBitArraySize) do begin
      {get the next byte}
      B := FBitArray^[i];
      {if it is different from the previous byte, close off
        the previous run and start a new one}
      if (B <> PrevB) then begin
        Packer.Add(RunLen, PrevB);
        PrevB := B;
        RunLen := 1;
      end
      {otherwise, continue this run}
      else begin
        {if we've already reached 255 bytes in this run
          before adding this byte, close it off and start a
          new one}
        if (RunLen = 255) then begin
          Packer.Add(RunLen, PrevB);
          RunLen := 0;
        end;
        inc(RunLen);
      end;
    end;
    {close off the final run}
    Packer.Add(RunLen, PrevB);
    {mark the packer object as being complete (this adds the
      sentinel and calculates the compressed buffer size)}
    Packer.MarkComplete;
    {reallocate our buffer and copy over the compressed
      data}
    FBitArraySize := Packer.BufferLen;
    ReallocMem(FBitArray, FBitArraySize);
    Packer.CopyBuffer(FBitArray^);
    FPacked := true;
  finally
    Packer.Free;
  end;
end;
procedure TaaBitSet.bsUnpack;
var
  i : integer;
  Buf    : PByteArray;
  RunLen : integer;
  InInx  : integer;
  OutInx : integer;
  Done   : boolean;
  Value  : byte;
  Mask   : byte;
  ControlByte : byte;
begin
  {allocate output buffer large enough to hold all the bits}
  GetMem(Buf, (FBitCount + 7) div 8);
  try
    {initialize for the loop}
    Done := false;
    InInx := 0;
    OutInx := 0;
    {continue unpacking until the end of compressed data is
      found}
    repeat
      {set the mask for the control byte}
      Mask := $80;
      {read the control byte}
      ControlByte := FBitArray^[InInx];
      inc(InInx);
      {repeat until all the bits in the control byte have
        been used}
      while (Mask <> 0) do begin
        {if the control bit says that the next byte is
          literal, copy it over to the output buffer}
        if ((ControlByte and Mask) = 0) then begin
          Buf^[OutInx] := FBitArray^[InInx];
          inc(OutInx);
          inc(InInx);
        end
        {otherwise it's an FLE instruction; get the run
          length and the value to copy and duplicate it
          (note: a runlength of zero indicates the end of
          the compressed data)}
        else begin
          RunLen := FBitArray^[InInx];
          inc(InInx);
          if (RunLen = 0) then begin
            Done := true;
            Break; // out of Mask loop
          end
          else begin
            Value := FBitArray^[InInx];
            inc(InInx);
            for i := 1 to RunLen do begin
              Buf^[OutInx] := Value;
              inc(OutInx);
            end;
          end;
        end;
        {set mask to get the next bit}
        Mask := Mask shr 1;
      end;
    until Done;
    {throw away the old packed buffer, and set it (and other
      fields) for the new unpacked one}
    FreeMem(FBitArray);
    FBitArray := Buf;
    FBitArraySize := (FBitCount + 7) div 8;
    FPacked := false;
  except
    FreeMem(Buf);
    raise;
  end;
end;

Unlock The Secrets

Let・s recap a little here. Our word index will consist of a set of items, keyed by word. Each item contains the word and a bitset. For each bit in the bitset, it will be clear if the corresponding document does not contain the word, and set if it does. There is an array of document names, the index of each one corresponding to a bit in each bitset.

I・ve so far skated over the word index part. What structure should we use here? It all depends on the number of words we・re tracking. If the number is large, say in the tens or hundreds of thousands, we could use a B-tree or the database engine du jour. For a smaller number of words (up to 10,000 or so), a hash table would be ideal. For simplicity・s sake in this article (it・s getting big enough already!), I decided to use a hash table, one that I・d written already, of course. All that was needed was a way of storing its data to, and loading it from, a stream. To do this I added an Iterate method to the class to iterate through the entries in the hash table and using this I was able to save the entire word index (words and bitsets) to a stream. Reading it from a stream was easy enough. Listing 4 shows the (rather simple) word index class that I implemented. This class provides a simple wrapper around the hash table, together with read from and write to stream/file facilities.

? Listing 4: A simple word index class.

type
  TaaWordIndex = class
    private
      FTable : TaaHashTableLinear;
      FSaveStream : TStream;
    protected
      procedure wiSaveAction(const S : string;
        aObject : pointer; var aStopNow : boolean);
    public
      constructor Create(aWordCount : integer);
      destructor Destroy; override;
      procedure Add(const aWord : string;
        aBitSet : TaaBitSet);
      function Find(const aWord : string) : TaaBitSet;
      procedure LoadFromStream(aStream : TStream);
      procedure LoadFromFile(const aFileName : string);
      procedure StoreToStream(aStream : TStream);
      procedure StoreToFile(const aFileName : string);
  end;
procedure DestroyTableEntry(const S : string;
  aObject : pointer);
begin
  TaaBitSet(aObject).Free;
end;
constructor TaaWordIndex.Create(aWordCount : integer);
begin
  inherited Create;
  FTable := TaaHashTableLinear.Create(aWordCount,
    AAELFHash);
  FTable.OnDeleteString := DestroyTableEntry;
end;
destructor TaaWordIndex.Destroy;
begin
  FTable.Free;
  inherited Destroy;
end;
procedure TaaWordIndex.Add(const aWord : string;
 aBitSet : TaaBitSet);

begin
  FTable.Insert(aWord, aBitSet);
end;
function TaaWordIndex.Find(const aWord : string) :
 TaaBitSet;

var
  Obj : pointer;
begin
  if not FTable.Find(aWord, Obj) then
    Result := nil
  else
    Result := TaaBitSet(Obj);
end;
procedure TaaWordIndex.LoadFromStream(aStream : TStream);
var
  i : integer;
  WordCount : integer;
  Len : integer;
  S : string;
  BitSet : TaaBitSet;
begin
  aStream.ReadBuffer(WordCount, sizeof(WordCount));
  S := '';
  BitSet := nil;
  try
    for i := 1 to WordCount do begin
      aStream.ReadBuffer(Len, sizeof(Len));
      SetLength(S, Len);
      aStream.ReadBuffer(S[1], Len);
      BitSet := TaaBitSet.Create(1);
      BitSet.LoadFromStream(aStream);
      Add(S, BitSet);
      S := '';
      BitSet := nil;
    end;
  except
    S := '';
    BitSet.Free;
    raise;
  end;
end;
procedure TaaWordIndex.StoreToStream(aStream : TStream);
var
  WordCount : integer;
begin
  WordCount := FTable.Count;
  aStream.WriteBuffer(WordCount, sizeof(WordCount));
  FSaveStream := aStream;
  FTable.Iterate(wiSaveAction);
  FSaveStream := nil;
end;
procedure TaaWordIndex.wiSaveAction(const S : string; aObject : pointer;
  var aStopNow : boolean);
var
  Len : integer;
begin
  Len := length(S);
  FSaveStream.WriteBuffer(Len, sizeof(Len));
  FSaveStream.WriteBuffer(S[1], Len);
  TaaBitSet(aObject).StoreToStream(FSaveStream);
end;

At which point we can do some real work. The first part of any word indexing is of course building the index. I took a baker・s dozen of Shakespeare・s plays in text file format (from Project Gutenberg) and created a file with their names. This served as my document list. We can easily read this list into a string list and from this get the document numbers for each text file.

We must create a hash table to contain the stop words. For this we can use a text file containing a suitable list and parse it using the word parser we developed. For each word we encounter we add it to the hash table of stop words.

Once we have a list of documents, we can create a word index object. I created it to have at least 10,000 entries (the hash table is self-growing, but it・s best to estimate a count of words to avoid too much reallocation and growth).

The next step is to parse each document in turn and extract the words. For each word, we see if it is in the stop word hash table. If it is, we ignore it. If not, we check to see if it is in the word index. If it is not present in the word index, we create a new bitset, set the bit corresponding to our current document, and then add the word plus its bitset to the word index. If it is present in the word index, we・ll get back the word・s bitset object and we just set the bit corresponding to our document.

So Hip It Hurts

Once we・ve done all this, we have our word index object set up and it・s time to use it. Remember way back at the beginning of the article we wrote a search parser? Well, it・s now time to put it to use.

The parser generates an RPN expression. To evaluate it, we shall use a standard stack-based algorithm. First we create a stack that holds bitset objects. We now read through the RPN expression. For each word in the expression we look it up in the index. We create a new bitset, assign it the bitset value we get from the word index (if the word wasn・t found, we leave the newly created bitset as is), and then we push it onto the stack. For an AND or an OR operator, we pop off the two top bitsets from the stack, perform the operation and push the resulting bitset back onto the stack (the second bitset is freed, we don・t need it any more). For a NOT operation, we pop off the top bitset, NOT it, and then push it back. At the end of the RPN expression, we・ll have a single bitset at the top of the stack with the answer.

Ah, but what・s the answer? Each bit that・s set in the bitset indicates that the relevant document satisfies the search. If there are no bits set (all are clear) then no document could be matched to the search phrase. After all the preparation, the execution seems a little anti-climatic, don・t you think? Listing 5 shows the code that performs the search; this month・s disk contains all the source.

? Listing 5: A document search.

procedure TfrmSearch.btnSearchClick(Sender: TObject);
var
  SearchPhrase : TaaSearchParser;
  RPNWalker    : TaaRPNNode;
  Stack        : TBitSetStack;
  BS1          : TaaBitSet;
  BS2          : TaaBitSet;
  i            : integer;
begin
  {prepare for the evaluation}
  lbxResults.Items.Clear;
  SearchPhrase := nil;
  Stack := nil;
  BS1 := nil;
  try
    {parse the search phrase}
    SearchPhrase :=
      TaaSearchParser.Create(edtSearchPhrase.Text);
    RPNWalker := SearchPhrase.RPN;
    {create the stack for evaluating the RPN expression}
    Stack := TBitSetStack.Create;
    while (RPNWalker <> nil) do begin
      if RPNWalker is TaaRPNWord then begin
        BS1 := TaaBitSet.Create(DocList.Count);
        BS2 := WordIndex.Find(TaaRPNWord(
          RPNWalker).PhraseWord);
        if (BS2 <> nil) then
          BS1.Assign(BS2);
        Stack.Push(BS1);
      end else if RPNWalker is TaaRPN_AND then begin
        BS1 := Stack.Pop;
        BS2 := Stack.Pop;
        BS1.AndBitSet(BS2);
        Stack.Push(BS1);
        BS2.Free;
      end else if RPNWalker is TaaRPN_OR then begin
        BS1 := Stack.Pop;
        BS2 := Stack.Pop;
        BS1.OrBitSet(BS2);
        Stack.Push(BS1);
        BS2.Free;
      end else {RPNWalker is TaaRPN_NOT} begin
        BS1 := Stack.Pop;
        BS1.NotBitSet;
        Stack.Push(BS1);
      end;
      BS1 := nil;
      RPNWalker := RPNWalker.Next;
    end;
    {display the results}
    BS1 := Stack.Pop;
    for i := 0 to pred(DocList.Count) do begin
      if BS1[i] then
        lbxResults.Items.Add(DocList[i]);
    end;
  finally
    BS1.Free;
    Stack.Free;
    SearchPhrase.Free;
  end;
end;

Stranger Thoughts

Having seen the basics of word searching, it is important to discuss some of its limitations. The interested reader may want to implement some of these improvements.

The first thing is that the search phrase does not filter out stop words. It・s a bit daft if the user searches for a stop word (one that is likely to appear in a document) and gets the result that no documents contain it.

The second thing is that the word parser just takes words as they are found in the text. Hence index, indexes, indexing, etc, are all different words and must be searched for individually. For some implementations it makes sense to apply a word stemming algorithm. What this does is to reduce each word to its root form (or stem). Then we can search for index, say, and get references to indexes, indexing, indexed, and so on. In other words, in searching for the singular form we・ll match the plural form, or different tenses if the word were a verb. Essentially this is done using a special word list, each line of which consists of the root word followed by its derivations. This list must be read in and parsed and stored in a data structure prior to doing the word indexing.

The next thing is that the bitset needs a change. To find the matches, we currently read through all the bits in the bitset, looking for ones that are set. A much faster method, for thousands of documents, would be to add a method to the bitset class that scanned through the bytes in the internal buffer looking for non-zero values. Once it found one, it could scan the bits in this byte looking for a set bit. This would be much faster than the current method that I show in Listing 5.

The final enhancement I would like to talk about is the one that・s the most drastic. At present, the word index we built merely indicates that such-and-such documents have a particular word. An improvement might be to give a list of positions in the document as well. This would be a drastic change, indeed: we could no longer simply use bitsets since each word and document combination would have to have a list of positions (line and column, for example). Probably better is to leave well enough alone and, should the user wish to see the occurrences of the words in the document, we could search the document at the time we display it. Mind you, it・s hard to highlight words that match the search phrase NOT (this OR that)!

And with that, we come to the end of a rather involved article. It・s been an interesting one in the sense that we・ve reused an awful lot of previously written code and previously seen algorithms. It only goes to show that, if you have a good grounding in algorithms and data structures, you・ll more easily recognize situations where you can use them. And it・s even better: I answered Our Esteemed Editor・s question. I can breathe easy again and hope he picks Mr Jewell next time!



Julian Bucknall is the author of Tomes of Delphi: Algorithms and Data Structures. You can reach him at julianb@turbopower.com. He also has some very rare alphabets.

The code that accompanies this article is freeware and can be used as-is in your own applications.
© Julian M Bucknall, 2002

The Delphi Magazine is published by iTec, who also provide TDMWeb Web Hosting. Click here for contact information.
Copyright ?iTec 2001. The Delphi Magazine is an independent publication. All trademarks are acknowledged.