Ask A Thousand Times - Delphi K.Top 討論區

發表時間：2004-05-30 14:04:53
IP:61.64.xxx.xxx 未訂閱
This article first appeared in The Delphi Magazine Issue 78 (February 2002) http://www.thedelphimagazine.com/samples/1374/1374.htm
Ask A Thousand Times     How does Google do it?     Well. There are times when you can argue about something and there are times when you just have to accept the 
job you’re given. I stared at the screen at the email from Our Esteemed Editor. It started off with ‘Can’t 
remember if I’ve ever suggested this before...’ and went on to ask about a particular algorithm.     Ha! A ‘suggestion’, no less! Right! I can read between the lines as well as anyone, I didn’t get where I am today, 
etc. I dashed off a quick reply to the effect that my grandmother learned that algorithm whilst a teacher at 
Malory Towers (or was it The Chalet School?) and then secretly perused some algorithm books, sweat beading 
on the old brow. I brought up Google, which only made matters worse. How did they do it?     Seriously, OEE had raised a thorny problem about whose solution I only had the vaguest idea. The question was 
that of text searching. How do Google and all the other web search engines enable you to find relevant web 
pages using only a couple of words? They can’t just download a billion pages and then search them all at your
 request. They must preprocess them and index them somehow. This article is about how they do this.     By Default By Design     Let’s get some assumptions and definitions out of the way first. We shall assume that we have numerous 
documents in some kind of database that we wish to search. By document, I mean some text as a separate object. 
It could be an HTML page, a newsgroup posting, a Word document, or whatever. By database, I don’t 
necessarily mean an actual database, like Oracle or SQL Server, but some kind of persistent ‘container’ holding 
the documents. This could be something as simple as a folder on your machine or it could actually be a 
database, with the documents stored as BLOBs. The text is assumed to be an ASCII stream: in other words, to 
consist of words strung together with spaces and the usual punctuation. I don’t want to get into the problems of 
indexing Word or PDF documents, thank you very much.     When we search, we want to be able to specify a series of words separated by the reserved words AND, OR, 
and NOT. These will be our logical operators. In other words, to search the documents for those with both ‘text’
 and ‘indexing’ we would specify text AND indexing as our search criterion. To find the documents that
 contain either the word text or indexing (or both) we would use text OR indexing. If we specify two or more 
words without any ANDs or ORs, the search engine will assume that the words are separated by AND: hence 
text indexing as a search phrase will be read as text AND indexing. NOT is obviously the logical negation 
operation. We’ll assume that NOT has the highest precedence, that AND and OR have the same precedence, 
and that they’re evaluated from left to right. So NOT text AND indexing is read as (NOT text) AND (indexing).     From these definitions, I’m sure that you can see that this month’s article will be split up into several subtopics.
 First, we need to parse each document into separate words. Second, we need to build the word index using 
these words. Third, we need to be able to parse the search phrase to understand it. Last of all, we need to apply 
the parsed search phrase to the word index to be able to find the documents we want. A tall order, but we’ll 
manage without taking over the entire magazine. I think.     A To Z     Let’s take things from easy to hard. The third operation is possibly the easiest to do, so let’s start there.     Whenever we write a parser, it’s generally easier to start by writing down the grammar we’re going to be
 parsing in BNF. (For details on BNF, see Algorithms Alfresco for January 2001.) Table 1 shows the BNF for 
our particular search phrase grammar. (The ::= defines a production and means ‘is defined as’ and the | denotes 
an alternative. The first production therefore states that an expr is either a factor, or is a factor immediately 
followed by another expr, or is a factor followed by AND followed by another expr, or is a factor followed by
 OR followed by another expr. The other productions are read in a similar manner.)     ? Table 1: BNF for the search grammar.     ::=  |   |  AND  |  OR  
 
 ::=  | NOT  
 
 ::=  | (  ) 
 
 ::= a series of characters delimited by white space 
     Having written the BNF, it is a fairly simple matter to convert it to code. We write a parser class, and for each 
production, we create a method called ParseXxx where the Xxx is the name of the object on the left-hand side 
of the production. The code in each of these methods implements the right-hand side of the production. This is 
known as recursive descent parsing.     Let’s take the easy one first: ParseFactor. The production states that a factor is a term or is the word NOT
 followed by another factor. So ParseFactor should start off with a check for the current word to be NOT. If it is
, we should recursively call ParseFactor again. If is isn’t, we should call ParseTerm. The other methods can be 
written in a similar manner.     However, before we do so, we have to consider something else. It’s all very well to be able to parse the search 
phrase, but we still need to store the results of our parsing in a data structure so that we can apply it to a word 
index. We’ll store the parsed phrase in a Reverse Polish Notation (RPN) string where the operator occurs after 
its operands. The query word1 AND word2 would therefore be stored as word1 word2 AND; NOT (word1 OR 
word2) would be stored as word1 word2 OR NOT, and so on. (For more information on RPN expressions, see 
Algorithms Alfresco from June 1999.) You’ll see why I choose this layout when it’s time to apply the parsed 
string.     Listing 1 shows the TaaSearchParser class as well as the classes that go to make up the RPN string. (If you 
follow the code, you’ll see that the RPN string gets stored as a linked list of TaaRPNNodes and not as a real 
‘string’.) If you look at the method that parses a factor, spParseFactor, you’ll see that it first checks if the 
current word is NOT. If it isn’t, the method returns the result of calling the spParseTerm method. If it is, the
 method calls spParseFactor again. To the result of this call, a new NOT node is created and appended. The
 Append method of the TaaRPNNode class walks the linked list and adds the node to the final item in the list. 
Other parsing methods follow the same scheme.     ? Listing 1: Parsing a search phrase.    type 
  // a base node that helps form an RPN string 
  TaaRPNNode = class 
    private 
      FNext : TaaRPNNode; 
    public 
      destructor Destroy; override; 
      procedure Append(aNode : TaaRPNNode); 
      property Next : TaaRPNNode read FNext; 
  end; 
  TaaRPNWord = class(TaaRPNNode)  // an RPN node for a word 
    private 
      FWord : string; 
    public 
      constructor Create(const aPhraseWord : string); 
      property PhraseWord : string read FWord; 
  end; 
  // an RPN node for the AND operator 
  TaaRPN_AND = class(TaaRPNNode); 
  // an RPN node for the OR operator 
  TaaRPN_OR = class(TaaRPNNode); 
  // an RPN node for the NOT operator 
  TaaRPN_NOT = class(TaaRPNNode); 
  TaaSearchParser = class  // a parser for search phrases 
    private 
      FCurWord : string; 
      FPhrase  : string; 
      FPosn    : integer; 
      FRPN     : TaaRPNNode; 
    protected 
      function spGetRPN : TaaRPNNode; 
      procedure spSetPhrase(const aPhrase : string); 
      function spParseExpr : TaaRPNNode; 
      function spParseFactor : TaaRPNNode; 
      function spParseTerm : TaaRPNNode; 
      procedure spParsePhrase; 
      procedure spGetNextWord; 
    public 
      constructor Create(const aPhrase : string); 
      destructor Destroy; override; 
      property Phrase : string read FPhrase 
        write spSetPhrase; 
      property RPN : TaaRPNNode read spGetRPN; 
  end; 
destructor TaaRPNNode.Destroy; 
begin 
  Next.Free; // recursively destroys linked list of nodes 
  inherited Destroy; 
end; 
procedure TaaRPNNode.Append(aNode : TaaRPNNode); 
var 
  Walker : TaaRPNNode; 
begin 
  Walker := Self; 
  while (Walker.Next <> nil) do 
    Walker := Walker.Next; 
  Walker.FNext := aNode; 
end; 
constructor TaaRPNWord.Create(const aPhraseWord : string); 
begin 
  inherited Create; 
  FWord := aPhraseWord; 
end; 
constructor TaaSearchParser.Create(const aPhrase : string); 
begin 
  inherited Create; 
  Phrase := aPhrase; 
end; 
destructor TaaSearchParser.Destroy; 
begin 
  FRPN.Free; 
  inherited Destroy; 
end; 
procedure TaaSearchParser.spGetNextWord; 
var 
  Walker    : PAnsiChar; 
  WordStart : PAnsiChar; 
begin 
  {advance past the current word} 
  inc(FPosn, length(FCurWord)); 
  FCurWord := ''; 
  {skip past the white space} 
  Walker := @FPhrase[FPosn]; 
  while (Walker^ = ' ') do begin 
    inc(FPosn); 
    inc(Walker); 
  end; 
  {check for the parentheses first; if found, copy them to 
    the internal field} 
  if (Walker^ = '(') then 
    FCurWord := '(' 
  else if (Walker^ = ')') then 
    FCurWord := ')' 
  {otherwise search for the end of the current word (it'll 
    end with the end of the string, with white space, or 
    with parentheses); copy the word to the internal field} 
  else begin 
    WordStart := Walker; 
    while (Walker^ <> #0) and (Walker^ <> ' ') and 
          (Walker^ <> '(') and (Walker^ <> ')') do 
      inc(Walker); 
    FCurWord := Copy(FPhrase, FPosn, Walker - WordStart); 
  end; 
end; 
function TaaSearchParser.spGetRPN : TaaRPNNode; 
begin 
  if (FRPN = nil) then 
    spParsePhrase; 
  Result := FRPN; 
end; 
function TaaSearchParser.spParseExpr : TaaRPNNode; 
begin 
  Result := spParseFactor; 
  spGetNextWord; 
  if (FCurWord = 'and') then begin 
    spGetNextWord; 
    Result.Append(spParseExpr); 
    Result.Append(TaaRPN_AND.Create); 
  end else if (FCurWord = 'or') then begin 
    spGetNextWord; 
    Result.Append(spParseExpr); 
    Result.Append(TaaRPN_OR.Create); 
  end else if (FCurWord <> '') and (FCurWord <> ')') then 
    begin 
    Result.Append(spParseExpr); 
    Result.Append(TaaRPN_AND.Create); 
  end; 
end; 
function TaaSearchParser.spParseFactor : TaaRPNNode; 
begin 
  if (FCurWord <> 'not') then 
    Result := spParseTerm 
  else begin 
    spGetNextWord; 
    Result := spParseFactor; 
    Result.Append(TaaRPN_NOT.Create); 
  end; 
end; 
procedure TaaSearchParser.spParsePhrase; 
begin 
  if (FPhrase <> '') then begin 
    FPosn := 1; 
    spGetNextWord; 
    if (FCurWord <> '') then 
      FRPN := spParseExpr; 
  end; 
end; 
function TaaSearchParser.spParseTerm : TaaRPNNode; 
begin 
  if (FCurWord = '(') then begin 
    spGetNextWord; 
    Result := spParseExpr; 
    if (FCurWord <> ')') then 
      raise Exception.Create('TaaSearchParser: missing close 
        parenthesis in phrase'); 
  end 
  else begin 
    if (FCurWord = '') then 
      raise Exception.Create('TaaSearchParser: missing final 
        search word'); 
    if (FCurWord = 'and') or (FCurWord = 'or') or 
       (FCurWord = 'not') then 
      raise Exception.Create('TaaSearchParser: operator used 
        as search word'); 
    Result := TaaRPNWord.Create(FCurWord); 
  end; 
end; 
procedure TaaSearchParser.spSetPhrase(const aPhrase :
 string); 
begin 
  FPhrase := LowerCase(aPhrase); 
  FRPN.Free; 
  FRPN := nil; 
end;     Errors in the search phrase are all caught by the spParseTerm method (it is at the bottom of the BNF grammar,
 after all). Errors include: a missing right parenthesis; an operator (AND, OR, or NOT) being used as a search 
word (example: “this AND and”); and a missing word after the phrase is completely read (example: “this 
AND”).     The most interesting method is perhaps spGetNextWord. It is this method that scans the search phrase for the 
individual words. The search phrase is assumed to be pretty simple: words are separated by white space. 
There’s a small extra wrinkle, however: we must take account of parentheses (otherwise “NOT (this AND that)” 
would be parsed as “NOT”, “(this”, “AND”, and “that)”). So we must use parentheses as word delimiters as 
well as white space.     This Must Be Magic     Well, that was the easy bit. Now onto the real meat: generating the index. The first step is parsing out all the 
words from the text.     In the search phrase parser, we had to read the phrase word by word. It should come as no surprise that we have
 to do the same for every one of our documents; however, the definition of what constitutes a word becomes 
more complex. Indeed, my research on Google led into all sorts of blind alleys and to some remarkably nasty 
looking code.     Let’s encapsulate my discoveries. There’s the school of thought that says ‘the space is the delimiter’. In other 
words, to find the words, we drop an anchor and parse through the text character by character until a space is 
reached. The text from the anchor to the space is a word. Continue like this (drop, parse, cut) until you have 
parsed all the text and found all the words.     This simplistic algorithm is a little too simple for our needs. Punctuation is not stripped, and ‘hangs’ on to the
 beginnings or the ends of words. For instance, in that last sentence, we’d have [stripped,] [‘hangs’] and
 [words.] as some of the ‘words’ in the sentence. Another problem is that if we follow the algorithm strictly we’d 
get some false positives when there are two or more spaces in a row. Hence, in general, there’s some scanning 
through spaces after finding a word.     The next school of thought is the ‘words just consist of letters and digits’ brigade. Here we have two ‘modes’: 
scanning an alphanumeric word or scanning all the punctuation or spaces in between (that is, the 
non-alphanumeric characters). It’s a simple enough state machine that a non-beginner programmer could write 
in 15 minutes. However, in that last sentence, it would convert [It’s] into two words [It] and [s] and 
[non-beginner] into [non] and [beginner]: perhaps not what we intended.     What we will do instead is to follow a middle path. We’ll make the delimiting characters a parameter passed to 
the routine (a string would be best). The routine would then be coded as a state machine, as the second 
example, 
but this time it would determine state switches based on whether the current character is, or isn’t, in the 
delimiter parameter.     This is all very well, but what do we want to do with the words that we discover? Before I answer that question, 
let’s take a detour.     Welcome To The Real World     What I’ve done so far is describe a particular algorithm with certain well-defined features: it takes a bunch of
 text and extracts the words, the words being delimited by a set of delimiter characters. I’m now just about to 
go into details about how I want to use those words. Surely, it makes sense to implement the word parser now, 
before we go onto the next step? I should put my programmer’s hat on firmly and code up the parser in a 
generic way. Once I’ve done that, I can then think about how I want to use the words that the parser is going to 
spit out at me.     In other words, we should ideally decouple the code that processes the words from the routine that discovers 
them. Decoupling (or separating) code is a Good Thing in programming: it enables better reuse of code that 
you’ve already written.     Let’s continue with this argument. By doing this, writing a generic implementation of a word parser, I can use it 
in other situations: implementing a word count routine, finding the distinct words used in the text, or 
performing some analysis of the writer’s vocabulary. By not doing it, I end up with a word parser that can 
only be used for my particular need today: tomorrow be damned.     Of course, there is another very important reason: by writing decoupled code, you get to test the various 
algorithms separately and thoroughly. If you had written tightly coupled code, you’d run into the problem of not 
knowing where a bug might be. (Is it in the parser? Or is it in the code that processes each word? Maybe in the 
coupling?)     So, by separating out the ‘word-processor’ part from the ‘word-parser’, I could easily write the parser and then,
 for testing, write a simple ‘word-processor’ routine that wrote the words as separate lines in a text file, for 
example. Later on, I could write the real routine to process the words, secure in the knowledge that the word 
parsing routine works and has been thoroughly tested.     So, let’s do that. I’ll make the word parsing routine take in a stream containing the text to be parsed. It’ll also, 
as I’ve said, take in a string containing the word delimiter characters. There are a couple of choices for how to 
present the words parsed out of the text. First, put them all into a container (a string list or a text file, or similar) 
to be processed en masse. Or, second, fire off an event for each word and let the word-processor decide what to 
do with them (whether to save them, whether to count them, whether to process them all individually when they 
arrive, and so on). To me the latter choice makes the most sense: it uses less memory and it defers any decision 
about saving the words to the code making the call.     So, the final parameter is an Action routine. For every word discovered, this routine will be called to process the
 word. Our initial Action routine will just write the words out to a text file, but, later on, we’ll be writing the
 heavy-duty routine that helps us in our quest. Normally, I would write the Action routine as a simple procedure. 
However, I’ve been brought to task in the past for this simplistic eighties type of design and so I shall make it a 
method of an object instead.     Phew! After all that, Listing 2 shows the parser. If you look at it, you will see that I’ve made a designer’s 
decision that any character less than or equal to a space is also a delimiter (characters like carriage return, line 
feed, tab, and space: the so-called non-visible characters). The delimiter string therefore just needs to hold 
visible punctuation.     ? Listing 2: Parsing a stream of text into words.    function IsDelimiter(aCh : char; const aDelims : string) : 
  boolean; 
// returns true if aCh is a delimiter in S or is <= space 
var 
  i : integer; 
begin 
  Result := true; 
  if (aCh > ' ') then begin 
    for i := 1 to length(aDelims) do 
      if (aDelims[i] = aCh) then 
        Exit; 
    Result := false; 
  end; 
end; 
procedure AAParseWords(aText : TStream; const aDelims :
 string; aAction : TaaWordParseAction); 
const 
  BufSize = 16 * 1024; 
var 
  State   : (ScanWord, ScanOther); 
  Buffer  : PChar; 
  BufPos  : PChar; 
  BufLeft : integer; 
  Ch      : char; 
  StopNow : boolean; 
  CharQ   : TaaCharQueue; 
begin 
  // prepare for the memory allocations 
  Buffer := nil; 
  CharQ := nil; 
  try 
    // create a character queue with which to build words 
    CharQ := TaaCharQueue.Create; 
    // allocate a buffer to hold data from the stream 
    GetMem(Buffer, BufSize); 
    // force the stream to be read first time through 
    BufLeft := 1; 
    BufPos := Buffer; 
    // we'll start in the scanning delimiter state 
    State := ScanOther; 
    StopNow := false; 
    // continue until there is no more data in the stream... 
    while (not StopNow) and (BufLeft <> 0) do begin 
      // advance the buffer variables (reading more data if 
      // needed) 
      dec(BufLeft); 
      if (BufLeft <> 0) then 
        inc(BufPos) 
      else begin 
        BufLeft := aText.Read(Buffer^, BufSize); 
        BufPos := Buffer; 
      end; 
      // get the next character 
      Ch := BufPos^; 
      // process the character according to the state 
      case State of 
        ScanWord : 
          begin 
            if IsDelimiter(Ch, aDelims) then begin 
              aAction(CharQ.AsString, StopNow); 
              State := ScanOther; 
            end 
            else 
              CharQ.Append(Ch); 
          end; 
        ScanOther : 
          begin 
            if not IsDelimiter(Ch, aDelims) then begin 
              CharQ.Clear; 
              CharQ.Append(Ch); 
              State := ScanWord; 
            end; 
          end; 
      end;{case} 
    end; 
  finally 
    if (Buffer <> nil) then 
      FreeMem(Buffer); 
    CharQ.Free; 
  end; 
end;     Skyscraping     What next? Well, we have to decide what to do with the words we parse out of the text. Remember that our goal 
is to create a word index that we can search against. So it certainly makes sense that we have this huge structure 
containing items keyed by word. What must each item contain? The word for a start: it’s the key into the index. 
And then it will contain a list of documents that contain the word. I’m sure you can see that, in a loose sense 
anyhow, given a word, we can easily find the documents that contain that word. This structure is known as an 
inverted file.     Of course, there is a group of words that we really shouldn’t bother to index since every document can be 
expected to contain them, or, more precisely, so many documents will contain them it just doesn’t make sense 
to index them. In English text, these include a, the, of, you, and, or, and so on. These words are known as stop 
words, since when we get one we might as well stop processing it. There is no single list of stop words, I’m 
afraid; it’s more of a subjective thing. For our purposes, we’ll use a simple list I found on the internet.     At first blush, when I say huge for the inverted file, it seems I really mean gigantic. Suppose the documents 
were URLs for a web search engine. Then, for example, the word text would have a list of possibly millions of
 URLs. Each item in the index could then be tens (hundreds) of megabytes in size and there would be 
thousands of them (possibly millions). I don’t think so.     The first obvious thing we notice to reduce the space requirements is that the list of documents for each word 
doesn’t have to contain the actual names of the documents. We could have an array of document names 
somewhere and then for each item in the index we would have a list of integer values, each value being the 
number of the document in the external array (a simple lookup, in other words, to find the document name for a 
given value).     That’s still pretty steep memory- wise for each item in the index, but manageable. However, it doesn’t make the 
next stage particularly easy to do: applying a search phrase to the index. For a search phrase consisting of a 
single word, it would be pretty easy to get a list of documents containing the word. But remember that the 
search phrase can contain the logical operators: AND, OR, and NOT. How would we implement these?     Suppose that we had the search phrase text AND indexing. We can look up both words and get the two lists. 
Now, to perform an efficient AND on these two lists of integers would require the lists to be sorted (note to self: 
must maintain the document lists sorted). Imagine that each list stands vertically, side by side. Look at the two 
top items. If they are different, remove the smallest item and throw it away, and then go back to look at the two 
top items again. If they are the same, remove both items from both lists and store the number in a result list, and 
then go back to look at the two top items again. I’m sure that you can see that the resulting list would only 
contain the numbers that originally appeared in both lists; in other words, we’ve done an AND operation.     For OR, we would do the same, except that every distinct number gets put in the result list. For NOT, it’s a little 
more complicated, but I’m sure you get the idea. (I’ve deliberately glossed over the fact that one of the two lists 
could be empty during all this: an easy exercise for the reader.)     However, it’s all a bit complicated. We must make sure the lists are maintained in ascending order. We must 
code up the merge-type operation for each logical operator, and make sure that we cater for the 
running-out-of-list problem. If we’re not careful we’d run into efficiency problems should the output list not 
be large enough and has to be grown. Etc, etc.     Oh, if only we could use some nice logical arithmetic instead of all this messing around with merging integer 
lists. Well, actually, we can. Back in December 1999, I presented a bitset class, an array of Boolean values, all 
either on or off, true or false. Methods were given to perform AND, OR and NOT operations on bitset objects. 
What if we had a bitset for each word in our index, instead of a list of integers? Each bit would indicate a 
particular document; if the nth bit were set, the nth document contained that word, if clear it would not. The 
operations we’d perform in our search would then be simple indeed: logical arithmetic on bitsets.     All That Matters     Consider this idea in more detail, though. Each bitset corresponds to a particular word. For unusual words the
 majority of the bits in the bitset are likely to be off, indicating that most documents will not contain the word.
 Since the bits are gathered eight to a byte in this bitset class, that means that the majority of bytes in the
 internal buffer are zero. And that means, in general, that the bitsets are highly compressible, most easily by 
run-length encoding (RLE).     RLE is a simple compression scheme where essentially runs of equal bytes are replaced by a length value byte 
(or perhaps a word) followed by the byte itself (known as the literal byte). The fun part comes when the run 
length is 1: how do you encode this? A length byte of 1 followed by the byte itself? That would mean that a 
buffer of bytes with no runs greater than 1 would double in size when we compress it. We can’t just output the
 literal byte and hope for the best: there is no way to know whether a byte is a literal byte or the length byte for 
an RLE pair. Unfortunately, there is no standard method of solving this conundrum.     I decided to use a single bit, a state bit, to indicate whether the next byte was literal or was a length byte for the 
byte value that immediately followed. On its face, however, this is nasty: outputting single bits is quite simply a 
pain in the neck. I didn’t want to output a Boolean either: too big. In the end, to avoid having to write individual 
bits to the output buffer, I gathered the state bits eight at a time and output them as a byte before the data that 
corresponds to the eight state bits. If you remember that far back, this is like the scheme I used in the LZ77 
compression from the May 1999 issue. In our worst-case scenario (no run length is greater than one), this RLE
 compression algorithm would grow the data by about an eighth in size.     So, I altered the bitset class to store and load its data from a stream using RLE compression. To avoid spending 
too much time decompressing the data, I changed the bitset class to defer unpacking the bit data until it was 
actually needed. Listing 3 shows the bitset’s RLE compression and decompression.     ? Listing 3: Bitset RLE (de)compression (continued on facing page).    type 
  TRLEData = packed record 
    rdLen : byte; 
    rdVal : byte; 
  end; 
  TRLEPacker = class 
    private 
      FRLEData : array [0..7] of TRLEData; 
      FRLEInx  : integer; 
      FBuffer  : PByteArray; 
      FBufInx  : integer; 
    protected 
      procedure rpWriteEncoding(aCount : integer); 
    public 
      constructor Create(aBufSize : integer); 
      destructor Destroy; override; 
      procedure Add(aLen : byte; aValue : byte); 
      procedure MarkComplete; 
      procedure CopyBuffer(var aDest); 
      property BufferLen : integer read FBufInx; 
  end; 
constructor TRLEPacker.Create(aBufSize : integer); 
begin 
  inherited Create; 
  GetMem(FBuffer, aBufSize); 
end; 
destructor TRLEPacker.Destroy; 
begin 
  if (FBuffer <> nil) then 
    FreeMem(FBuffer); 
  inherited Destroy; 
end; 
procedure TRLEPacker.Add(aLen : byte; aValue : byte); 
begin 
  FRLEData[FRLEInx].rdLen := aLen; 
  FRLEData[FRLEInx].rdVal := aValue; 
  inc(FRLEInx); 
  if (FRLEInx = 8) then 
    rpWriteEncoding(8); 
end; 
procedure TRLEPacker.CopyBuffer(var aDest); 
begin 
  Move(FBuffer^, aDest, FBufInx); 
end; 
procedure TRLEPacker.MarkComplete; 
begin 
  {add the sentinel to indicate end-of-compressed-data (a 
    code for a length of zero does this)} 
  Add(0, 0); 
  {write out any final encoding} 
  if (FRLEInx <> 0) then 
    rpWriteEncoding(FRLEInx); 
end; 
procedure TRLEPacker.rpWriteEncoding(aCount : integer); 
var 
  i : integer; 
  ControlByte : byte; 
begin 
  {initialize the control byte} 
  ControlByte := 0; 
  {for all the encodings, set the relevant bit of the 
    control byte if a run, leave it clear otherwise (note: 
    the end-of-data sentinel has a length of zero and this 
    code treats it as an actual length)} 
  for i := 0 to pred(aCount) do begin 
    ControlByte := ControlByte shl 1; 
    if (FRLEData[i].rdLen <> 1) then 
      inc(ControlByte); 
  end; 
  {if the number of encodings is less than 8, set the rest 
    of the bits as clear} 
  if (aCount <> 8) then 
    for i := aCount to 7 do 
      ControlByte := ControlByte shl 1; 
  {write out the control byte} 
  FBuffer^[FBufInx] := ControlByte; 
  inc(FBufInx); 
  {write out the encodings, either as run length followed by 
    the byte or as the byte itself if the run length were 1} 
  for i := 0 to pred(aCount) do begin 
    case FRLEData[i].rdLen  of 
      0 : begin 
            FBuffer^[FBufInx] := 0; 
            inc(FBufInx); 
          end; 
      1 : begin 
            FBuffer^[FBufInx] := FRLEData[i].rdVal; 
            inc(FBufInx); 
          end; 
    else {any other value: 2..255} 
      FBuffer^[FBufInx] := FRLEData[i].rdLen; 
      inc(FBufInx); 
      FBuffer^[FBufInx] := FRLEData[i].rdVal; 
      inc(FBufInx); 
    end; 
  end; 
  {start over} 
  FRLEInx := 0; 
end; 
procedure TaaBitSet.bsPack; 
var 
  i      : integer; 
  B      : byte; 
  PrevB  : byte; 
  RunLen : byte; 
  Packer : TRLEPacker; 
begin 
  {allocate a packer object with a buffer big enough for the 
    worst case, in which all runs are of length one--that 
    is, packing will grow the data by 1 byte for each 8 
    unpacked bytes, plus one byte over for the sentinel} 
  Packer := TRLEPacker.Create(FBitArraySize   
    ((FBitArraySize   8) div 8)); 
  try 
    {set it up so previous byte is the first byte and 
     current run length is zero: makes loop code easier} 
    PrevB := FBitArray^[0]; 
    RunLen := 0; 
    {process the rest of the bytes} 
    for i := 0 to pred(FBitArraySize) do begin 
      {get the next byte} 
      B := FBitArray^[i]; 
      {if it is different from the previous byte, close off 
        the previous run and start a new one} 
      if (B <> PrevB) then begin 
        Packer.Add(RunLen, PrevB); 
        PrevB := B; 
        RunLen := 1; 
      end 
      {otherwise, continue this run} 
      else begin 
        {if we've already reached 255 bytes in this run 
          before adding this byte, close it off and start a 
          new one} 
        if (RunLen = 255) then begin 
          Packer.Add(RunLen, PrevB); 
          RunLen := 0; 
        end; 
        inc(RunLen); 
      end; 
    end; 
    {close off the final run} 
    Packer.Add(RunLen, PrevB); 
    {mark the packer object as being complete (this adds the 
      sentinel and calculates the compressed buffer size)} 
    Packer.MarkComplete; 
    {reallocate our buffer and copy over the compressed 
      data} 
    FBitArraySize := Packer.BufferLen; 
    ReallocMem(FBitArray, FBitArraySize);