This article first appeared in The Delphi Magazine Issue 78 (February 2002)
http://www.thedelphimagazine.com/samples/1374/1374.htm
Ask A Thousand Times How does Google do it? Well. There are times when you can argue about something and there are times when you just have to accept the
job you’re given. I stared at the screen at the email from Our Esteemed Editor. It started off with ‘Can’t
remember if I’ve ever suggested this before...’ and went on to ask about a particular algorithm. Ha! A ‘suggestion’, no less! Right! I can read between the lines as well as anyone, I didn’t get where I am today,
etc. I dashed off a quick reply to the effect that my grandmother learned that algorithm whilst a teacher at
Malory Towers (or was it The Chalet School?) and then secretly perused some algorithm books, sweat beading
on the old brow. I brought up Google, which only made matters worse. How did they do it? Seriously, OEE had raised a thorny problem about whose solution I only had the vaguest idea. The question was
that of text searching. How do Google and all the other web search engines enable you to find relevant web
pages using only a couple of words? They can’t just download a billion pages and then search them all at your
request. They must preprocess them and index them somehow. This article is about how they do this. By Default By Design Let’s get some assumptions and definitions out of the way first. We shall assume that we have numerous
documents in some kind of database that we wish to search. By document, I mean some text as a separate object.
It could be an HTML page, a newsgroup posting, a Word document, or whatever. By database, I don’t
necessarily mean an actual database, like Oracle or SQL Server, but some kind of persistent ‘container’ holding
the documents. This could be something as simple as a folder on your machine or it could actually be a
database, with the documents stored as BLOBs. The text is assumed to be an ASCII stream: in other words, to
consist of words strung together with spaces and the usual punctuation. I don’t want to get into the problems of
indexing Word or PDF documents, thank you very much. When we search, we want to be able to specify a series of words separated by the reserved words AND, OR,
and NOT. These will be our logical operators. In other words, to search the documents for those with both ‘text’
and ‘indexing’ we would specify text AND indexing as our search criterion. To find the documents that
contain either the word text or indexing (or both) we would use text OR indexing. If we specify two or more
words without any ANDs or ORs, the search engine will assume that the words are separated by AND: hence
text indexing as a search phrase will be read as text AND indexing. NOT is obviously the logical negation
operation. We’ll assume that NOT has the highest precedence, that AND and OR have the same precedence,
and that they’re evaluated from left to right. So NOT text AND indexing is read as (NOT text) AND (indexing). From these definitions, I’m sure that you can see that this month’s article will be split up into several subtopics.
First, we need to parse each document into separate words. Second, we need to build the word index using
these words. Third, we need to be able to parse the search phrase to understand it. Last of all, we need to apply
the parsed search phrase to the word index to be able to find the documents we want. A tall order, but we’ll
manage without taking over the entire magazine. I think. A To Z Let’s take things from easy to hard. The third operation is possibly the easiest to do, so let’s start there. Whenever we write a parser, it’s generally easier to start by writing down the grammar we’re going to be
parsing in BNF. (For details on BNF, see Algorithms Alfresco for January 2001.) Table 1 shows the BNF for
our particular search phrase grammar. (The ::= defines a production and means ‘is defined as’ and the | denotes
an alternative. The first production therefore states that an expr is either a factor, or is a factor immediately
followed by another expr, or is a factor followed by AND followed by another expr, or is a factor followed by
OR followed by another expr. The other productions are read in a similar manner.) ? Table 1: BNF for the search grammar. ::= | | AND | OR
::= | NOT
::= | ( )
::= a series of characters delimited by white space
Having written the BNF, it is a fairly simple matter to convert it to code. We write a parser class, and for each
production, we create a method called ParseXxx where the Xxx is the name of the object on the left-hand side
of the production. The code in each of these methods implements the right-hand side of the production. This is
known as recursive descent parsing. Let’s take the easy one first: ParseFactor. The production states that a factor is a term or is the word NOT
followed by another factor. So ParseFactor should start off with a check for the current word to be NOT. If it is
, we should recursively call ParseFactor again. If is isn’t, we should call ParseTerm. The other methods can be
written in a similar manner. However, before we do so, we have to consider something else. It’s all very well to be able to parse the search
phrase, but we still need to store the results of our parsing in a data structure so that we can apply it to a word
index. We’ll store the parsed phrase in a Reverse Polish Notation (RPN) string where the operator occurs after
its operands. The query word1 AND word2 would therefore be stored as word1 word2 AND; NOT (word1 OR
word2) would be stored as word1 word2 OR NOT, and so on. (For more information on RPN expressions, see
Algorithms Alfresco from June 1999.) You’ll see why I choose this layout when it’s time to apply the parsed
string. Listing 1 shows the TaaSearchParser class as well as the classes that go to make up the RPN string. (If you
follow the code, you’ll see that the RPN string gets stored as a linked list of TaaRPNNodes and not as a real
‘string’.) If you look at the method that parses a factor, spParseFactor, you’ll see that it first checks if the
current word is NOT. If it isn’t, the method returns the result of calling the spParseTerm method. If it is, the
method calls spParseFactor again. To the result of this call, a new NOT node is created and appended. The
Append method of the TaaRPNNode class walks the linked list and adds the node to the final item in the list.
Other parsing methods follow the same scheme. ? Listing 1: Parsing a search phrase. type
// a base node that helps form an RPN string
TaaRPNNode = class
private
FNext : TaaRPNNode;
public
destructor Destroy; override;
procedure Append(aNode : TaaRPNNode);
property Next : TaaRPNNode read FNext;
end;
TaaRPNWord = class(TaaRPNNode) // an RPN node for a word
private
FWord : string;
public
constructor Create(const aPhraseWord : string);
property PhraseWord : string read FWord;
end;
// an RPN node for the AND operator
TaaRPN_AND = class(TaaRPNNode);
// an RPN node for the OR operator
TaaRPN_OR = class(TaaRPNNode);
// an RPN node for the NOT operator
TaaRPN_NOT = class(TaaRPNNode);
TaaSearchParser = class // a parser for search phrases
private
FCurWord : string;
FPhrase : string;
FPosn : integer;
FRPN : TaaRPNNode;
protected
function spGetRPN : TaaRPNNode;
procedure spSetPhrase(const aPhrase : string);
function spParseExpr : TaaRPNNode;
function spParseFactor : TaaRPNNode;
function spParseTerm : TaaRPNNode;
procedure spParsePhrase;
procedure spGetNextWord;
public
constructor Create(const aPhrase : string);
destructor Destroy; override;
property Phrase : string read FPhrase
write spSetPhrase;
property RPN : TaaRPNNode read spGetRPN;
end;
destructor TaaRPNNode.Destroy;
begin
Next.Free; // recursively destroys linked list of nodes
inherited Destroy;
end;
procedure TaaRPNNode.Append(aNode : TaaRPNNode);
var
Walker : TaaRPNNode;
begin
Walker := Self;
while (Walker.Next <> nil) do
Walker := Walker.Next;
Walker.FNext := aNode;
end;
constructor TaaRPNWord.Create(const aPhraseWord : string);
begin
inherited Create;
FWord := aPhraseWord;
end;
constructor TaaSearchParser.Create(const aPhrase : string);
begin
inherited Create;
Phrase := aPhrase;
end;
destructor TaaSearchParser.Destroy;
begin
FRPN.Free;
inherited Destroy;
end;
procedure TaaSearchParser.spGetNextWord;
var
Walker : PAnsiChar;
WordStart : PAnsiChar;
begin
{advance past the current word}
inc(FPosn, length(FCurWord));
FCurWord := '';
{skip past the white space}
Walker := @FPhrase[FPosn];
while (Walker^ = ' ') do begin
inc(FPosn);
inc(Walker);
end;
{check for the parentheses first; if found, copy them to
the internal field}
if (Walker^ = '(') then
FCurWord := '('
else if (Walker^ = ')') then
FCurWord := ')'
{otherwise search for the end of the current word (it'll
end with the end of the string, with white space, or
with parentheses); copy the word to the internal field}
else begin
WordStart := Walker;
while (Walker^ <> #0) and (Walker^ <> ' ') and
(Walker^ <> '(') and (Walker^ <> ')') do
inc(Walker);
FCurWord := Copy(FPhrase, FPosn, Walker - WordStart);
end;
end;
function TaaSearchParser.spGetRPN : TaaRPNNode;
begin
if (FRPN = nil) then
spParsePhrase;
Result := FRPN;
end;
function TaaSearchParser.spParseExpr : TaaRPNNode;
begin
Result := spParseFactor;
spGetNextWord;
if (FCurWord = 'and') then begin
spGetNextWord;
Result.Append(spParseExpr);
Result.Append(TaaRPN_AND.Create);
end else if (FCurWord = 'or') then begin
spGetNextWord;
Result.Append(spParseExpr);
Result.Append(TaaRPN_OR.Create);
end else if (FCurWord <> '') and (FCurWord <> ')') then
begin
Result.Append(spParseExpr);
Result.Append(TaaRPN_AND.Create);
end;
end;
function TaaSearchParser.spParseFactor : TaaRPNNode;
begin
if (FCurWord <> 'not') then
Result := spParseTerm
else begin
spGetNextWord;
Result := spParseFactor;
Result.Append(TaaRPN_NOT.Create);
end;
end;
procedure TaaSearchParser.spParsePhrase;
begin
if (FPhrase <> '') then begin
FPosn := 1;
spGetNextWord;
if (FCurWord <> '') then
FRPN := spParseExpr;
end;
end;
function TaaSearchParser.spParseTerm : TaaRPNNode;
begin
if (FCurWord = '(') then begin
spGetNextWord;
Result := spParseExpr;
if (FCurWord <> ')') then
raise Exception.Create('TaaSearchParser: missing close
parenthesis in phrase');
end
else begin
if (FCurWord = '') then
raise Exception.Create('TaaSearchParser: missing final
search word');
if (FCurWord = 'and') or (FCurWord = 'or') or
(FCurWord = 'not') then
raise Exception.Create('TaaSearchParser: operator used
as search word');
Result := TaaRPNWord.Create(FCurWord);
end;
end;
procedure TaaSearchParser.spSetPhrase(const aPhrase :
string);
begin
FPhrase := LowerCase(aPhrase);
FRPN.Free;
FRPN := nil;
end; Errors in the search phrase are all caught by the spParseTerm method (it is at the bottom of the BNF grammar,
after all). Errors include: a missing right parenthesis; an operator (AND, OR, or NOT) being used as a search
word (example: “this AND and”); and a missing word after the phrase is completely read (example: “this
AND”). The most interesting method is perhaps spGetNextWord. It is this method that scans the search phrase for the
individual words. The search phrase is assumed to be pretty simple: words are separated by white space.
There’s a small extra wrinkle, however: we must take account of parentheses (otherwise “NOT (this AND that)”
would be parsed as “NOT”, “(this”, “AND”, and “that)”). So we must use parentheses as word delimiters as
well as white space. This Must Be Magic Well, that was the easy bit. Now onto the real meat: generating the index. The first step is parsing out all the
words from the text. In the search phrase parser, we had to read the phrase word by word. It should come as no surprise that we have
to do the same for every one of our documents; however, the definition of what constitutes a word becomes
more complex. Indeed, my research on Google led into all sorts of blind alleys and to some remarkably nasty
looking code. Let’s encapsulate my discoveries. There’s the school of thought that says ‘the space is the delimiter’. In other
words, to find the words, we drop an anchor and parse through the text character by character until a space is
reached. The text from the anchor to the space is a word. Continue like this (drop, parse, cut) until you have
parsed all the text and found all the words. This simplistic algorithm is a little too simple for our needs. Punctuation is not stripped, and ‘hangs’ on to the
beginnings or the ends of words. For instance, in that last sentence, we’d have [stripped,] [‘hangs’] and
[words.] as some of the ‘words’ in the sentence. Another problem is that if we follow the algorithm strictly we’d
get some false positives when there are two or more spaces in a row. Hence, in general, there’s some scanning
through spaces after finding a word. The next school of thought is the ‘words just consist of letters and digits’ brigade. Here we have two ‘modes’:
scanning an alphanumeric word or scanning all the punctuation or spaces in between (that is, the
non-alphanumeric characters). It’s a simple enough state machine that a non-beginner programmer could write
in 15 minutes. However, in that last sentence, it would convert [It’s] into two words [It] and [s] and
[non-beginner] into [non] and [beginner]: perhaps not what we intended. What we will do instead is to follow a middle path. We’ll make the delimiting characters a parameter passed to
the routine (a string would be best). The routine would then be coded as a state machine, as the second
example,
but this time it would determine state switches based on whether the current character is, or isn’t, in the
delimiter parameter. This is all very well, but what do we want to do with the words that we discover? Before I answer that question,
let’s take a detour. Welcome To The Real World What I’ve done so far is describe a particular algorithm with certain well-defined features: it takes a bunch of
text and extracts the words, the words being delimited by a set of delimiter characters. I’m now just about to
go into details about how I want to use those words. Surely, it makes sense to implement the word parser now,
before we go onto the next step? I should put my programmer’s hat on firmly and code up the parser in a
generic way. Once I’ve done that, I can then think about how I want to use the words that the parser is going to
spit out at me. In other words, we should ideally decouple the code that processes the words from the routine that discovers
them. Decoupling (or separating) code is a Good Thing in programming: it enables better reuse of code that
you’ve already written. Let’s continue with this argument. By doing this, writing a generic implementation of a word parser, I can use it
in other situations: implementing a word count routine, finding the distinct words used in the text, or
performing some analysis of the writer’s vocabulary. By not doing it, I end up with a word parser that can
only be used for my particular need today: tomorrow be damned. Of course, there is another very important reason: by writing decoupled code, you get to test the various
algorithms separately and thoroughly. If you had written tightly coupled code, you’d run into the problem of not
knowing where a bug might be. (Is it in the parser? Or is it in the code that processes each word? Maybe in the
coupling?) So, by separating out the ‘word-processor’ part from the ‘word-parser’, I could easily write the parser and then,
for testing, write a simple ‘word-processor’ routine that wrote the words as separate lines in a text file, for
example. Later on, I could write the real routine to process the words, secure in the knowledge that the word
parsing routine works and has been thoroughly tested. So, let’s do that. I’ll make the word parsing routine take in a stream containing the text to be parsed. It’ll also,
as I’ve said, take in a string containing the word delimiter characters. There are a couple of choices for how to
present the words parsed out of the text. First, put them all into a container (a string list or a text file, or similar)
to be processed en masse. Or, second, fire off an event for each word and let the word-processor decide what to
do with them (whether to save them, whether to count them, whether to process them all individually when they
arrive, and so on). To me the latter choice makes the most sense: it uses less memory and it defers any decision
about saving the words to the code making the call. So, the final parameter is an Action routine. For every word discovered, this routine will be called to process the
word. Our initial Action routine will just write the words out to a text file, but, later on, we’ll be writing the
heavy-duty routine that helps us in our quest. Normally, I would write the Action routine as a simple procedure.
However, I’ve been brought to task in the past for this simplistic eighties type of design and so I shall make it a
method of an object instead. Phew! After all that, Listing 2 shows the parser. If you look at it, you will see that I’ve made a designer’s
decision that any character less than or equal to a space is also a delimiter (characters like carriage return, line
feed, tab, and space: the so-called non-visible characters). The delimiter string therefore just needs to hold
visible punctuation. ? Listing 2: Parsing a stream of text into words. function IsDelimiter(aCh : char; const aDelims : string) :
boolean;
// returns true if aCh is a delimiter in S or is <= space
var
i : integer;
begin
Result := true;
if (aCh > ' ') then begin
for i := 1 to length(aDelims) do
if (aDelims[i] = aCh) then
Exit;
Result := false;
end;
end;
procedure AAParseWords(aText : TStream; const aDelims :
string; aAction : TaaWordParseAction);
const
BufSize = 16 * 1024;
var
State : (ScanWord, ScanOther);
Buffer : PChar;
BufPos : PChar;
BufLeft : integer;
Ch : char;
StopNow : boolean;
CharQ : TaaCharQueue;
begin
// prepare for the memory allocations
Buffer := nil;
CharQ := nil;
try
// create a character queue with which to build words
CharQ := TaaCharQueue.Create;
// allocate a buffer to hold data from the stream
GetMem(Buffer, BufSize);
// force the stream to be read first time through
BufLeft := 1;
BufPos := Buffer;
// we'll start in the scanning delimiter state
State := ScanOther;
StopNow := false;
// continue until there is no more data in the stream...
while (not StopNow) and (BufLeft <> 0) do begin
// advance the buffer variables (reading more data if
// needed)
dec(BufLeft);
if (BufLeft <> 0) then
inc(BufPos)
else begin
BufLeft := aText.Read(Buffer^, BufSize);
BufPos := Buffer;
end;
// get the next character
Ch := BufPos^;
// process the character according to the state
case State of
ScanWord :
begin
if IsDelimiter(Ch, aDelims) then begin
aAction(CharQ.AsString, StopNow);
State := ScanOther;
end
else
CharQ.Append(Ch);
end;
ScanOther :
begin
if not IsDelimiter(Ch, aDelims) then begin
CharQ.Clear;
CharQ.Append(Ch);
State := ScanWord;
end;
end;
end;{case}
end;
finally
if (Buffer <> nil) then
FreeMem(Buffer);
CharQ.Free;
end;
end; Skyscraping What next? Well, we have to decide what to do with the words we parse out of the text. Remember that our goal
is to create a word index that we can search against. So it certainly makes sense that we have this huge structure
containing items keyed by word. What must each item contain? The word for a start: it’s the key into the index.
And then it will contain a list of documents that contain the word. I’m sure you can see that, in a loose sense
anyhow, given a word, we can easily find the documents that contain that word. This structure is known as an
inverted file. Of course, there is a group of words that we really shouldn’t bother to index since every document can be
expected to contain them, or, more precisely, so many documents will contain them it just doesn’t make sense
to index them. In English text, these include a, the, of, you, and, or, and so on. These words are known as stop
words, since when we get one we might as well stop processing it. There is no single list of stop words, I’m
afraid; it’s more of a subjective thing. For our purposes, we’ll use a simple list I found on the internet. At first blush, when I say huge for the inverted file, it seems I really mean gigantic. Suppose the documents
were URLs for a web search engine. Then, for example, the word text would have a list of possibly millions of
URLs. Each item in the index could then be tens (hundreds) of megabytes in size and there would be
thousands of them (possibly millions). I don’t think so. The first obvious thing we notice to reduce the space requirements is that the list of documents for each word
doesn’t have to contain the actual names of the documents. We could have an array of document names
somewhere and then for each item in the index we would have a list of integer values, each value being the
number of the document in the external array (a simple lookup, in other words, to find the document name for a
given value). That’s still pretty steep memory- wise for each item in the index, but manageable. However, it doesn’t make the
next stage particularly easy to do: applying a search phrase to the index. For a search phrase consisting of a
single word, it would be pretty easy to get a list of documents containing the word. But remember that the
search phrase can contain the logical operators: AND, OR, and NOT. How would we implement these? Suppose that we had the search phrase text AND indexing. We can look up both words and get the two lists.
Now, to perform an efficient AND on these two lists of integers would require the lists to be sorted (note to self:
must maintain the document lists sorted). Imagine that each list stands vertically, side by side. Look at the two
top items. If they are different, remove the smallest item and throw it away, and then go back to look at the two
top items again. If they are the same, remove both items from both lists and store the number in a result list, and
then go back to look at the two top items again. I’m sure that you can see that the resulting list would only
contain the numbers that originally appeared in both lists; in other words, we’ve done an AND operation. For OR, we would do the same, except that every distinct number gets put in the result list. For NOT, it’s a little
more complicated, but I’m sure you get the idea. (I’ve deliberately glossed over the fact that one of the two lists
could be empty during all this: an easy exercise for the reader.) However, it’s all a bit complicated. We must make sure the lists are maintained in ascending order. We must
code up the merge-type operation for each logical operator, and make sure that we cater for the
running-out-of-list problem. If we’re not careful we’d run into efficiency problems should the output list not
be large enough and has to be grown. Etc, etc. Oh, if only we could use some nice logical arithmetic instead of all this messing around with merging integer
lists. Well, actually, we can. Back in December 1999, I presented a bitset class, an array of Boolean values, all
either on or off, true or false. Methods were given to perform AND, OR and NOT operations on bitset objects.
What if we had a bitset for each word in our index, instead of a list of integers? Each bit would indicate a
particular document; if the nth bit were set, the nth document contained that word, if clear it would not. The
operations we’d perform in our search would then be simple indeed: logical arithmetic on bitsets. All That Matters Consider this idea in more detail, though. Each bitset corresponds to a particular word. For unusual words the
majority of the bits in the bitset are likely to be off, indicating that most documents will not contain the word.
Since the bits are gathered eight to a byte in this bitset class, that means that the majority of bytes in the
internal buffer are zero. And that means, in general, that the bitsets are highly compressible, most easily by
run-length encoding (RLE). RLE is a simple compression scheme where essentially runs of equal bytes are replaced by a length value byte
(or perhaps a word) followed by the byte itself (known as the literal byte). The fun part comes when the run
length is 1: how do you encode this? A length byte of 1 followed by the byte itself? That would mean that a
buffer of bytes with no runs greater than 1 would double in size when we compress it. We can’t just output the
literal byte and hope for the best: there is no way to know whether a byte is a literal byte or the length byte for
an RLE pair. Unfortunately, there is no standard method of solving this conundrum. I decided to use a single bit, a state bit, to indicate whether the next byte was literal or was a length byte for the
byte value that immediately followed. On its face, however, this is nasty: outputting single bits is quite simply a
pain in the neck. I didn’t want to output a Boolean either: too big. In the end, to avoid having to write individual
bits to the output buffer, I gathered the state bits eight at a time and output them as a byte before the data that
corresponds to the eight state bits. If you remember that far back, this is like the scheme I used in the LZ77
compression from the May 1999 issue. In our worst-case scenario (no run length is greater than one), this RLE
compression algorithm would grow the data by about an eighth in size. So, I altered the bitset class to store and load its data from a stream using RLE compression. To avoid spending
too much time decompressing the data, I changed the bitset class to defer unpacking the bit data until it was
actually needed. Listing 3 shows the bitset’s RLE compression and decompression. ? Listing 3: Bitset RLE (de)compression (continued on facing page). type
TRLEData = packed record
rdLen : byte;
rdVal : byte;
end;
TRLEPacker = class
private
FRLEData : array [0..7] of TRLEData;
FRLEInx : integer;
FBuffer : PByteArray;
FBufInx : integer;
protected
procedure rpWriteEncoding(aCount : integer);
public
constructor Create(aBufSize : integer);
destructor Destroy; override;
procedure Add(aLen : byte; aValue : byte);
procedure MarkComplete;
procedure CopyBuffer(var aDest);
property BufferLen : integer read FBufInx;
end;
constructor TRLEPacker.Create(aBufSize : integer);
begin
inherited Create;
GetMem(FBuffer, aBufSize);
end;
destructor TRLEPacker.Destroy;
begin
if (FBuffer <> nil) then
FreeMem(FBuffer);
inherited Destroy;
end;
procedure TRLEPacker.Add(aLen : byte; aValue : byte);
begin
FRLEData[FRLEInx].rdLen := aLen;
FRLEData[FRLEInx].rdVal := aValue;
inc(FRLEInx);
if (FRLEInx = 8) then
rpWriteEncoding(8);
end;
procedure TRLEPacker.CopyBuffer(var aDest);
begin
Move(FBuffer^, aDest, FBufInx);
end;
procedure TRLEPacker.MarkComplete;
begin
{add the sentinel to indicate end-of-compressed-data (a
code for a length of zero does this)}
Add(0, 0);
{write out any final encoding}
if (FRLEInx <> 0) then
rpWriteEncoding(FRLEInx);
end;
procedure TRLEPacker.rpWriteEncoding(aCount : integer);
var
i : integer;
ControlByte : byte;
begin
{initialize the control byte}
ControlByte := 0;
{for all the encodings, set the relevant bit of the
control byte if a run, leave it clear otherwise (note:
the end-of-data sentinel has a length of zero and this
code treats it as an actual length)}
for i := 0 to pred(aCount) do begin
ControlByte := ControlByte shl 1;
if (FRLEData[i].rdLen <> 1) then
inc(ControlByte);
end;
{if the number of encodings is less than 8, set the rest
of the bits as clear}
if (aCount <> 8) then
for i := aCount to 7 do
ControlByte := ControlByte shl 1;
{write out the control byte}
FBuffer^[FBufInx] := ControlByte;
inc(FBufInx);
{write out the encodings, either as run length followed by
the byte or as the byte itself if the run length were 1}
for i := 0 to pred(aCount) do begin
case FRLEData[i].rdLen of
0 : begin
FBuffer^[FBufInx] := 0;
inc(FBufInx);
end;
1 : begin
FBuffer^[FBufInx] := FRLEData[i].rdVal;
inc(FBufInx);
end;
else {any other value: 2..255}
FBuffer^[FBufInx] := FRLEData[i].rdLen;
inc(FBufInx);
FBuffer^[FBufInx] := FRLEData[i].rdVal;
inc(FBufInx);
end;
end;
{start over}
FRLEInx := 0;
end;
procedure TaaBitSet.bsPack;
var
i : integer;
B : byte;
PrevB : byte;
RunLen : byte;
Packer : TRLEPacker;
begin
{allocate a packer object with a buffer big enough for the
worst case, in which all runs are of length one--that
is, packing will grow the data by 1 byte for each 8
unpacked bytes, plus one byte over for the sentinel}
Packer := TRLEPacker.Create(FBitArraySize
((FBitArraySize 8) div 8));
try
{set it up so previous byte is the first byte and
current run length is zero: makes loop code easier}
PrevB := FBitArray^[0];
RunLen := 0;
{process the rest of the bytes}
for i := 0 to pred(FBitArraySize) do begin
{get the next byte}
B := FBitArray^[i];
{if it is different from the previous byte, close off
the previous run and start a new one}
if (B <> PrevB) then begin
Packer.Add(RunLen, PrevB);
PrevB := B;
RunLen := 1;
end
{otherwise, continue this run}
else begin
{if we've already reached 255 bytes in this run
before adding this byte, close it off and start a
new one}
if (RunLen = 255) then begin
Packer.Add(RunLen, PrevB);
RunLen := 0;
end;
inc(RunLen);
end;
end;
{close off the final run}
Packer.Add(RunLen, PrevB);
{mark the packer object as being complete (this adds the
sentinel and calculates the compressed buffer size)}
Packer.MarkComplete;
{reallocate our buffer and copy over the compressed
data}
FBitArraySize := Packer.BufferLen;
ReallocMem(FBitArray, FBitArraySize);