THttpScan
Version 4.01  -  September 3, 2001

Home page http://www.delphicity.com


PACKAGE INSTALL

Note to C++Builder users:
------------------------
    Add a 
    #pragma link "inet.lib" 
    statement at the top of your main cpp file.

1. If a previous THttpScan package is already installed, 
remove it first:

- Components | Install Packages,

- click on "DelphiCity THttpScan",

- click "Remove",

- click "Yes",

- click "Ok",

- search for "HttpScan.*" and "Th*.*" files in your Borland 
directories and delete them, to be certain that old units will
not remain in the search paths (causing later raw errors). 


2. Install the current package:

- unzip the archive in a folder of your choice,

- according to your Delphi or C++Builder version, copy all 
the Delphi\*.* or CBuilder\*.* archive files to 
the Borland\Delphi\Imports or \Borland\CBuilder\Imports directory,

- run Delphi or C++Builder,

- select Component | Install packages,

- press the "Add" button,

- locate the HttpScan.bpl file in the Imports directory and select it,

- select Open,

- select Ok,

- check the Delphicity tab in the right of the component palette. 
The THttpScan object should have been added.


 

methods and functions

function Start: Boolean    (1st syntax)
starts downloading and processing the URL set in the StartingUrl property, which must have been set beforehand.

function Start (StartingUrl_: string): boolean   (2nd syntax)
starts downloading and processing the URL set in the StartingUrl_ parameter passed to the function.

procedure Stop: kills all HttpScan processes currently running. Must be called before closing the Form. The Form can be closed after the OnWorking event occurs (false) or the Working property returns false.


 

properties

Agent: string = ' '
OBSOLETE.

ConcurrentDownloads: integer = 6
number of html pages downloads running simultaneously (between 4 and 20, according to your ISP speed and your processor is a good range).

DepthSearchLevel: integer = 3
represents the deep of  the followed pages tree starting from the first Url. Or "each time I find a link, I click on this link, n times". If kept on the host of the starting Url with StayOnSite set to true, a high value allow to grab an entire web site. 
The most important parameter with StayOnSite.

HttpPort: integer = 80
http port of the starting Url

FileOfResults: string = ' '
complete path of the file in which to store the results of the processing.

HtmlExtensions: string of extensions separated by char(13)+char(10), not visible on the object properties.
Set of extensions that should be recognized as html pages (e.g. "htm", "html", "php", "asp", etc...). THttpScan don't know if a link is a text file, image, video, sound or binary file, because for speed reasons, it doesn't get the head of each link. Don't change this string except if you want to add and html type extension THttpScan should follow. To do that, simply add a char(10) + char (13) + 'your ext' to the string. DO NOT CRUSH THE STRING OTHERWISE THTTPSCAN WILL NOT WORK.

KeywordsFilter: string of keywords separated by char(13)+char(10), not visible on the object properties.
Set of keywords to filter URLs. One keyword per line. Very short keywords will eliminate a lot of Url (e.g. a keyword like "th" eliminates all the Url containing "th"). Activated by KeywordsFilterEnabled = true.

KeywordsFilterEnabled: boolean = false
if set to true, the KeywordsFilter stringlist is used to determine if the URL contains one of the keywords and must be ignored.

KeywordsLimiter: string of keywords separated by char(13)+char(10), not visible with the object inspector.
Set of keywords URLs MUST CONTAIN. One keyword per line. Very short keywords will report a lot of Urls (e.g. a keyword like "th" allows all the Url containing "th"). Activated by KeywordsLimiterEnabled = true.

KeywordsLimiterEnabled: boolean = false
if set to true, the KeywordsLimiter stringlist is used to determine if the URL contains one of the keywords and must be reported.

LeavesFirst: boolean = false
if we think to the pages scanned (starting from the initial URL) as a tree with its branches and leaves, THttpScan scans
through the leaves before the branches.

LinkScan: TLinkScan = (scanAllLinks, scanInitialSite, scanInitialPath)
Sets the global way to surf through links. 
scanAllLinks: for each html page found, all the links are downloaded and scanned, and so on...
scanInitialSite: scans only links owned by the site of the starting url.
scanInitialPath: scans only links with the same sub path than the starting url (links of the same tree level and below).

LinkReport: TLinkReport = (reportAllLinks, reportCurrentSiteLinks, reportCurrentPathLinks)
Sets the global way links are reported.
reportAllLinks: reports all links found in the current html page,
reportCurrentSiteLinks: reports links owned by the same site than the current html page,
reportCurrentPathLinks: reports only links with the same sub path than the current html page (links of the same tree level and below).

MaxQueueSize: integer = 5000
maximum size of the html pages queue. The html pages queue grows faster than the analyzed pages. After a few minutes, we can have 50 pages analyzed and 10000 pages in queue. This queue size limitation helps to avoid memory problems with huge queues. New links founds are ignored if adding them implies a queue size greater than MaxQueueSize.

Password: string = ' '
needed if the starting Url is username/password protected.

ProxyAddress: string = ''
Ip address of the proxy server

ProxyPassword: string = ''
password to authenticate to the proxy server

ProxyPort: integer
Port of the proxy server

ProxyType: tProxyType = (PROXY_DIRECT, PROXY_USE_PROXY, PROXY_DEFAULT)
PROXY_DIRECT: direct connection to Internet, all the Proxy.. parameters are ignored
PROXY_USE_PROXY: the Proxy... parameters are used to authenticate to the proxy server
PROXY_DEFAULT: the control panel parameters are used

ProxyUser: string
username to authenticate to the proxy server

Referrer: string = ' '
OBSOLETE.

Retries: integer = 3
number of download retries when a connect or GET error occurs.

SeekRobotsTxt: boolean = false
if set to true, THttpScan searches for robots.txt files at the root of the sites (http://www.hostname.foo/robots.txt). If the file is found, the body content is returned by the OnPageReceived event

StartingUrl: string = ' '
the Url from which the scanning will be performed. 
Must be set before calling the Start function if it is called without Url parameter.

TimeOut: integer = 300
time left to the http thread to connect to an URL (in seconds) before aborting process. The thread tries to connect Retries times before the OnError event occurs.

TypeFilter: string of file types separated by char(13)+char(10), not visible on the object properties.
Set of file types to report only corresponding URLs. One file type per line (e.g. : jpg  gif mp3). Lowercase only. For jpeg use "jpg" and for mpeg use "mpg" (THttpScan converts jpeg in jpg and mpeg in mpg). Activated by TypeFilterEnabled = true.

TypeFilterEnabled: boolean = false
if set to true, the TypeFilter stringlist is used to report on URL whose file type is found in the TypeFilter list.

UserName: string = ' '
needed if the starting Url is username/password protected.

Working: boolean = false. Read only, non visible in the object properties.
Indicates the state of HttpScan: "waiting" or "working". Can be tested before closing the Form to know if downloads are currently running. See also the OnWorking event.


 

events

OnError (Url: String; ErrorCode: Cardinal; ErrorMsg: String);
occurs when a "GET" request fails. Returns the Url which failed, with the error code and the error message if available.

OnLinkFound (UrlFound, TypeLink, FromUrl, HostName, UrlPath, UrlPathWithFile, ExtraInfos: String; var WriteToFile: String);
This event occurs each time a link is found and returns the following parameters:
UrlFound: the full address on the link found
TypeLink: type of link (htm, jpg, mpg, cgi, php, etc...)
FromUrl: the referring url (.htm) from which the link come from
Hostname: the host name of the UrlFound address
UrlPath: the Url path (without host name & without filename)
UrlPathWithFile: the Url path (without host name but with filename)
ExtraInfos: the extra info passed to the URL (e.g. ?param1=v)
WriteToFile: the line to be written to FileOfResult. See comments here.
HrefOrSrc: returns 'S' if the link is an object loaded on the page (a thumb for example) and 'H' if the link is the destination URL.
CountArea: all the area found receive a sequential number. When a Href or Src link is found, it receives the number corresponding to his area. So, the couples Href / Src link can be associated.
FollowIfHtmlLink: if false, THttpScan doesn't continue searching in the direction of the current link.

Onlog (LogMessage: string);
returns a string which explains the internal process (for debugging purposes)

OnMetaTag (Url, ReferringUrl, TagType, Tag1stAttrib, Tag1stValue, Tag2ndAttrib, Tag2ndValue, Tag3rdAttrib, Tag3rdValue: String);
Returns the tag type and attributes of the current html page. If there is 5 tags on a page the event occurs 5 times for this page. The number of attributes is different according to the tag types, so the attribute parameters are called "1st", "2nd" and "3rd".
Url: the URL from which the meta tag is returned
ReferringUrl: the parent URL
TagType: TITLE, META, LINK, BASE, etc...
Tag1stAttrib: tag attribute, according to the TagType. E.g. if TagType = META, returns "NAME", "HTTP-EQUIV", etc...
Tag1stValue: value of the Tag1stAttrib, e.g. if Tag1stAttrib = "NAME", returns "keywords", "description", etc...
Tag2ndAttrib: e.g. if Tag1stAttrib = "NAME" and Tag1stValue = "keywords", returns "CONTENT" ;
Tag2ndValue: e.g. if Tag1stAttrib = "NAME", Tag1stValue = "keywords" and Tag2ndAttrib =  "CONTENT", returns the content string.
Tag3rdAttrib: e.g. if TagType = "LINK", Tag1stAttrib = "REL", Tag1stValue = "STYLESHEET", Tag2ndAttrib =  "HREF", Tag2ndValue = "/style/??.css", returns "TYPE".
Tag3rdValue: e.g. "/text/css" for the sample above.
If you find this is complicated, take a look at the demo, and you'll think it is finally very simple!

OnPageReceived (Hostname, Url, Head, Body: string);
this event occurs each time an html page is downloaded and returns the following parameters:
Url: Url of the text page received
Hostname: hostname of the page received
Head: head of the http query request for the page received
Body: body of the text of the page received.

OnUpdatedStats (InQueue, Downloading, ToAnalyze, Done, Retries, Errors: Integer);
occurs each time something changes in the HttpScan state. Returns the number of pages in queue (waiting for download), the number of pages currently downloading, the number of pages waiting to be analyzed, the number of pages analyzed (done), and the number of page downloads in error.

OnWorking (working_: boolean);
occurs when HttpScan pass from the state "waiting" to the state "working" and opposite. Can be used to detected when HttpScan has terminated his job. You can use also the Working property.


 

Comments about the WriteToFile parameter used in the OnLinkFound event:

WriteToFile contains the string to be written to the FileOfResults. If you leave it untouched, for each link found a line is written to the file like this : "TypeLink";"NewUrl";"HostName".

WriteToFile is useful to write links to the FileOfResult file only for some kind of links (e.g. "jpg"), or to choose the information written to the file. For examples:

If you want to write your own data to the file, e.g. Typelink, NewUrl and FromUrl then add the following line in the event:
WriteToFile:= '"' + TypeLink + '";"' + NewUrl + '";"' + HostName + '"';

If you want to skip the event's link and not to write anything into the file for the current link found, simply add the following line in the event:
if ...=... then begin
   WriteToFile:= '';
end;

 


 

DISCLAIMER

The author of this program accepts no responsibility for damages resulting from the use of this product and make no warranty or representation, either expressed or implied, including but not limited to, any implied warranty of merchantability or fitness for a practical purpose.

This software package is provided here "AS IS", and you the user, assume all risks when using them.


 

DESCRIPTION

With THTTPSCAN you access to web sites as a collection of links to files and data, instead of as graphics and text.

THTTPSCAN recursively analyzes HTML pages and reports all the links it finds to a text file: html, mail, jpg, mpeg, mp3, etc.

THttpScan surfs on the links through HTML pages in the neighborhood of the initial URL. The links appearing several times are treated only once.

The LinkScan property allows you to limit the scanning to the initial site or the initial URL path. 

The LinkReport property allows you to report only links owned by the current site or even with the same path name.

The DepthSearchLevel allows you to limit the level of pages scanned, starting from the initial page, especially when not limiting the scanning to a site. 

Using the LinkScan and LinkReport properties with a high DelphSearchLevel value, you can easily scan a whole site or only a subdirectory of a web site.

Events are generated for each link found and each page read, returning URL, meta tags, document type, referrer, host name,...

According to your line speed, you can grab thousands of links from a starting URL in a few minutes.

THTTPSCAN saves you having to tangle with the HTML parsing. Most common parameters can be simply set from the Object Inspector. It can be placed on any window, it is only visible at design time.

Register and you will get the full source code with 24 months upgrades.

 

FEATURES

  • creates a txt file containing all the links found.
  • reports all kind of links: html, mail, jpg, gif, mp3, mpeg...
  • reports meta tags: title, description, keywords, ...

  • search depth level from 1 (same page) to n
    • a high level with "stay on site" enabled reports the whole site
    • a high level without "stay on site" enabled reports all the links on all the pages from the starting Url, until the search depth level reaches a chosen value.
  • events generated on each link found with the following parameters:
    • URL
    • extension
    • referring Url
    • hostname
    • path
    • extra info
    • href or src type
  • discovers links in frames, php and JavaScript,
  • events generated for each HTML page read:
    • page URL
    • full query status
    • full page content

  • real concurrent downloads (overtake the Wininet limitations),

  • asynchronous-no blocking transactions,

  • full proxy support.

 

WHAT IS NEW

v3.07:
- returns meta tags in the "OnMetaTag" event
- retry count now retries
- minor bugs fixed.

 

SYSTEM REQUIREMENTS

  • Windows 95/98/NT/2000,
  • Delphi 4 or 5.

 

 

 

REGISTRATION

Click here to register.