線上訂房服務-台灣趴趴狗聯合訂房中心
發文 回覆 瀏覽次數:1879
推到 Plurk!
推到 Facebook!

請問搜尋引擎要怎麼寫?

答題得分者是:ddy
kiss999
一般會員


發表:19
回覆:10
積分:13
註冊:2002-10-12

發送簡訊給我
#1 引用回覆 回覆 發表時間:2002-11-29 00:19:34 IP:218.187.xxx.xxx 未訂閱
我想連結到Google,只要輸入關鍵字就能自動搜尋. 請問有沒有這方面的資料?
ddy
站務副站長


發表:262
回覆:2105
積分:1169
註冊:2002-07-13

發送簡訊給我
#2 引用回覆 回覆 發表時間:2002-11-29 00:55:47 IP:61.59.xxx.xxx 未訂閱
搜尋引擎大概分成二部份 一部份是接受使用者的查詢,去資料庫找資料 另一部份是依演算法去收集與分析、整理網路的連結 你要問的是那一部份? 如果是第一部份…我想資料庫所提供的"全文檢索查詢"就可以做得很好了 第二部份就比較難獲得演算法的內容,大多屬商業機密,不過提供幾個相關原理的連結參考    Google網頁排序能力和回覆款目品質探討:以檢索失誤率為例 http://dimes.lins.fju.edu.tw/pub/書藝-38/rer-Google.htm 網頁抓取與分析:入門篇 http://neural.cs.nthu.edu.tw/jang/books/webprog/08perl/13.asp?SessionCount=15 網頁抓取與分析:進階篇 http://neural.cs.nthu.edu.tw/jang/books/webprog/08perl/14.asp?SessionCount=16 如果沒辦法自己寫一個,那就用Delphi呼叫Google Search Engine吧 http://delphi.ktop.com.tw/topic.php?TOPIC_ID=18344 --【KTop SNG新聞現場】--記者:ddy----------------------------------------- 請各位市民做好資源回收與垃圾分類,讓不良標題與不當發言在KTop 市消失 ------------------------------------------------------------------------- 發表人 - ddy 於 2002/11/29 01:11:03 發表人 - ddy 於 2002/11/29 01:12:06
flyup
資深會員


發表:280
回覆:508
積分:385
註冊:2002-04-15

發送簡訊給我
#3 引用回覆 回覆 發表時間:2002-12-23 15:05:37 IP:61.217.xxx.xxx 未訂閱
Question/Problem/Abstract: Describe the principles of an Indexed search engine (like google). How this works when a lot of text information is compressed into an index - so that a search can be done within very short time. Answer: Introductory Principles of Indexed Searching by Jim McKeeth Introduction There are really two main ways to search a large collection of text documents. The simplest method would be to load each document and scan through it for the search terms, this would be referred to as a full text scan. The second, much faster method is to create an index and then search the index. An index is a list of terms found in a document or set of documents. Each word only appears once per document so it is much shorter then the original document. Creating an index -------------------------------------------------------------------------------- Finding the words In order to create an index you must first parse the document. Parsing, is the process of picking out the individual tokens (terms or words) in a piece of text. A parser is a type of state machine. There are many existing parsing routines available. "The Tomes of Delphi Algorithms and Data Structures" by Julian Bucknall contains many very good parsers. An example. A simple parser would scan through a string of text, starting at the beginning, looking at each character. If it is a letter or number then it is part of a word, if it is white space or punctuation then it is a separator. Each word is added to a list (i.e. TStringList) in the order it is found in the document. Typically each word is converted to the same case (upper or lower). It is really important to consider what you are indexing and how your index will be used when creating your index and parsing. For example if you are parsing HTML then you want to exclude most tags (with the obvious exception of META tags, which are handled specially). Other times you might only want to index summery information about each document. Indexing the Words Now that we have a parsed token list we need to index it. The simplest index is just a list of each word found in a document, and a reference to the document. This reference may be a URL, a document name or any other unique identifier (a GUID or a foreign key to another table describing the document). A more complex index may include the number of times the word is found in the document or a ranking for where it is in the document (in the title, keyword section, first paragraph, middle, last, etc.) This additional information stored with each word is part of what differentiates one search engine's performance from another. Many times certain words are left out. These are called stop words. Stop words are common words, words that will not be searched on, or words that will not enhance the meaning of a search. An example of stop words includes "THE, A, AN, AND, IF, BUT", words with numbers in them, or anything else you want to filter out. Selecting stop words is another point of separation of performance. Some web search engines used to leave out words like "HTML" or "WEB" because they were so common while other search engines would include every word. Other search engines start with a dictionary list and only index words found in that dictionary list. This leads to trouble when you are indexing names, technical terms or anything else not found in your original dictionary. One time I was creating a search engine for a collection newsgroup articles. I discovered that there was UUEncoded (similar to MIME or Base64) binaries in the articles. This resulted in my parser finding words that were hundreds of characters long and total gibberish. I decided to omit any word longer then 50 characters or shorter then 4. Making the choices about what to include and what to omit is an important decision, and will vary based on what content you are indexing. So here is an example table structure for your index: Table: WordList --------------- Document: Number (foreign key to Documents table) Word : String[20] (if our longest word is 20 characters) Count : Number (how many times the word is found) The primary key would be a compound of Document and Word since each word is listed once per document. Table: Documents ---------------- Document : Number(primary key index) Title : string (title of document) Location : string (URL or full filename and path) Optionally you could include the entire document as a blob in this table. You could also have other tables that lists terms (from the meta section of the document) or include authors. Again this design choice depends on the type of documents you are indexing and the purpose of your search engine. Searching Your Index -------------------------------------------------------------------------------- Once all the indexes are stored in a database you need to be able to search the index for a document. A simple SQL statement to search for a document that contains a single word could look like this: SELECT * FROM WordList WHERE Word = :v_search_term ORDER BY Count DESC This returns all documents containing your single search term and they are ordered by the number of times the word is found. If you want to use SQL then to search on multiple terms involves an join for each term. Instead you could retrieve a list for each term and then merge them manually. This is where you would support AND, OR or NOT key words. If you want to allow phrase searching then you could search for each word in the phrase and then search those documents for the phrase. The same technique could be used for the NEAR key word. There are other more advanced techniques to do this that are much quicker, but they are beyond the scope of this document. Once the hits are found and ranked then display the title of each document, possibly a summary or the context of the hits, and provide a way for your user to reach the document. Variations -------------------------------------------------------------------------------- One thing Google does a little differently is they look at how pages are linked. This works really well with the hyper linked nature of the web. For example if you search for Borland most pages that mention Borland link to www.borland.com. This is assumed to indicate that www.borland.com is a very important site about Borland. Google also limits the number of hits you get on each domain. Many search engines also rank pages higher if the search term appears in the URL or title of the page. They also look at the description and keywords meta tags for ranking. Some search engines will actually ignore a word if it appears too often in a page. This weeds out sites that try to artificially inflate their rankings. Phonetics or Soundex is another technique that can be used. This could be done with an additional table similar to the word table, but instead store the soundex value for the words instead of the actual word. Conclusion -------------------------------------------------------------------------------- Searching a shorter and well organized index is much quicker then searching an entire document. Creating and searching an index takes a lot more planning and effort up front, but quickly pays off if the text is searched very often. Typically the larger and more complex the index, the more effective the search. If your index gets too large or complex then the search speed will degrade. There are off the shelf searching tools available to end users and developers alike. dtSearch ( http://www.dtsearch.com/ ) makes a wide range of searching tools for both end users and all types of developers. Tamarack Associates' Rubicon ( http://www.tamaracka.com/ ) is a search component for Delphi that provides a lot of flexibility, especially in storage options. Both are extremely fast and useful, but don't let that stop you from designing your own, especially if you have a specialized need. See also: http://www.dtsearch.com/dtsoftware.html#Art_of_The_Text_Query ---------------- 局局棋盤步步新, 變化無常平常待。 人生相處平常心, 無憂無慮心事成。 ----------------
系統時間:2024-03-29 12:48:24
聯絡我們 | Delphi K.Top討論版
本站聲明
1. 本論壇為無營利行為之開放平台,所有文章都是由網友自行張貼,如牽涉到法律糾紛一切與本站無關。
2. 假如網友發表之內容涉及侵權,而損及您的利益,請立即通知版主刪除。
3. 請勿批評中華民國元首及政府或批評各政黨,是藍是綠本站無權干涉,但這裡不是政治性論壇!