繁簡體判別 |
尚未結案
|
lin11112
初階會員 發表:42 回覆:83 積分:25 註冊:2003-02-17 發送簡訊給我 |
|
aftcast
站務副站長 發表:81 回覆:1485 積分:1763 註冊:2002-11-21 發送簡訊給我 |
|
pceyes
尊榮會員 發表:70 回覆:657 積分:1140 註冊:2003-03-13 發送簡訊給我 |
|
lin11112
初階會員 發表:42 回覆:83 積分:25 註冊:2003-02-17 發送簡訊給我 |
|
aftcast
站務副站長 發表:81 回覆:1485 積分:1763 註冊:2002-11-21 發送簡訊給我 |
看你執念很深,我就講一下了! 目先,我貼一下unicode國際組織的一篇faq內容給你看:
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character? A: It's basically impossible and largely meaningless. It's the equivalent of asking if "a" is an English letter or a French one. There are some characters where one can guess based on the source information in Unihan.txt that it's traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable. (For example, one particularly nasty obscenity in Cantonese would probably have never been encoded for Cantonese, but has made it in for the sake of Korean, where one hopes it isn't nearly as obscene.) The phonetic data in Unihan.txt should not be used for this purpose. A blank in the phonetic data means that nobody's supplied a reading, not that a reading doesn't exist. Because updating the Unihan database is an ongoing process, these fields will be increasingly filled out as time goes on, but they should never be taken as absolutely complete. In particular, there are obscure characters where it is known that there is a reading, but since the character does not occur in standard dictionaries, we are unable to supply it (e.g., 䃟 U 40DF in Cantonese). A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean. The only proper mechanism is, as for determining whether "chat" is spelled correctly in English or French, is to use a higher-level protocol 希望你能看得懂…不過,話雖如此… 我想了一想,有了一個主意,也許可行! 但工程比較大,我沒時間測式… 我提供我的想法吧! 我有一份Gig5對unicode的文字比照表。把它放入一個資料庫裡,比如說access。 讀入字串前要先知道該檔案是unicode或是ansi檔案,這有幾個判別法,使用bom開頭碼來測,但也不一定都準…詳情一時很難教你。但若你一開始就知道該檔是unicode或是一般ansi的檔,那就不用管這件事了! 接下來,分二項: 1/ 若是ansi的檔,那麼隨機取一段字串放入ansistring,然後用IsLeadByte來判別是否為中文而非英文。然後若是中文則把這個中文字轉成hex code(即內碼,這需要另外的技巧),然後再使用multibytetounicode的方式指定big5為轉換條件,然後轉為unicode,此時再把這個unicode轉成hex code,這時候拿內碼的hex 與unicode的hex這一對值去比對上面的資料庫,若是合,那就是big5的文字,若是不合,那是簡體 2/ 若是unicode的檔,那處理過程則相反,先把unicode的hex算出,然後使用unicodetomultibyte配合big5條件,轉出big5碼,然後再轉成hex code,再把這一對值去比對資料庫。 原則上應該是很可行,但真的要不少的技巧! PS 我對編碼有相當熟的了解和興趣,但目前時間不太多,暫沒法實作給你看,待過些日子看看… 或者有前輩懂我說的演算結構,那就請他們實作或貼部份的code給你看。 ===================引 用 lin11112 文 章=================== 小弟是想在讀入一文字檔時 能分辨此文字檔的內容是繁體或簡體的
------
蕭沖 --All ideas are worthless unless implemented-- C++ Builder Delphi Taiwan G+ 社群 http://bit.ly/cbtaiwan |
本站聲明 |
1. 本論壇為無營利行為之開放平台,所有文章都是由網友自行張貼,如牽涉到法律糾紛一切與本站無關。 2. 假如網友發表之內容涉及侵權,而損及您的利益,請立即通知版主刪除。 3. 請勿批評中華民國元首及政府或批評各政黨,是藍是綠本站無權干涉,但這裡不是政治性論壇! |