cht電腦資訊gcin
adm Find login register

字頻 (Ⅱ)

Tetralet

joined: 2007-11-27
posted: 255
promoted: 35
bookmarked: 13
1subject: 字頻 (Ⅱ)Promote 2 Bookmark 12008-01-25quote  

經過一些(不可說Ⅱ)而取得的字頻。這次結果似乎比較具有可信度。請不妨參考看看!

eliu

joined: 2007-08-09
posted: 11474
promoted: 617
bookmarked: 187
新竹, 台灣
2subject: Promote 0 Bookmark 02008-01-25quote  

那是要看 big5 or utf8

Tetralet

joined: 2007-11-27
posted: 255
promoted: 35
bookmarked: 13
3subject: Promote 0 Bookmark 02008-01-25quote  

Big5.txt → 繁體中文 Big5 文字。

GB2312.txt → 簡體中文 GB2312 文字。

Shift_JIS.txt → 日文字。

UTF-8.txt → 非以上 3 者。

排序方式(例): 

	awk '{print $2, $1}' Big5.txt | sort -r -g > Big5.count
	

 

edited: 1
本人已不在此站活動

joined: 2007-09-19
posted: 4946
promoted: 325
bookmarked: 206
歸隱山林
4subject: Promote 0 Bookmark 02008-01-25quote  
Thanks. 
cut -c 1-2 Big5.txt GB2312.txt Shift_JIS.txt UTF-8.txt | sort -u | wc -l
27850

請問,這是以什麼字集為準的?就是說原母字集的那個二萬多字是怎麼決定的?


edited: 1
Tetralet

joined: 2007-11-27
posted: 255
promoted: 35
bookmarked: 13
5subject: Promote 0 Bookmark 02008-01-25quote  

取自 GCIN 的 cj5.cin 及 pho-huge.tab.src。

本人已不在此站活動

joined: 2007-09-19
posted: 4946
promoted: 325
bookmarked: 206
歸隱山林
6subject: Promote 0 Bookmark 02008-01-25quote  
Tetralet

取自 GCIN 的 cj5.cin 及 pho-huge.tab.src。

Soga,那工具要留起來,以後可以用在 non-BMP 的漢字。微笑


cht電腦資訊gcin
adm Find login register
views:17478