更新时间:10-06 (学大教育)提供原创文章
摘要:随着互联网逐渐成为日常生活中必不可少的一种信息传播的媒介,人们彻底颠覆了以前获取信息的模式,由被动接收信息变为主动寻找感兴趣的信息。然而,在面对互联网上的海量信息时,如何快速搜索出有价值或者我们所感兴趣的信息显得极为重要。
目前,对信息内容进行关键词索引是非常有效的信息检索的方式之一,而这种技术同时也被广泛的运用于搜索引擎等互联网应用中。相对于以往查找信息需要对全文进行检索的方式来说,只检索关键词的查找速度将大幅提高,并且对系统性能要求大幅降低,所以利用关键词检索是一种使用更低成本带来更高效益的方法。然而,由于信息本身并没有显著地标识出关键词,而为它们手工标引出关键词的成本很高。所以为了提高关键词索引的速度和质量,如何利用机器来自动完成对信息的关键词索引就成了一项十分有意义的课题,而自动化的关键词索引也是未来互联网对信息处理的一个研究方向。
本文主要介绍了关键词索引的研究背景和国内外的研究现状,以及针对中英文单词之间的差异,对中文进行分词,提取中文关键词的特征。文中设计出一种将统计信息、语义分析和机器学习方法有机相结合的一种关键词索引算法,并在实验中能取得较好的实验效果。
关键词:关键词,索引,抽取,关键词索引,信息检索
Abstract:With Internet becoming indispensable in daily life as a medium of information dissemination, people have completely changed the passive model of receiving information. However, faced with massive information from Internet, it’s extremely important to find out how to search information people concerned or valuable.
At present, the keywords index on the information content is one of effective ways to retrieve information, and this technology also has been widely used in search engine and other Internet applications. Compared with the previous way using full-text search to retrieve information, searching keywords only can raise the speed rapidly and reduce the system performance requirements. So using keywords search is a way of lower cost but higher benefit. However, the information itself does not identify significant words, and the cost of manual keywords indexing for information is very high. Therefore, in order to improve the speed and quality of keywords indexing, knowing how to use machines to realize the keywords index of information automatically has become a very significant issue, while automated keywords indexing is the research direction of information processing by Internet in the future.
This paper introduces the research background of the keywords indexing and current situation at home and abroad, as well as Chinese words segmentation and further picking up features of Chinese words according to the differences between Chinese and English words. In this paper, a keywords indexing algorithm, combined with information statistics, semantic analysis and machine learning methods, is designed and achieved desired results in the experiments.
Keywords:Keyword, Index, Extraction, Keyword Indexing, Information Retrieva
如何让机器自动对文本信息进行关键词抽取就是本课题的主要研究的目的。同时,考虑到中文词语和英文单词之间的差异,国外的一些方法并不适用在中文文本的索引,所以特别针对中文文本的关键词抽取是本课题的一个重要研究方向。并且,往往根据文字内容的层次关系,关键词并不仅限于从文本自身中抽取,任何可以概括整篇文章内容的词语即便没有出现在文本当中也应可以作为整篇文章的关键词。这就需要机器处理文本时也具有一定的学习和联想能力。