更新时间:10-07 (学大教育)提供原创文章
摘要:随着科技的发展,人们越来越需要在大量的数据中找到对自己有用的信息,空间关系自动识别就是这一背景下产生的重要课题,它也是自然语言处理的一项重要任务。空间关系的自动识别就是利用关系抽取技术,根据已标注实例集来预测未知实例所属类别。不同于传统单标记分类技术的是,地理实体对可能同时属于多个类别,是一个多标记分类问题。
通过对多标记分类算法进行研究,本文选择基于k近邻的多标记分类算法(ML-KNN)进行空间关系抽取。ML-KNN算法是以KNN算法进行扩展的多标记分类算法。首先要获得每个待分类实例在训练集中的K个最近邻,再根据近邻实例所属类别得到最大后验概率,判断待分类实例是否具有每个可能的标记。并且,在空间实例相似度的计算上,本文选择基于扩展子序列核的方法。
本文使用《百科全书》上收集的188篇中文文档作为实验数据,将这188篇文档进行划分,随机选取其中3/4为训练文档,剩余1/4为测试文档。使用ML-KNN算法实现空间关系的抽取并使用多标记分类算法的评价指标对实验结果进行分析。
【关键字】机器学习;空间关系;关系抽取;多标记分类;K最近邻;扩展子序列核
Abstract:With the development of technology, people have an increasing need to get useful information from large amounts of data. Automatic recognition of spatial relations is an important subject arising under this background and is also an important task of natural language processing. Relation extraction is used in automatic recognition of spatial relations to classify unseen instances by the marked instances. Different from traditional classification, Geographic relations may belong to multiple categories at the same time. Automatic identification of spatial relations is a multi-lable classification problem.
After studying the multi-lable algorithms, we choose the multi-lable K nearest neighbors (ML-KNN) algorithm for spatial relation extraction.ML-KNN is a multi-lable classification algorithm which is derived from the traditional K nearest neighbors (KNN) algorithm. Firstly, for each unseen instance, its K nearest neighbors in the training set are identified. After that, based on the number of neighboring instances belong to each possible class, maximum a posteriori principle is used to forecast the lable set of the unseen instance. To calculate the instances’ similarity, we use the subsequence kernel method.
We do the experiment with 188 Chinese documents which are collected from Encyclopedia as our experiment data. By dividing these 188 documents, we randomly pick 3/4 of them as training documents and pick the remaining 1/4 as test documents. We complete the extraction of spatial relations by using ML-KNN method and analyse the experimental results.
keywords: Machine Learning; Spatial Relations; Relation Extraction; Muti-Lable Classification; K Nearest Neighbors; Subsequence Kernel