基于改进的TF-IDF算法的微博话题检测

陈朔鹰; 金镇晟

doi:10.3981/j.issn.1000-7857.2016.2.048

科技导报 >

2016 , Vol. 34 >Issue 2: 282 - 286

DOI: https://doi.org/10.3981/j.issn.1000-7857.2016.2.048

综述文章

基于改进的TF-IDF算法的微博话题检测

陈朔鹰 ,
金镇晟

展开

1. 北京理工大学网络信息中心, 北京 100081;
2. 北京理工大学计算机学院, 北京 100081

陈朔鹰,副教授,研究方向为计算机网络、数据挖掘、物联网,电子信箱:chensy@bit.edu.cn

收稿日期: 2015-04-23

修回日期: 2015-06-08

网络出版日期: 2016-02-04

收起

Weibo topic detection based on improved TF-IDF algorithm

CHEN Shuoying ,
JIN Zhensheng

Expand

1. Department of Network Information Center, Beijing Institute of Technology, Beijing 100081, China;
2. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China

Received date: 2015-04-23

Revised date: 2015-06-08

Online published: 2016-02-04

Fold

摘要

中文微博具有更新快、时效性强等特点,产生的热点话题均具有一定的突发性,与此同时文本中有代表性的特征词也会随之激增。利用这一特性,在传统的TF-IDF(term frequency-inverse document frequency)基础上提出一种改进的特征权重算法,称之为TF-IDF-KE(term frequency-inverse document frequency-kinetic energy),用以解决突发性热点话题在聚类时特征不明显的问题。该算法结合物体的动能原理,将特征项的突发值用动能的概念进行描述,加入权值计算,提高突发性特征项的权重,最后使用CURE(clustering using representatives)算法,实现微博的话题检测。该方法描述了文本和特征项所具有的动态属性,实验结果表明,该方法能够有效地提高话题检测的效果。

关键词： 微博; TF-IDF; 话题检测; TDT; 文本聚类

本文引用格式

陈朔鹰 , 金镇晟 . 基于改进的TF-IDF算法的微博话题检测[J]. 科技导报, 2016 , 34(2) : 282 -286 . DOI: 10.3981/j.issn.1000-7857.2016.2.048

Abstract

The topic detection and tracking (TDT) is an issue of natural language processing, which concerns with solving the problem of information explosion. The Weibo TDT is a central issue in recent years. A bad performance is usually achieved for Weibo with a short text, while the topic detection of a long text is widely used in the industry with better results. Weibo's features of short text and not very clear meaning make the clustering algorithms' effect not ideal in topic detection. So this paper focuses on finding a new way to improve the effect of clustering for Weibo. Weibo features fast renewal and strong timeliness. Hot topics produced by Weibo show burstiness, and their representative words increase in a great extent. With this feature in mind, improving the representative word's weight to a certain degree is a good way to give a prominence to the feature of short text. The burstiness of the words is a thing to consider, similar to the kinetic theory of the object. The formula of the kinetic energy theorem is used in this paper. Then an improved feature extraction algorithm named the TFIDF-KE (term frequency-inverse document frequency-kinetic energy) is proposed. The new algorithm consists of the kinetic energy and the TF-IDF (term frequency-inverse document frequency). The formula of the kinetic energy theorem is used to evaluate the burstiness of the words and add the value to the formula. Then, the weight of some important words can be improved when extracting features. Finally, the implementation of the CURE (clustering using representatives) algorithm completes the Weibo topic detection task. The method presented in this paper describes burstiness of text and feature and solves the problem that the feature of bursty hot topics is not obvious, when clustering in a certain extent. The experimental results show that the method can effectively improve the effect of topic detection in some degree and a better accuracy rate P can be achieved, as well as the R and F values of the recall rate. So TF-IDF-KE is an effective optimization method and can well be used for the task of the TDT.

Key words： Weibo; TF-IDF; topic detection; TDT; text clustering

参考文献

[1] 洪宇, 张宇, 刘挺, 等. 话题检测与跟踪的评测及研究综述[J]. 中文信息学报, 2007, 21(6): 71-87. Hong Yu, Zhang Yu, Liu Ting, et al. Topic detection and tracking re-view[J]. Journal of Chinese Information Processing, 2007, 21(6): 71-87.
[2] Li H, Wei J. Netnews bursty hot topic detection based on bursty fea-tures[C]. Proceedings of the 2010 International Conference on E-Busi-ness and E-Government.Guangzhou, China, 2010.
[3] Holz F, Teresniak S. Towards automatic detection and tracking of topic change[M]//Computational linguistics and Intelligent Text Processing. Berlin Heidelberg: Springer, 2010: 327-339.
[4] Qiu J, Liao L J, Dong X J. Topic detection and tracking for Chinese news web pages[C]. Advanced Language Processing and Web Information Tech-nology, 2008. ALPIT'08. International Conference on Dalian, China, 2008.
[5] Allan J, Lavrenko V, Jin H. First story detection in TDT is hard[C]//Pro-ceedings of the Ninth International Conference on Information and Knowledge Management. Atlanta, Georgia ACM, 2000: 374-381.
[6] Salton G, Wong A, Yang C S. A vector space model for automatic index-ing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[7] 贾自艳, 何清, 张海俊, 等. 一种基于动态进化模型的事件探测和追踪算法[J]. 计算机研究与发展, 2004, 41(7): 1273-1280. Jia Ziyan, He Qing, Zhang Haijun, et al. A news event detection and tracking algorithm based on dynamic evolution model[J]. Journal of Computer Research and Development, 2004, 41(7): 1273-1280.
[8] 闵可锐, 赵迎宾, 刘昕, 等. 互联网话题识别与跟踪系统设计及实现[J]. 计算机工程, 2008, 34(19): 212-214. Min Kerui, Zhao Yingbin, Liu Xin, et al. Design and implementation of topic detection and tracking system on web[J]. Computer Engineering, 2008, 34(19): 212-214.
[9] Fung G P C, Yu J X, Yu P S, et al. Parameter free bursty events detection in text streams[C]//Proceedings of the 31st International Conference on Very Large Data Bases. Trondheim, VLDB Endowment, 2005: 181-192.
[10] Fung G P C, Yu J X, Liu H, et al. Time-dependent event hierarchy construction[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Jose: ACM, 2007: 300-309.
[11] Wang X, Zhai C X, Hu X, et al. Mining correlated bursty topic pat-terns from coordinated text streams[C]//Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. San Jose, USA: ACM, 2007: 784-793.
[12] Subašic I, Berendt B. From bursty patterns to bursty facts: the effec-tiveness of temporal text mining for news[J]. Aug-2010, 2010.
[13] 薛峰, 周亚东, 高峰, 等. 一种突发性热点话题在线发现与跟踪方法[J]. 西安交通大学学报, 2012, 45(12): 64-69. Xue Feng, Zhou Yadong, Gao Feng, et al. An online detection and tracking method for burstytopics[J]. Journal of Xi'an Jiaotong Universi-ty, 2012, 45(12): 64-69.
[14] 翟东海, 聂洪玉, 崔静静, 等. 基于改进的χ2检验的热点词突发性度量研究[J]. 计算机与数字工程, 2013, 41(11): 1788-1790. Zhai Donghai, Nie Hongyu, Cui Jingjing, et al. Busty measurement of hot term based on improvement χ2 test combined with TF[J]. Computer & Digital Engineering, 2013, 41(11): 1788-1790.
[15] 翟东海, 王佳君, 聂洪玉, 等. 基于互信息的热点词发现和突发性话题检测研究[J]. 西藏大学学报: 自然科学版, 2013(1): 17. Zhai Donghai, Wang Jiajun, Nie Hongyu, et al. Research on bursty topic detection and hot word extraction based on mutual information[J]. Journal of Tibet University, 2013(1): 17.
[16] Salton G, McGill M J. Introduction to modern information retrieval[M]. New York: McGraw-Hill, 1983.
[17] Salton G, Buckley C. Term-weighting approaches in automatic text re-trieval[J]. Information Processing & Management, 1988, 24(5): 513-523.
[18] Nigram K, Lafferty J, Mccallum A. Using maximum entropy for text classification[J]. In Proceedings of IJCAI-1999, 1999.
[19] Pang-Ning T, Steinbach M, Kumar V. Introduction to data mining[M]. Boston: Addison-Wesley, 2006.
[20] Guha S, Rastogi R. Cure: An efficient clustering algorithm for large da-tabases[J]. Information Systems, 2001, 26(1): 35-58.

Options

文章导航

摘要

本文引用格式

Abstract

参考文献

联系我们

访问统计

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献

联系我们

访问统计