• 中国科学学与科技政策研究会
  • 中国科学院科技政策与管理科学研究所
  • 清华大学科学技术与社会研究中心
ISSN 1003-2053 CN 11-1805/G3

›› 2013, Vol. ›› Issue (11): 1615-1622.

• 科学学理论与方法 • 上一篇    下一篇

主题词簇方法研究

张嶷1,汪雪锋2,朱东华1,周潇2   

  1. 1. 北京理工大学
    2. 北京理工大学管理与经济学院
  • 收稿日期:2013-03-03 修回日期:2013-06-13 出版日期:2013-11-15 发布日期:2013-11-18
  • 通讯作者: 张嶷

Term Clumping

  • Received:2013-03-03 Revised:2013-06-13 Online:2013-11-15 Published:2013-11-18

摘要: 如何从科技文献数据中获取有效的信息,提升知识发现的能力是当前科学学研究中甚为关注的热点问题。大量相关的分析技术与方法均围绕自然语言处理技术所获取的“主题词”展开。然而,一般情况下,从科技文献数据中获取的主题词数量庞大,人工清洗几无可能,软件清洗亦缺乏可信度。本文以文献计量学方法为基础,构建了包括停词表、模糊语义处理、关联规则、词频与文档频次转换以及聚类分析在内的半自动化“主题词簇”方法体系,实现了以定量方法为主、定性方法为辅的主题词清洗、合并与聚类方案,旨在为技术竞争情报分析提供更为精准的主题词词表。本文以Derwent专利数据库中国“光伏电池”领域的科技文献为例,展开实证研究,验证了方法的科学性与有效性。

关键词: 文本分析, 文献计量学, 文本挖掘, 主题词簇, 光伏电池, Text Analysis, Bibliometrics, Text Mining, Term Clumping, Photovoltaic Cell

Abstract: How to retrieve potential information and improve the capability for knowledge discovery is one of the hottest topics in the research of science of science. Most related approaches are based on the “terms” delivered from Natural Language Processing. However, phrases and terms retrieved in this way are really huge and “noise,” which is impossible for individuals to deal with and also incredible for software processing. Based on bibliometric and text mining techniques, this paper constructed the semi-automatic “term clumping” steps, involving with stopwords thesaurus, fuzzy matching, association rules, term frequency inverse document frequency analysis and cluster analysis. These steps combined quantitative and qualitative methodologies for term cleaning, consolidation and clustering, and generated better term list for competitive technical intelligence. This paper applied the term clumping steps in the dataset from Derwent Patent database on China’s Photovoltaic Cell, and demonstrated these steps effectively.