我一直在寻找有效关键字提取任务算法。 目标是找到一种算法,能够以有效的方式提取关键字,并且能够平衡提取质量和执行时间,因为我的数据语料库迅速增加已经达到了数百万行。 我对于算法一个主要的要求是提取关键字本身总是要有意义的,即使脱离了上下文的语境也能够表达一定的含义。
本篇文章使用 2000 个文档的语料库对几种著名的关键字提取算法进行测试和试验。
使用的库列表
我使用了以下python库进行研究
NLTK,以帮助我在预处理阶段和一些辅助函数
- RAKE
- YAKE
- PKE
- KeyBERT
- Spacy
Pandas 和Matplotlib还有其他通用库
实验流程
基准测试的工作方式如下
我们将首先导入包含我们的文本数据的数据集。 然后,我们将为每个算法创建提取逻辑的单独函数
algorithm_name(str: text) → [keyword1, keyword2, ..., keywordn]
然后,我们创建的一个函数用于提取整个语料库的关键词。
extract_keywords_from_corpus(algorithm, corpus) → {algorithm, corpus_keywords, elapsed_time}
下一步,使用Spacy帮助我们定义一个匹配器对象,用来判断关键字是否对我们的任务有意义,该对象将返回 true 或 false。
最后,我们会将所有内容打包到一个输出最终报告的函数中。
数据集
我使用的是来自互联网的小文本数数据集。这是一个样本
- ['To follow up from my previous questions. . Here is the result!\n',
- 'European mead competitions?\nI’d love some feedback on my mead, but entering the Mazer Cup isn’t an option for me, since shipping alcohol to the USA from Europe is illegal. (I know I probably wouldn’t get caught/prosecuted, but any kind of official record of an issue could screw up my upcoming citizenship application and I’m not willing to risk that).\n\nAre there any European mead comps out there? Or at least large beer comps that accept entries in the mead categories and are likely to have experienced mead judges?', 'Orange Rosemary Booch\n', 'Well folks, finally happened. Went on vacation and came home to mold.\n', 'I’m opening a gelato shop in London on Friday so we’ve been up non-stop practicing flavors - here’s one of our most recent attempts!\n', "Does anyone have resources for creating shelf stable hot sauce? Ferment and then water or pressure can?\nI have dozens of fresh peppers I want to use to make hot sauce, but the eventual goal is to customize a recipe and send it to my buddies across the States. I believe canning would be the best way to do this, but I'm not finding a lot of details on it. Any advice?", 'what is the practical difference between a wine filter and a water filter?\nwondering if you could use either', 'What is the best custard base?\nDoes someone have a recipe that tastes similar to Culver’s frozen custard?', 'Mold?\n'
大部分是与食物相关的。我们将使用2000个文档的样本来测试我们的算法。
我们现在还没有对文本进行预处理,因为有一些算法的结果是基于stopwords和标点符号的。
算法
让我们定义关键字提取函数。
- # initiate BERT outside of functions
- bert = KeyBERT()
- # 1. RAKE
- def rake_extractor(text):
- """
- Uses Rake to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- r = Rake()
- r.extract_keywords_from_text(text)
- return r.get_ranked_phrases()[:5]
- # 2. YAKE
- def yake_extractor(text):
- """
- Uses YAKE to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text)
- results = []
- for scored_keywords in keywords:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 3. PositionRank
- def position_rank_extractor(text):
- """
- Uses PositionRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- # define the valid Part-of-Speeches to occur in the graph
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor = pke.unsupervised.PositionRank()
- extractor.load_document(text, language='en')
- extractor.candidate_selection(pos=pos, maximum_word_number=5)
- # 4. weight the candidates using the sum of their word's scores that are
- # computed using random walk biaised with the position of the words
- # in the document. In the graph, nodes are words (nouns and
- # adjectives only) that are connected if they occur in a window of
- # 3 words.
- extractor.candidate_weighting(window=3, pos=pos)
- # 5. get the 5-highest scored candidates as keyphrases
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 4. SingleRank
- def single_rank_extractor(text):
- """
- Uses SingleRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor = pke.unsupervised.SingleRank()
- extractor.load_document(text, language='en')
- extractor.candidate_selection(pos=pos)
- extractor.candidate_weighting(window=3, pos=pos)
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 5. MultipartiteRank
- def multipartite_rank_extractor(text):
- """
- Uses MultipartiteRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- extractor = pke.unsupervised.MultipartiteRank()
- extractor.load_document(text, language='en')
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor.candidate_selection(pos=pos)
- # 4. build the Multipartite graph and rank candidates using random walk,
- # alpha controls the weight adjustment mechanism, see TopicRank for
- # threshold/method parameters.
- extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 6. TopicRank
- def topic_rank_extractor(text):
- """
- Uses TopicRank to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- extractor = pke.unsupervised.TopicRank()
- extractor.load_document(text, language='en')
- pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
- extractor.candidate_selection(pos=pos)
- extractor.candidate_weighting()
- keyphrases = extractor.get_n_best(n=5)
- results = []
- for scored_keywords in keyphrases:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
- # 7. KeyBERT
- def keybert_extractor(text):
- """
- Uses KeyBERT to extract the top 5 keywords from a text
- Arguments: text (str)
- Returns: list of keywords (list)
- """
- keywords = bert.extract_keywords(text, keyphrase_ngram_range=(3, 5), stop_words="english", top_n=5)
- results = []
- for scored_keywords in keywords:
- for keyword in scored_keywords:
- if isinstance(keyword, str):
- results.append(keyword)
- return results
每个提取器将文本作为参数输入并返回一个关键字列表。对于使用来讲非常简单。
注意:由于某些原因,我不能在函数之外初始化所有提取器对象。每当我这样做时,TopicRank和MultiPartiteRank都会抛出错误。就性能而言,这并不完美,但基准测试仍然可以完成。
我们已经通过传递 pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 来限制一些可接受的语法模式——这与 Spacy 一起将确保几乎所有的关键字都是从人类语言视角来选择的。 我们还希望关键字包含三个单词,只是为了有更具体的关键字并避免过于笼统。
从整个语料库中提取关键字
现在让我们定义一个函数,该函数将在输出一些信息的同时将单个提取器应用于整个语料库。
- def extract_keywords_from_corpus(extractor, corpus):
- """This function uses an extractor to retrieve keywords from a list of documents"""
- extractor_name = extractor.__name__.replace("_extractor", "")
- logging.info(f"Starting keyword extraction with {extractor_name}")
- corpus_kws = {}
- start = time.time()
- # logging.info(f"Timer initiated.") <-- uncomment this if you want to output start of timer
- for idx, text in tqdm(enumerate(corpus), desc="Extracting keywords from corpus..."):
- corpus_kws[idx] = extractor(text)
- end = time.time()
- # logging.info(f"Timer stopped.") <-- uncomment this if you want to output end of timer
- elapsed = time.strftime("%H:%M:%S", time.gmtime(end - start))
- logging.info(f"Time elapsed: {elapsed}")
- return {"algorithm": extractor.__name__,
- "corpus_kws": corpus_kws,
- "elapsed_time": elapsed}
这个函数所做的就是将传入的提取器数据和一系列有用的信息组合成一个字典(比如执行任务花费了多少时间)来方便我们后续生成报告。
语法匹配函数
这个函数确保提取器返回的关键字始终(几乎?)意义。 例如,
我们可以清楚地了解到,前三个关键字可以独立存在,它们完全是有意义的。我们不需要更多信息来理解关键词的含义,但是第四个就毫无任何意义,所以需要尽量避免这种情况。
Spacy 与 Matcher 对象可以帮助我们做到这一点。 我们将定义一个匹配函数,它接受一个关键字,如果定义的模式匹配,则返回 True 或 False。
- def match(keyword):
- """This function checks if a list of keywords match a certain POS pattern"""
- patterns = [
- [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'VERB'}],
- [{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'NOUN'}, {'POS': 'VERB'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'ADV'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'VERB'}],
- [{'POS': 'PROPN'}, {'POS': 'PROPN'}],
- [{'POS': 'NOUN'}, {'POS': 'NOUN'}],
- [{'POS': 'ADJ'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'ADP'}, {'POS': 'PROPN'}],
- [{'POS': 'PROPN'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
- [{'POS': 'NOUN'}, {'POS': 'ADP'}, {'POS': 'NOUN'}],
- [{'POS': 'PROPN'}, {'POS': 'NOUN'}, {'POS': 'PROPN'}],
- [{'POS': 'VERB'}, {'POS': 'ADV'}],
- [{'POS': 'PROPN'}, {'POS': 'NOUN'}],
- ]
- matcher = Matcher(nlp.vocab)
- matcher.add("pos-matcher", patterns)
- # create spacy object
- doc = nlp(keyword)
- # iterate through the matches
- matches = matcher(doc)
- # if matches is not empty, it means that it has found at least a match
- if len(matches) > 0:
- return True
- return False
基准测试函数
我们马上就要完成了。 这是启动脚本和收集结果之前的最后一步。
我们将定义一个基准测试函数,它接收我们的语料库和一个布尔值,用于对我们的数据进行打乱。 对于每个提取器,它调用
extract_keywords_from_corpus 函数返回一个包含该提取器结果的字典。 我们将该值存储在列表中。
对于列表中的每个算法,我们计算
- 平均提取关键词数
- 匹配关键字的平均数量
- 计算一个分数表示找到的平均匹配数除以执行操作所花费的时间
我们将所有数据存储在 Pandas DataFrame 中,然后将其导出为 .csv。
- def get_sec(time_str):
- """Get seconds from time."""
- h, m, s = time_str.split(':')
- return int(h) * 3600 + int(m) * 60 + int(s)
- def benchmark(corpus, shuffle=True):
- """This function runs the benchmark for the keyword extraction algorithms"""
- logging.info("Starting benchmark...\n")
- # Shuffle the corpus
- if shuffle:
- random.shuffle(corpus)
- # extract keywords from corpus
- results = []
- extractors = [
- rake_extractor,
- yake_extractor,
- topic_rank_extractor,
- position_rank_extractor,
- single_rank_extractor,
- multipartite_rank_extractor,
- keybert_extractor,
- ]
- for extractor in extractors:
- result = extract_keywords_from_corpus(extractor, corpus)
- results.append(result)
- # compute average number of extracted keywords
- for result in results:
- len_of_kw_list = []
- for kws in result["corpus_kws"].values():
- len_of_kw_list.append(len(kws))
- result["avg_keywords_per_document"] = np.mean(len_of_kw_list)
- # match keywords
- for result in results:
- for idx, kws in result["corpus_kws"].items():
- match_results = []
- for kw in kws:
- match_results.append(match(kw))
- result["corpus_kws"][idx] = match_results
- # compute average number of matched keywords
- for result in results:
- len_of_matching_kws_list = []
- for idx, kws in result["corpus_kws"].items():
- len_of_matching_kws_list.append(len([kw for kw in kws if kw]))
- result["avg_matched_keywords_per_document"] = np.mean(len_of_matching_kws_list)
- # compute average percentange of matching keywords, round 2 decimals
- result["avg_percentage_matched_keywords"] = round(result["avg_matched_keywords_per_document"] / result["avg_keywords_per_document"], 2)
- # create score based on the avg percentage of matched keywords divided by time elapsed (in seconds)
- for result in results:
- elapsed_seconds = get_sec(result["elapsed_time"]) + 0.1
- # weigh the score based on the time elapsed
- result["performance_score"] = round(result["avg_matched_keywords_per_document"] / elapsed_seconds, 2)
- # delete corpus_kw
- for result in results:
- del result["corpus_kws"]
- # create results dataframe
- df = pd.DataFrame(results)
- df.to_csv("results.csv", index=False)
- logging.info("Benchmark finished. Results saved to results.csv")
- return df
结果
- results = benchmark(texts[:2000], shuffle=True)
下面是产生的报告
我们可视化一下:
根据我们定义的得分公式(
avg_matched_keywords_per_document/time_elapsed_in_seconds), Rake 在 2 秒内处理 2000 个文档,尽管准确度不如 KeyBERT,但时间因素使其获胜。
如果我们只考虑准确性,计算为
avg_matched_keywords_per_document 和 avg_keywords_per_document 之间的比率,我们得到这些结果
从准确性的角度来看,Rake 的表现也相当不错。如果我们不考虑时间的话,KeyBERT 肯定会成为最准确、最有意义关键字提取的算法。Rake 虽然在准确度上排第二,但是差了一大截。
如果需要准确性,KeyBERT 肯定是首选,如果要求速度的话Rake肯定是首选,因为他的速度块,准确率也算能接受吧。