不是所有数据格式都会采用表格格式。随着我们进入大数据时代,数据的格式非常多样化,包括图像、文本、图形等等。
因为格式非常多样,从一个数据到另一个数据,所以将这些数据预处理为计算机可读的格式是非常必要的。
在本文中,将展示如何使用Python预处理文本数据,我们需要用到 NLTK 和 re-library 库。
过程
1.文本小写
在我们开始处理文本之前,最好先将所有字符都小写。我们这样做的原因是为了避免区分大小写的过程。
假设我们想从字符串中删除停止词,正常操作是将非停止词合并成一个句子。如果不使用小写,则无法检测到停止词,并将导致相同的字符串。这就是为什么降低文本大小写这么重要了。
用Python实现这一点很容易。代码是这样的:
- # 样例
- x = "Watch This Airport Get Swallowed Up By A Sandstorm In Under A Minute http://t.co/TvYQczGJdy"
- # 将文本小写
- x = x.lower()
- print(x)
- >>> watch this airport get swallowed up by a sandstorm in under a minute http://t.co/tvyqczgjdy
2.删除 Unicode 字符
一些文章中可能包含 Unicode 字符,当我们在 ASCII 格式上看到它时,它是不可读的。大多数情况下,这些字符用于表情符号和非 ASCII 字符。要删除该字符,我们可以使用这样的代码:
- # 示例
- x = "Reddit Will Now QuarantineÛ_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP"
- # 删除 unicode 字符
- x = x.encode('ascii', 'ignore').decode()
- print(x)
- >>> Reddit Will Now Quarantine_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP
3.删除停止词
停止词是一种对文本意义没有显著贡献的词。因此,我们可以删除这些词。为了检索停止词,我们可以从 NLTK 库中下载一个资料库。以下为实现代码:
- import nltk
- nltk.download()
- # 只需下载所有nltk
- stop_words = stopwords.words("english")
- # 示例
- x = "America like South Africa is a traumatised sick country - in different ways of course - but still messed up."
- # 删除停止词
- x = ' '.join([word for word in x.split(' ') if word not in stop_words])
- print(x)
- >>> America like South Africa traumatised sick country - different ways course - still messed up.
4.删除诸如提及、标签、链接等术语。
除了删除 Unicode 和停止词外,还有几个术语需要删除,包括提及、哈希标记、链接、标点符号等。
要去除这些,如果我们仅依赖于已经定义的字符,很难做到这些操作。因此,我们需要通过使用正则表达式(Regex)来匹配我们想要的术语的模式。
Regex 是一个特殊的字符串,它包含一个可以匹配与该模式相关联的单词的模式。通过使用名为 re. 的 Python 库搜索或删除这些模式。以下为实现代码:
- import re
- # 删除提及
- x = "@DDNewsLive @NitishKumar and @ArvindKejriwal can't survive without referring @@narendramodi . Without Mr Modi they are BIG ZEROS"
- x = re.sub("@\S+", " ", x)
- print(x)
- >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
- # 删除 URL 链接
- x = "Severe Thunderstorm pictures from across the Mid-South http://t.co/UZWLgJQzNS"
- x = re.sub("https*\S+", " ", x)
- print(x)
- >>> Severe Thunderstorm pictures from across the Mid-South
- # 删除标签
- x = "Are people not concerned that after #SLAB's obliteration in Scotland #Labour UK is ripping itself apart over #Labourleadership contest?"
- x = re.sub("#\S+", " ", x)
- print(x)
- >>> Are people not concerned that after obliteration in Scotland UK is ripping itself apart over contest?
- # 删除记号和下一个字符
- x = "Notley's tactful yet very direct response to Harper's attack on Alberta's gov't. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli"
- x = re.sub("\'\w+", '', x)
- print(x)
- >>> Notley tactful yet very direct response to Harper attack on Alberta gov. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli
- # 删除标点符号
- x = "In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare."
- x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
- print(x)
- >>> In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare.
- # 删除数字
- x = "C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980... http://t.co/tNI92fea3u http://t.co/czBaMzq3gL"
- x = re.sub(r'\w*\d+\w*', '', x)
- print(x)
- >>> C- specially modified to land in a stadium and rescue hostages in Iran in ... http://t.co/ http://t.co/
- #替换空格
- x = " and can't survive without referring . Without Mr Modi they are BIG ZEROS"
- x = re.sub('\s{2,}', " ", x)
- print(x)
- >>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
5.功能组合
在我们了解了文本预处理的每个步骤之后,让我们将其应用于列表。如果仔细看这些步骤,你会发现其实每个方法都是相互关联的。因此,必须将其应用于函数,以便我们可以按顺序同时处理所有问题。在应用预处理步骤之前,以下是文本示例:
- Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
- Forest fire near La Ronge Sask. Canada
- All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
- 13,000 people receive #wildfires evacuation orders in California
- Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
在预处理文本列表时,我们应先执行几个步骤:
- 创建包含所有预处理步骤的函数,并返回预处理的字符串
- 使用名为"apply"的方法应用函数,并使用该方法将列表链接在一起。
代码如下:
- # 导入错误的情况下
- # ! pip install nltk
- # ! pip install textblob
- import numpy as np
- import matplotlib.pyplot as plt
- import pandas as pd
- import re
- import nltk
- import string
- from nltk.corpus import stopwords
- # # 如果缺少语料库
- # 下载 all-nltk
- nltk.download()
- df = pd.read_csv('train.csv')
- stop_words = stopwords.words("english")
- wordnet = WordNetLemmatizer()
- def text_preproc(x):
- x = x.lower()
- x = ' '.join([word for word in x.split(' ') if word not in stop_words])
- x = x.encode('ascii', 'ignore').decode()
- x = re.sub(r'https*\S+', ' ', x)
- x = re.sub(r'@\S+', ' ', x)
- x = re.sub(r'#\S+', ' ', x)
- x = re.sub(r'\'\w+', '', x)
- x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
- x = re.sub(r'\w*\d+\w*', '', x)
- x = re.sub(r'\s{2,}', ' ', x)
- return x
- df['clean_text'] = df.text.apply(text_preproc)
上面的文本预处理结果如下:
- deeds reason may allah forgive us
- forest fire near la ronge sask canada
- residents asked place notified officers evacuation shelter place orders expected
- people receive evacuation orders california
- got sent photo ruby smoke pours school
最后
以上内容就是使用 Python 进行文本预处理的具体步骤,希望能够帮助大家用它来解决与文本数据相关的问题,提高文本数据的规范性以及模型的准确度。