Python3——统计字符串中的单词出现的次数

需求:统计一个文件或一个字符串中所有单词出现的次数。由于句子中存在标点符号,直接对字符串切割的话会把单词和标点切割在一起,比如:

We met at the wrong time, but separated at the right time. The most urgent is to take the most beautiful scenery!!! the deepest wound was the most real emotions.

直接切割的话,如下:

['We', 'met', 'at', 'the', 'wrong', 'time,', 'but', 'separated', 'at', 'the', 'right', 'time.', 'The', 'most', 'urgent', 'is', 'to', 'take', 'the', 'most', 'beautiful', 'scenery!!!', 'the', 'deepest', 'wound', 'was', 'the', 'most', 'real', 'emotions.']

思路:那我能不能先把标点符号去掉(替换为空字符),然后再进行切割呢?

如何去掉标点符号?或者说如何判断字符是标点符号呢?

String模块提供了相应的方法

import string

print(dir(string))

运行结果:

['Formatter', 'Template', '_ChainMap', '_TemplateMetaclass', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_re', '_string', 'ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'capwords', 'digits', 'hexdigits', 'octdigits', 'printable', 'punctuation', 'whitespace']

其中 punctuation用来判断是不是标点符号。

import string

str1 = 'We met at the wrong time, but separated at the right time. The most urgent is to take the most beautiful scenery!!! the deepest wound was the most real emotions.'
str2 = ''
for i in str1:
    if i not in string.punctuation:
        str2 = str2 + i
print(str2)

lst2 = str2.split(' ')
print(lst2)

运行结果:

We met at the wrong time but separated at the right time The most urgent is to take the most beautiful scenery the deepest wound was the most real emotions
['We', 'met', 'at', 'the', 'wrong', 'time', 'but', 'separated', 'at', 'the', 'right', 'time', 'The', 'most', 'urgent', 'is', 'to', 'take', 'the', 'most', 'beautiful', 'scenery', 'the', 'deepest', 'wound', 'was', 'the', 'most', 'real', 'emotions']

这样我们就得到一个单词的列表。
下一步,我们要判断每个单词的数量?
思路,首先再准备一个去重的列表,然后循环去判断每个单词在原列表中的数量即可。

import string

str1 = 'We met at the wrong time, but separated at the right time. The most urgent is to take the most beautiful scenery!!! the deepest wound was the most real emotions.'
str2 = ''
for i in str1:
    if i not in string.punctuation:
        str2 = str2 + i
print(str2)

lst2 = str2.split(' ')
print(lst2)

lst3 = list(set(lst2))
print(lst3)

# 准备一个空列表,用例存储每个单词的数量
lst4 = []
for i in lst3:
    lst4.append(lst2.count(i))

print(lst4)

# 将单词和数量组成结果
dict1 = dict(zip(lst3,lst4))
print(dict1)

运行结果:

We met at the wrong time but separated at the right time The most urgent is to take the most beautiful scenery the deepest wound was the most real emotions
['We', 'met', 'at', 'the', 'wrong', 'time', 'but', 'separated', 'at', 'the', 'right', 'time', 'The', 'most', 'urgent', 'is', 'to', 'take', 'the', 'most', 'beautiful', 'scenery', 'the', 'deepest', 'wound', 'was', 'the', 'most', 'real', 'emotions']
['wrong', 'scenery', 'emotions', 'to', 'We', 'deepest', 'met', 'but', 'time', 'the', 'take', 'real', 'The', 'beautiful', 'most', 'separated', 'right', 'was', 'wound', 'at', 'is', 'urgent']
[1, 1, 1, 1, 1, 1, 1, 1, 2, 5, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 1, 1]
{'wrong': 1, 'scenery': 1, 'emotions': 1, 'to': 1, 'We': 1, 'deepest': 1, 'met': 1, 'but': 1, 'time': 2, 'the': 5, 'take': 1, 'real': 1, 'The': 1, 'beautiful': 1, 'most': 3, 'separated': 1, 'right': 1, 'was': 1, 'wound': 1, 'at': 2, 'is': 1, 'urgent': 1}

补充:

print(string.ascii_lowercase) #所有的小写字母

print(string.ascii_uppercase)  #所有的大写字母

print(string.hexdigits)        #所有的十六进制字符

print(string.punctuation)      #所有的标点字符

运行结果:

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789abcdefABCDEF
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页