需求:统计一个文件或一个字符串中所有单词出现的次数。由于句子中存在标点符号,直接对字符串切割的话会把单词和标点切割在一起,比如:
We met at the wrong time, but separated at the right time. The most urgent is to take the most beautiful scenery!!! the deepest wound was the most real emotions.
直接切割的话,如下:
['We', 'met', 'at', 'the', 'wrong', 'time,', 'but', 'separated', 'at', 'the', 'right', 'time.', 'The', 'most', 'urgent', 'is', 'to', 'take', 'the', 'most', 'beautiful', 'scenery!!!', 'the', 'deepest', 'wound', 'was', 'the', 'most', 'real', 'emotions.']
思路:那我能不能先把标点符号去掉(替换为空字符),然后再进行切割呢?
如何去掉标点符号?或者说如何判断字符是标点符号呢?
String模块提供了相应的方法
import string
print(dir(string))
运行结果:
['Formatter', 'Template', '_ChainMap', '_TemplateMetaclass', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_re', '_string', 'ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'capwords', 'digits', 'hexdigits', 'octdigits', 'printable', 'punctuation', 'whitespace']
其中 punctuation用来判断是不是标点符号。
import string
str1 = 'We met at the wrong time, but separated at the right time. The most urgent is to take the most beautiful scenery!!! the deepest wound was the most real emotions.'
str2 = ''
for i in str1:
if i not in string.punctuation:
str2 = str2 + i
print(str2)
lst2 = str2.split(' ')
print(lst2)
运行结果:
We met at the wrong time but separated at the right time The most urgent is to take the most beautiful scenery the deepest wound was the most real emotions
['We', 'met', 'at', 'the', 'wrong', 'time', 'but', 'separated', 'at', 'the', 'right', 'time', 'The', 'most', 'urgent', 'is', 'to', 'take', 'the', 'most', 'beautiful', 'scenery', 'the', 'deepest', 'wound', 'was', 'the', 'most', 'real', 'emotions']
这样我们就得到一个单词的列表。
下一步,我们要判断每个单词的数量?
思路,首先再准备一个去重的列表,然后循环去判断每个单词在原列表中的数量即可。
import string
str1 = 'We met at the wrong time, but separated at the right time. The most urgent is to take the most beautiful scenery!!! the deepest wound was the most real emotions.'
str2 = ''
for i in str1:
if i not in string.punctuation:
str2 = str2 + i
print(str2)
lst2 = str2.split(' ')
print(lst2)
lst3 = list(set(lst2))
print(lst3)
# 准备一个空列表,用例存储每个单词的数量
lst4 = []
for i in lst3:
lst4.append(lst2.count(i))
print(lst4)
# 将单词和数量组成结果
dict1 = dict(zip(lst3,lst4))
print(dict1)
运行结果:
We met at the wrong time but separated at the right time The most urgent is to take the most beautiful scenery the deepest wound was the most real emotions
['We', 'met', 'at', 'the', 'wrong', 'time', 'but', 'separated', 'at', 'the', 'right', 'time', 'The', 'most', 'urgent', 'is', 'to', 'take', 'the', 'most', 'beautiful', 'scenery', 'the', 'deepest', 'wound', 'was', 'the', 'most', 'real', 'emotions']
['wrong', 'scenery', 'emotions', 'to', 'We', 'deepest', 'met', 'but', 'time', 'the', 'take', 'real', 'The', 'beautiful', 'most', 'separated', 'right', 'was', 'wound', 'at', 'is', 'urgent']
[1, 1, 1, 1, 1, 1, 1, 1, 2, 5, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 1, 1]
{'wrong': 1, 'scenery': 1, 'emotions': 1, 'to': 1, 'We': 1, 'deepest': 1, 'met': 1, 'but': 1, 'time': 2, 'the': 5, 'take': 1, 'real': 1, 'The': 1, 'beautiful': 1, 'most': 3, 'separated': 1, 'right': 1, 'was': 1, 'wound': 1, 'at': 2, 'is': 1, 'urgent': 1}
补充:
print(string.ascii_lowercase) #所有的小写字母
print(string.ascii_uppercase) #所有的大写字母
print(string.hexdigits) #所有的十六进制字符
print(string.punctuation) #所有的标点字符
运行结果:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789abcdefABCDEF
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~