-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathnltk1.html
133 lines (117 loc) · 10.1 KB
/
nltk1.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
<!DOCTYPE html>
<html lang="cn">
<head>
<meta charset="utf-8" />
<title>Python自然语言工具库NLTK快速入门教程1简介</title>
<link rel="stylesheet" href="/theme/css/main.css" />
</head>
<body id="index" class="home">
<header id="banner" class="body">
<h1><a href="/">python自动化测试人工智能 </a></h1>
<nav><ul>
<li><a href="/category/ba-zi.html">八字</a></li>
<li><a href="/category/ce-shi.html">测试</a></li>
<li><a href="/category/ce-shi-kuang-jia.html">测试框架</a></li>
<li><a href="/category/common.html">common</a></li>
<li><a href="/category/da-shu-ju.html">大数据</a></li>
<li><a href="/category/feng-shui.html">风水</a></li>
<li><a href="/category/ji-qi-xue-xi.html">机器学习</a></li>
<li><a href="/category/jie-meng.html">解梦</a></li>
<li><a href="/category/linux.html">linux</a></li>
<li class="active"><a href="/category/python.html">python</a></li>
<li><a href="/category/shu-ji.html">书籍</a></li>
<li><a href="/category/shu-ju-fen-xi.html">数据分析</a></li>
<li><a href="/category/zhong-cao-yao.html">中草药</a></li>
<li><a href="/category/zhong-yi.html">中医</a></li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header>
<h1 class="entry-title">
<a href="/nltk1.html" rel="bookmark"
title="Permalink to Python自然语言工具库NLTK快速入门教程1简介">Python自然语言工具库NLTK快速入门教程1简介</a></h1>
</header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2018-12-19T18:25:00+08:00">
Published: 三 19 十二月 2018
</abbr>
<address class="vcard author">
By <a class="url fn" href="/author/andrew.html">andrew</a>
</address>
<p>In <a href="/category/python.html">python</a>.</p>
</footer><!-- /.post-info --> <ul>
<li><a href="https://china-testing.github.io/practices.html">python测试开发项目实战-目录</a></li>
<li><a href="https://china-testing.github.io/python_books.html">python工具书籍下载-持续更新</a></li>
<li><a href="https://china-testing.github.io/python3_quick.html">python 3.7极速入门教程 - 目录</a></li>
</ul>
<h3 id="_1">什么是自然语言处理?</h3>
<p>自然语言处理是指通过软件或机器理解并操作文本或语音。 人类互动,了解彼此的观点,并用适当的答案作出回应。 在NLP中,这种交互,理解,响应是由计算机而不是人类完成的。</p>
<h3 id="nltk">什么是NLTK?</h3>
<p>NLTK代表Natural Language Toolkit。它包使计算机理解人类语言并使用适当的响应回复它。 本教程中将讨论标记,粉刺,词形还原,标点,字符计数,字数统计等。</p>
<h3 id="_2">自然语言库介绍</h3>
<ul>
<li>NLTK 最有用,且是是所有NLP库中的鼻祖。</li>
<li>spaCy 这是完全优化和高度准确的库,广泛用于深度学习</li>
<li>Stanford CoreNLP Python 基于C-S的体系结构,用JAVA编写的,但它提供了在Python API</li>
<li>TextBlob 处理文本数据,主要以API的形式提供所有类型的操作。</li>
<li>Gensim 强大、非常高效且可扩展。</li>
<li>Pattern 个轻量级NLP模块。 这通常用于Web挖掘,爬虫。 p</li>
<li>Polyglot 轻松处理多语言应用程序,基于身份和实体方式的特征提取。</li>
<li>PyNLPl 又名Pineapple。 它为许多数据格式提供了解析器,如FoLiA/Giza/Moses/ARPA/Timbl/CQL。</li>
<li>Vocabulary 从给定文本中获取语义类型信息。</li>
</ul>
<p>另外还有jieba、SnowNLP、thulac等系列中文库,可以参考下:https://github.com/china-testing/python-api-tesing</p>
<h3 id="nltk_1">NLTK安装</h3>
<div class="highlight"><pre><span></span><span class="n">pip3</span> <span class="n">install</span> <span class="n">nltk</span>
</pre></div>
<p>下载数据集</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">nltk</span>
<span class="n">nltk</span><span class="o">.</span><span class="n">download</span> <span class="p">()</span>
</pre></div>
<p><img alt="image.png" src="https://upload-images.jianshu.io/upload_images/12713060-d18c95e52ac197c7.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240"></p>
<p>验证数据集</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">brown</span>
<span class="o">>>></span> <span class="n">brown</span><span class="o">.</span><span class="n">words</span><span class="p">()</span>
<span class="p">[</span><span class="s1">'The'</span><span class="p">,</span> <span class="s1">'Fulton'</span><span class="p">,</span> <span class="s1">'County'</span><span class="p">,</span> <span class="s1">'Grand'</span><span class="p">,</span> <span class="s1">'Jury'</span><span class="p">,</span> <span class="s1">'said'</span><span class="p">,</span> <span class="o">...</span><span class="p">]</span>
</pre></div>
<p>分词快速入门</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">nltk.tokenize</span> <span class="kn">import</span> <span class="n">RegexpTokenizer</span>
<span class="o">>>></span> <span class="n">tokenizer</span> <span class="o">=</span> <span class="n">RegexpTokenizer</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\w+'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">filterdText</span><span class="o">=</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">tokenize</span><span class="p">(</span><span class="s1">'Hello https://china-testing.github.io/, You have build a very good site and I love visiting your site.'</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">filterdText</span><span class="p">)</span>
<span class="p">[</span><span class="s1">'Hello'</span><span class="p">,</span> <span class="s1">'https'</span><span class="p">,</span> <span class="s1">'china'</span><span class="p">,</span> <span class="s1">'testing'</span><span class="p">,</span> <span class="s1">'github'</span><span class="p">,</span> <span class="s1">'io'</span><span class="p">,</span> <span class="s1">'You'</span><span class="p">,</span> <span class="s1">'have'</span><span class="p">,</span> <span class="s1">'build'</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">,</span> <span class="s1">'very'</span><span class="p">,</span> <span class="s1">'good'</span><span class="p">,</span> <span class="s1">'site'</span><span class="p">,</span> <span class="s1">'and'</span><span class="p">,</span> <span class="s1">'I'</span><span class="p">,</span> <span class="s1">'love'</span><span class="p">,</span> <span class="s1">'visiting'</span><span class="p">,</span> <span class="s1">'your'</span><span class="p">,</span> <span class="s1">'site'</span><span class="p">]</span>
</pre></div>
<p>RegexpTokenizer删除所有表达式,符号,字符,数字或任何你去掉的东西。</p>
<h3 id="_3">参考资料</h3>
<ul>
<li><a href="https://china-testing.github.io/practices.html">本文最新版本地址</a></li>
<li><a href="https://github.com/china-testing/python-api-tesing">本文涉及的python测试开发库</a> 谢谢点赞!</li>
<li><a href="https://github.com/china-testing/python-api-tesing/blob/master/books.md">本文相关海量书籍下载</a> </li>
<li><a href="https://china-testing.github.io/python3_quick9.html" title="Permalink to python 3.7极速入门教程9最佳python中文工具书籍下载">python 3.7极速入门教程9最佳python中文工具书籍下载</a></li>
<li><a href="https://github.com/china-testing/python-api-tesing/blob/master/practices/TTS.py">最新代码地址</a></li>
<li>道家技术-手相手诊看相中医等钉钉群21734177 qq群:391441566 184175668 338228106 看手相、面相、舌相、抽签、体质识别。服务费50元每人次起。请联系钉钉或者微信pythontesting</li>
<li><a href="https://china-testing.github.io/testing_training.html">接口自动化性能测试线上培训大纲</a></li>
</ul>
</div><!-- /.entry-content -->
</article>
</section>
<section id="extras" class="body">
<div class="blogroll">
<h2>links</h2>
<ul>
<li><a href="https://china-testing.github.io/testing_training.html">自动化性能接口测试线上及深圳培训与项目实战 qq群:144081101 591302926</a></li>
<li><a href="http://blog.sciencenet.cn/blog-2604609-1112306.html">pandas数据分析scrapy爬虫 521070358 Py人工智能pandas-opencv 6089740</a></li>
<li><a href="http://blog.sciencenet.cn/blog-2604609-1112306.html">中医解梦看相八字算命qq群 391441566 csdn书籍下载-python爬虫 437355848</a></li>
</ul>
</div><!-- /.blogroll -->
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="http://getpelican.com/">Pelican</a>, which takes great advantage of <a href="http://python.org">Python</a>.
</address><!-- /#about -->
<p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
</footer><!-- /#contentinfo -->
</body>
</html>