Skip to content

Commit 32824f8

Browse files
committed
starting to port things over to twarc2
1 parent 729b110 commit 32824f8

12 files changed

+243
-23
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ twarc.egg-info
1010
.pytest_cache
1111
.vscode
1212
.env
13+
site

docs/README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1-
# twarc Project
1+
# twarc
22

3-
twarc is a command line tool and Python library for archiving Twitter JSON data. It works with the older v1.1 API and the newer v2 API and Academic Access.
3+
twarc is a command line tool and Python library for archiving Twitter JSON
4+
data. It has separate commands (twarc and twarc2) for working with the older
5+
v1.1 API and the newer v2 API and Academic Access (respectively). It also has an ecosystem of [plugins](plugins) for doing things with the collected data.
46

5-
See `twarc` documentation for running commands using the v1.1 API: [twarc](twarc_en_us.md) and [twarc2](twarc2_en_us.md) for using the v2 API.
7+
See the `twarc` documentation for running commands: [twarc2](twarc2_en_us.md) and [twarc1](twarc2_en_us.md) for using the v1.1 API. If you aren't sure about which one to use you'll want to start with twarc2 since the v1.1 is scheduled to be retired.
68

79
## Install
810

@@ -12,4 +14,4 @@ If you have python installed, you can install twarc using:
1214
pip3 install twarc
1315
```
1416

15-
Once installed, you should be able to use the twarc command line, or use it as a Python library.
17+
Once installed, you should be able to use the twarc and twarc2 command line utilities, or use it as a Python library.

docs/plugins.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,13 @@ published separately from twarc on [PyPI] and are installed with [pip]. Here is
1313
a list of some known plugins (if you write one please [let us know] so we can
1414
add it to this list):
1515

16-
* [twarc-ids](https://pypi.org/project/twarc-ids/): extract tweet ids from tweets
17-
* [twarc-videos](https://pypi.org/project/twarc-videos): extract videos from tweets
18-
* [twarc-csv](https://pypi.org/project/twarc-csv/): export tweets to CSV
19-
* [twarc-network](https://pypi.org/project/twarc-network): visualize your tweets as a network graph
16+
* [twarc-ids](https://pypi.org/project/twarc-ids/): a simple example of printing the ids for tweets to use as a reference for creating plugins
17+
* [twarc-csv](https://pypi.org/project/twarc-csv/): export tweets to CSV, which is probably the first thing a researcher will want to do
18+
* [twarc-videos](https://pypi.org/project/twarc-videos): extract videos from tweets
19+
* [twarc-network](https://pypi.org/project/twarc-network): visualize tweets and users as a network graph
2020
* [twarc-timeline-archive](https://pypi.org/project/twarc-timeline-archive): routinely download tweet timelines for a list of users
2121
* [twarc-hashtags](https://pypi.org/project/twarc-hashtags): create a report of hashtags that are used in collected tweet data
22+
* Write your own, and [let us know] so we can add it here!
2223

2324
## Writing a Plugin
2425

docs/twarc_en_us.md renamed to docs/twarc1_en_us.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
twarc
1+
twarc1
22
=====
33

44
***For information about working with the Twitter V2 API please see the [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2/) page.***

docs/twarc_es_mx.md renamed to docs/twarc1_es_mx.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# twarc
1+
# twarc1
22

33
twarc es una recurso de línea de commando y catálogo de Python para archivar JSON dato de Twitter. Cada tweet se representa como
44
un artículo de JSON que es [exactamente](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object) lo que fue capturado del API de Twitter. Los Tweets se archivan como [JSON de línea orientado](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON). twarc se encarga del [límite de tarifa](https://developer.twitter.com/en/docs/basics/rate-limiting) del API de Twitter. twarc también puede facilitar la colección de usuarios, tendencias y detallar las identificaciones de los tweets.

docs/twarc_ja_jp.md renamed to docs/twarc1_ja_jp.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
twarc
1+
twarc1
22
=====
33

44
twarcは、TwitterのJSONデータをアーカイブするためのコマンドラインツールおよびPythonライブラリーのプログラムです。

docs/twarc_pt_br.md renamed to docs/twarc1_pt_br.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
twarc
1+
twarc1
22
=====
33

44
twarc é uma ferramenta de linha de comando e usa a biblioteca Python para arquivamento de dados do Twitter com JSON.

docs/twarc_sv_se.md renamed to docs/twarc1_sv_se.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
twarc
1+
twarc1
22
=====
33

44
twarc är ett kommandoradsverktyg twarc och ett Pythonbibliotek för arkivering av Twitter JSON data.

docs/twarc_sw_ke.md renamed to docs/twarc1_sw_ke.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
twarc
1+
twarc1
2+
23
=====
34

45
twarc ni chombo ya command-line na Python Library ya kuhifadhi Twitter JSON

docs/twarc_zw_zh.md renamed to docs/twarc1_zw_zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
twarc
1+
twarc1
22
=====
33

44
twarc 是一个用来处理并存档推特 JSON 数据的命令行工具和 Python 包。

docs/twarc2_en_us.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,218 @@
1+
2+
# twarc2
3+
4+
twarc2 is a command line tool and Python library for archiving Twitter JSON
5+
data. Each tweet is represented as a JSON object that was returned from the
6+
Twitter API. Since Twitter's introduction of their [v2
7+
API](https://developer.twitter.com/en/docs/twitter-api/api-reference-index#v2)
8+
the JSON representation of a tweet is conditional on the types of fields and
9+
expansions that are requested. twarc2 does the work of requesting the highest
10+
fidelity representation of a tweet by requesting all the available data for
11+
tweets.
12+
13+
Tweets are streamed or stored as [line-oriented
14+
JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc2
15+
will handle Twitter API's [rate
16+
limits](https://dev.twitter.com/rest/public/rate-limiting) for you. In addition
17+
to letting you collect tweets twarc can also help you collect users and hydrate
18+
tweet ids. It also has a collection of [plugins](plugins) you can use to do
19+
things with the collected JSON data (such as converting it to CSV).
20+
21+
twarc2 was developed as part of the [Documenting the Now](http://www.docnow.io)
22+
project which was funded by the [Mellon Foundation](https://mellon.org/).
23+
24+
## Install
25+
26+
Before using twarc you will need to register an application at
27+
[apps.twitter.com](http://apps.twitter.com). Once you've created your
28+
application, note down the consumer key, consumer secret and then click to
29+
generate an access token and access token secret. With these four variables
30+
in hand you are ready to start using twarc.
31+
32+
1. install [Python 3](http://python.org/download)
33+
2. [pip](https://pip.pypa.io/en/stable/installing/) install twarc:
34+
35+
```
36+
pip install --upgrade twarc
37+
```
38+
39+
### Homebrew (macOS only)
40+
41+
For macOS users, you can also install `twarc` via [Homebrew](https://brew.sh/):
42+
43+
```bash
44+
$ brew install twarc
45+
```
46+
47+
### Windows
48+
49+
If you installed with pip and see a "failed to create process" when running twarc try reinstalling like this:
50+
51+
python -m pip install --upgrade --force-reinstall twarc
52+
53+
## Quickstart:
54+
55+
First you're going to need to tell twarc about your application API keys and
56+
grant access to one or more Twitter accounts:
57+
58+
twarc2 configure
59+
60+
Then try out a search:
61+
62+
twarc2 search blacklivesmatter search.jsonl
63+
64+
Or maybe you'd like to collect tweets as they happen?
65+
66+
twarc2 filter blacklivesmatter stream.jsonl
67+
68+
See below for the details about these commands and more.
69+
70+
## Configure
71+
72+
Once you've got your Twitter developer access set up you can tell twarc what they are with the `configure` command.
73+
74+
twarc2 configure
75+
76+
This will store your credentials in your home directory so you don't have to
77+
keep entering them in. You can most of twarc's functionality by simply
78+
configuring the *bearer token*, but if you want it to be complete you can enter
79+
in the *API key* and *API secret*.
80+
81+
You can also the keys in the system environment (`CONSUMER_KEY`,
82+
`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) or using command line
83+
options (`--consumer-key`, `--consumer-secret`, `--access-token`,
84+
`--access-token-secret`).
85+
86+
## Search
87+
88+
This uses Twitter's [tweets/search/recent](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) and [tweets/search/all](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) endpoints to download *pre-existing* tweets matching a given query. This command will search for any tweets mentioning *blacklivesmatter* from the 7 days.
89+
90+
twarc2 search blacklivesmatter tweets.jsonl
91+
92+
If you have access to the [Academic Research Product Track](https://developer.twitter.com/en/products/twitter-api/academic-research) you can search the full archive of tweets by using the `--archive` option.
93+
94+
twarc2 search --archive blacklivesmatter tweets.jsonl
95+
96+
The queries can be a lot more expressive than matching a single term. For
97+
example this query will search for tweets containing either `blacklivesmatter`
98+
or `blm` that were sent to the user \@deray.
99+
100+
twarc2 search 'blacklivesmatter OR blm to:deray' tweets.jsonl
101+
102+
The best way to get familiar with Twitter's search syntax is to consult Twitter's [Building queries for Search Tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) documentation.
103+
104+
You also should definitely check out Igor Brigadir's *excellent* reference guide
105+
to the Twitter Search syntax:
106+
[Advanced Search on Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md).
107+
There are lots of hidden gems in there that the advanced search form doesn't
108+
make readily apparent.
109+
110+
### Limit
111+
112+
Because there is a 500,000 tweet limit (5 million for Academic Research Track)
113+
you may want to limit the number of tweets you retrieve by using `--limit`:
114+
115+
twarc2 search --limit 5000 blacklivesmatter tweets.jsonl
116+
117+
### Time
118+
119+
You can also limit to a particular time range using `--start-time` and/or
120+
`--end-time`, which can be especially useful in conjunction with `--archive`
121+
when you are searching for historical tweets.
122+
123+
twarc2 search --start-time 2014-07-17 --end-time 2014-07-24 '"eric garner"' tweets.jsonl
124+
125+
If you leave off --start-time or --end-time it will be open on that side. So
126+
for example to get all "eric garner" tweets before 2014-07-24 you would just
127+
leave off the `--start-time`:
128+
129+
twarc2 search --end-time 2014-07-24 '"eric garner"' tweets.jsonl
130+
131+
## Stream
132+
133+
The `stream` command will use Twitter's API
134+
[tweets/search/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream)
135+
endpoint to collect tweets as they happen. In order to use it you first need to
136+
create one or more [rules]. For example:
137+
138+
twarc2 stream-rules add blacklivesmatter
139+
140+
You can list your active stream rules:
141+
142+
twarc2 stream-rules list
143+
144+
And you can collect the data from the stream, which will bring down any tweets that match your rules:
145+
146+
twarc2 stream stream.jsonl
147+
148+
When you want to stop you use `ctrl-c`. This only stops the stream but doesn't delete your stream rule. To remove a rule you can:
149+
150+
twarc2 stream-rules delete blacklivesmatter
151+
152+
### Sample
153+
154+
Use the `sample` command to listen to Twitter's [tweets/sample/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/sampled-stream/api-reference/get-tweets-sample-stream) API for a "random" sample of recent public statuses.
155+
156+
twarc2 sample sample.jsonl
157+
158+
### Users
159+
160+
If you have a file of user ids you can fetch the user metadata for them with
161+
the `users` command:
162+
163+
twarc users users.txt users.jsonl
164+
165+
If the file contains usernames instead of user ids you can use the `--usernames` option:
166+
167+
twarc2 users --usernames users.txt users.jsonl
168+
169+
### Followers
170+
171+
You can fetch the followers of an account using the `followers` command:
172+
173+
twarc2 followers deray users.jsonl
174+
175+
### Following
176+
177+
To get the users that a user is following you can use `following`:
178+
179+
twarc2 following deray users.jsonl
180+
181+
The result will include exactly one user id per line. The response order is
182+
reverse chronological, or most recent followers first.
183+
184+
### Timeline
185+
186+
The `timeline` command will use Twitter's [user timeline API](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets) to collect the most recent tweets posted by the user indicated by screen_name.
187+
188+
twarc2 timeline deray tweets.jsonl
189+
190+
### Conversation
191+
192+
You can retrieve a conversation thread using the tweet ID at the head of the
193+
conversation:
194+
195+
twarc2 conversation 266031293945503744 > conversation.jsonl
196+
197+
## Dehydrate
198+
199+
The `dehydrate` command generates an id list from a file of tweets:
200+
201+
twarc2 dehydrate tweets.jsonl tweet-ids.txt
202+
203+
## Hydrate
204+
205+
twarc's `hydrate` command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's [tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/api-reference/get-tweets)
206+
API endpoint:
207+
208+
twarc2 hydrate ids.txt tweets.jsonl
209+
210+
Twitter API's [Terms of Service](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) discourage people from making large amounts of raw Twitter data available on the Web. The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available. You can then use Twitter's API to *hydrate* the data, or to retrieve the full JSON for each identifier. This is particularly important for [verification](https://en.wikipedia.org/wiki/Reproducibility) of social media research.
211+
212+
# Command Line Usage
213+
214+
Below is what you see when you run `twarc2 --help`.
215+
1216
::: mkdocs-click:
2217
:module: twarc.command2
3218
:command: twarc2

mkdocs.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,16 @@ theme:
1414

1515
nav:
1616
- Home: README.md
17-
- twarc:
18-
- twarc (en): twarc_en_us.md
19-
- twarc (es): twarc_es_mx.md
20-
- twarc (ja): twarc_ja_jp.md
21-
- twarc (pt): twarc_pt_br.md
22-
- twarc (sv): twarc_sv_se.md
23-
- twarc (sw): twarc_sw_ke.md
24-
- twarc (zw): twarc_zw_zh.md
2517
- twarc2:
2618
- twarc2 (en): twarc2_en_us.md
19+
- twarc1:
20+
- twarc1 (en): twarc1_en_us.md
21+
- twarc1 (es): twarc1_es_mx.md
22+
- twarc1 (ja): twarc1_ja_jp.md
23+
- twarc1 (pt): twarc1_pt_br.md
24+
- twarc1 (sv): twarc1_sv_se.md
25+
- twarc1 (sw): twarc1_sw_ke.md
26+
- twarc1 (zw): twarc1_zw_zh.md
2727
- Plugins: plugins.md
2828
- Tutorials: tutorials.md
2929
- Twitter Developer Access: twitter-developer-access.md

0 commit comments

Comments
 (0)