Quick start¶
Setup¶
Install the requirements.
Configure Terraform.
$ cd terraform $ cp example.terraform.tfvars terraform.tfvars
and then update
terraform.tfvars
with your root or administrator AWS access keys and also select a new name for your S3 bucket.Set up your AWS environment using Terraform.
$ terraform init $ terraform apply
Terraform will output some values that are needed in the next step.
Note that the root or administrator AWS access keys are no longer required by Twarc-Cloud so you can remove them.
Configure Twarc-Cloud.
$ cd .. $ cp example.twarc_cloud.ini twarc_cloud.ini
and then update
twarc_cloud.ini
with the values output by Terraform from the previous step. You can also optionally provide a Honeybadger API key.Acquire a Twitter API keys using Twarc.
$ twarc configure
and then provide your consumer keys. Twarc will then ask you paste a url into a browser, where you will be asked to log into your Twitter account and authorize Twarc-Cloud to access your account.
Make sure everything is working:
$ python3 twarc_cloud.py usage: twarc_cloud.py [-h] [-V] [--debug] {collection-config,collection,harvest} ... Manage AWS resources for Twarc Cloud. positional arguments: {collection-config,collection,harvest} command help collection-config Collection configuration-related commands. collection Collection-related commands. harvest Harvest-related commands. optional arguments: -h, --help show this help message and exit -V, --version Show version and exit --debug $ python twarc_cloud.py harvest list No running harvests.
Create a user timeline collection¶
Create a collection configuration file.
$ python3 twarc_cloud.py collection-config template user_timeline --id=test_collection Template written to collection.json. Add the collection before adding users to collect. $ cat collection.json { "id": "test_collection", "credentials": { "consumer_key": "<Your Twitter API consumer key>", "consumer_secret": "<Your Twitter API consumer secret>", "access_token": "<Your Twitter API access token>", "access_token_secret": "<Your Twitter API access token secret>" }, "type": "user_timeline", "users": {}, "delete_users_for": [ "protected", "suspended", "not_found" ] }
Add credentials to the collection configuration.
$ python3 twarc_cloud.py collection-config credentials Added credentials to collection.json.
This adds the Twitter API keys that you acquired earlier with Twarc.
Add the collection.
$ python3 twarc_cloud.py collection add Collection added. Don't forget to start or schedule the collection.
This copies the collection configuration file to your S3 bucket.
Add users to the collection.
$ python3 twarc_cloud.py collection-config screennames @justin_littman @not_justin_littman Getting users ids for screen names. This may take some time ... Added screen names to collection.json. Following screen names where not found: not_justin_littman
Twarc-cloud will notify you if any of the users cannot be found. You can also add users by user id and load them from files.
Update the collection.
$ python3 twarc_cloud.py collection-config update Collection configuration updated.
Schedule the collection.
$ python3 twarc_cloud.py collection schedule test_collection "rate(7 days)" Scheduled
That’s it! A harvest will be performed immediately and then again every 7 days.
Download the collection¶
$ python3 twarc_cloud.py collection download test_collection
Collection downloaded to download/twarc-cloud/collections/test_collection
$ find download/twarc-cloud2/collections/test_collection -type f
download/twarc-cloud2/collections/test_collection/harvests/2019/03/09/15/35/07/tweets-20190309153508.jsonl.gz
download/twarc-cloud2/collections/test_collection/harvests/2019/03/09/15/35/07/users.jsonl
download/twarc-cloud2/collections/test_collection/harvests/2019/03/09/15/35/07/manifest-sha1.txt
download/twarc-cloud2/collections/test_collection/harvests/2019/03/09/15/35/07/user_changes.json
download/twarc-cloud2/collections/test_collection/harvests/2019/03/09/15/35/07/collection.json
download/twarc-cloud2/collections/test_collection/harvests/2019/03/09/15/35/07/harvest.json
download/twarc-cloud2/collections/test_collection/changesets/change-20190309153326.json
download/twarc-cloud2/collections/test_collection/changesets/change-20190309153507.json
download/twarc-cloud2/collections/test_collection/changesets/change-20190309153304.json
download/twarc-cloud2/collections/test_collection/collection.json
download/twarc-cloud2/collections/test_collection/last_harvest.json
Some explanation:
download/twarc-cloud2/collections/test_collection/harvests/2019/03/09/15/35/07/
contains the files created by the harvest.tweets-20190309153508.jsonl.gz
contains the tweets as in a newline-delimited, gzip compressed JSON format as retrieved from Twitter’s API. In this case there is only one file; depending on the number of tweets and how long a harvest takes, there may be multiple files.users.jsonl
contains the users in a newline-delimited JSON format as retrieved from Twitter’s API.manifest-sha1.txt
contains a SHA1 checksum for each tweet file in the harvest.user_changes.json
describes any changes that were found for users, e.g., changed screen names.collection.json
is the collection configuration file used to perform this harvest.harvest.json
contains information about the harvest such as the number of tweets collected.
download/twarc-cloud2/collections/test_collection/changesets/
contains changeset files that record every change made to the collection configuration.
Stop the collection¶
$ python twarc_cloud.py collection stop test_collection
Stopped