search-spider

The home for the spider that supports search.gov.

About

The spider uses the open source scrapy framework.

The spiders can be found at search_gov_crawler/search_gov_spiders/spiders/.

Core Scrapy File Structure

*Note: Other files and directories are within the repository but the folders and files below relate to those needed for the scrapy framework.

├── search_gov_crawler ( scrapy root )
│   ├── search_gov_spider ( scrapy project *Note multiple projects can exist within a project root )
│   │   ├── extensions ( includes custom scrapy extensions )
│   │   ├── helpers ( includes common functions )
│   │   ├── spiders
│   │   │   ├── domain_spider.py ( spider for html pages )
│   │   │   ├── domain_spider_js.py  ( spider for js pages )
│   │   ├── utility_files ( includes json files with default domains to scrape )
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   ├── scrapy.cfg

Scrapy Web Crawler

The spider can either scrape for URLs from the list of required domains or take in a domain and starting URL to scrape a site/domain.

Running the spider produces a list of urls found in search_gov_crawler/search_gov_spiders/spiders/scrapy_urls/{spider_name}/{spider_name}_{date}-{UTC_time}.txt as specified by FEEDS in settings.py.

Quick Start

Make sure to run pip install -r requirements.txt and playwright install before running any spiders.

Running Against All Listed Search.gov Domains

Navigate down to search_gov_crawler/search_gov_spiders, then enter the command below:

scrapy crawl domain_spider

to run for all urls / domains that do not require javacript handling. To run for all sites that require javascript run:

scrapy crawl domain_spider_js

^^^ These will take a long time

Running Against A Specific Domain

In the same directory specified above, enter the command below, adding the domain and starting URL for the crawler:

scrapy crawl domain_spider -a allowed_domains=example.com -a start_urls=www.example.com

or

scrapy crawl domain_spider -a allowed_domains=example.com -a start_urls=www.example.com

Setup and Use

Make sure to run pip install -r requirements.txt and playwright install before running any spiders.

Option 1: straight from command-line

Navigate to the spiders directory
Enter one of two following commands:
- This command will output the yielded URLs in the destination (relative to the spiders directory) and file format specified in the search_gov_crawler/search_gov_spiders/pipelines.py:
```
$ scrapy runspider <spider_file.py>
```
- This command will output the yielded URLs in the destination (relative to the spiders directory) and file format specified by the user:
```
$ scrapy runspider <spider_file.py>  -o <filepath_to_output_folder/spider_output_filename.csv>
```

Option 2: benchmark command line

The benchmark script is primarily intended for use in timing and testing scrapy runs. There are two ways to run. In either case its likes you want to redirect your ouput to a log file using something like <benchmark command> >scrapy.log 2>&1

To run a single domain (specifying starting URL -u and allowed domain -d):

   $ python search_gov_spiders/benchmark.py -u https://www.example.com -d example.com

To run multiple spiders simultaneously, provide a json file in the format of the crawl-sites.json file as an argument:
```
   $ python search_gov_spiders/benchmark.py -f </path/to/crawl-sites-like-file.json>
```

There are other options available. Run python search_gov_spiders/benchmark.py -h for more info.

Option 3: custom scheduler

To run jobs on a schedule, as defined in the crawl-sites.json file
```
   $ python search_gov_spiders/scrapy_scheduler.py
```

Option 4: deploying on server (Scrapyd)

Navigate to the Scrapy project root directory and start the server.
```
 $ scrapyd
```
- Note: the directory where you start the server is arbitrary. It's simply where the logs and Scrapy project FEED destination (relative to the server directory) will be.

Run this command to eggify the Scrapy project and deploy it to the Scrapyd server:

 $ scrapyd-deploy default

Note: This will simply deploy it to a local Scrapyd server. To add custom deployment endpoints, you navigate to the scrapy.cfg file and add or customize endpoints.

For instance, if you wanted local and production endpoints:

  [settings]
  default = search_gov_spiders.settings

  [deploy: local]
  url = http://localhost:6800/
  project = search_gov_spiders

  [deploy: production]
  url = <IP_ADDRESS>
  project = search_gov_spiders

To deploy:

  # deploy locally
  scrapyd-deploy local

  # deploy production
  scrapyd-deploy production

For an interface to view jobs (pending, running, finished) and logs, access http://localhost:6800/. However, to actually manipulate the spiders deployed to the Scrapyd server, you'll need to use the Scrapyd JSON API.

Some most-used commands:
- Schedule a job:
```
  $ curl http://localhost:6800/schedule.json -d project=search_gov_spiders -d spider=<spider_name>
```
- Check load status of a service:
```
  $ curl http://localhost:6800/daemonstatus.json
```

Adding new spiders

Navigate to anywhere within the Scrapy project root directory and run this command:
```
$ scrapy genspider -t crawl <spider_name> "<spider_starting_domain>"
```
Open the /search_gov_spiders/search_gov_spiders/spiders/boilerplate.py file and replace the lines of the generated spider with the lines of the boilerplate spider as dictated in the boilerplate file.
Modify the rules in the new spider as needed. Here's the Scrapy rules documentation for the specifics.

To update the Scrapyd server with the new spider, run:

 $ scrapyd-deploy <default or endpoint_name>

 ## Running Against All Listed Search.gov Domains

Running scrapy scheduler

This process allows for scrapy to be run directly using an in-memory scheduler. The schedule is based on the initial schedule setup in the utility files readme. The process will run until killed.

Source virtual environment and update dependencies.

Start scheduler

 $ python search_gov_crawler/scrapy_scheduler.py

Running Scrapydweb UI

Local Environment Setup

Source virtual environment, update dependencies, and change working directory to search_gov_crawler
Start scrapyd
```
 $ scrapyd
```
Build latest version of scrapy project (if any changes have been made since last run)
```
 $ scrapyd-deploy local -p search_gov_spiders
```
Start logparser
```
 $ python -m search_gov_logparser
```
Start scrapydweb
```
 $ python -m search_gov_scrapydweb
```

Name		Name	Last commit message	Last commit date
Latest commit History 285 Commits
.circleci		.circleci
.github		.github
cicd-scripts		cicd-scripts
search_gov_crawler		search_gov_crawler
tests		tests
.codeclimate.yml		.codeclimate.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.profile		.profile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
appspec.yml		appspec.yml
manifest.yml		manifest.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runtime.txt		runtime.txt
setup.cfg		setup.cfg
startcommand.sh		startcommand.sh
test-codeclimate.sh		test-codeclimate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-spider

Table of contents:

About

Core Scrapy File Structure

Scrapy Web Crawler

Quick Start

Running Against All Listed Search.gov Domains

Running Against A Specific Domain

Setup and Use

Option 1: straight from command-line

Option 2: benchmark command line

Option 3: custom scheduler

Option 4: deploying on server (Scrapyd)

Adding new spiders

Running scrapy scheduler

Running Scrapydweb UI

Local Environment Setup

About

Releases

Packages

Contributors 7

Languages

License

GSA/searchgov-spider

Folders and files

Latest commit

History

Repository files navigation

search-spider

Table of contents:

About

Core Scrapy File Structure

Scrapy Web Crawler

Quick Start

Running Against All Listed Search.gov Domains

Running Against A Specific Domain

Setup and Use

Option 1: straight from command-line

Option 2: benchmark command line

Option 3: custom scheduler

Option 4: deploying on server (Scrapyd)

Adding new spiders

Running scrapy scheduler

Running Scrapydweb UI

Local Environment Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages