-
Notifications
You must be signed in to change notification settings - Fork 3
Automation of Data Acquisition and Crawling
This wiki entry provides a guide for setting up a daemon to monitor the data/staging
folder for incoming data, upon detection of new data the crawler module will kick in and run a metadata extraction task in order to generate client-side metadata before sending all of that to the File Manager to be ingested.
Please ensure you have all of the prerequisite software installed. In particular, you should now have an unpacked coal-sds deployment available on your filesystem at /usr/local/coal-sds-deploy
. The following documentation assumes you have executed a cd
into that directory. The unpackaged coal-sds contents will look as follows
[ec2-user@ip-172-31-28-45 coal-sds-deploy]$ ls -al
total 56
drwxr-xr-x 14 root root 4096 Apr 16 02:12 .
drwxr-xr-x 14 root root 4096 Apr 16 02:12 ..
drwxrwxrwx 2 root root 4096 Apr 16 01:51 bin
drwxr-xr-x 7 root root 4096 Apr 16 02:12 crawler
drwxrwxrwx 7 root root 4096 Apr 16 02:07 data
drwxr-xr-x 4 root root 4096 Apr 16 02:12 extensions
drwxr-xr-x 8 root root 4096 Apr 16 02:12 filemgr
drwxrwxrwx 2 root root 4096 Apr 16 02:07 logs
drwxr-xr-x 8 root root 4096 Apr 16 02:12 pcs
drwxr-xr-x 5 root root 4096 Apr 16 02:12 pge
drwxr-xr-x 8 root root 4096 Apr 16 02:12 resmgr
drwxr-xr-x 3 root root 4096 Apr 16 02:12 solr
drwxrwxrwx 11 root root 4096 Apr 16 02:06 tomcat
drwxr-xr-x 8 root root 4096 Apr 16 02:12 workflow
Starting the process can be found here.
Once the OODT has been started, navigate to the bin folder in the crawler directory:
$ cd /usr/local/coal-sds-deploy/crawler/bin
Then, run the following command to execute the daemon (running on port 9003) that will monitor (every 2 seconds) the /usr/local/coal-sds-deploy/data/staging
and do the following upon detection of new staging data
- Execute the TikaCmdLineMetExtractor which uses Apache Tika to extract metadata from whatever it finds.
- Ingests the file into the File Manager running on http://localhost:9000
- Upon successful ingestion moves the staging file to
data/archive
and deletes the original data file fromdata/staging
- Upon unsuccessful ingestion moves the staging file out of
data/staging
intodata/failure
./crawlctl
That's it...
Actually, the above script is merely wrapping the following execution
./crawler_launcher --filemgrUrl http://localhost:9000 \
--operation --launchMetCrawler \
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \
--productPath /usr/local/coal-sds-deploy/data/staging \
--metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor \
--metExtractorConfig /usr/local/coal-sds-deploy/data/met/tika.conf \
--failureDir /usr/local/coal-sds-deploy/data/failure \
--daemonPort 9003 \
--daemonWait 2 \
--successDir /usr/local/coal-sds-deploy/data/archive \
--actionIds DeleteDataFile
In order to see the above workflow in action... visit http://localhost:8080/opsui/status
to view the ingested metadata.