-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.txt
55 lines (27 loc) · 1.61 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
To compile from the command line and assuming that your path variable is pointing to the jdk bin folder, and your JAVA_HOME is
also set, run from the src folder
javac -cp ".;../external/lib/jsoup-1.7.3.jar" *.java
Then run from the same folder with
java WebCrawlerMain
Even better import the project into eclipse and run it from there.
The crawling algorithm is breadth first search when single threaded
If user enters more than one base url to crawl, a thread will be spawned for each. Threads can access any of the links
found by any other thread.
Adjustable params:
In WebCrawlerMain.java
private static final int NUM_SEARCH_RESULTS = 3;
controls how many google search results to use to seed the topical crawling
(an equal number of threads will be spawned)
In WebCrawlThread.java
final int MAX_NUM_LINKS_TO_CRAWL = 1000;
controls the maximum number of urls to crawl
and
final int MAX_DEPTH = 1000;
how many links from each page to retrieve
The data structures
private static HashSet<String> crawledUrls = new HashSet<String>();
private static LinkedHashSet<String> urlsToCrawl = new LinkedHashSet<String>();
in WebCrawlThread.java are shared by all threads and access to them is done via synchronized methods to prevent race conditions
among the threads. In particular the method getFromUrlsToCrawlSet will retrieve the next url from the LinkedHashSet urlsToCrawl
(this set keeps insertion order), remove it and added it to the set crawledUrls.
Since all of this method is executed by a single thread this guarantees that no two threads crawl the same url.