Skip to content

Commit 7befb4f

Browse files
committed
Adding the HITS Algorithm paper
1 parent 15189a9 commit 7befb4f

File tree

3 files changed

+56
-0
lines changed

3 files changed

+56
-0
lines changed

information_retrieval/README.md

+5
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,8 @@ The included documents are
4444
used in BM25. BM25 has been shown to be one of the best probabilistic
4545
weighting schemes. While the paper was in postscript form, the committer has
4646
changed the format to pdf as per guidelines of papers we love via ps2pdf.
47+
48+
* [:scroll:](hits.pdf) [Hits Algorithm](https://www.cs.cornell.edu/home/kleinber/auth.pdf) - Jon M. Kleinberg
49+
50+
This paper introduces the HITS algorithm, a link analysis algorithm that rates webpages. Unlike the more famous page rank algorithm, the hits algorithm makes a distinction between webpage behavior classifies them as hubs and autho rities. A page is authoratitative (in the sense the page has a large number of incoming links) or acts as a hub (a directory of sort, which can be measured by the number of outgoing link). The hits algorithm computes two scores for a page (authority and hub score) where the algorithm iteratively computes the hub score as sum of authority scores of outgoing links and authority scores as sum of hub scores of incoming links until a convergence is attained. These scores can then be used to rank documents. While this algorithm is famous in academia, its not very widely used in the industry (a variant of this algorithm was used by a company called Teoma which was acquired by AskJeeves)
51+

information_retrieval/README.md~

+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
## Information Retrieval
2+
3+
Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. (Says Wikipedia).
4+
5+
The included documents are
6+
7+
* [:scroll:](graph_of_word_and_tw_idf.pdf) [Graph of Word and TW-IDF](http://www.lix.polytechnique.fr/~rousseau/papers/rousseau-cikm2013.pdf) - Francois Rousseau & Michalis Vazirgiannis
8+
9+
The traditional IR system stores term-specific statistics (typically
10+
a term's frequency in each document - which we call TF) in an index.
11+
Such a model ignores dependencies between terms and considers a
12+
document's terms to occur independently of each other (and is aptly
13+
called the bag-of-words model). In this paper the authors use a
14+
statistic that uses a graph representation of a document to encode
15+
dependencies between terms and replace the TF statistic with a new
16+
TW statistic based on the graph constructed and achieve
17+
significantly better results that popular existing models. This
18+
paper won a honorable mention at CIKM 2013.
19+
20+
* [:scroll:](pagerank.pdf) [Pagerank Algorithm](http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf) - Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd
21+
22+
This paper introduces the PageRank algorithm, which forms the backbone of
23+
the present day google search engine. Pagerank operates by assessing the
24+
number of incoming and outgoing hyper links to a given web page and ranks the
25+
pages based on the link structure of a page. The authors also implemented
26+
PageRank on the backrub system (now called the Google Search
27+
Engine) in the [Anatomy of a Large-Scale Hypertextual Web Search Engine]
28+
http://infolab.stanford.edu/~backrub/google.html which assigned page ranks to
29+
every webpage in the world wide web. Google is currently the most commercially
30+
sucessful generic search engine in the world.
31+
32+
* [:scroll:](ocapi-trec3.pdf) [Okapi System](http://trec.nist.gov/pubs/trec3/papers/city.ps.gz) - Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford
33+
34+
This paper introduces the now famous Okapi information retrieval
35+
framework which introduces the BM25 ranking function for ranked
36+
retrieval. It is one of the first implementations of the probabilistic
37+
retrieval frameworks in literature. BM25 is a bag of words retrieval
38+
function. The IDF(Inverse document frequency) term can be interpreted
39+
via information theory. If a query q appears in n(q) docs the probability
40+
of picking a doc randomly and it containing that term :p(q) = n(q) / D,
41+
where D is the number of documents. The information content based on
42+
shannon's noisy channel model is = -log(p(q)) = log (D / n(q)). Smoothing
43+
by adding a constant to both numberator and demoninator leads to IDF term
44+
used in BM25. BM25 has been shown to be one of the best probabilistic
45+
weighting schemes. While the paper was in postscript form, the committer has
46+
changed the format to pdf as per guidelines of papers we love via ps2pdf.
47+
48+
* [:scroll:](hits.pdf) [Hits Algorithm](https://www.cs.cornell.edu/home/kleinber/auth.pdf) - Jon M. Kleinberg
49+
50+
This paper introduces the HITS algorithm, a link analysis algorithm that rates webpages. Unlike the more famous page rank algorithm, the hits algorithm makes a distinction between webpage behavior classifies them as hubs and autho rities. A page is authoratitative (in the sense the page has a large number of incoming links) or acts as a hub (a directory of sort, which can be measured by the number of outgoing link). The hits algorithm computes two scores for a page (authority and hub score) where the algorithm iteratively computes the hub score as sum of authority scores of outgoing links and authority scores as sum of hub scores of incoming links until a convergence is attained. These scores can then be used to rank documents. While this algorithm is famous in academia, its not very widely used in the industry (a variant of this algorithm was used by a company called Teoma which was acquired by AskJeeves)
51+

information_retrieval/hits.pdf

256 KB
Binary file not shown.

0 commit comments

Comments
 (0)