HashTables.html

<html>
<!-- THIS FILE WAS GENERATED BY A SCRIPT: DO NOT EDIT IT! -->
    <head>
        <link href="style.css" rel="stylesheet" type="text/css"/>
        <title>
            Design and Analysis of Algorithms: Hash Tables
        </title>
    </head>

    <body>
    <div id="header">
        <div id="logo">
            <img src="graphics/Julia.png">
        </div>
        <div id="user-tools">
            <a href="index.html">Home</a>
            &nbsp; &nbsp; 
            <a href="about.html">About</a>
            &nbsp; &nbsp;
            <a href="feedback.html">Feedback</a>
        </div>
    </div>

        <h1>
            Design and Analysis of Algorithms: Hash Tables
            <a href="#note1">*</a>
        </h1>
        
        <details>
            <summary class="sum1">
            Dictionaries
            </summary>

            <p>
            <img
            src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/English-English_and_English-Persian_dictionaries.JPG/350px-English-English_and_English-Persian_dictionaries.JPG">
            </p>
    
            <details>
                <summary class="sum2">
                Dictionary ADT.
                </summary>
                <p>
                Operations associated with this data type allow:
                </p>
                <ul>
                    <li>
                        the addition of a pair to the collection
                    </li>
                    <li>
                        the removal of a pair from the collection
                    </li>
                    <li>
                        the modification of an existing pair
                    </li>
                    <li>
                        the lookup of a value associated with a particular key
                    </li>
                </ul>
                    <p>
                    (<a href="https://en.wikipedia.org/wiki/Associative_array">Source</a>)
                    </p>
                <p>
                Typical uses:
                </p>
                <ul>
                    <li> Symbol lookup in a programming language
                    </li>
                    <li> Counting words in a book
                    </li>
                    <li> Store colors by name as key and their numeric equivalent as
                    the value. Then we can write <b>set_text(colors["red"])</b>.
                    </li>
                </ul>
        
        
                <p>
                <em>Direct addressing</em> and <em>Hashing</em>
                are two ways of implementing a
                dictionary. Are there others?
                </p>
            </details>
        </details>

        <details>
            <summary class="sum1">
            11.1 Direct-address tables
            </summary>
            <ul>
                <li> <em>O(1)</em> <em>worst</em> case time for lookup.
                </li>
                <li> Uses:
                <ul class="nested">
                    <li> Memoization
                    </li>
                    <li> Bingo
                    </li>
                    <li> Sieve of Eratosthenes
                    </li>
                    <li> Mark zipcodes seen
                    </li>
                </ul>
                </li>
                <li> Downside: wastes space. If you have no idea how many possible
                keys you need, direct addressing is not a good choice.
                <br>For instance, if your key is an arbitrary string!
                </li>
                <li><a
                        href="https://github.com/gcallah/algorithms/blob/master/python/hash.py">
                    Example code here.
                    </a>
                </li>
            </ul>

            <details>
                <summary class="sum2">
                    Direct-address operations
                </summary>

                <p>
                <code>
                    <pre>
                    Direct-Address-Search(T, k)
                        return T[k]

                    Direct-Address-Insert(T, x)
                        T[x.key] = x

                    Direct-Address-Delete(T, x)
                        T[x.key] = NIL
                    </pre>
                </code>
                </p>

            </details>

    <details>
        <summary class="sum2">
            Quiz
        </summary>
        <ol>
            <li>
                A good use for a direct-address table might be:
            </li>
            <ol type="a" class="nested">
                <li>
                <input type="radio" name="q1" value="a">
                All answers are fine
                </li>
                <li>
                <input type="radio" name="q1" value="b">
                Memoization
                </li>
                <li>
                <input type="radio" name="q1" value="c">
                Bingo
                </li>
                <li>
                <input type="radio" name="q1" value="d">
                Marking members of a set as present
                </li>
            </ol>
            <li>
                We can't use direct-address tables when
            </li>
            <ol type="a" class="nested">
                <li>
                <input type="radio" name="q2" value="a">
                there are a large number of (potential) entries
                </li>
                <li>
                <input type="radio" name="q2" value="b">
                all answers are fine
                </li>
                <li>
                <input type="radio" name="q2" value="c">
                we are programming the Sieve of Eratosthenes
                </li>
                <li>
                <input type="radio" name="q2" value="d">
                we are dealing with zipcodes
                </li>
            </ol>
            <li>
                What is direct addressing?
            </li>
            <ol type="a" class="nested">
                <li>
                <input type="radio" name="q3" value="a">
                Fewer keys than array positions
                </li>
                <li>
                <input type="radio" name="q3" value="b">
                Every key specifies a distinct array position
                </li>
                <li>
                <input type="radio" name="q3" value="c">
                Fewer array positions than keys
                </li>
                <li>
                <input type="radio" name="q3" value="d">
                None of the mentioned
                </li>
            </ol>
        </ol>
        <details>
            <summary class="sum3">
                Answers
            </summary>
            <p>
                1. a; 2. a; 3. b; 
            </p>
        </details>
    </details>

        </details>

        <details>
            <summary class="sum1">
            11.2 Hash tables
            </summary>
            <figure>
            <img
            src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Hash_table_4_1_1_0_0_1_0_LL.svg/300px-Hash_table_4_1_1_0_0_1_0_LL.svg.png">
            </figure>
    
            <details>
            <summary class="sum2">
            Basic Hashing
            </summary>
            <ul>
                <li> <em>O(1)</em> <em>average</em> case time for lookup.
                </li>
                <li> Universe of keys <em>U</em> mapped into slots of a <em>hash table</em>
                of size <em>m</em> by hash function <em>h</em>.
                </li>
                <li> Because <em>size(U) &gt; m</em>,
                    collisions are always possible.
                <br>Imagine we hash by word length:
                'mark' and 'beam' both hash to
                4. (Stupid hash function, but it
                illustrates the idea.) We must
                resolve this collision somehow.
                </li>
                <li> Resolve collisions by chaining:
                <br> Each slot holds a linked list of values.
                </li>
                <li> <a
                    href="https://en.wikipedia.org/wiki/Cryptographic_hash_function">
                Cryptographic hashing
                </a>
                <br> Use large hash keys:
                    <a href="https://en.wikipedia.org/wiki/SHA-1">
                        SHA-1</a> uses 160 bit keys. <a
                        href="https://en.wikipedia.org/wiki/SHA-2">SHA-2</a> uses
                        keys of up to 512 bits.
                        <br>
                        <img
                        src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/SHA-2.svg/400px-SHA-2.svg.png">
                        <li> <a href="http://www.phash.org">
                            Perceptual hashing
                        </a>
                    </li>
                </li>
            </ul>
            </details>

            <details>
                <summary class="sum2">
                Introducing probability into an algorithm.
                </summary>
                <p>
                What happens to the usual assumptions?
                <br>
                <b>Correctness</b>: always, most of the time?
                <br>
                <b>Termination</b>: always, or almost always?
                What does "performance" mean if the running
                time/answer/even termination change from one run to the next?
                </p>
            </details>

            <details>
                <summary class="sum2">
                Probability Basics
                </summary>
                <p>
                <a href="https://gcallah.github.io/algorithms/Probability.html">
                    Reviewed in this document.
                </a>
                </p>
            </details>

            <details>
                <summary class="sum2">
                Simple uniform hashing 
                </summary>
                <p>
                This employs <b>chaining</b>. Furthermore, we assume that the
                distribution of elements is uniform across hash table slots.
                <br>
                <img
                src="https://upload.wikimedia.org/wikipedia/commons/3/3b/Hasq_hash_chains.png"
                height="210" width="240">
                </p>
                <ul>
                    <li> Hash table <em>T</em> with <em>m</em>
                        slots storing <em>n</em> elements.
                    </li>
                    <li> <b>Load factor</b>: <em>&alpha; = n / m</em>
                    <br> <em>&alpha;</em> is the average number of
                    elements stored in a
                    chain.
                    </li>
                    <li> Our analysis is in terms of <em>&alpha;</em>, which can be
                    less than, equal to, or greater than one.
                    </li>
                    <li><b>Worst case</b> is very bad:
                    <br>All <em>n</em> keys hash to the same slot.
                    <br>Worst case for searching becomes <em>&Theta;(n)</em> plus
                    time to compute hash function.
                    <br>We could have just used a linked list directly!
                    </li>
                    <li><b>Average case</b>:
                    <br>Assuming any given element is equally likely to hash into
                    any slot...
                    <br>We get average case <em>&Theta;(1 + &alpha;)</em> time.
                    <br>
                    <b>Unsuccessful search</b>: the average chain length
                    will be <i>&alpha;</i>. Thus, after finding the right
                    slot with a hash function that runs in O(1) time, we
                    will search &alpha; expected elements before giving up,
                    giving us he above run time.
                    <br />
                    <b>Successful search</b>: 
                    The probability that a list will be searched is
                    proportional to the number of items it contains.
                    Nevertheless, we still expect &alpha; items to be
                    searched.
                    </li>
                    <li>
                        This means that if our table size is roughly
                        proportional to <i>n</i>, then we have <i>n</i> =
                        O(<i>m</i>), and &alpha; = <i>n</i> / <i>m</i>,
                        and so &alpha; = O(<i>m</i>) / <i>m</i>, and so 
                        &alpha; = O(1). And thus the whole search is O(1).
                    </li>
                </ul>
            </details>
        </details>

        <details>
            <summary class="sum1">
            11.3 Hash functions
            </summary>
            <ul>
                <li> First, convert key to an integer.
                <br> E.g., we can interpret characters in a string
                by their ASCII values.
                <br> Then treat each value as a digit in a radix-128 integer.
                <br>
                </li>

                <li> Keys could be many other things besides ordinary strings.
                <br> E.g., genomes:
                <br>
                <img
                src="https://upload.wikimedia.org/wikipedia/commons/6/63/Part_of_DNA_sequence_prototypification_of_complete_genome_of_virus_5418_nucleotides.gif"
                height="320" width="340">
                </li>

                <li> Multiplication method:
                <br>
                <i>h</i>(<i>k</i>) = [<i>m</i> (<i>k</i> <i>A</i> mod 1)],
                where 0 &lt; <i>A</i> &lt; 1.
                <br />
                Lots of special considerations on the best values for <i>A</i>:
                we have a suggestion that it should be about (5<sup>1/2</sup> 
                &minus; 1) / 2, or 0.6180339887...
                </li>

                <li>
                    Division method:
                <br><i>h</i>(<i>k</i>) = <i>k</i> mod <i>P</i>,
                where <em>P</em> is a suitably-chosen prime number.
                </li>
            </ul>

            <details>
                <summary class="sum2">
                    Choosing the right <i>m</i> for the division method
                </summary>

                    <p>
                    Consider the following hashing scheme:
                        <br>
                        <i>h</i>(<i>k</i>) = <i>k</i> mod <i>m</i>
                        <br>
                        <em>m</em> = 7
                        <br>We convert a string into a hashable key by treating it as a
                        base-8 number.
                        <br>So 'abc', where a = 1, b = 2, and c = 3, is converted to a
                        key as follows: 1 * 8<sup>2</sup> + 2 * 8 + 3 = 83.
                        <br>In this hashing scheme, what do the strings 'cba' and 'bac'
                        hash to?
                        <br>Can you write a more general statement about a pattern we
                        can detect here? Something along the lines of, "If the table
                        size is 2<sup><i>P</i></sup> - 1, and strings
                        are interpreted in radix 2<sup><i>P</i></sup>..."
                        <br>
                        <br>
                        <b>Answer:</b>
                        <br>
                        If <i>h</i>(<i>k</i>) = <i>k</i> mod <i>m</i>,
                        where <i>m</i> = 2<sup>P</sup> 
                         &minus; 1, and <em>k</em> is a
                        character string interpreted in
                        radix 2<sup><i>P</i></sup>, then all
                        permutations of a given string
                        will hash to the same value. So in the example above, 'abc',
                        'cba', and 'bac' all hash to the same value.
                        <br>
                        <br>
                        <b>Proof</b>:
                        <br>
                        Assumed (could be proven, but we won't do it here):
                    </p>
                        <ol>
                            <li>
                                (<i>x</i> + <i>y</i>) mod <i>z</i> == 
                                    (<i>x</i> mod <i>z</i>
                                    + <i>y</i> mod <i>z</i>) mod <i>z</i>
                                <br><b>Example:</b> (10 + 12) mod 7 ==
                                (10 mod 7 + 12 mod 7) mod 7
                            </li>
                            <li>
                                (<i>x</i> * <i>y</i>) mod <i>z</i> == 
                                    (<i>x</i> mod <i>z</i>) * (<i>y</i> mod <i>z</i>)
                                mod <i>z</i>
                                <br><b>Example:</b> (10 * 12) mod 7 ==
                                (10 mod 7) * (12 mod 7) mod 7
                                <br>(7 * 17 = 119)
                            </li>
                            <li>
                                if <i>x</i> mod <i>z</i> == 1,
                                then <i>x</i><sup><i>n</i></sup> mod <i>z</i> == 1
                                <br><b>Example:</b> 8 mod 7 == 1, and
                                8<sup>2</sup> mod 7 == 1
                                <br>This is a special case of 2!
                            </li>
                        </ol>
                        <p>
                        So, we have:
                        <br>
                        <img src="https://raw.githubusercontent.com/gcallah/algorithms/master/graphics/H3Eq1.gif">
                        <br>
                        <br>
                        <img src="https://raw.githubusercontent.com/gcallah/algorithms/master/graphics/H3Eq2.gif">
                        <br>
                        <br>
                        <img src="https://raw.githubusercontent.com/gcallah/algorithms/master/graphics/H3Eq3.gif">
                        <br>
                        <br>
                        <img
                        src="https://raw.githubusercontent.com/gcallah/algorithms/master/graphics/H3Eq4.gif">
                        &nbsp;&nbsp;(By 1)
                        <br>
                        <br>
                        <img
                        src="https://raw.githubusercontent.com/gcallah/algorithms/master/graphics/H3Eq5.gif">
                        &nbsp;&nbsp;(By 2)
                        <br>
                        <br>
                        <img
                        src="https://raw.githubusercontent.com/gcallah/algorithms/master/graphics/H3Eq6.gif">
                        &nbsp;&nbsp;(By 3)
                        <br>
                        <br>
                        <img
                        src="https://raw.githubusercontent.com/gcallah/algorithms/master/graphics/H3Eq7.gif">
                    </p>
            </details>

            <details>
                <summary class="sum2">
                Universal hashing
                </summary>
                <ul>
                    <li>
                        Establish a <em>family</em> of hash functions.
                    </li>

                    <li>
                        Choose so that
                        Prob[<i>h</i>(<i>x</i>) = <i>h</i>(<i>y</i>)] &le; 1/m,
                        where <i>m</i> is the size of our hash table.
                    <br>In other words, the hash functions have no more chance of
                    collision than simply randomly choosing
                    to slots between 1 and <i>m</i>.
                    </li>

                    <li>
                        Choose one at random each execution.
                    <br>Tricky: what if we store hash values?
                    </li>

                    <li>
                        Good average case behavior
                    <br>If a "bad" function handles some
                    data once, a "good" one will handle it another time.
                    <br>So a "bad" set of programming variable names
                    one run will turn into a good set the next run.
                    </li>
                </ul>
            </details>
        </details>

        <details>
            <summary class="sum1">
            11.4 Open addressing
            </summary>
            <ul>
                <li>
                All elements are stored directly in the table; no chaining.
                </li>

                <li>
                Linear probing
                <br>Easy: just move along array indices!
                <br>Prone to clustering.
                <br />
                Why: once an area of the table <i>begins</i> to fill up,
                we are more likely to get collisions there.
                </li>

                <li>
                Quadratic probing
                <br />
                Uses a hash function of the form:
                <br />
                <i>h</i>(<i>k</i>, <i>i</i>) =
                (<i>h</i>'(<i>k</i>) + <i>c</i><sub>1</sub><i>i</i> +
                <i>c</i><sub>2</sub><i>i</i><sup>2</sup>
                mod <i>m</i>
                <br />
                Prone to milder form of clustering.
                </li>

                <li>
                    <a href="https://en.wikipedia.org/wiki/Double_hashing">
                    Double hashing
                </a>
                <br>Uses two hash functions to search array for key.
                </li>

                <li>
                    Unsuccessful search: 1 / (1 - <i>&alpha;</i>) expected
                    probes.
                    <br />
                    (Since at most one element can be in a slot, <i>&alpha;</i>
                    &le; 1.)
                    <br />
                    Our expected number of searches is 1 + <i>&alpha;</i>
                    + <i>&alpha;</i><sup>2</sup> + <i>&alpha;</i><sup>3</sup>
                    + <i>&alpha;</i><sup>4</sup>...
                </li>

                <li>
                    Successful search: (1 / <i>&alpha;</i>) ln (1 / 
                    (1 - <i>&alpha;</i>))
                </li>
                <li>
                    <a
                        href="https://github.com/gcallah/algorithms/blob/master/hash_tables.py">
                        Source code here</a>.
                </li>
            </ul>
        </details>
        <details>
            <summary class="sum1">
            11.5 Perfect hashing
            </summary>

            <p>
                We can get even better perfromance with a fixed hash table --
                think of reserved words in a programming language, or the
                index of a CD -- by <i>perfect hashing</i>.
                <br />
                We proceed as in hashing with chaining, but then, instead of a
                linked list, each hash slot gets a hash table
                <i>m</i><sub><i>j</i></sub> of size <i>n</i><sup>2</sup>, where
                <i>n</i> is the number of elements expected to hash to slot
                <i>j</i>.
                <br />
                The probability of geetting a collision is much like the
                birthday problem: when the table size is the square of the 
                expected number of entries, the probability of collisions 
                is &lt; 1/2. So we can just try hash functions until we find
                one that produces no collisions.
            </p>

        </details>

        <details>
            <summary class="sum1">
                Source Code
            </summary>
            <p>
            
<a href="https://github.com/gcallah/algorithms/tree/master/Java/HashTables">Java</a><br>
<a href="https://github.com/gcallah/algorithms/tree/master/Ruby/HashTables">Ruby</a><br>
<a href="https://github.com/gcallah/algorithms/tree/master/C++/HashTables">C++</a><br>
<a href="https://github.com/gcallah/algorithms/tree/master/Python/HashTables">Python</a><br>
            </p>
        </details>

    </body>
    <script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
        m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
        })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
        ga('create', 'UA-97026578-2', 'auto');
        ga('send', 'pageview');
    </script>
</html>