6  Data structures

6.1 Data structures

The topic of data structures focuses on how we represent, access and store data, ideally in ways efficient for our purpose. This chapter is a low level teaser of data structures geared towards students who haven’t been exposed to the ideas much at all.

6.2 Hash tables

A common starting point for data structures is a hash table. Suppose we want to store a set of values associated with a set of keys. Consider a list of some students and their PhD theses.

key value
Leena Modelling biomedical data and the foundations of bioequivalence
Xianbin Modeling composite outcomes and their component parts
Shu-Chih Structure/function relationships in the analysis of anatomical and functional neuroimaging data
Haley Statistical methods for inter-subject analysis of neuroscience data
Bruce From individuals to populations: application and insights concerning the generalized linear mixed model

If we stored these in an text array, say, and we wanted to look up Bruce’s thesis title, we’d have to check each key in turn until we arived at Bruce and then looked up his thesis. This has a worst case scenario of n operations. (Question, if we looked in the order of a random permutation, what is the expected number of lookups?)

Hash tables map the keys to a specific lookup number. Thus, when trying to find Bruce’s value, the has function would perform hash("Bruce") to get the hash value, then to straight to that point in the array. Sounds great!

There are some details, of course. Most (all) programming languages have hash tables, or libraries that add on hash tables. For example, the dict structure in python. Since that exists, let’s work in R and create our own hash table.

We need a hash function. Let’s create one as the sum of its utf8 values

hash = function(string, mod) sum(utf8ToInt(string)) %% mod
hash("Bruce", 5)
[1] 2
hash("Bruce2", 5)
[1] 2
hash("Haley", 5)
[1] 4

Here the mod is used to truncate our integer value so that our it fits in our list. In our case, let’s assume the list is of length no larger than 5. Ideally, you want you hash functions to have as few collisions, instances where different key texts give the same hash value, as possible. For our simple example, we’re not going to stress out over this much and our hash function is going to give lots of collisions. Let’s create an empty hash table

hash_table = vector(mode = "list", length = 5)

Now, let’s add an element

##Note this operates on hash_table outside of the function
add_pair = function(key, value, hash_table){
    n = length(hash_table)
    new_entry = list(c(key, value))
    hash_value = hash(key, n)
    hash_entry = hash_table[[hash_value]]
    if (is.null(hash_entry)){
        hash_table[[hash_value]] = new_entry
    }
    else {
        hash_table[[hash_value]] = c(hash_entry, new_entry)
    }
    return(hash_table)
}
hash_table = add_pair("Bruce", "From individuals to populations", hash_table)
hash_table = add_pair("Bruce2", "From individuals to populations2", hash_table)
hash_table = add_pair("Haley", "Statistical methods", hash_table)
hash_table
[[1]]
NULL

[[2]]
[[2]][[1]]
[1] "Bruce"                           "From individuals to populations"

[[2]][[2]]
[1] "Bruce2"                           "From individuals to populations2"


[[3]]
NULL

[[4]]
[[4]][[1]]
[1] "Haley"               "Statistical methods"


[[5]]
NULL

A nefarious character named Bruce2 submitted an incremental thesis. But, alas, must go into the hash table. The keys Bruce and Bruce2 result in collisions from our hash function, keys that have the same hash value. We’re adopting a collision strategy where we just add collision keys in turn. We should also write some code that prevents us from adding the same exact key twice. As it stands we could add Bruce twice when keys need to be unique.

Let’s write a retrieval function. It needs to find the appropriate hash value, then search within that list for the appropriate key.

retrieve = function(key, hash_table){
    n = length(hash_table)
    hash_value = hash(key, n)
    hash_entry = hash_table[[hash_value]]
    ## If there's nothing there return null
    if (is.null(hash_entry)){
        return(NULL)
    }
    else {
        keys = sapply(hash_entry, function(x) x[1])
        key_test = key == keys
        if (any(key_test)){
            key_index = which(key_test) 
            return(hash_entry[[key_index]][2])
        }
        ## If the key isn't there return null
        else return(NULL)
    }
}
retrieve("Bruce", hash_table)
[1] "From individuals to populations"
retrieve("Bruce2", hash_table)
[1] "From individuals to populations2"
retrieve("bruce", hash_table)
NULL