= function(string, mod) sum(utf8ToInt(string)) %% mod
hash hash("Bruce", 5)
[1] 2
hash("Bruce2", 5)
[1] 2
hash("Haley", 5)
[1] 4
The topic of data structures focuses on how we represent, access and store data, ideally in ways efficient for our purpose. This chapter is a low level teaser of data structures geared towards students who haven’t been exposed to the ideas much at all.
A common starting point for data structures is a hash table. Suppose we want to store a set of values
associated with a set of keys
. Consider a list of some students and their PhD theses.
key | value |
---|---|
Leena | Modelling biomedical data and the foundations of bioequivalence |
Xianbin | Modeling composite outcomes and their component parts |
Shu-Chih | Structure/function relationships in the analysis of anatomical and functional neuroimaging data |
Haley | Statistical methods for inter-subject analysis of neuroscience data |
Bruce | From individuals to populations: application and insights concerning the generalized linear mixed model |
If we stored these in an text array, say, and we wanted to look up Bruce’s thesis title, we’d have to check each key in turn until we arived at Bruce and then looked up his thesis. This has a worst case scenario of n operations. (Question, if we looked in the order of a random permutation, what is the expected number of lookups?)
Hash tables map the keys to a specific lookup number. Thus, when trying to find Bruce’s value, the has function would perform hash("Bruce")
to get the hash value, then to straight to that point in the array. Sounds great!
There are some details, of course. Most (all) programming languages have hash tables, or libraries that add on hash tables. For example, the dict
structure in python. Since that exists, let’s work in R and create our own hash table.
We need a hash function. Let’s create one as the sum of its utf8 values
= function(string, mod) sum(utf8ToInt(string)) %% mod
hash hash("Bruce", 5)
[1] 2
hash("Bruce2", 5)
[1] 2
hash("Haley", 5)
[1] 4
Here the mod is used to truncate our integer value so that our it fits in our list. In our case, let’s assume the list is of length no larger than 5. Ideally, you want you hash functions to have as few collisions
, instances where different key texts give the same hash value, as possible. For our simple example, we’re not going to stress out over this much and our hash function is going to give lots of collisions. Let’s create an empty hash table
= vector(mode = "list", length = 5) hash_table
Now, let’s add an element
##Note this operates on hash_table outside of the function
= function(key, value, hash_table){
add_pair = length(hash_table)
n = list(c(key, value))
new_entry = hash(key, n)
hash_value = hash_table[[hash_value]]
hash_entry if (is.null(hash_entry)){
= new_entry
hash_table[[hash_value]]
}else {
= c(hash_entry, new_entry)
hash_table[[hash_value]]
}return(hash_table)
}= add_pair("Bruce", "From individuals to populations", hash_table)
hash_table = add_pair("Bruce2", "From individuals to populations2", hash_table)
hash_table = add_pair("Haley", "Statistical methods", hash_table)
hash_table hash_table
[[1]]
NULL
[[2]]
[[2]][[1]]
[1] "Bruce" "From individuals to populations"
[[2]][[2]]
[1] "Bruce2" "From individuals to populations2"
[[3]]
NULL
[[4]]
[[4]][[1]]
[1] "Haley" "Statistical methods"
[[5]]
NULL
A nefarious character named Bruce2
submitted an incremental thesis. But, alas, must go into the hash table. The keys Bruce
and Bruce2
result in collisions from our hash function, keys that have the same hash value. We’re adopting a collision strategy where we just add collision keys in turn. We should also write some code that prevents us from adding the same exact key twice. As it stands we could add Bruce twice when keys need to be unique.
Let’s write a retrieval function. It needs to find the appropriate hash value, then search within that list for the appropriate key.
= function(key, hash_table){
retrieve = length(hash_table)
n = hash(key, n)
hash_value = hash_table[[hash_value]]
hash_entry ## If there's nothing there return null
if (is.null(hash_entry)){
return(NULL)
}else {
= sapply(hash_entry, function(x) x[1])
keys = key == keys
key_test if (any(key_test)){
= which(key_test)
key_index return(hash_entry[[key_index]][2])
}## If the key isn't there return null
else return(NULL)
}
}retrieve("Bruce", hash_table)
[1] "From individuals to populations"
retrieve("Bruce2", hash_table)
[1] "From individuals to populations2"
retrieve("bruce", hash_table)
NULL