4. Hashing
概要
TLDRThis lecture discusses hashing in the context of set data structures, covering the limitations of traditional search methods and emphasizing the significance of collision-resistant hash functions. Jason Ku details the comparison model, where items can only be compared, and introduces universal hashing as a way to enhance the performance of hash tables by reducing collision probabilities. He explores techniques for handling collisions through chaining, the expected performance of hash tables with universal hashing, and strategies for resizing hash tables dynamically as the number of items increases. The emphasis is placed on finding efficient methods for searching, inserting, and deleting items while maintaining performance integrity.
収穫
- 🔑 Understanding the limitations of traditional search methods.
- 📊 The role of the comparison model in data structure operations.
- ⚖️ Importance of universal hashing in reducing collisions.
- 🔄 Strategies for handling collisions using chaining.
- 📏 Expected chain lengths in universal hash tables should be constant.
- 📈 Dynamic resizing of hash tables helps manage increased data loads.
- 🗝️ Direct access arrays provide constant time access based on keys.
- ✨ Efficient hashing can greatly improve performance.
- 📝 Choosing good hash functions is crucial for success.
- 💡 The need for careful memory allocation in large data structures.
タイムライン
- 00:00:00 - 00:05:00
Lecture 4 of 6.006 focuses on hashing, contrasting with the previous lecture on set and sequence data structures, emphasizing the need for efficient item retrieval via keys.
- 00:05:00 - 00:10:00
Explored two methods for implementing the set interface - using unsorted arrays with linear scans and sorted arrays with log(n) retrieval time, introducing the potential for faster data structure construction.
- 00:10:00 - 00:15:00
Introduced a comparison model to prove that finding an item cannot be faster than log(n) time, explaining the implications of comparisons in determining item positions in data structures.
- 00:15:00 - 00:20:00
Described the concept of a decision tree, emphasizing that the number of comparisons relates to the tree's structure and is influenced by the required outputs from search operations.
- 00:20:00 - 00:25:00
Discussed the importance of finding keys in a data structure, illustrating how the number of outputs determines the minimum height of binary trees in a search algorithm context, stressing that at least (n + 1) leaves are necessary.
- 00:25:00 - 00:30:00
Entered into the specifics of the comparison model, detailing how the number of comparisons directly influences algorithm efficiency and establishing a log(n) baseline for searching in this context.
- 00:30:00 - 00:35:00
Elaborated on using direct access arrays for quick key retrieval through index mapping, which theoretically allows for constant time retrieval and insertion but poses challenges related to memory usage.
- 00:35:00 - 00:40:00
Addressed the limitations of direct access arrays concerning large key spaces, proposing a hashing method to map large keys into a more manageable space to optimize data storage and performance.
- 00:40:00 - 00:45:00
Outlined potential collision problems stemming from hashing where multiple keys map to the same location, introducing the concept of chaining to handle collisions and ensure efficient retrieval of items.
- 00:45:00 - 00:52:54
Concluded with the concept of universal hash functions as a solution to collision issues in hashing, discussing how choosing hash functions randomly from a family of functions can mitigate performance issues and lead to expected constant chain lengths.
マインドマップ
ビデオQ&A
What is the main topic of this lecture?
The lecture focuses on hashing and its applications in set data structures.
What is a comparison model?
A comparison model restricts how items are stored and allows only comparisons between keys.
What is the significance of universal hashing?
Universal hashing helps minimize the probability of collisions in hash tables, improving performance.
What do you do when a hash table reaches its size limit?
When a hash table reaches its size limit, it can be resized and rebuilt to accommodate more items.
What are the two main operations discussed in relation to set data structures?
The main operations discussed are finding items and handling collisions.
What is the expected length of chains in a universal hash table?
The expected length of chains can be constant if the hash table is properly sized.
How can you handle collisions in hash tables?
Collisions can be handled using chaining, where multiple items are stored at the same index.
What is the significance of log n time in data structure operations?
Log n time indicates efficient searching and improves performance over linear scanning.
What is a direct access array?
A direct access array allows for constant time access based on keys, but may require large memory allocation.
What is the limitation of using a simple modulus for hashing?
Using a simple modulus may lead to many collisions if the distribution of keys is not uniform.
ビデオをもっと見る
- 00:00:00
- 00:00:12JASON KU: Welcome to the fourth lecture of 6.006.
- 00:00:17Today we are going to be talking about hashing.
- 00:00:20Last lecture, on Tuesday, Professor Solomon
- 00:00:24was talking about set data structures,
- 00:00:29storing things so that you can query items
- 00:00:33by their key right, by what they intrinsically are--
- 00:00:37versus what Professor Demaine was talking
- 00:00:39about last week, which was sequence data structures, where
- 00:00:42we impose an external order on these items
- 00:00:46and we want you to maintain those.
- 00:00:49I'm not supporting operations where I'm looking stuff up
- 00:00:52based on what they are.
- 00:00:54That's what the set interface is for.
- 00:00:56So we're going to be talking a little bit more about the set
- 00:00:59interface today.
- 00:01:01On Tuesday, you saw two ways of implementing the set
- 00:01:05interface--
- 00:01:07one using just a unsorted array-- just,
- 00:01:09I threw these things in an array and I
- 00:01:12could do a linear scan of my items
- 00:01:14to support basically any of these operations.
- 00:01:17It's a little exercise you can go through.
- 00:01:19I think they show it to you in the recitation notes,
- 00:01:21but if you'd like to implement it for yourself, that's fine.
- 00:01:26And then we saw a slightly better data structure, at least
- 00:01:30for the find operations.
- 00:01:31Can I look something up, whether this key
- 00:01:34is in my set interface?
- 00:01:38We can do that faster.
- 00:01:39We can do that in log n time with a build overhead
- 00:01:43that's about n log n, because we showed you three ways to sort.
- 00:01:49Two of them were n squared.
- 00:01:51One of them was n log n, which is as good as we showed you
- 00:01:55how to do yesterday.
- 00:01:57So the question then becomes, can I build that data structure
- 00:02:00faster?
- 00:02:01That'll be a subject of next week's Thursday lecture.
- 00:02:04But this week we're going to concentrate on this static
- 00:02:08find.
- 00:02:09we got log n, which is an exponential improvement
- 00:02:11over linear right, but the question now becomes,
- 00:02:17can I do faster than log n time?
- 00:02:21And what we're going to do at the first part of this lecture
- 00:02:24is show you that, no, you--
- 00:02:26AUDIENCE: [INAUDIBLE]
- 00:02:27JASON KU: What's up?
- 00:02:28No?
- 00:02:29OK-- that you can't do faster than log n time,
- 00:02:35in the caveat that we are in a slightly more restricted model
- 00:02:38of computation that we were-- than what we introduce
- 00:02:43to you a couple of weeks ago.
- 00:02:46And then so if we're not in that more constrained model
- 00:02:50of computation, we can actually do faster.
- 00:02:52
- 00:02:55Log n's already pretty good.
- 00:02:57Log n is not going to be larger than like 30 for any problem
- 00:03:03that you're going to be talking about in the real world
- 00:03:08on real computers, but a factor of 30 is still bad.
- 00:03:13I would prefer to do faster with those constant factors, when
- 00:03:17I can.
- 00:03:18It's not a constant factor.
- 00:03:19It's a logarithmic factor, but you get what I'm saying.
- 00:03:22OK, so what we're going to do is first
- 00:03:24prove that you can't do faster for--
- 00:03:27does everyone understand-- remember what find key meant?
- 00:03:32I have a key, I have a bunch of items that have keys associated
- 00:03:36with them, and I want to see if one of the items that I'm
- 00:03:39storing contains a key that is the same as the one
- 00:03:42that I searched for.
- 00:03:43The item might contain other things,
- 00:03:46but in particular, it has a search key
- 00:03:49that I'm maintaining the set on so that it supports
- 00:03:52find operations, search operations based on that key
- 00:03:56quickly.
- 00:03:56Does that make sense?
- 00:03:58So there's the find one that we want to improve,
- 00:04:00and we also want to improve this insert delete.
- 00:04:03We want to be-- make this data structural dynamic, because we
- 00:04:08might do those operations quite a bit.
- 00:04:11And so this lecture's about optimizing those three things.
- 00:04:15OK, so first, I'm going to show you
- 00:04:17that we can't do faster than log n for find, which
- 00:04:22is a little weird.
- 00:04:23OK, the model of computation I'm going
- 00:04:26to be proving this lower bound on--
- 00:04:28
- 00:04:31how I'm going to approach this is I'm going to say that
- 00:04:33any way that I store these--
- 00:04:37the items that I'm storing in this data structure--
- 00:04:42for anyway I saw these things, any algorithm
- 00:04:45of this certain type is going to require
- 00:04:48at least logarithmic time.
- 00:04:50That's what we're going to try to prove.
- 00:04:52And the model of computation that's
- 00:04:55weaker than what we've been talking about previously
- 00:04:58is what I'm going to call the comparison model.
- 00:05:00
- 00:05:04And a comparison model means-- is that the items,
- 00:05:07the objects I'm storing--
- 00:05:10I can kind of think of them as black boxes.
- 00:05:12I don't get to touch these things, except the only way
- 00:05:15that I can distinguish between them is to say,
- 00:05:20given a key and an item, or two items, I can do a comparison
- 00:05:27on those keys.
- 00:05:28Are these keys the same?
- 00:05:31Is this key bigger than this one?
- 00:05:34Is it smaller than this one?
- 00:05:35Those are the only operations I get to do with them.
- 00:05:40Say, if the keys are numbers, I don't get
- 00:05:42to look at what number that is.
- 00:05:44I just get to take two keys and compare them.
- 00:05:46And actually, all of the search algorithms
- 00:05:49that we saw on Tuesday we're comparison sort algorithms.
- 00:05:53What you did was stepped through the program.
- 00:05:56At some point, you came to a branch
- 00:05:59and you looked at two keys, and you
- 00:06:01branched based on whether one key was bigger than another.
- 00:06:06That was a comparison.
- 00:06:07And then you move some stuff around,
- 00:06:09but that was the general paradigm.
- 00:06:11Those three sorting operations lived in this comparison model.
- 00:06:17You've got a comparison operations,
- 00:06:20like are they equal, less than, greater than,
- 00:06:25maybe greater than or equal, less than or equal?
- 00:06:28Generally, you have all these operations
- 00:06:30that you could do-- maybe not equal.
- 00:06:32
- 00:06:35But the key thing here is that there are only
- 00:06:38two possible outputs to each of these comparitors.
- 00:06:40
- 00:06:44There's only one thing that I can branch on.
- 00:06:46It's going to branch into two different lines.
- 00:06:49It's either true and I do some other computation,
- 00:06:52or it's false and I'll do a different set of computation.
- 00:06:56That makes sense?
- 00:06:58So what I'm going to do is I'm going
- 00:06:59to give you a comparison--
- 00:07:02an algorithm in the comparison model
- 00:07:05as what I like to call a decision tree.
- 00:07:08So if I specify an algorithm to you,
- 00:07:10the first thing it's going to do-- if I don't compare items
- 00:07:13at all, I'm kind of screwed, because I'll never
- 00:07:15be able to tell if my keys in there or not.
- 00:07:17So I have to do some comparisons.
- 00:07:21So I'll do some computation.
- 00:07:23Maybe I find out the length of the array
- 00:07:25and I do some constant time stuff, but at some point,
- 00:07:28I'll do a comparison, and I'll branch.
- 00:07:31I'll come to this node, and if the comparison--
- 00:07:35maybe a less than--
- 00:07:37if it's true, I'm going to go this way in my computation,
- 00:07:41and if it's false, I'm going to go this way in my computation.
- 00:07:45And I'm going to keep doing that with various comparisons--
- 00:07:51sure-- until I get down here to some leaf in which I
- 00:08:02I'm not branching.
- 00:08:04The internal nodes here are representing comparisons,
- 00:08:07but the leaves are representing--
- 00:08:09I stopped my computation.
- 00:08:11I'm outputting something.
- 00:08:13Does that make sense, what I'm trying to do?
- 00:08:16I'm changing my algorithm to be put
- 00:08:20in this kind of graphical way, where I'm branching what
- 00:08:24my program could possibly do based on the comparisons
- 00:08:28that I do.
- 00:08:30I'm not actually counting the rest of the work
- 00:08:33that the program does.
- 00:08:35I'm really only looking at the comparisons,
- 00:08:37because I know that I need to compare some things eventually
- 00:08:41to figure out what my items are.
- 00:08:44And if that's the only way I can distinguish items,
- 00:08:47then I have to do those comparisons to find out.
- 00:08:49Does that make sense?
- 00:08:51All right, so what I have is a binary tree
- 00:08:56that's representing the comparisons done
- 00:08:58by the algorithm.
- 00:08:59OK.
- 00:09:01So it starts at one comparison and then it branches.
- 00:09:04How many leaves must I have in my tree?
- 00:09:07
- 00:09:10What does that question mean, in terms of the program?
- 00:09:15AUDIENCE: [INAUDIBLE]
- 00:09:16JASON KU: What's up?
- 00:09:17AUDIENCE: The number of comparisons--
- 00:09:18JASON KU: The number of comparisons-- no,
- 00:09:20that's the number of internal nodes
- 00:09:21that I have in the algorithm.
- 00:09:23And actually, the number of comparisons
- 00:09:25that I do in an execution of the algorithm
- 00:09:27is just along a path from here to the-- to a leaf.
- 00:09:32So what do the leaves actually represent?
- 00:09:34Those represent outputs.
- 00:09:36I'm going to output something here.
- 00:09:39Yep?
- 00:09:40AUDIENCE: [INAUDIBLE]
- 00:09:41JASON KU: The number of--
- 00:09:42OK.
- 00:09:42
- 00:09:45So what is the output to my search algorithm?
- 00:09:47Maybe it's the-- an index of an item that contains this key.
- 00:09:52Or maybe I return the item is the output--
- 00:09:58the item of the thing I'm storing.
- 00:09:59And I'm storing n things, so I need at least n outputs,
- 00:10:04because I need to be able to return any of the items
- 00:10:07that I'm storing based on a different search parameter,
- 00:10:11if it's going to be correct.
- 00:10:12I actually need one more output.
- 00:10:13Why do I need one more output?
- 00:10:15
- 00:10:17If it's not in there--
- 00:10:20so any correct comparison searching algorithm--
- 00:10:26I'm doing some comparisons to find this thing--
- 00:10:30needs to have at least n plus 1 leaves.
- 00:10:34
- 00:10:38Otherwise, it can't be correct, because I could look up
- 00:10:41the one that I'm not returning in that set
- 00:10:44and it would never be able to return that value.
- 00:10:47Does that make sense?
- 00:10:50Yeah?
- 00:10:50AUDIENCE: [INAUDIBLE]
- 00:10:51JASON KU: What's n?
- 00:10:53For a data structure, n is the number
- 00:10:55of things stored in that data structure at that time--
- 00:10:58so the number of items in the data structure.
- 00:11:00That's what it means in all of these tables.
- 00:11:03Any other questions?
- 00:11:05OK, so now we get to the fun part.
- 00:11:09How many comparisons does this algorithm have to do?
- 00:11:13
- 00:11:16Yeah, up there--
- 00:11:17AUDIENCE: [INAUDIBLE]
- 00:11:19JASON KU: What's up?
- 00:11:22All right, your colleague is jumping ahead for a second,
- 00:11:25but really, I have to do as many comparisons in the worst case
- 00:11:30as the longest root-to-leaf path in this tree--
- 00:11:35because as I'm executing this algorithm,
- 00:11:37I'll go down this thing, always branching down,
- 00:11:42and at some point, I'll get to a leaf.
- 00:11:44And in the worst case, if I happen
- 00:11:47to need to return this particular output,
- 00:11:51then I'll have to walk down the longest thing, just the longest
- 00:11:55path.
- 00:11:57So then the longest path is the same as the height of the tree,
- 00:12:01so the question then becomes, what
- 00:12:04is the minimum height of any binary tree that has at least n
- 00:12:10plus 1 leaves?
- 00:12:13Does everyone understand why we're asking that question?
- 00:12:18Yeah?
- 00:12:19AUDIENCE: Could you over again why it needs n plus 1 leaves?
- 00:12:22JASON KU: Why it needs n plus 1 leaves--
- 00:12:24if it's a correct algorithm, it needs to return--
- 00:12:27it needs to be able to return any of the n items
- 00:12:30that I'm storing or say that the key that I'm looking for
- 00:12:33is not there--
- 00:12:35great question.
- 00:12:37OK, so what is the minimum height
- 00:12:40of any binary tree that has n plus 1--
- 00:12:44at least n plus 1 leaves?
- 00:12:48You can actually state a recurrence for that
- 00:12:50and solve that.
- 00:12:50You're going to do that in your recitation.
- 00:12:52But it's log n.
- 00:12:53The best you can do is if this is a balanced binary tree.
- 00:12:57So the min height is going to be at least log n height.
- 00:13:10
- 00:13:14Or the min height is logarithmic,
- 00:13:17so it's actually theta right here.
- 00:13:19But if I just said height here, I
- 00:13:21would be lower bounding the height.
- 00:13:24I could have a linear height, if I just changed comparisons
- 00:13:28down one by one, if I was doing a linear search, for example.
- 00:13:34All right, so this is saying that, if I'm just restricting
- 00:13:36to comparisons, I have to spend at least logarithmic time
- 00:13:40to be able to find whether this key is in my set.
- 00:13:43
- 00:13:46But I don't want logarithmic time.
- 00:13:48I want faster.
- 00:13:49So how can I do that?
- 00:13:51AUDIENCE: [INAUDIBLE]
- 00:13:51JASON KU: I have one operation in my model of computation
- 00:13:54I presented a couple of weeks ago
- 00:13:56that allows me to do faster, which allows me to do something
- 00:14:00stronger than comparisons.
- 00:14:03Comparisons have a constant branching factor.
- 00:14:06In particular, I can--
- 00:14:08if I do this operation-- this constant time operation--
- 00:14:11I can branch to two different locations.
- 00:14:17It's like an if kind of situation-- if, or else.
- 00:14:21And in fact, if I had constant branching factor
- 00:14:24for any constant here--
- 00:14:28if I had three or four, if it was bounded by a constant,
- 00:14:31the height of this tree would still
- 00:14:32be bounded by a log base the constant
- 00:14:36of that number of leaves.
- 00:14:39So I need, in some sense, to be able to branch
- 00:14:42a non-constant amount.
- 00:14:45So how can I branch a non-constant amount?
- 00:14:49This is a little tricky.
- 00:14:51We had this really neat operation in the random access
- 00:14:57machine that we could randomly go
- 00:15:01to any place in memory in constant time
- 00:15:03based on a number.
- 00:15:04
- 00:15:08That was a super powerful thing, because
- 00:15:10within a single constant time operation,
- 00:15:12I could go to any space in memory.
- 00:15:15That's potentially much larger than linear branching factor,
- 00:15:19depending on the size of my model
- 00:15:20and the size of my machine.
- 00:15:22So that's a very powerful operation.
- 00:15:24Can we use that to find quicker?
- 00:15:27Anyone have any ideas?
- 00:15:28
- 00:15:31Sure.
- 00:15:32AUDIENCE: [INAUDIBLE]
- 00:15:33JASON KU: We're going to get to hashing in a second,
- 00:15:35but this is a simpler concept than hashing--
- 00:15:40something you probably are familiar with already.
- 00:15:44We've kind of been using it implicitly
- 00:15:46in some of our sequence data structure things.
- 00:15:50What we're going to do is, if I have an item that has key 10,
- 00:15:57I'm going to keep an array and store that item 10 spaces away
- 00:16:04from the front of the array, right at index 9,
- 00:16:07or the 10th index.
- 00:16:09Does that make sense?
- 00:16:11If I store that item at that location in memory,
- 00:16:14I can use this random access to that location
- 00:16:19and see if there's something there.
- 00:16:21If there's something there, I return that item.
- 00:16:23Does that make sense?
- 00:16:24This is what I call a direct access array.
- 00:16:26
- 00:16:29It's really no different than the arrays
- 00:16:32that we've been talking about earlier in the class.
- 00:16:38We got an array, and if I have an item here
- 00:16:43with key equals 10, I'll stick it here in the 10th place.
- 00:16:50Now, I can only now store one item with the key 10
- 00:16:56in my thing, and that's one of the stipulations we
- 00:16:58had on our set data structures.
- 00:17:00If we tried to insert something with the same key
- 00:17:03as something already stored there,
- 00:17:04we're going to replace the item.
- 00:17:06That's what the semantics of our set interface was.
- 00:17:09But that's OK.
- 00:17:10That's satisfying the conditions of our set interface.
- 00:17:14So if we store it there, that's fantastic.
- 00:17:17How long does it take to find, if we
- 00:17:19have an item with the key 10?
- 00:17:23It takes constant time, worst case--
- 00:17:25great.
- 00:17:27How about inserting or deleting something?
- 00:17:29AUDIENCE: [INAUDIBLE]
- 00:17:30JASON KU: What's that?
- 00:17:31AUDIENCE: [INAUDIBLE]
- 00:17:32JASON KU: Again, constant time--
- 00:17:34we've solved all our problems.
- 00:17:36This is amazing.
- 00:17:36
- 00:17:39OK.
- 00:17:40What's not amazing about this?
- 00:17:42Why don't we just do this all the time?
- 00:17:43
- 00:17:47Yeah?
- 00:17:50AUDIENCE: You don't know how high the numbers go.
- 00:17:53JASON KU: I don't know how high the numbers go.
- 00:17:56So let's say I'm storing, I don't know,
- 00:17:59a number associated with that the 300 or 400 of you
- 00:18:03that are in this classroom.
- 00:18:05
- 00:18:08But I'm storing your MIT IDs.
- 00:18:10How big are those numbers?
- 00:18:12Those are like nine-digit numbers--
- 00:18:15pretty long numbers.
- 00:18:17So what I would need to do-- and if I was storing your keys
- 00:18:21as MIT IDs, I would need an array
- 00:18:25that has indices that span the tire
- 00:18:28space of nine-digit numbers.
- 00:18:33That's like 10 to the--
- 00:18:3710 to the 9.
- 00:18:37Thank you.
- 00:18:3810 to the 9 is the size of a direct access road off
- 00:18:43to build to be able to use this technique
- 00:18:50to create a direct access array to search on your MIT IDs,
- 00:18:54when there's only really 300 of you in here.
- 00:18:57So 300 or 400 is an n that's much
- 00:19:00smaller than the size of the numbers
- 00:19:03that I'm trying to store.
- 00:19:04What I'm going to use as a variable
- 00:19:06to talk about the size of the numbers I'm storing--
- 00:19:09I'm going to say u is the maximum size of any number
- 00:19:12that I'm storing.
- 00:19:13It's the size of the universe of space of keys that I'm storing.
- 00:19:17Does that make sense?
- 00:19:19OK, so to instantiate a direct access array of that size,
- 00:19:24I have to allocate that amount of space.
- 00:19:26And so if that is much bigger than n,
- 00:19:31then I'm kind of screwed, because I'm
- 00:19:34using much more space.
- 00:19:36And these order operations are bad also, because essentially,
- 00:19:40if I am storing these things non-continuously,
- 00:19:46I kind of just have to scan down the thing
- 00:19:48to find the next element, for example.
- 00:19:52OK, what's your question?
- 00:19:53AUDIENCE: Is a direct access array
- 00:19:55a sequence data structure?
- 00:19:56JASON KU: A direct access array is a set data structure.
- 00:19:59That's why it's a set interface up there.
- 00:20:01
- 00:20:05Your colleague is asking whether you can use a direct accessory
- 00:20:09to implement a set--
- 00:20:10I mean a sequence.
- 00:20:11And actually, I think you'll see in your recitation notes,
- 00:20:14you have code that can take a set data structure
- 00:20:19and implement sequence data structure,
- 00:20:20and take sequence data structure and implement a set data
- 00:20:23structure.
- 00:20:24They just won't necessarily have very good run time.
- 00:20:26So this direct access array semantics
- 00:20:29is really just good for these specific set operations.
- 00:20:34Does that makes sense?
- 00:20:35Yeah?
- 00:20:35AUDIENCE: What is u?
- 00:20:36JASON KU: u is this the size of the largest key
- 00:20:39that I'm allowed to store.
- 00:20:40That makes sense?
- 00:20:42The direct access array is supporting up to u size keys.
- 00:20:47Does that make sense?
- 00:20:48OK, we're going to move on for a second.
- 00:20:51That's the problem, right?
- 00:20:52When u largest key--
- 00:20:59
- 00:21:01we're assuming integers here--
- 00:21:04integer keys-- so in the comparison model,
- 00:21:10we could store any arbitrary objects
- 00:21:12that supported a comparison.
- 00:21:14Here we really need to have integer keys,
- 00:21:17or else we're not going to be able to use those as addresses.
- 00:21:21So we're making an assumption on the inputs
- 00:21:25that I can only store integers now.
- 00:21:27I can't store arbitrary objects--
- 00:21:29items with keys.
- 00:21:31And in particular, I also need to-- this is a subtlety
- 00:21:34that's in the word RAM model--
- 00:21:36how can I be assured that these keys can
- 00:21:39be looked up in constant time?
- 00:21:41
- 00:21:44I have this little CPU.
- 00:21:46It's got some number of registers it can act upon.
- 00:21:49How big is those registers?
- 00:21:52AUDIENCE: [INAUDIBLE]
- 00:21:53JASON KU: What?
- 00:21:54
- 00:21:56Right now, they're 64 bits, but in general, they're w.
- 00:21:59They're the size of your word on your machine.
- 00:22:042 to the w is the number of dresses I can access.
- 00:22:09If I'm going to be able to use this direct accessory,
- 00:22:11I need to make sure that the u is less than 2 to the w,
- 00:22:19if I want these operations to run in constant time.
- 00:22:22If I have kids that are much larger than this,
- 00:22:25I'm going to need to do something else,
- 00:22:28but this is kind of the assumption.
- 00:22:30In this class, when we give you an array of integers,
- 00:22:34or an array of strings, or something
- 00:22:35like that on your problem or on an exam,
- 00:22:38the assumption is, unless we give you bounds
- 00:22:41on the size of those things--
- 00:22:45like the number of characters in your string
- 00:22:47or the size of the number in the--
- 00:22:49you can assume that those things will fit in one word of memory.
- 00:22:53
- 00:22:58w is the word size of your machine, the number of bits
- 00:23:04that your machine can do operations on in constant time.
- 00:23:08Any other questions?
- 00:23:10OK, so we have this problem.
- 00:23:12We're using way too much space, when we
- 00:23:15have a large universe of keys.
- 00:23:18So how do we get around that Problem any ideas?
- 00:23:24
- 00:23:28Sure.
- 00:23:29AUDIENCE: Instead of [INAUDIBLE]..
- 00:23:31
- 00:23:36JASON KU: OK, so what your colleague is saying--
- 00:23:39instead of just storing one value at each place,
- 00:23:43maybe store more than one value.
- 00:23:47If we're using this idea, where I
- 00:23:50am storing my key at the index of the key,
- 00:23:53that's getting around the us having
- 00:23:55to have unique keys in our data structure.
- 00:23:58It's not getting around this space usage problem.
- 00:24:02Does that make sense?
- 00:24:04We will end up storing multiple things at indices,
- 00:24:09but there's another trick that I'm looking for right now.
- 00:24:13We have a lot of space that we would
- 00:24:16need to allocate for this data structure.
- 00:24:19What's an alternative?
- 00:24:22Instead of allocating a lot of space, we allocate--
- 00:24:25
- 00:24:28less space.
- 00:24:30Let's allocate less space.
- 00:24:31All right.
- 00:24:32
- 00:24:36This is our space of keys, u.
- 00:24:38
- 00:24:40But instead, I want to store those things in a direct access
- 00:24:47array of maybe size n, something like the order of the things
- 00:24:53that I'm going to be storing.
- 00:24:55I'm going to relax that and say we're
- 00:24:57going to make this a length m that's
- 00:25:00around the size of the things I'm storing.
- 00:25:04
- 00:25:07And what I'm going to do is I'm going to try
- 00:25:09to map this space of keys--
- 00:25:12this large space of keys, from 0 to u minus 1
- 00:25:16or something like that--
- 00:25:18down to arrange that 0 to m minus 1.
- 00:25:21
- 00:25:24I'm going to want a function--
- 00:25:26this is what I'm going to call h--
- 00:25:29which maps this range down to a smaller range.
- 00:25:37
- 00:25:40Does that make sense?
- 00:25:41I'm going to have some function that
- 00:25:43takes that large base of keys--
- 00:25:44sticks them down here.
- 00:25:46
- 00:25:48And instead of staring at an index of the key,
- 00:25:55I'm going to put the key through this function, the key space,
- 00:25:58into a compressed space and store it
- 00:26:02at that index location.
- 00:26:05Does that make sense?
- 00:26:06Sure.
- 00:26:07AUDIENCE: [INAUDIBLE]
- 00:26:10JASON KU: Your colleague is--
- 00:26:12comes up with the question I was going to ask right away,
- 00:26:15which was, what's the problem here?
- 00:26:17The problem is it's the potential that we might be--
- 00:26:21have to store more than one thing at the same index
- 00:26:26location.
- 00:26:27If I have a function that matches this big space down
- 00:26:31to this small space, I got to have
- 00:26:36multiple of these things going to the same places here, right?
- 00:26:40It can't be objective.
- 00:26:44But just based on pigeonhole principle,
- 00:26:45I have more of these things.
- 00:26:47At least two of them have to go to something over here.
- 00:26:50In fact, if I have, say, u is bigger than n squared,
- 00:26:54for example, there--
- 00:26:58for any function I give you that maps
- 00:27:00this large space down to the small space, n of these things
- 00:27:05will map to the same place.
- 00:27:08So if I choose a bad function here,
- 00:27:11then I'll have to store n things at the same index location.
- 00:27:16And if I go there, I have to check
- 00:27:19to see whether any of those are the things
- 00:27:21that I'm looking for.
- 00:27:22I haven't gained anything.
- 00:27:23I really want a hash function that will evenly distribute
- 00:27:27keys over this space.
- 00:27:29
- 00:27:32Does that make sense?
- 00:27:34But we have a problem here.
- 00:27:35If we need to store multiple things
- 00:27:37at a given location in memory--
- 00:27:41can't do that.
- 00:27:42I have one thing I can put there.
- 00:27:44So I have two options on how to deal--
- 00:27:46what I call collisions.
- 00:27:49If I have two items here, like a and b,
- 00:27:52these are different keys in my universe of space.
- 00:27:58But it's possible that they both map down
- 00:28:02to some hash that has the same value.
- 00:28:07
- 00:28:10If I first hash a, and a is--
- 00:28:14I put a there, where do I put b?
- 00:28:17
- 00:28:22There are two options.
- 00:28:25AUDIENCE: Is the second data structure [INAUDIBLE]
- 00:28:28so that it can store [INAUDIBLE]??
- 00:28:31JASON KU: OK, so what your colleague is saying--
- 00:28:34can I store this one is a linked list,
- 00:28:36and then I can just insert a guy right next to where it was?
- 00:28:40What's the problem there?
- 00:28:43Are linked lists good with direct accessing by an index?
- 00:28:48No, they're terrible with get_at and set_at
- 00:28:51They take linear time there.
- 00:28:53So really, the whole point of direct this array
- 00:28:55is that there is an array underneath,
- 00:28:57and I can do this index arithmetic
- 00:28:59and go down to the next thing.
- 00:29:01So I really don't want to replace a linked
- 00:29:03list as this data structure.
- 00:29:07Yeah?
- 00:29:07
- 00:29:10What's up?
- 00:29:11AUDIENCE: [INAUDIBLE]
- 00:29:13JASON KU: We can make it really unlikely.
- 00:29:15Sure.
- 00:29:17I don't know what likely means, because I'm
- 00:29:19giving you a hash function-- one hash function.
- 00:29:22And I don't know what the inputs are.
- 00:29:23Yeah?
- 00:29:26Go ahead.
- 00:29:26AUDIENCE: [INAUDIBLE]
- 00:29:31JASON KU: OK, right.
- 00:29:32So there are actually two solutions here.
- 00:29:36One is I-- maybe, if I choose m to be larger than n,
- 00:29:42there's going to be extra space in here.
- 00:29:45I'll just stick it somewhere else in the existing array.
- 00:29:49How I find an open space is a little complicated,
- 00:29:52but this is a technique called open addressing, which
- 00:29:57is much more common than the technique
- 00:30:00we're going to be talking about today in implementations.
- 00:30:04Python uses an open addressing scheme, which is essentially,
- 00:30:07find another place in the array to put this collision.
- 00:30:12Open addressing is notoriously difficult to analyze,
- 00:30:15so we're not going to do that in this class.
- 00:30:17There's a much easier technique that-- we
- 00:30:19have an implementation for you in the recitation handouts.
- 00:30:23It's what your colleague up here--
- 00:30:26I can't find him--
- 00:30:27over there was saying--
- 00:30:29was, instead of storing it somewhere else
- 00:30:31in the existing direct access array down here,
- 00:30:35which we usually call the hash table--
- 00:30:37
- 00:30:41instead of storing it somewhere else in that hash table,
- 00:30:43we'll instead, at that key, store a pointer
- 00:30:47to another data structure, some other data structure that
- 00:30:51can store a bunch of things-- just like any sequence data
- 00:30:54structure, like a dynamic array, or linked list,
- 00:30:56or anything right.
- 00:30:57All I need to do is be able to stick a bunch of things
- 00:30:59on there when there are collisions,
- 00:31:03and then, when I go up to look for that thing,
- 00:31:05I'll just look through all of the things in that data
- 00:31:09structure and see if my key exists.
- 00:31:11Does that make sense?
- 00:31:13Now, we want to make sure that those additional data
- 00:31:16structures, which I'll call chains--
- 00:31:19we want to make sure that those chains are short.
- 00:31:24I don't want them to be long.
- 00:31:27So what I'm going to do is, when I have this collision here,
- 00:31:29instead I'll have a pointer to some--
- 00:31:31I don't know-- maybe make it a dynamic array, or a linked
- 00:31:33list, or something like that.
- 00:31:35And I'll put a here and I'll b here.
- 00:31:38And then later, when I look up key K, or look up a or b--
- 00:31:46let's look up b--
- 00:31:48I'll go to this hash value here.
- 00:31:51I'll put it through the hash function.
- 00:31:52I'll go to this index.
- 00:31:54I'll go to the data structure, the chain associated
- 00:31:56to that index, and I'll look at all of these items.
- 00:31:59I'm just going to do a linear find.
- 00:32:01I'm going to look.
- 00:32:01
- 00:32:04I could put any data structure here,
- 00:32:06but I'm going to look at this one, see if it's b.
- 00:32:08It's not b.
- 00:32:09Look at this one-- it is b.
- 00:32:11I return yes.
- 00:32:12Does that make sense?
- 00:32:13So this is an idea called chaining.
- 00:32:15I can put anything I want there.
- 00:32:16Commonly, we talk about putting a linked list there,
- 00:32:20but you can put a dynamic array there.
- 00:32:24You can put a sorted array there to make it easier
- 00:32:27to check whether the key is there.
- 00:32:29You can put anything you want there.
- 00:32:30The point of this lecture is going
- 00:32:32to try to show that there's a choice of hash function
- 00:32:35I can make that make sure that these chains are small so
- 00:32:42that it really doesn't matter how I saw them there,
- 00:32:45because I can just--
- 00:32:46if there's a constant number of things stored there,
- 00:32:49I can just look at all of them and do whatever I want,
- 00:32:52and still get constant time.
- 00:32:53Yeah?
- 00:32:54AUDIENCE: So does that means that, when you have [INAUDIBLE]
- 00:33:01let's just say, for some reason, the number of things
- 00:33:05[INAUDIBLE] is that most of them get multiple [INAUDIBLE]..
- 00:33:10Is it just a data structure that only holds one thing?
- 00:33:13JASON KU: Yeah.
- 00:33:13So what your colleague is saying is,
- 00:33:16at initialization, what is stored here?
- 00:33:19Initially, it points to an empty data structure.
- 00:33:22I'm just going to initialize all of these things to have--
- 00:33:25now, you get some overhead here.
- 00:33:27We're paying something for this-- some extra space
- 00:33:29and having pointer and another data structure
- 00:33:31at all of these things.
- 00:33:32Or you could have the semantics where,
- 00:33:34if I only have one thing here, I'm
- 00:33:36going to store that thing at this location,
- 00:33:38but if I have multiple, it points to a data structure.
- 00:33:41These are kind of complicated implementation details,
- 00:33:44but you get the basic idea.
- 00:33:46If I just have a 0 size data structure
- 00:33:49at all of these things, I'm still
- 00:33:50going to have a constant factor overhead.
- 00:33:54It's still going to be a linear size data structure,
- 00:33:57as long as m is linear in n.
- 00:33:59Does that makes sense?
- 00:34:01OK.
- 00:34:02So how do we pick a good hash function?
- 00:34:05I already told you that any fixed hash
- 00:34:08function I give you is going to experience collisions.
- 00:34:12And if u is large, then there's the possibility that I--
- 00:34:20for some input, all of the things in my set
- 00:34:23go directly to the same hashed index value.
- 00:34:27So that ain't great.
- 00:34:29Let's ignore that for a second.
- 00:34:30What's the easiest way to get down
- 00:34:33from this large space of keys down to a small one?
- 00:34:36What's the easiest thing you could do?
- 00:34:38Yeah?
- 00:34:38AUDIENCE: [INAUDIBLE]
- 00:34:38JASON KU: Modulus-- great.
- 00:34:40This is called the division method.
- 00:34:41
- 00:34:51And what its function is is essentially,
- 00:34:54it's going to take a key and it's
- 00:34:56going to say equal to be K mod m.
- 00:35:04I'm going to take something of a large space,
- 00:35:06and I'm going to mod it so that it just wraps around--
- 00:35:09
- 00:35:13perfectly valid thing to do.
- 00:35:15It satisfies what we're doing in a hash table.
- 00:35:18And if my kids are completely uniformly distributed--
- 00:35:24if, when I use my hash function, all of the keys
- 00:35:28here are uniformly distributed over this larger space, then
- 00:35:35actually, this isn't such a bad thing.
- 00:35:38But that's imposing some kind of distribution requirements
- 00:35:42on the type of inputs I'm allowed
- 00:35:43to use with this hash function for it
- 00:35:45to have good performance.
- 00:35:48But this plus a little bit of extra mixing and bit
- 00:35:53manipulation is essentially what Python does.
- 00:35:58Essentially, all it does is jumbles up
- 00:36:00that key for some fixed amount of jumbling,
- 00:36:05and then mods it m, and sticks it there.
- 00:36:11It's hard coded in the Python library, what this hash
- 00:36:15function is, and so there exist some sequences of inserts
- 00:36:21into a hash table in Python which
- 00:36:24will be really bad in terms of performance,
- 00:36:26because these chain links are the amount number of collisions
- 00:36:30that I'll get at a single hash is going to be large.
- 00:36:35But they do that for other reasons.
- 00:36:36They want a deterministic hash function.
- 00:36:38They want something that I do the program again--
- 00:36:41it's going to do the same thing underneath.
- 00:36:45But sometimes Python gets it wrong.
- 00:36:47But if your data that you're storing
- 00:36:50is sufficiently uncorrelated to the hash function
- 00:36:53that they've chosen--
- 00:36:54which, usually, it is--
- 00:36:56this is a pretty good performance.
- 00:36:58But this is not a practical class.
- 00:37:03Well, it is a practical class, but one of the things
- 00:37:05that we are--
- 00:37:07that's the emphasis of this class
- 00:37:09is making sure we can prove that this is good in theory as well.
- 00:37:13I don't want to know that sometimes this will be good.
- 00:37:17I really want to know that, if I choose--
- 00:37:21if I make this data structure and I put some inputs on it,
- 00:37:26I want a running time that is independent on what
- 00:37:30inputs I decided to use, independent of what keys
- 00:37:34I decided to store.
- 00:37:35Does that makes sense?
- 00:37:36
- 00:37:40But it's impossible for me to pick a fixed hash function that
- 00:37:44will achieve this, because I just
- 00:37:45told you that, if u is large--
- 00:37:48this is u-- if u is large, then there
- 00:37:52exists inputs that map everything to one place.
- 00:37:55
- 00:37:57I'm screwed, right?
- 00:37:58There's no way to solve this problem.
- 00:38:00
- 00:38:03That's true if I want a deterministic hash function--
- 00:38:06I want the thing to be repeatable,
- 00:38:07to do the same thing over and over again
- 00:38:09for any set of inputs.
- 00:38:12What can I do instead?
- 00:38:14Weaken my notion of what constant time is to do better--
- 00:38:18
- 00:38:22OK, use a non-deterministic--
- 00:38:24what does non-deterministic mean?
- 00:38:26It means don't choose a hash function up front--
- 00:38:31choose one randomly later.
- 00:38:34So have the user--
- 00:38:35they pick whatever inputs they're going to do,
- 00:38:38and then I'm going to pick a hash function randomly.
- 00:38:40They don't know which hash function I'm going to pick,
- 00:38:42so it's hard for them to give me an input that's bad.
- 00:38:45
- 00:38:49I'm going to choose a random hash function.
- 00:38:52Can I choose a hash function from the space
- 00:38:55of all hash functions?
- 00:38:58What is the space of all hash functions of this form?
- 00:39:00
- 00:39:03For every one of these values, I give a value in here.
- 00:39:06
- 00:39:10For each one of these independently random number
- 00:39:12between this range, how many such hash functions are there?
- 00:39:15
- 00:39:19m to the this number-- that's a lot of things.
- 00:39:25So I can't do that.
- 00:39:26What I can do is fix a family of hash functions
- 00:39:29where, if I choose one from-- randomly,
- 00:39:32I get good performance.
- 00:39:33And so the hash function I'm going to use,
- 00:39:36and we're going to spend the rest of the time on,
- 00:39:39is what I call a universal hash function.
- 00:39:43It satisfies what we call a universal hash property--
- 00:39:47so universal hash function.
- 00:39:53And this is a little bit of a weird nomenclature,
- 00:39:56because I'm defining this to you as the universal hash function,
- 00:40:01but actually, universal is a descriptor.
- 00:40:05There exist many universal hash functions.
- 00:40:09This just happens to be an example of one of them.
- 00:40:12OK?
- 00:40:12
- 00:40:23So here's the hash function--
- 00:40:27doesn't look actually all that different.
- 00:40:32Goodness gracious-- how many parentheses are there--
- 00:40:36mod p, mod m.
- 00:40:41OK.
- 00:40:41So it's kind of doing the same thing as what's happening up
- 00:40:44here, but before modding by m, I'm multiplying it by a number,
- 00:40:52I'm adding a number, I'm taking it mod another number,
- 00:40:55and then I'm getting by m.
- 00:40:57This is a little weird.
- 00:40:58And not only that-- this is still a fixed hash function.
- 00:41:02I don't want that.
- 00:41:03I want to generalize this to be a family of hash functions,
- 00:41:10which are this habk for some random choice of a,
- 00:41:21b in this larger range.
- 00:41:26
- 00:41:29All right, this is a lot of notation here.
- 00:41:34Essentially what this is saying is, I have a has family.
- 00:41:40It's parameterized by the length of my hash function
- 00:41:43and some fixed large random prime that's bigger than u.
- 00:41:48I'm going to pick some large prime number,
- 00:41:52and that's going to be fixed when I make the hash table.
- 00:41:55
- 00:41:58And then, when I instantiate the hash table,
- 00:42:02I'm going to choose randomly one of these things
- 00:42:06by choosing a random a and a random b from this range.
- 00:42:10Does that makes sense?
- 00:42:12AUDIENCE: [INAUDIBLE]
- 00:42:16JASON KU: This is a not equal to 0.
- 00:42:19If I had 0 here, I lose the key information,
- 00:42:22and that's no good.
- 00:42:23
- 00:42:26Does this make sense?
- 00:42:27So what this is doing is multiplying this key
- 00:42:30by some random number, adding some random number,
- 00:42:34modding by this prime, and then modding
- 00:42:37by the size of my thing.
- 00:42:39So it's doing a bunch of jumbling,
- 00:42:41and there's some randomness involved here.
- 00:42:43I'm choosing the hash function by choosing an a,
- 00:42:46b randomly from this thing.
- 00:42:47So when I start up my program, I'm
- 00:42:53going to instantiate this thing with some random a and b,
- 00:42:56not deterministically.
- 00:42:58The user, when they're using this thing,
- 00:43:01doesn't know which a and b I picked,
- 00:43:04so it's really hard for them to give me a bad example.
- 00:43:07And this universal hash function--
- 00:43:11this universal hash family, shall we say-- really,
- 00:43:13this is a family of functions, and I'm choosing one randomly
- 00:43:17within that family--
- 00:43:20is universal.
- 00:43:21And universality says that--
- 00:43:26what is the property of universality?
- 00:43:30It means that the probability, by choosing a hash function
- 00:43:34from this hash family, that a certain key collides
- 00:43:43with another key is less than or equal to 1/m for all--
- 00:43:52any different two keys in my universe.
- 00:43:57
- 00:44:02Does that make sense?
- 00:44:03
- 00:44:05Basically, this thing has the property that, if I randomly--
- 00:44:10for any two keys that I pick in my universe space,
- 00:44:16if I randomly choose a hash function,
- 00:44:19the probability that these things collide
- 00:44:22is less than 1/m.
- 00:44:23Why is that good?
- 00:44:25This is, in some sense, a measure
- 00:44:26of how well distributed these things are.
- 00:44:30I want these things to collide with 1/m probability
- 00:44:35so that these things don't collide very--
- 00:44:39it's not very likely for these things to collide.
- 00:44:41Does that make sense?
- 00:44:43So we want proof that this hash family satisfies
- 00:44:46this universality property.
- 00:44:48You'll do that in 046.
- 00:44:50But we can use this result to show that,
- 00:44:54if we use a universal-- this universal hash family,
- 00:44:58that the length of our change--
- 00:45:01chains is expected to be constant length.
- 00:45:06So we're going to use this property to prove that.
- 00:45:10How do we prove that?
- 00:45:11We're going to do a little probability.
- 00:45:15So how are we going to prove that?
- 00:45:16I'm going to define a random variable, an indicator
- 00:45:20random variable.
- 00:45:20Does anyone remember what an indicator in a variable is?
- 00:45:23Yeah, it's a variable that, with some amount of probability,
- 00:45:28is 1, and 1 minus that probability is 0.
- 00:45:33So I'm going to define this indicator
- 00:45:35random variable xij is a random variable over my choice--
- 00:45:44over choice of a hash function in my has family.
- 00:45:50And what does this mean?
- 00:45:52It means xij equals 1, if hash Ki equals hKj--
- 00:46:04these things collide-- and 0 otherwise.
- 00:46:09
- 00:46:13So I'm choosing randomly over this hash family.
- 00:46:18If, for two keys--
- 00:46:22key i and and j--
- 00:46:24if these things collide, that's going to be 1.
- 00:46:27If they don't, then it's 0.
- 00:46:29OK?
- 00:46:30Then, how can we write a formula for the length
- 00:46:34of a chain in this model?
- 00:46:37So the size of a chain--
- 00:46:39
- 00:46:43or let's put it here--
- 00:46:46the size of the chain at i--
- 00:46:55at i in my hash table--
- 00:46:58is going to equal--
- 00:47:00I'm going to call that the random variable xi--
- 00:47:03that's going to equal the sum over j equals 0 to--
- 00:47:07
- 00:47:10what is it-- over, I think, u minus 1 of summation--
- 00:47:17or sorry-- of xij.
- 00:47:20So basically, if I fix this location i,
- 00:47:33this is where this key goes.
- 00:47:35
- 00:47:38Sorry.
- 00:47:38This is the size of chain at h of Ki.
- 00:47:44Sorry.
- 00:47:45So I look at wherever Ki goes is hashed,
- 00:47:49and I see how many things collide with it.
- 00:47:52I'm just summing over all of these things,
- 00:47:55because this is 1 if there's a collision and 0 if there's not.
- 00:47:58Does that make sense?
- 00:48:00So this is the size of the chain at the index location mapped
- 00:48:04to by Ki.
- 00:48:06
- 00:48:09So here's where your probability comes in.
- 00:48:13What's the expected value of this chain
- 00:48:15length over my random choice?
- 00:48:18Expected value of choosing a hash function
- 00:48:22from this universal hash family of this chain length--
- 00:48:25
- 00:48:29I can put in my definition here.
- 00:48:31That's the expected value of the summation over j of xij.
- 00:48:38
- 00:48:45What do I know about expectations and summations?
- 00:48:49
- 00:48:53If these variables are independent from each other--
- 00:48:56AUDIENCE: [INAUDIBLE]
- 00:48:58JASON KU: Say what?
- 00:49:00AUDIENCE: [INAUDIBLE]
- 00:49:02JASON KU: Linearity of expectation--
- 00:49:05basically, the expectation sum of these independent random
- 00:49:08variables is the same as the summation
- 00:49:10of their expectations.
- 00:49:12So this is equal to the summation
- 00:49:14over j of the expectations of these individual ones.
- 00:49:18
- 00:49:26One of these j's is the same as i.
- 00:49:32j loops over all of the things from 0 to u minus 1.
- 00:49:37One of them is i, so when xhi is hj, what is the expected value
- 00:49:47that they collide?
- 00:49:491-- so I'm going to refactor this
- 00:49:52as being this, where j does not equal i, plus 1.
- 00:49:59Are people OK with that?
- 00:50:00Because if i equals--
- 00:50:04if j and i are equal, they definitely collide.
- 00:50:08They're the same key.
- 00:50:10So I'm expected to have one guy there, which
- 00:50:13was the original key, xi.
- 00:50:16But otherwise, we can use this universal property
- 00:50:22that says, if they're not equal and they collide--
- 00:50:27which is exactly this case--
- 00:50:30the probability that that happens is 1/m.
- 00:50:35And since it's an indicator random variable,
- 00:50:38the expectation is there are outcomes
- 00:50:41times their probabilities-- so 1 times that probability
- 00:50:45plus 0 times 1 minus that probability, which is just 1/m.
- 00:50:51So now we get the summation of 1/m for j
- 00:50:58not equal to i plus 1.
- 00:51:02
- 00:51:08Oh, and this-- sorry.
- 00:51:10I did this wrong.
- 00:51:11This isn't u.
- 00:51:12This is n.
- 00:51:13We're storing n keys.
- 00:51:17OK, so now I'm looping over j--
- 00:51:20this over all of those things.
- 00:51:22How many things are there?
- 00:51:23n minus 1 things, right?
- 00:51:26So this should equal 1 plus n minus 1 over m.
- 00:51:32So that's what universality gives us.
- 00:51:35So as long as we choose m to be larger than n,
- 00:51:41or at least linear in n, then we're
- 00:51:44expected to have our chain lengths be constant,
- 00:51:49because this thing becomes a constant if m is at least order
- 00:51:54n.
- 00:51:55Does that make sense?
- 00:51:57OK.
- 00:51:58The last thing I'm going to leave you with
- 00:52:00is, how do we make this thing dynamic?
- 00:52:02If we're growing the number of things
- 00:52:05we're storing in this thing, it's
- 00:52:07possible that, as we grow n for a fixed m,
- 00:52:10this thing will stop being--
- 00:52:13m will stop being linear in n, right?
- 00:52:15Well, then all we have to do is, if we get too far,
- 00:52:20we rebuild the entire thing--
- 00:52:22the entire hash table with the new m,
- 00:52:24just like we did with a dynamic array.
- 00:52:27And you can prove--
- 00:52:28we're not going to do that here, but you
- 00:52:31can prove that you won't do that operation too often, if you're
- 00:52:35resizing in the right way.
- 00:52:37And so you just rebuild completely
- 00:52:40after a certain number of operations.
- 00:52:42OK, so that's hashing.
- 00:52:44Next week, we're going to be talking
- 00:52:45about doing a faster sort.
- 00:52:48
- hashing
- set data structures
- comparison model
- universal hashing
- collisions
- performance
- direct access array
- hash tables
- dynamically resizing
- searching efficiency