What is the main topic of this lecture?

The lecture focuses on hashing and its applications in set data structures.

What is a comparison model?

A comparison model restricts how items are stored and allows only comparisons between keys.

What is the significance of universal hashing?

Universal hashing helps minimize the probability of collisions in hash tables, improving performance.

What do you do when a hash table reaches its size limit?

When a hash table reaches its size limit, it can be resized and rebuilt to accommodate more items.

What are the two main operations discussed in relation to set data structures?

The main operations discussed are finding items and handling collisions.

What is the expected length of chains in a universal hash table?

The expected length of chains can be constant if the hash table is properly sized.

How can you handle collisions in hash tables?

Collisions can be handled using chaining, where multiple items are stored at the same index.

What is the significance of log n time in data structure operations?

Log n time indicates efficient searching and improves performance over linear scanning.

What is a direct access array?

A direct access array allows for constant time access based on keys, but may require large memory allocation.

What is the limitation of using a simple modulus for hashing?

Using a simple modulus may lead to many collisions if the distribution of keys is not uniform.

4. Hashing

00:52:54

https://www.youtube.com/watch?v=Nu8YGneFCWE

概要

TLDRThis lecture discusses hashing in the context of set data structures, covering the limitations of traditional search methods and emphasizing the significance of collision-resistant hash functions. Jason Ku details the comparison model, where items can only be compared, and introduces universal hashing as a way to enhance the performance of hash tables by reducing collision probabilities. He explores techniques for handling collisions through chaining, the expected performance of hash tables with universal hashing, and strategies for resizing hash tables dynamically as the number of items increases. The emphasis is placed on finding efficient methods for searching, inserting, and deleting items while maintaining performance integrity.

収穫

🔑 Understanding the limitations of traditional search methods.
📊 The role of the comparison model in data structure operations.
⚖️ Importance of universal hashing in reducing collisions.
🔄 Strategies for handling collisions using chaining.
📏 Expected chain lengths in universal hash tables should be constant.
📈 Dynamic resizing of hash tables helps manage increased data loads.
🗝️ Direct access arrays provide constant time access based on keys.
✨ Efficient hashing can greatly improve performance.
📝 Choosing good hash functions is crucial for success.
💡 The need for careful memory allocation in large data structures.

タイムライン

00:00:00 - 00:05:00
Lecture 4 of 6.006 focuses on hashing, contrasting with the previous lecture on set and sequence data structures, emphasizing the need for efficient item retrieval via keys.
00:05:00 - 00:10:00
Explored two methods for implementing the set interface - using unsorted arrays with linear scans and sorted arrays with log(n) retrieval time, introducing the potential for faster data structure construction.
00:10:00 - 00:15:00
Introduced a comparison model to prove that finding an item cannot be faster than log(n) time, explaining the implications of comparisons in determining item positions in data structures.
00:15:00 - 00:20:00
Described the concept of a decision tree, emphasizing that the number of comparisons relates to the tree's structure and is influenced by the required outputs from search operations.
00:20:00 - 00:25:00
Discussed the importance of finding keys in a data structure, illustrating how the number of outputs determines the minimum height of binary trees in a search algorithm context, stressing that at least (n + 1) leaves are necessary.
00:25:00 - 00:30:00
Entered into the specifics of the comparison model, detailing how the number of comparisons directly influences algorithm efficiency and establishing a log(n) baseline for searching in this context.
00:30:00 - 00:35:00
Elaborated on using direct access arrays for quick key retrieval through index mapping, which theoretically allows for constant time retrieval and insertion but poses challenges related to memory usage.
00:35:00 - 00:40:00
Addressed the limitations of direct access arrays concerning large key spaces, proposing a hashing method to map large keys into a more manageable space to optimize data storage and performance.
00:40:00 - 00:45:00
Outlined potential collision problems stemming from hashing where multiple keys map to the same location, introducing the concept of chaining to handle collisions and ensure efficient retrieval of items.
00:45:00 - 00:52:54
Concluded with the concept of universal hash functions as a solution to collision issues in hashing, discussing how choosing hash functions randomly from a family of functions can mitigate performance issues and lead to expected constant chain lengths.

ビデオQ&A

What is the main topic of this lecture?
The lecture focuses on hashing and its applications in set data structures.
What is a comparison model?
A comparison model restricts how items are stored and allows only comparisons between keys.
What is the significance of universal hashing?
Universal hashing helps minimize the probability of collisions in hash tables, improving performance.
What do you do when a hash table reaches its size limit?
When a hash table reaches its size limit, it can be resized and rebuilt to accommodate more items.
What are the two main operations discussed in relation to set data structures?
The main operations discussed are finding items and handling collisions.
What is the expected length of chains in a universal hash table?
The expected length of chains can be constant if the hash table is properly sized.
How can you handle collisions in hash tables?
Collisions can be handled using chaining, where multiple items are stored at the same index.
What is the significance of log n time in data structure operations?
Log n time indicates efficient searching and improves performance over linear scanning.
What is a direct access array?
A direct access array allows for constant time access based on keys, but may require large memory allocation.
What is the limitation of using a simple modulus for hashing?
Using a simple modulus may lead to many collisions if the distribution of keys is not uniform.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス！

字幕

オートスクロール:

00:00:00
00:00:12
JASON KU: Welcome to the fourth lecture of 6.006.
00:00:17
Today we are going to be talking about hashing.
00:00:20
Last lecture, on Tuesday, Professor Solomon
00:00:24
was talking about set data structures,
00:00:29
storing things so that you can query items
00:00:33
by their key right, by what they intrinsically are--
00:00:37
versus what Professor Demaine was talking
00:00:39
about last week, which was sequence data structures, where
00:00:42
we impose an external order on these items
00:00:46
and we want you to maintain those.
00:00:49
I'm not supporting operations where I'm looking stuff up
00:00:52
based on what they are.
00:00:54
That's what the set interface is for.
00:00:56
So we're going to be talking a little bit more about the set
00:00:59
interface today.
00:01:01
On Tuesday, you saw two ways of implementing the set
00:01:05
interface--
00:01:07
one using just a unsorted array-- just,
00:01:09
I threw these things in an array and I
00:01:12
could do a linear scan of my items
00:01:14
to support basically any of these operations.
00:01:17
It's a little exercise you can go through.
00:01:19
I think they show it to you in the recitation notes,
00:01:21
but if you'd like to implement it for yourself, that's fine.
00:01:26
And then we saw a slightly better data structure, at least
00:01:30
for the find operations.
00:01:31
Can I look something up, whether this key
00:01:34
is in my set interface?
00:01:38
We can do that faster.
00:01:39
We can do that in log n time with a build overhead
00:01:43
that's about n log n, because we showed you three ways to sort.
00:01:49
Two of them were n squared.
00:01:51
One of them was n log n, which is as good as we showed you
00:01:55
how to do yesterday.
00:01:57
So the question then becomes, can I build that data structure
00:02:00
faster?
00:02:01
That'll be a subject of next week's Thursday lecture.
00:02:04
But this week we're going to concentrate on this static
00:02:08
find.
00:02:09
we got log n, which is an exponential improvement
00:02:11
over linear right, but the question now becomes,
00:02:17
can I do faster than log n time?
00:02:21
And what we're going to do at the first part of this lecture
00:02:24
is show you that, no, you--
00:02:26
AUDIENCE: [INAUDIBLE]
00:02:27
JASON KU: What's up?
00:02:28
No?
00:02:29
OK-- that you can't do faster than log n time,
00:02:35
in the caveat that we are in a slightly more restricted model
00:02:38
of computation that we were-- than what we introduce
00:02:43
to you a couple of weeks ago.
00:02:46
And then so if we're not in that more constrained model
00:02:50
of computation, we can actually do faster.
00:02:52
00:02:55
Log n's already pretty good.
00:02:57
Log n is not going to be larger than like 30 for any problem
00:03:03
that you're going to be talking about in the real world
00:03:08
on real computers, but a factor of 30 is still bad.
00:03:13
I would prefer to do faster with those constant factors, when
00:03:17
I can.
00:03:18
It's not a constant factor.
00:03:19
It's a logarithmic factor, but you get what I'm saying.
00:03:22
OK, so what we're going to do is first
00:03:24
prove that you can't do faster for--
00:03:27
does everyone understand-- remember what find key meant?
00:03:32
I have a key, I have a bunch of items that have keys associated
00:03:36
with them, and I want to see if one of the items that I'm
00:03:39
storing contains a key that is the same as the one
00:03:42
that I searched for.
00:03:43
The item might contain other things,
00:03:46
but in particular, it has a search key
00:03:49
that I'm maintaining the set on so that it supports
00:03:52
find operations, search operations based on that key
00:03:56
quickly.
00:03:56
Does that make sense?
00:03:58
So there's the find one that we want to improve,
00:04:00
and we also want to improve this insert delete.
00:04:03
We want to be-- make this data structural dynamic, because we
00:04:08
might do those operations quite a bit.
00:04:11
And so this lecture's about optimizing those three things.
00:04:15
OK, so first, I'm going to show you
00:04:17
that we can't do faster than log n for find, which
00:04:22
is a little weird.
00:04:23
OK, the model of computation I'm going
00:04:26
to be proving this lower bound on--
00:04:28
00:04:31
how I'm going to approach this is I'm going to say that
00:04:33
any way that I store these--
00:04:37
the items that I'm storing in this data structure--
00:04:42
for anyway I saw these things, any algorithm
00:04:45
of this certain type is going to require
00:04:48
at least logarithmic time.
00:04:50
That's what we're going to try to prove.
00:04:52
And the model of computation that's
00:04:55
weaker than what we've been talking about previously
00:04:58
is what I'm going to call the comparison model.
00:05:00
00:05:04
And a comparison model means-- is that the items,
00:05:07
the objects I'm storing--
00:05:10
I can kind of think of them as black boxes.
00:05:12
I don't get to touch these things, except the only way
00:05:15
that I can distinguish between them is to say,
00:05:20
given a key and an item, or two items, I can do a comparison
00:05:27
on those keys.
00:05:28
Are these keys the same?
00:05:31
Is this key bigger than this one?
00:05:34
Is it smaller than this one?
00:05:35
Those are the only operations I get to do with them.
00:05:40
Say, if the keys are numbers, I don't get
00:05:42
to look at what number that is.
00:05:44
I just get to take two keys and compare them.
00:05:46
And actually, all of the search algorithms
00:05:49
that we saw on Tuesday we're comparison sort algorithms.
00:05:53
What you did was stepped through the program.
00:05:56
At some point, you came to a branch
00:05:59
and you looked at two keys, and you
00:06:01
branched based on whether one key was bigger than another.
00:06:06
That was a comparison.
00:06:07
And then you move some stuff around,
00:06:09
but that was the general paradigm.
00:06:11
Those three sorting operations lived in this comparison model.
00:06:17
You've got a comparison operations,
00:06:20
like are they equal, less than, greater than,
00:06:25
maybe greater than or equal, less than or equal?
00:06:28
Generally, you have all these operations
00:06:30
that you could do-- maybe not equal.
00:06:32
00:06:35
But the key thing here is that there are only
00:06:38
two possible outputs to each of these comparitors.
00:06:40
00:06:44
There's only one thing that I can branch on.
00:06:46
It's going to branch into two different lines.
00:06:49
It's either true and I do some other computation,
00:06:52
or it's false and I'll do a different set of computation.
00:06:56
That makes sense?
00:06:58
So what I'm going to do is I'm going
00:06:59
to give you a comparison--
00:07:02
an algorithm in the comparison model
00:07:05
as what I like to call a decision tree.
00:07:08
So if I specify an algorithm to you,
00:07:10
the first thing it's going to do-- if I don't compare items
00:07:13
at all, I'm kind of screwed, because I'll never
00:07:15
be able to tell if my keys in there or not.
00:07:17
So I have to do some comparisons.
00:07:21
So I'll do some computation.
00:07:23
Maybe I find out the length of the array
00:07:25
and I do some constant time stuff, but at some point,
00:07:28
I'll do a comparison, and I'll branch.
00:07:31
I'll come to this node, and if the comparison--
00:07:35
maybe a less than--
00:07:37
if it's true, I'm going to go this way in my computation,
00:07:41
and if it's false, I'm going to go this way in my computation.
00:07:45
And I'm going to keep doing that with various comparisons--
00:07:51
sure-- until I get down here to some leaf in which I
00:08:02
I'm not branching.
00:08:04
The internal nodes here are representing comparisons,
00:08:07
but the leaves are representing--
00:08:09
I stopped my computation.
00:08:11
I'm outputting something.
00:08:13
Does that make sense, what I'm trying to do?
00:08:16
I'm changing my algorithm to be put
00:08:20
in this kind of graphical way, where I'm branching what
00:08:24
my program could possibly do based on the comparisons
00:08:28
that I do.
00:08:30
I'm not actually counting the rest of the work
00:08:33
that the program does.
00:08:35
I'm really only looking at the comparisons,
00:08:37
because I know that I need to compare some things eventually
00:08:41
to figure out what my items are.
00:08:44
And if that's the only way I can distinguish items,
00:08:47
then I have to do those comparisons to find out.
00:08:49
Does that make sense?
00:08:51
All right, so what I have is a binary tree
00:08:56
that's representing the comparisons done
00:08:58
by the algorithm.
00:08:59
OK.
00:09:01
So it starts at one comparison and then it branches.
00:09:04
How many leaves must I have in my tree?
00:09:07
00:09:10
What does that question mean, in terms of the program?
00:09:15
AUDIENCE: [INAUDIBLE]
00:09:16
JASON KU: What's up?
00:09:17
AUDIENCE: The number of comparisons--
00:09:18
JASON KU: The number of comparisons-- no,
00:09:20
that's the number of internal nodes
00:09:21
that I have in the algorithm.
00:09:23
And actually, the number of comparisons
00:09:25
that I do in an execution of the algorithm
00:09:27
is just along a path from here to the-- to a leaf.
00:09:32
So what do the leaves actually represent?
00:09:34
Those represent outputs.
00:09:36
I'm going to output something here.
00:09:39
Yep?
00:09:40
AUDIENCE: [INAUDIBLE]
00:09:41
JASON KU: The number of--
00:09:42
OK.
00:09:42
00:09:45
So what is the output to my search algorithm?
00:09:47
Maybe it's the-- an index of an item that contains this key.
00:09:52
Or maybe I return the item is the output--
00:09:58
the item of the thing I'm storing.
00:09:59
And I'm storing n things, so I need at least n outputs,
00:10:04
because I need to be able to return any of the items
00:10:07
that I'm storing based on a different search parameter,
00:10:11
if it's going to be correct.
00:10:12
I actually need one more output.
00:10:13
Why do I need one more output?
00:10:15
00:10:17
If it's not in there--
00:10:20
so any correct comparison searching algorithm--
00:10:26
I'm doing some comparisons to find this thing--
00:10:30
needs to have at least n plus 1 leaves.
00:10:34
00:10:38
Otherwise, it can't be correct, because I could look up
00:10:41
the one that I'm not returning in that set
00:10:44
and it would never be able to return that value.
00:10:47
Does that make sense?
00:10:50
Yeah?
00:10:50
AUDIENCE: [INAUDIBLE]
00:10:51
JASON KU: What's n?
00:10:53
For a data structure, n is the number
00:10:55
of things stored in that data structure at that time--
00:10:58
so the number of items in the data structure.
00:11:00
That's what it means in all of these tables.
00:11:03
Any other questions?
00:11:05
OK, so now we get to the fun part.
00:11:09
How many comparisons does this algorithm have to do?
00:11:13
00:11:16
Yeah, up there--
00:11:17
AUDIENCE: [INAUDIBLE]
00:11:19
JASON KU: What's up?
00:11:22
All right, your colleague is jumping ahead for a second,
00:11:25
but really, I have to do as many comparisons in the worst case
00:11:30
as the longest root-to-leaf path in this tree--
00:11:35
because as I'm executing this algorithm,
00:11:37
I'll go down this thing, always branching down,
00:11:42
and at some point, I'll get to a leaf.
00:11:44
And in the worst case, if I happen
00:11:47
to need to return this particular output,
00:11:51
then I'll have to walk down the longest thing, just the longest
00:11:55
path.
00:11:57
So then the longest path is the same as the height of the tree,
00:12:01
so the question then becomes, what
00:12:04
is the minimum height of any binary tree that has at least n
00:12:10
plus 1 leaves?
00:12:13
Does everyone understand why we're asking that question?
00:12:18
Yeah?
00:12:19
AUDIENCE: Could you over again why it needs n plus 1 leaves?
00:12:22
JASON KU: Why it needs n plus 1 leaves--
00:12:24
if it's a correct algorithm, it needs to return--
00:12:27
it needs to be able to return any of the n items
00:12:30
that I'm storing or say that the key that I'm looking for
00:12:33
is not there--
00:12:35
great question.
00:12:37
OK, so what is the minimum height
00:12:40
of any binary tree that has n plus 1--
00:12:44
at least n plus 1 leaves?
00:12:48
You can actually state a recurrence for that
00:12:50
and solve that.
00:12:50
You're going to do that in your recitation.
00:12:52
But it's log n.
00:12:53
The best you can do is if this is a balanced binary tree.
00:12:57
So the min height is going to be at least log n height.
00:13:10
00:13:14
Or the min height is logarithmic,
00:13:17
so it's actually theta right here.
00:13:19
But if I just said height here, I
00:13:21
would be lower bounding the height.
00:13:24
I could have a linear height, if I just changed comparisons
00:13:28
down one by one, if I was doing a linear search, for example.
00:13:34
All right, so this is saying that, if I'm just restricting
00:13:36
to comparisons, I have to spend at least logarithmic time
00:13:40
to be able to find whether this key is in my set.
00:13:43
00:13:46
But I don't want logarithmic time.
00:13:48
I want faster.
00:13:49
So how can I do that?
00:13:51
AUDIENCE: [INAUDIBLE]
00:13:51
JASON KU: I have one operation in my model of computation
00:13:54
I presented a couple of weeks ago
00:13:56
that allows me to do faster, which allows me to do something
00:14:00
stronger than comparisons.
00:14:03
Comparisons have a constant branching factor.
00:14:06
In particular, I can--
00:14:08
if I do this operation-- this constant time operation--
00:14:11
I can branch to two different locations.
00:14:17
It's like an if kind of situation-- if, or else.
00:14:21
And in fact, if I had constant branching factor
00:14:24
for any constant here--
00:14:28
if I had three or four, if it was bounded by a constant,
00:14:31
the height of this tree would still
00:14:32
be bounded by a log base the constant
00:14:36
of that number of leaves.
00:14:39
So I need, in some sense, to be able to branch
00:14:42
a non-constant amount.
00:14:45
So how can I branch a non-constant amount?
00:14:49
This is a little tricky.
00:14:51
We had this really neat operation in the random access
00:14:57
machine that we could randomly go
00:15:01
to any place in memory in constant time
00:15:03
based on a number.
00:15:04
00:15:08
That was a super powerful thing, because
00:15:10
within a single constant time operation,
00:15:12
I could go to any space in memory.
00:15:15
That's potentially much larger than linear branching factor,
00:15:19
depending on the size of my model
00:15:20
and the size of my machine.
00:15:22
So that's a very powerful operation.
00:15:24
Can we use that to find quicker?
00:15:27
Anyone have any ideas?
00:15:28
00:15:31
Sure.
00:15:32
AUDIENCE: [INAUDIBLE]
00:15:33
JASON KU: We're going to get to hashing in a second,
00:15:35
but this is a simpler concept than hashing--
00:15:40
something you probably are familiar with already.
00:15:44
We've kind of been using it implicitly
00:15:46
in some of our sequence data structure things.
00:15:50
What we're going to do is, if I have an item that has key 10,
00:15:57
I'm going to keep an array and store that item 10 spaces away
00:16:04
from the front of the array, right at index 9,
00:16:07
or the 10th index.
00:16:09
Does that make sense?
00:16:11
If I store that item at that location in memory,
00:16:14
I can use this random access to that location
00:16:19
and see if there's something there.
00:16:21
If there's something there, I return that item.
00:16:23
Does that make sense?
00:16:24
This is what I call a direct access array.
00:16:26
00:16:29
It's really no different than the arrays
00:16:32
that we've been talking about earlier in the class.
00:16:38
We got an array, and if I have an item here
00:16:43
with key equals 10, I'll stick it here in the 10th place.
00:16:50
Now, I can only now store one item with the key 10
00:16:56
in my thing, and that's one of the stipulations we
00:16:58
had on our set data structures.
00:17:00
If we tried to insert something with the same key
00:17:03
as something already stored there,
00:17:04
we're going to replace the item.
00:17:06
That's what the semantics of our set interface was.
00:17:09
But that's OK.
00:17:10
That's satisfying the conditions of our set interface.
00:17:14
So if we store it there, that's fantastic.
00:17:17
How long does it take to find, if we
00:17:19
have an item with the key 10?
00:17:23
It takes constant time, worst case--
00:17:25
great.
00:17:27
How about inserting or deleting something?
00:17:29
AUDIENCE: [INAUDIBLE]
00:17:30
JASON KU: What's that?
00:17:31
AUDIENCE: [INAUDIBLE]
00:17:32
JASON KU: Again, constant time--
00:17:34
we've solved all our problems.
00:17:36
This is amazing.
00:17:36
00:17:39
OK.
00:17:40
What's not amazing about this?
00:17:42
Why don't we just do this all the time?
00:17:43
00:17:47
Yeah?
00:17:50
AUDIENCE: You don't know how high the numbers go.
00:17:53
JASON KU: I don't know how high the numbers go.
00:17:56
So let's say I'm storing, I don't know,
00:17:59
a number associated with that the 300 or 400 of you
00:18:03
that are in this classroom.
00:18:05
00:18:08
But I'm storing your MIT IDs.
00:18:10
How big are those numbers?
00:18:12
Those are like nine-digit numbers--
00:18:15
pretty long numbers.
00:18:17
So what I would need to do-- and if I was storing your keys
00:18:21
as MIT IDs, I would need an array
00:18:25
that has indices that span the tire
00:18:28
space of nine-digit numbers.
00:18:33
That's like 10 to the--
00:18:37
10 to the 9.
00:18:37
Thank you.
00:18:38
10 to the 9 is the size of a direct access road off
00:18:43
to build to be able to use this technique
00:18:50
to create a direct access array to search on your MIT IDs,
00:18:54
when there's only really 300 of you in here.
00:18:57
So 300 or 400 is an n that's much
00:19:00
smaller than the size of the numbers
00:19:03
that I'm trying to store.
00:19:04
What I'm going to use as a variable
00:19:06
to talk about the size of the numbers I'm storing--
00:19:09
I'm going to say u is the maximum size of any number
00:19:12
that I'm storing.
00:19:13
It's the size of the universe of space of keys that I'm storing.
00:19:17
Does that make sense?
00:19:19
OK, so to instantiate a direct access array of that size,
00:19:24
I have to allocate that amount of space.
00:19:26
And so if that is much bigger than n,
00:19:31
then I'm kind of screwed, because I'm
00:19:34
using much more space.
00:19:36
And these order operations are bad also, because essentially,
00:19:40
if I am storing these things non-continuously,
00:19:46
I kind of just have to scan down the thing
00:19:48
to find the next element, for example.
00:19:52
OK, what's your question?
00:19:53
AUDIENCE: Is a direct access array
00:19:55
a sequence data structure?
00:19:56
JASON KU: A direct access array is a set data structure.
00:19:59
That's why it's a set interface up there.
00:20:01
00:20:05
Your colleague is asking whether you can use a direct accessory
00:20:09
to implement a set--
00:20:10
I mean a sequence.
00:20:11
And actually, I think you'll see in your recitation notes,
00:20:14
you have code that can take a set data structure
00:20:19
and implement sequence data structure,
00:20:20
and take sequence data structure and implement a set data
00:20:23
structure.
00:20:24
They just won't necessarily have very good run time.
00:20:26
So this direct access array semantics
00:20:29
is really just good for these specific set operations.
00:20:34
Does that makes sense?
00:20:35
Yeah?
00:20:35
AUDIENCE: What is u?
00:20:36
JASON KU: u is this the size of the largest key
00:20:39
that I'm allowed to store.
00:20:40
That makes sense?
00:20:42
The direct access array is supporting up to u size keys.
00:20:47
Does that make sense?
00:20:48
OK, we're going to move on for a second.
00:20:51
That's the problem, right?
00:20:52
When u largest key--
00:20:59
00:21:01
we're assuming integers here--
00:21:04
integer keys-- so in the comparison model,
00:21:10
we could store any arbitrary objects
00:21:12
that supported a comparison.
00:21:14
Here we really need to have integer keys,
00:21:17
or else we're not going to be able to use those as addresses.
00:21:21
So we're making an assumption on the inputs
00:21:25
that I can only store integers now.
00:21:27
I can't store arbitrary objects--
00:21:29
items with keys.
00:21:31
And in particular, I also need to-- this is a subtlety
00:21:34
that's in the word RAM model--
00:21:36
how can I be assured that these keys can
00:21:39
be looked up in constant time?
00:21:41
00:21:44
I have this little CPU.
00:21:46
It's got some number of registers it can act upon.
00:21:49
How big is those registers?
00:21:52
AUDIENCE: [INAUDIBLE]
00:21:53
JASON KU: What?
00:21:54
00:21:56
Right now, they're 64 bits, but in general, they're w.
00:21:59
They're the size of your word on your machine.
00:22:04
2 to the w is the number of dresses I can access.
00:22:09
If I'm going to be able to use this direct accessory,
00:22:11
I need to make sure that the u is less than 2 to the w,
00:22:19
if I want these operations to run in constant time.
00:22:22
If I have kids that are much larger than this,
00:22:25
I'm going to need to do something else,
00:22:28
but this is kind of the assumption.
00:22:30
In this class, when we give you an array of integers,
00:22:34
or an array of strings, or something
00:22:35
like that on your problem or on an exam,
00:22:38
the assumption is, unless we give you bounds
00:22:41
on the size of those things--
00:22:45
like the number of characters in your string
00:22:47
or the size of the number in the--
00:22:49
you can assume that those things will fit in one word of memory.
00:22:53
00:22:58
w is the word size of your machine, the number of bits
00:23:04
that your machine can do operations on in constant time.
00:23:08
Any other questions?
00:23:10
OK, so we have this problem.
00:23:12
We're using way too much space, when we
00:23:15
have a large universe of keys.
00:23:18
So how do we get around that Problem any ideas?
00:23:24
00:23:28
Sure.
00:23:29
AUDIENCE: Instead of [INAUDIBLE]..
00:23:31
00:23:36
JASON KU: OK, so what your colleague is saying--
00:23:39
instead of just storing one value at each place,
00:23:43
maybe store more than one value.
00:23:47
If we're using this idea, where I
00:23:50
am storing my key at the index of the key,
00:23:53
that's getting around the us having
00:23:55
to have unique keys in our data structure.
00:23:58
It's not getting around this space usage problem.
00:24:02
Does that make sense?
00:24:04
We will end up storing multiple things at indices,
00:24:09
but there's another trick that I'm looking for right now.
00:24:13
We have a lot of space that we would
00:24:16
need to allocate for this data structure.
00:24:19
What's an alternative?
00:24:22
Instead of allocating a lot of space, we allocate--
00:24:25
00:24:28
less space.
00:24:30
Let's allocate less space.
00:24:31
All right.
00:24:32
00:24:36
This is our space of keys, u.
00:24:38
00:24:40
But instead, I want to store those things in a direct access
00:24:47
array of maybe size n, something like the order of the things
00:24:53
that I'm going to be storing.
00:24:55
I'm going to relax that and say we're
00:24:57
going to make this a length m that's
00:25:00
around the size of the things I'm storing.
00:25:04
00:25:07
And what I'm going to do is I'm going to try
00:25:09
to map this space of keys--
00:25:12
this large space of keys, from 0 to u minus 1
00:25:16
or something like that--
00:25:18
down to arrange that 0 to m minus 1.
00:25:21
00:25:24
I'm going to want a function--
00:25:26
this is what I'm going to call h--
00:25:29
which maps this range down to a smaller range.
00:25:37
00:25:40
Does that make sense?
00:25:41
I'm going to have some function that
00:25:43
takes that large base of keys--
00:25:44
sticks them down here.
00:25:46
00:25:48
And instead of staring at an index of the key,
00:25:55
I'm going to put the key through this function, the key space,
00:25:58
into a compressed space and store it
00:26:02
at that index location.
00:26:05
Does that make sense?
00:26:06
Sure.
00:26:07
AUDIENCE: [INAUDIBLE]
00:26:10
JASON KU: Your colleague is--
00:26:12
comes up with the question I was going to ask right away,
00:26:15
which was, what's the problem here?
00:26:17
The problem is it's the potential that we might be--
00:26:21
have to store more than one thing at the same index
00:26:26
location.
00:26:27
If I have a function that matches this big space down
00:26:31
to this small space, I got to have
00:26:36
multiple of these things going to the same places here, right?
00:26:40
It can't be objective.
00:26:44
But just based on pigeonhole principle,
00:26:45
I have more of these things.
00:26:47
At least two of them have to go to something over here.
00:26:50
In fact, if I have, say, u is bigger than n squared,
00:26:54
for example, there--
00:26:58
for any function I give you that maps
00:27:00
this large space down to the small space, n of these things
00:27:05
will map to the same place.
00:27:08
So if I choose a bad function here,
00:27:11
then I'll have to store n things at the same index location.
00:27:16
And if I go there, I have to check
00:27:19
to see whether any of those are the things
00:27:21
that I'm looking for.
00:27:22
I haven't gained anything.
00:27:23
I really want a hash function that will evenly distribute
00:27:27
keys over this space.
00:27:29
00:27:32
Does that make sense?
00:27:34
But we have a problem here.
00:27:35
If we need to store multiple things
00:27:37
at a given location in memory--
00:27:41
can't do that.
00:27:42
I have one thing I can put there.
00:27:44
So I have two options on how to deal--
00:27:46
what I call collisions.
00:27:49
If I have two items here, like a and b,
00:27:52
these are different keys in my universe of space.
00:27:58
But it's possible that they both map down
00:28:02
to some hash that has the same value.
00:28:07
00:28:10
If I first hash a, and a is--
00:28:14
I put a there, where do I put b?
00:28:17
00:28:22
There are two options.
00:28:25
AUDIENCE: Is the second data structure [INAUDIBLE]
00:28:28
so that it can store [INAUDIBLE]??
00:28:31
JASON KU: OK, so what your colleague is saying--
00:28:34
can I store this one is a linked list,
00:28:36
and then I can just insert a guy right next to where it was?
00:28:40
What's the problem there?
00:28:43
Are linked lists good with direct accessing by an index?
00:28:48
No, they're terrible with get_at and set_at
00:28:51
They take linear time there.
00:28:53
So really, the whole point of direct this array
00:28:55
is that there is an array underneath,
00:28:57
and I can do this index arithmetic
00:28:59
and go down to the next thing.
00:29:01
So I really don't want to replace a linked
00:29:03
list as this data structure.
00:29:07
Yeah?
00:29:07
00:29:10
What's up?
00:29:11
AUDIENCE: [INAUDIBLE]
00:29:13
JASON KU: We can make it really unlikely.
00:29:15
Sure.
00:29:17
I don't know what likely means, because I'm
00:29:19
giving you a hash function-- one hash function.
00:29:22
And I don't know what the inputs are.
00:29:23
Yeah?
00:29:26
Go ahead.
00:29:26
AUDIENCE: [INAUDIBLE]
00:29:31
JASON KU: OK, right.
00:29:32
So there are actually two solutions here.
00:29:36
One is I-- maybe, if I choose m to be larger than n,
00:29:42
there's going to be extra space in here.
00:29:45
I'll just stick it somewhere else in the existing array.
00:29:49
How I find an open space is a little complicated,
00:29:52
but this is a technique called open addressing, which
00:29:57
is much more common than the technique
00:30:00
we're going to be talking about today in implementations.
00:30:04
Python uses an open addressing scheme, which is essentially,
00:30:07
find another place in the array to put this collision.
00:30:12
Open addressing is notoriously difficult to analyze,
00:30:15
so we're not going to do that in this class.
00:30:17
There's a much easier technique that-- we
00:30:19
have an implementation for you in the recitation handouts.
00:30:23
It's what your colleague up here--
00:30:26
I can't find him--
00:30:27
over there was saying--
00:30:29
was, instead of storing it somewhere else
00:30:31
in the existing direct access array down here,
00:30:35
which we usually call the hash table--
00:30:37
00:30:41
instead of storing it somewhere else in that hash table,
00:30:43
we'll instead, at that key, store a pointer
00:30:47
to another data structure, some other data structure that
00:30:51
can store a bunch of things-- just like any sequence data
00:30:54
structure, like a dynamic array, or linked list,
00:30:56
or anything right.
00:30:57
All I need to do is be able to stick a bunch of things
00:30:59
on there when there are collisions,
00:31:03
and then, when I go up to look for that thing,
00:31:05
I'll just look through all of the things in that data
00:31:09
structure and see if my key exists.
00:31:11
Does that make sense?
00:31:13
Now, we want to make sure that those additional data
00:31:16
structures, which I'll call chains--
00:31:19
we want to make sure that those chains are short.
00:31:24
I don't want them to be long.
00:31:27
So what I'm going to do is, when I have this collision here,
00:31:29
instead I'll have a pointer to some--
00:31:31
I don't know-- maybe make it a dynamic array, or a linked
00:31:33
list, or something like that.
00:31:35
And I'll put a here and I'll b here.
00:31:38
And then later, when I look up key K, or look up a or b--
00:31:46
let's look up b--
00:31:48
I'll go to this hash value here.
00:31:51
I'll put it through the hash function.
00:31:52
I'll go to this index.
00:31:54
I'll go to the data structure, the chain associated
00:31:56
to that index, and I'll look at all of these items.
00:31:59
I'm just going to do a linear find.
00:32:01
I'm going to look.
00:32:01
00:32:04
I could put any data structure here,
00:32:06
but I'm going to look at this one, see if it's b.
00:32:08
It's not b.
00:32:09
Look at this one-- it is b.
00:32:11
I return yes.
00:32:12
Does that make sense?
00:32:13
So this is an idea called chaining.
00:32:15
I can put anything I want there.
00:32:16
Commonly, we talk about putting a linked list there,
00:32:20
but you can put a dynamic array there.
00:32:24
You can put a sorted array there to make it easier
00:32:27
to check whether the key is there.
00:32:29
You can put anything you want there.
00:32:30
The point of this lecture is going
00:32:32
to try to show that there's a choice of hash function
00:32:35
I can make that make sure that these chains are small so
00:32:42
that it really doesn't matter how I saw them there,
00:32:45
because I can just--
00:32:46
if there's a constant number of things stored there,
00:32:49
I can just look at all of them and do whatever I want,
00:32:52
and still get constant time.
00:32:53
Yeah?
00:32:54
AUDIENCE: So does that means that, when you have [INAUDIBLE]
00:33:01
let's just say, for some reason, the number of things
00:33:05
[INAUDIBLE] is that most of them get multiple [INAUDIBLE]..
00:33:10
Is it just a data structure that only holds one thing?
00:33:13
JASON KU: Yeah.
00:33:13
So what your colleague is saying is,
00:33:16
at initialization, what is stored here?
00:33:19
Initially, it points to an empty data structure.
00:33:22
I'm just going to initialize all of these things to have--
00:33:25
now, you get some overhead here.
00:33:27
We're paying something for this-- some extra space
00:33:29
and having pointer and another data structure
00:33:31
at all of these things.
00:33:32
Or you could have the semantics where,
00:33:34
if I only have one thing here, I'm
00:33:36
going to store that thing at this location,
00:33:38
but if I have multiple, it points to a data structure.
00:33:41
These are kind of complicated implementation details,
00:33:44
but you get the basic idea.
00:33:46
If I just have a 0 size data structure
00:33:49
at all of these things, I'm still
00:33:50
going to have a constant factor overhead.
00:33:54
It's still going to be a linear size data structure,
00:33:57
as long as m is linear in n.
00:33:59
Does that makes sense?
00:34:01
OK.
00:34:02
So how do we pick a good hash function?
00:34:05
I already told you that any fixed hash
00:34:08
function I give you is going to experience collisions.
00:34:12
And if u is large, then there's the possibility that I--
00:34:20
for some input, all of the things in my set
00:34:23
go directly to the same hashed index value.
00:34:27
So that ain't great.
00:34:29
Let's ignore that for a second.
00:34:30
What's the easiest way to get down
00:34:33
from this large space of keys down to a small one?
00:34:36
What's the easiest thing you could do?
00:34:38
Yeah?
00:34:38
AUDIENCE: [INAUDIBLE]
00:34:38
JASON KU: Modulus-- great.
00:34:40
This is called the division method.
00:34:41
00:34:51
And what its function is is essentially,
00:34:54
it's going to take a key and it's
00:34:56
going to say equal to be K mod m.
00:35:04
I'm going to take something of a large space,
00:35:06
and I'm going to mod it so that it just wraps around--
00:35:09
00:35:13
perfectly valid thing to do.
00:35:15
It satisfies what we're doing in a hash table.
00:35:18
And if my kids are completely uniformly distributed--
00:35:24
if, when I use my hash function, all of the keys
00:35:28
here are uniformly distributed over this larger space, then
00:35:35
actually, this isn't such a bad thing.
00:35:38
But that's imposing some kind of distribution requirements
00:35:42
on the type of inputs I'm allowed
00:35:43
to use with this hash function for it
00:35:45
to have good performance.
00:35:48
But this plus a little bit of extra mixing and bit
00:35:53
manipulation is essentially what Python does.
00:35:58
Essentially, all it does is jumbles up
00:36:00
that key for some fixed amount of jumbling,
00:36:05
and then mods it m, and sticks it there.
00:36:11
It's hard coded in the Python library, what this hash
00:36:15
function is, and so there exist some sequences of inserts
00:36:21
into a hash table in Python which
00:36:24
will be really bad in terms of performance,
00:36:26
because these chain links are the amount number of collisions
00:36:30
that I'll get at a single hash is going to be large.
00:36:35
But they do that for other reasons.
00:36:36
They want a deterministic hash function.
00:36:38
They want something that I do the program again--
00:36:41
it's going to do the same thing underneath.
00:36:45
But sometimes Python gets it wrong.
00:36:47
But if your data that you're storing
00:36:50
is sufficiently uncorrelated to the hash function
00:36:53
that they've chosen--
00:36:54
which, usually, it is--
00:36:56
this is a pretty good performance.
00:36:58
But this is not a practical class.
00:37:03
Well, it is a practical class, but one of the things
00:37:05
that we are--
00:37:07
that's the emphasis of this class
00:37:09
is making sure we can prove that this is good in theory as well.
00:37:13
I don't want to know that sometimes this will be good.
00:37:17
I really want to know that, if I choose--
00:37:21
if I make this data structure and I put some inputs on it,
00:37:26
I want a running time that is independent on what
00:37:30
inputs I decided to use, independent of what keys
00:37:34
I decided to store.
00:37:35
Does that makes sense?
00:37:36
00:37:40
But it's impossible for me to pick a fixed hash function that
00:37:44
will achieve this, because I just
00:37:45
told you that, if u is large--
00:37:48
this is u-- if u is large, then there
00:37:52
exists inputs that map everything to one place.
00:37:55
00:37:57
I'm screwed, right?
00:37:58
There's no way to solve this problem.
00:38:00
00:38:03
That's true if I want a deterministic hash function--
00:38:06
I want the thing to be repeatable,
00:38:07
to do the same thing over and over again
00:38:09
for any set of inputs.
00:38:12
What can I do instead?
00:38:14
Weaken my notion of what constant time is to do better--
00:38:18
00:38:22
OK, use a non-deterministic--
00:38:24
what does non-deterministic mean?
00:38:26
It means don't choose a hash function up front--
00:38:31
choose one randomly later.
00:38:34
So have the user--
00:38:35
they pick whatever inputs they're going to do,
00:38:38
and then I'm going to pick a hash function randomly.
00:38:40
They don't know which hash function I'm going to pick,
00:38:42
so it's hard for them to give me an input that's bad.
00:38:45
00:38:49
I'm going to choose a random hash function.
00:38:52
Can I choose a hash function from the space
00:38:55
of all hash functions?
00:38:58
What is the space of all hash functions of this form?
00:39:00
00:39:03
For every one of these values, I give a value in here.
00:39:06
00:39:10
For each one of these independently random number
00:39:12
between this range, how many such hash functions are there?
00:39:15
00:39:19
m to the this number-- that's a lot of things.
00:39:25
So I can't do that.
00:39:26
What I can do is fix a family of hash functions
00:39:29
where, if I choose one from-- randomly,
00:39:32
I get good performance.
00:39:33
And so the hash function I'm going to use,
00:39:36
and we're going to spend the rest of the time on,
00:39:39
is what I call a universal hash function.
00:39:43
It satisfies what we call a universal hash property--
00:39:47
so universal hash function.
00:39:53
And this is a little bit of a weird nomenclature,
00:39:56
because I'm defining this to you as the universal hash function,
00:40:01
but actually, universal is a descriptor.
00:40:05
There exist many universal hash functions.
00:40:09
This just happens to be an example of one of them.
00:40:12
OK?
00:40:12
00:40:23
So here's the hash function--
00:40:27
doesn't look actually all that different.
00:40:32
Goodness gracious-- how many parentheses are there--
00:40:36
mod p, mod m.
00:40:41
OK.
00:40:41
So it's kind of doing the same thing as what's happening up
00:40:44
here, but before modding by m, I'm multiplying it by a number,
00:40:52
I'm adding a number, I'm taking it mod another number,
00:40:55
and then I'm getting by m.
00:40:57
This is a little weird.
00:40:58
And not only that-- this is still a fixed hash function.
00:41:02
I don't want that.
00:41:03
I want to generalize this to be a family of hash functions,
00:41:10
which are this habk for some random choice of a,
00:41:21
b in this larger range.
00:41:26
00:41:29
All right, this is a lot of notation here.
00:41:34
Essentially what this is saying is, I have a has family.
00:41:40
It's parameterized by the length of my hash function
00:41:43
and some fixed large random prime that's bigger than u.
00:41:48
I'm going to pick some large prime number,
00:41:52
and that's going to be fixed when I make the hash table.
00:41:55
00:41:58
And then, when I instantiate the hash table,
00:42:02
I'm going to choose randomly one of these things
00:42:06
by choosing a random a and a random b from this range.
00:42:10
Does that makes sense?
00:42:12
AUDIENCE: [INAUDIBLE]
00:42:16
JASON KU: This is a not equal to 0.
00:42:19
If I had 0 here, I lose the key information,
00:42:22
and that's no good.
00:42:23
00:42:26
Does this make sense?
00:42:27
So what this is doing is multiplying this key
00:42:30
by some random number, adding some random number,
00:42:34
modding by this prime, and then modding
00:42:37
by the size of my thing.
00:42:39
So it's doing a bunch of jumbling,
00:42:41
and there's some randomness involved here.
00:42:43
I'm choosing the hash function by choosing an a,
00:42:46
b randomly from this thing.
00:42:47
So when I start up my program, I'm
00:42:53
going to instantiate this thing with some random a and b,
00:42:56
not deterministically.
00:42:58
The user, when they're using this thing,
00:43:01
doesn't know which a and b I picked,
00:43:04
so it's really hard for them to give me a bad example.
00:43:07
And this universal hash function--
00:43:11
this universal hash family, shall we say-- really,
00:43:13
this is a family of functions, and I'm choosing one randomly
00:43:17
within that family--
00:43:20
is universal.
00:43:21
And universality says that--
00:43:26
what is the property of universality?
00:43:30
It means that the probability, by choosing a hash function
00:43:34
from this hash family, that a certain key collides
00:43:43
with another key is less than or equal to 1/m for all--
00:43:52
any different two keys in my universe.
00:43:57
00:44:02
Does that make sense?
00:44:03
00:44:05
Basically, this thing has the property that, if I randomly--
00:44:10
for any two keys that I pick in my universe space,
00:44:16
if I randomly choose a hash function,
00:44:19
the probability that these things collide
00:44:22
is less than 1/m.
00:44:23
Why is that good?
00:44:25
This is, in some sense, a measure
00:44:26
of how well distributed these things are.
00:44:30
I want these things to collide with 1/m probability
00:44:35
so that these things don't collide very--
00:44:39
it's not very likely for these things to collide.
00:44:41
Does that make sense?
00:44:43
So we want proof that this hash family satisfies
00:44:46
this universality property.
00:44:48
You'll do that in 046.
00:44:50
But we can use this result to show that,
00:44:54
if we use a universal-- this universal hash family,
00:44:58
that the length of our change--
00:45:01
chains is expected to be constant length.
00:45:06
So we're going to use this property to prove that.
00:45:10
How do we prove that?
00:45:11
We're going to do a little probability.
00:45:15
So how are we going to prove that?
00:45:16
I'm going to define a random variable, an indicator
00:45:20
random variable.
00:45:20
Does anyone remember what an indicator in a variable is?
00:45:23
Yeah, it's a variable that, with some amount of probability,
00:45:28
is 1, and 1 minus that probability is 0.
00:45:33
So I'm going to define this indicator
00:45:35
random variable xij is a random variable over my choice--
00:45:44
over choice of a hash function in my has family.
00:45:50
And what does this mean?
00:45:52
It means xij equals 1, if hash Ki equals hKj--
00:46:04
these things collide-- and 0 otherwise.
00:46:09
00:46:13
So I'm choosing randomly over this hash family.
00:46:18
If, for two keys--
00:46:22
key i and and j--
00:46:24
if these things collide, that's going to be 1.
00:46:27
If they don't, then it's 0.
00:46:29
OK?
00:46:30
Then, how can we write a formula for the length
00:46:34
of a chain in this model?
00:46:37
So the size of a chain--
00:46:39
00:46:43
or let's put it here--
00:46:46
the size of the chain at i--
00:46:55
at i in my hash table--
00:46:58
is going to equal--
00:47:00
I'm going to call that the random variable xi--
00:47:03
that's going to equal the sum over j equals 0 to--
00:47:07
00:47:10
what is it-- over, I think, u minus 1 of summation--
00:47:17
or sorry-- of xij.
00:47:20
So basically, if I fix this location i,
00:47:33
this is where this key goes.
00:47:35
00:47:38
Sorry.
00:47:38
This is the size of chain at h of Ki.
00:47:44
Sorry.
00:47:45
So I look at wherever Ki goes is hashed,
00:47:49
and I see how many things collide with it.
00:47:52
I'm just summing over all of these things,
00:47:55
because this is 1 if there's a collision and 0 if there's not.
00:47:58
Does that make sense?
00:48:00
So this is the size of the chain at the index location mapped
00:48:04
to by Ki.
00:48:06
00:48:09
So here's where your probability comes in.
00:48:13
What's the expected value of this chain
00:48:15
length over my random choice?
00:48:18
Expected value of choosing a hash function
00:48:22
from this universal hash family of this chain length--
00:48:25
00:48:29
I can put in my definition here.
00:48:31
That's the expected value of the summation over j of xij.
00:48:38
00:48:45
What do I know about expectations and summations?
00:48:49
00:48:53
If these variables are independent from each other--
00:48:56
AUDIENCE: [INAUDIBLE]
00:48:58
JASON KU: Say what?
00:49:00
AUDIENCE: [INAUDIBLE]
00:49:02
JASON KU: Linearity of expectation--
00:49:05
basically, the expectation sum of these independent random
00:49:08
variables is the same as the summation
00:49:10
of their expectations.
00:49:12
So this is equal to the summation
00:49:14
over j of the expectations of these individual ones.
00:49:18
00:49:26
One of these j's is the same as i.
00:49:32
j loops over all of the things from 0 to u minus 1.
00:49:37
One of them is i, so when xhi is hj, what is the expected value
00:49:47
that they collide?
00:49:49
1-- so I'm going to refactor this
00:49:52
as being this, where j does not equal i, plus 1.
00:49:59
Are people OK with that?
00:50:00
Because if i equals--
00:50:04
if j and i are equal, they definitely collide.
00:50:08
They're the same key.
00:50:10
So I'm expected to have one guy there, which
00:50:13
was the original key, xi.
00:50:16
But otherwise, we can use this universal property
00:50:22
that says, if they're not equal and they collide--
00:50:27
which is exactly this case--
00:50:30
the probability that that happens is 1/m.
00:50:35
And since it's an indicator random variable,
00:50:38
the expectation is there are outcomes
00:50:41
times their probabilities-- so 1 times that probability
00:50:45
plus 0 times 1 minus that probability, which is just 1/m.
00:50:51
So now we get the summation of 1/m for j
00:50:58
not equal to i plus 1.
00:51:02
00:51:08
Oh, and this-- sorry.
00:51:10
I did this wrong.
00:51:11
This isn't u.
00:51:12
This is n.
00:51:13
We're storing n keys.
00:51:17
OK, so now I'm looping over j--
00:51:20
this over all of those things.
00:51:22
How many things are there?
00:51:23
n minus 1 things, right?
00:51:26
So this should equal 1 plus n minus 1 over m.
00:51:32
So that's what universality gives us.
00:51:35
So as long as we choose m to be larger than n,
00:51:41
or at least linear in n, then we're
00:51:44
expected to have our chain lengths be constant,
00:51:49
because this thing becomes a constant if m is at least order
00:51:54
n.
00:51:55
Does that make sense?
00:51:57
OK.
00:51:58
The last thing I'm going to leave you with
00:52:00
is, how do we make this thing dynamic?
00:52:02
If we're growing the number of things
00:52:05
we're storing in this thing, it's
00:52:07
possible that, as we grow n for a fixed m,
00:52:10
this thing will stop being--
00:52:13
m will stop being linear in n, right?
00:52:15
Well, then all we have to do is, if we get too far,
00:52:20
we rebuild the entire thing--
00:52:22
the entire hash table with the new m,
00:52:24
just like we did with a dynamic array.
00:52:27
And you can prove--
00:52:28
we're not going to do that here, but you
00:52:31
can prove that you won't do that operation too often, if you're
00:52:35
resizing in the right way.
00:52:37
And so you just rebuild completely
00:52:40
after a certain number of operations.
00:52:42
OK, so that's hashing.
00:52:44
Next week, we're going to be talking
00:52:45
about doing a faster sort.
00:52:48

タグ

hashing
set data structures
comparison model
universal hashing
collisions
performance
direct access array
hash tables
dynamically resizing
searching efficiency