999 kreative ideen zum selbermachen

[music playing] >> doug lloyd: by now youknow a lot about arrays, and you know a lot about linked lists. and we've discuss thepros and cons, we've discussed that linked listscan get bigger and smaller, but they take up more size. arrays are much more straightforward touse, but they're restrictive in as much as we have to set the size ofthe array at the very beginning and then we're stuck with it.

>> but that's, we've pretty muchexhausted all of our topics about linked lists and arrays. or have we? maybe we can do somethingeven more creative. and that sort of lendsthe idea of a hash table. >> so in a hash table we're going to trycombine an array with a linked list. we're going to take the advantagesof the array, like random access, being able to just go to arrayelement 4 or array element 8 without having to iterate across.

that's pretty fast, right? >> but we also want to have our datastructure be able to grow and shrink. we don't need, we don'twant to be restricted. and we want to be ableto add and remove things very easily, which if you recall,is very complex with an array. and we can call thisnew thing a hash table. >> and if implemented correctly,we're sort of taking the advantages of both datastructures you've already seen, arrays and linked lists.

insertion can start totend toward theta of 1. theta we haven't really discussed,but theta is just the average case, what's actually going to happen. you're not always going tohave the worst case scenario, and you're not always going to havethe best case scenario, so what's the average scenario? >> well an average insertioninto a hash table can start to get close to constant time. and deletion can getclose to constant time.

and lookup can getclose to constant time. that's-- we don't have a datastructure yet that can do that, and so this already soundslike a pretty great thing. we've really mitigated thedisadvantages of each on its own. >> to get this performanceupgrade though, we need to rethink how we adddata into the structure. specifically we want thedata itself to tell us where it should go in the structure. and if we then need to see if it's inthe structure, if we need to find it,

we want to look at the dataagain and be able to effectively, using the data, randomly access it. just by looking at thedata we should have an idea of where exactly we'regoing to find it in the hash table. >> now the downside of a hashtable is that they're really pretty bad at ordering or sorting data. and in fact, if you startto use them to order or sort data you lose all of theadvantages you previously had in terms of insertion and deletion.

the time becomes closer totheta of n, and we've basically regressed into a linked list. and so we only want to use hashtables if we don't care about whether data is sorted. for the context in whichyou'll use them in cs50 you probably don't carethat the data is sorted. >> so a hash table is a combinationof two distinct pieces with which we're familiar. the first is a function, whichwe usually call a hash function.

and that hash function is going toreturn some non-negative integer, which we usually call a hashcode, ok? the second piece is an array, which iscapable of storing data of the type we want to place into the data structure. we'll hold off on thelinked list element for now and just start with the basics of ahash table to get your head around it, and then we'll maybe blowyour mind a little bit when we combine arrays and link lists together. >> the basic idea thoughis we take some data.

we run that data throughthe hash function. and so the data is processedand it spits out a number, ok? and then with that numberwe just store the data we want to store in thearray at that location. so for example we have maybethis hash table of strings. it's got 10 elements in it, sowe can fit 10 strings in it. >> let's say we want to hash john. so john as the data we want to insertinto this hash table somewhere. where do we put it?

well typically with anarray so far we probably would put it in array location 0. but now we have this new hash function. >> and let's say that we run johnthrough this hash function and it's spits out 4. well that's where we'regoing to want to put john. we want to put john in array location4, because if we hash john again-- let's say later wewant to search and see if john exists in this hashtable-- all we need to do

is run it through the same hashfunction, get the number 4 out, and be able to find johnimmediately in our data structure. that's pretty good. >> let's say we now do thisagain, we want to hash paul. we want to add paulinto this hash table. let's say that this time we runpaul through the hash function, the hashcode that is generated is 6. well now we can put paulin the array location 6. and if we need to look up whetherpaul is in this hash table,

all we need to do is run paulthrough the hash function again and we're going to get 6 out again. >> and then we just lookat array location 6. is paul there? if so, he's in the hash table. is paul not there? he's not in the hash table. it's pretty straightforward. >> now how do you define a hash function?

well there's really no limit to thenumber of possible hash functions. in fact there's a number of really,really good ones on the internet. there's a number of really,really bad ones on the internet. it's also pretty easyto write a bad one. >> so what makes up a goodhash function, right? well a good hash function shoulduse only the data being hashed, and all of the data being hashed. so we don't want to use anything--we don't incorporate anything else other than the data.

and we want to use all of the data. we don't want to just use a pieceof it, we want to use all of it. a hash function shouldalso be deterministic. what does that mean? well it means that every time wepass the exact same piece of data into the hash function we alwaysget the same hashcode out. if i pass john into thehash function i get out 4. i should be able to do that 10,000times and i'll always get 4. so no random numbers effectivelycan be involved in our hash tables--

in our hash functions. >> a hash function should alsouniformly distribute data. if every time you run data through thehash function you get the hashcode 0, that's probably not so great, right? you probably want to biga range of hash codes. also things can be spreadout throughout the table. and also it would be great if reallysimilar data, like john and jonathan, maybe were spread out to weighdifferent locations in the hash table. that would be a nice advantage.

>> here's an example of a hash function. i wrote this one up earlier. it's not a particularlygood hash function for reasons that don't reallybear going into right now. but do you see what's going on here? it seems like we're declaring a variablecalled sum and setting it equal to 0. and then apparently i'm doing somethingso long as strstr[j] is not equal to backslash 0. what am i doing there?

>> this is basically just anotherway of implementing [? strl ?] and detecting when you'vereached the end of the string. so i don't have to actuallycalculate the length of the string, i'm just using when i hit thebackslash 0 character i know i've reached the end of the string. and then i'm going to keepiterating through that string, adding strstr[j] to sum, and then at theend of the day going to return sum mod hash_max. >> basically all this hashfunction is doing is adding up

all of the ascii values ofmy string, and then it's returning some hashcodemodded by hash_max. it's probably the sizeof my array, right? i don't want to be getting hashcodes if my array is of size 10, i don't want to be gettingout hash codes 11, 12, 13, i can't put things intothose locations of the array, that would be illegal. i'd suffer a segmentation fault. >> now here is another quick aside.

generally you're probably not going towant to write your own hash functions. it is actually a bit ofan art, not a science. and there's a lot that goes into them. the internet, like i said, is fullof really good hash functions, and you should use the internet tofind hash functions because it's really just kind of an unnecessarywaste of time to create your own. >> you can write simple onesfor testing purposes. but when you actually are going tostart hashing data and storing it into a hash table you'reprobably going to want

to use some function that was generatedfor you, that exists on the internet. if you do just be sureto cite your sources. there's no reason toplagiarize anything here. >> the computer science community isdefinitely growing, and really values open source, and it's really importantto cite your sources so that people can get attribution forthe work that they're doing to the benefit of the community. so always be sure--and not just for hash functions, but generally when youuse code from an outside source,

always cite your source. give credit to the person who didsome of the work so you don't have to. >> ok so let's revisit thishash table for a second. this is where we leftoff after we inserted john and paul into this hash table. do you see a problem here? you might see two. but in particular, do yousee this possible problem? >> what if i hash ringo, and itturns out that after processing

that data through the hash functionringo also generated the hashcode 6. i've already got data athashcode-- array location 6. so it's probably going to be a bitof a problem for me now, right? >> we call this a collision. and the collision occurs when twopieces of data run through the same hash function yield the same hashcode. presumably we still want to get bothpieces of data into the hash table, otherwise we wouldn't be running ringoarbitrarily through the hash function. we presumably want to getringo into that array.

>> how do we do it though, if heand paul both yield hashcode 6? we don't want to overwrite paul,we want paul to be there too. so we need to find a way to getelements into the hash table that still preserves our quickinsertion and quick look up. and one way to deal with it is todo something called linear probing. >> using this method if we have acollision, well, what do we do? well we can't put him in array location6, or whatever hashcode was generated, let's put him at hashcode plus 1. and if that's full let'sput him in hashcode plus 2.

the benefit of this being if he'snot exactly where we think he is, and we have to start searching,maybe we don't have to go too far. maybe we don't have to searchall n elements of the hash table. maybe we have to searcha couple of them. >> and so we're still tending towardsthat average case being close to 1 vs close to n, so maybe that'll work. so let's see how thismight work out in reality. and let's see if maybe we can detectthe problem that might occur here. >> let's say we hash bart.

so now we're going to run a new setof strings through the hash function, and we run bart through the hashfunction, we get hashcode 6. we take a look, we see 6 isempty, so we can put bart there. >> now we hash lisa and thatalso generates hashcode 6. well now that we're using thislinear probing method we start at 6, we see that 6 is full. we can't put lisa in 6. so where do we go? let's go to 7.

7's empty, so that works. so let's put lisa there. >> now we hash homer and we get 7. ok well we know that 7's fullnow, so we can't put homer there. so let's go to 8. is 8 available? yeah, and 8's close to 7, so ifwe have to start searching we're not going to have to go too far. and so let's put homer at 8.

>> now we hash maggie andreturns 3, thank goodness we're able to just put maggie there. we don't have to do anysort of probing for that. now we hash marge, andmarge also returns 6. >> well 6 is full, 7 is full, 8 is full,9, all right thank god, 9 is empty. i can put marge at 9. already we can see that we're startingto have this problem where now we're starting to stretch things kindof far away from their hash codes. and that theta of 1, that averagecase of being constant time,

is starting to get a little more--starting to tend a little more towards theta of n. we're starting to lose thatadvantage of hash tables. >> this problem that we just sawis something called clustering. and what's really bad aboutclustering is that once you now have two elements that are side byside it makes it even more likely, you have double thechance, that you're going to have another collisionwith that cluster, and the cluster will grow by one.

and you'll keep growing and growingyour likelihood of having a collision. and eventually it's just as badas not sorting the data at all. >> the other problem though is westill, and so far up to this point, we've just been sort ofunderstanding what a hash table is, we still only have room for 10 strings. if we want to continue to hashthe citizens of springfield, we can only get 10 of them in there. and if we try and add an 11th or 12th,we don't have a place to put them. we could just be spinning around incircles trying to find an empty spot,

and we maybe get stuckin an infinite loop. >> so this sort of lends to the ideaof something called chaining. and this is where we're going to bringlinked lists back into the picture. what if instead of storing justthe data itself in the array, every element of the array couldhold multiple pieces of data? well that doesn't make sense, right? we know that an array can onlyhold-- each element of an array can only hold one pieceof data of that data type. >> but what if that data typeis a linked list, right?

so what if everyelement of the array was a pointer to the head of a linked list? and then we could buildthose linked lists and grow them arbitrarily,because linked lists allow us to grow and shrink a lot moreflexibly than an array does. so what if we now use,we leverage this, right? we start to grow these chainsout of these array locations. >> now we can fit an infiniteamount of data, or not infinite, an arbitrary amount ofdata, into our hash table

without ever running intothe problem of collision. we've also eliminatedclustering by doing this. and well we know that when we insertinto a linked list, if you recall from our video on linked lists, singlylinked lists and doubly linked lists, it's a constant time operation. we're just adding to the front. >> and for look up, well we do knowthat look up in a linked list can be a problem, right? we have to search throughit from beginning to end.

there's no randomaccess in a linked list. but if instead of having one linkedlist where a lookup would be o of n, we now have 10 linked lists,or 1,000 linked lists, now it's o of n divided by 10,or o of n divided by 1,000. >> and while we were talkingtheoretically about complexity we disregard constants, in the realworld these things actually matter, right? we actually will noticethat this happens to run 10 times faster,or 1,000 times faster,

because we're distributing one longchain across 1,000 smaller chains. and so each time we have to searchthrough one of those chains we can ignore the 999 chains we don't careabout , and just search that one. >> which is on average tobe 1,000 times shorter. and so we still are sort oftending towards this average case of being constant time, butonly because we are leveraging dividing by some huge constant factor. let's see how this mightactually look though. so this was the hash table we hadbefore we declared a hash table that

was capable of storing 10 strings. we're not going to do that anymore. we already know thelimitations of that method. now our hash table's going to bean array of 10 nodes, pointers to heads of linked lists. >> and right now it's null. each one of those 10 pointers is null. there's nothing in ourhash table right now. >> now let's start to put somethings into this hash table.

and let's see how this method isgoing to benefit us a little bit. let's now hash joey. we'll will run the string joey througha hash function and we return 6. well what do we do now? >> well now working with linked lists,we're not working with arrays. and when we're workingwith linked lists we know we need to start dynamicallyallocating space and building chains. that's sort of how-- those are the coreelements of building a linked list. so let's dynamicallyallocate space for joey,

and then let's add him to the chain. >> so now look what we've done. when we hash joey we got the hashcode 6. now the pointer at array location 6points to the head of a linked list, and right now it's the onlyelement of a linked list. and the node in thatlinked list is joey. >> so if we need to look up joeylater, we just hash joey again, we get 6 again because ourhash function is deterministic. and then we start at the headof the linked list pointed

to by array location6, and we can iterate across that trying to find joey. and if we build ourhash table effectively, and our hash function effectivelyto distribute data well, on average each of those linkedlists at every array location will be 1/10 the size of if wejust had it as a single huge linked list with everything in it. >> if we distribute that huge linkedlist across 10 linked lists each list will be 1/10 the size.

and thus 10 times quickerto search through. so let's do this again. let's now hash ross. >> and let's say ross, when we do thatthe hash code we get back is 2. well now we dynamically allocate anew node, we put ross in that node, and we say now array location2, instead of pointing to null, points to the head of a linkedlist whose only node is ross. and we can do this one more time, wecan hash rachel and get hashcode 4. malloc a new node, put rachel inthe node, and say a array location

4 now points to the headof a linked list whose only element happens to be rachel. >> ok but what happens ifwe have a collision? let's see how we handle collisionsusing the separate chaining method. let's hash phoebe. we get the hashcode 6. in our previous example we were juststoring the strings in the array. this was a problem. >> we don't want to clobberjoey, and we've already

seen that we can get some clusteringproblems if we try and step through and probe. but what if we just kind oftreat this the same way, right? it's just like adding an elementto the head of a linked list. let's just malloc space for phoebe. >> we'll say phoebe's next pointer pointsto the old head of the linked list, and then 6 just points to thenew head of the linked list. and now look, we've changed phoebe in. we can now store twoelements with hashcode 6,

and we don't have any problems. >> that's pretty much allthere is to chaining. and chaining is definitelythe method that's going to be most effective for you ifyou are storing data in a hash table. but this combination ofarrays and linked lists together to form a hash table reallydramatically improves your ability to store large amounts of data, andvery quickly and efficiently search through that data. >> there's still one moredata structure out there

that might even be a bitbetter in terms of guaranteeing that our insertion, deletion, andlook up times are even faster. and we'll see that in a video on tries. i'm doug lloyd, this is cs50.