Class 13 Slides Another data structure, in brief: hash tables Short code example Main program for inserting values into a hash table Last sort method: Heapsort Storin a heap tree in an array The Swapdown method "Heapify" Extracting the sorted elements The full algorithm, in brief, plus efficiency analysis SwapDown code HeapSort code -------------------------------------------------------------------------------- Another data structure, in brief: hash tables Linked lists are very useful when you don’t know in advance how much data you’ll need to store. But if the number of items in the list is large, searching is pretty inefficient, since you’re limited to linear search. However, if you could break that large linked list up into several smaller linked lists, you’d get some increase in efficiency. Though not so much that it would change the theoretical order of the search algorithm, the increased efficiency may be significant. The key is to be able to determine, when asked to search for a particular item, exactly which of the smaller linked lists it’s in. A hash table uses an array to store pointers to several linked lists. The linked lists may or may not be sorted. To determine which list to store an item in, or to search for it in, we use a hash function to map the item to a list. The hash function uses data associated with the item to calculate a number within the range of legal subscripts for the array. It should be chosen so that it sufficiently scatters the items throughout the available subscripts, and therefore across the linked lists; that is, so it is relatively unlikely that two items will hash to the same hash value. For instance, you may need to be able to store and retrieve a potentially large number of string values. This is exactly what a compiler has to do: it must keep track of the names of all the variables in the program it’s compiling, and be able to retrieve them quickly. Suppose it chooses a hash table size of 200. Then it might calculate the hash value for a variable name string this way: hash(string) = (sum of ASCII values of characters in the string) % 200 So the hash value for “santa” would be: 115 + 97 + 110 + 116 + 97 = 535 % 200 = 135 You can adjust this to other hash table sizes simply by changing the value in the modulus calculation. -------------------------------------------------------------------------------- Short code example The hash function is pretty easy to code: const int HASHSIZE = 200; int hash(const apstring& str) { int i, sum = 0; for (i = 0; i < str.length(); i++) sum += str[i]; return sum % HASHSIZE; } The data type to be stored in your hash table could be defined as: struct hashItem { apstring varName; hashItem *next; }; And your hash table would be set up like this: apvector hashtable(HASHSIZE, NULL); -------------------------------------------------------------------------------- Main program for inserting values into a hash table int main() { apstring theWord; int hashvalue; hashItem *newItem; apvector hashtable(HASHSIZE, NULL); cout << "Enter a word: "; getline(cin, theWord); while (theWord != "") { newItem = new hashItem; newItem->varName = theWord; newItem->next = NULL; hashvalue = hash(theWord); if (hashtable[hashvalue] != NULL) newItem->next = hashtable[hashvalue]; hashtable[hashvalue] = newItem; cout << "Enter a word: "; getline(cin, theWord); } return 0; } -------------------------------------------------------------------------------- Last sort method: Heapsort Again changing track completely! Heapsort is kind of odd, because we use diagrams of binary trees to illustrate the way the algorithm works, but it is best implemented using an array. Fortunately, this is not as difficult as it may sound. Heapsort relies on the definition of a heap. A heap is a binary tree with the following properties: It is complete, which means that every level of the tree is completely filled, except possibly the bottom level, and on that level, the nodes are in the leftmost positions. The item stored at each node is greater than or equal to the items stored in each of its children. The first of these binary trees is a heap. The second is not, because it is not complete. The third is complete, but it does not have the second property. -------------------------------------------------------------------------------- Storing a heap tree in an array To store the data values in an array, we number the nodes in the tree from top to bottom, left to right, starting from 0 at the root. Then we use those values as the subscripts into the array. So for our example heap: Notice that the children of node i can be found at 2*i + 1 and 2*i + 2. Also, the parent of a node i can be found at (i – 1) / 2. So we will still be able to traverse the tree even though it’s stored in an array. In fact, we actually have a random access tree! -------------------------------------------------------------------------------- The Swapdown method Heapsort relies on a subalgorithm called Swapdown, which it uses in two different contexts. So before we can understand heapsort itself, we have to know how to Swapdown. Swapdown works on a tree where the two subtrees of the root node are heaps, but the root node itself may not satisfy heap condition #2. That is, the item at the root may be less than one (or both) of its child items. To correct the problem, we swap the item at the root with the larger of its two children (diagram on left). Notice that this, unfortunately, makes the left subtree a non-heap. So we simply “Swapdown” again (diagram on right). We continue until we find a place for the item we’re swapping down that satisfies condition #2 (which may mean swapping it as far down as the bottom level). -------------------------------------------------------------------------------- “Heapify” As mentioned, Swapdown relies on the two subtrees being heaps already. How do they get that way? Well, by performing Swapdown on them. If a node is a leaf, it’s already a heap. So we start from the lowest rightmost non-leaf node (which happens to be the parent of the lowest rightmost leaf node), and perform Swapdown right to left, bottom to top, on all the subtrees (all the non-leaf nodes) in the tree. Notice that in our small example, that means we perform Swapdown on nodes 2, 1, and 0, in that order. Convenient that we numbered them that way, eh? Here is a diagram of this Heapify process on a slightly larger tree: -------------------------------------------------------------------------------- Extracting the sorted elements Notice that none of our tree processing orders (pre-order, in-order, post-order) print the elements in order! So how the heck do we go from a heap to a sorted array? One thing the heapify process has accomplished is to push the largest item to the top of the tree. What we will do is, in effect, remove that largest item from the heap, replace it with one of the other elements in the heap, then make it a heap again. To make things simpler, the item we’ll replace it with will be the highest numbered leaf. This is because removing the highest numbered leaf doesn’t make the tree incomplete (condition #1). What we actually do is swap the largest item (which is at position 0) with the highest numbered leaf (which is at the last position in the heap), then we reduce the size of the part of our array we consider the heap. Since moving the highest numbered leaf doesn’t disturb the heapness of the two subtrees of the root, we then only need to perform Swapdown on the root rather than do the whole heapify process. Array before: 9 7 8 6 4 1 3 2 5 Array after: 5 7 8 6 4 1 3 2 9 In terms of a loop invariant, we could write (for array A of size N): A[0] through A[lastHeapPos] form a heap, and A[lastHeapPos + 1] through A[N – 1] are sorted Each time through our main loop, the largest item remaining in the heap is “removed”, the heap is adjusted, and lastHeapPos decreases by one. -------------------------------------------------------------------------------- The full algorithm, in brief, plus efficiency analysis There are two main phases to Heapsort: Heapify: perform SwapDown on each non-leaf node, bottom to top, right to left, to convert the items into a heap. Order the elements from the heap: swap and prune largest number, then perform SwapDown on new root. Since the tree we’re working with is complete, it will never have more than log N levels. And since SwapDown looks at only a constant number (2) of items per level, SwapDown is O(log N). In the Heapify phase, SwapDown gets performed no more than N times (in fact, closer to 1/2 N, since about half the nodes are leaf nodes). In the Ordering phase, the same is true: SwapDown gets called exactly N times. So the whole algorithm is O(N log N). Finally, we have an array-based sort algorithm that does as well as the file-based Quicksort! The code follows; I’ve written SwapDown recursively, which I think is more natural and easier to read than writing it with a loop. -------------------------------------------------------------------------------- SwapDown code void SwapDown(apvector& A, int parent, int nItems) { // Calculate indices for left and right children int leftChild = 2 * parent + 1; int rightChild = leftChild + 1; // Check whether leftChild exists; if not, this is a leaf if (leftChild >= nItems) return; // Check whether rightChild exists if (rightChild >= nItems) { // No right child; swap if parent less than child if (A[leftChild] > A[parent]) swap(A[leftChild], A[parent]); return; } // If parent is greater than both children, stop if ( (A[leftChild] <= A[parent]) && (A[rightChild] <= A[parent]) ) return; // Swap greater child with parent; do SwapDown on subtree if (A[leftChild] > A[rightChild]) { swap(A[leftChild], A[parent]); SwapDown(A, leftChild, nItems); } else { swap(A[rightChild], A[parent]); SwapDown(A, rightChild, nItems); } } -------------------------------------------------------------------------------- HeapSort code void HeapSort(apvector& A) { int node; int nItems = A.length(); // Perform SwapDown on each node, right to left (bottom to top) for (node = nItems - 1; node >= 0; node--) SwapDown(A, node, nItems); int swapPos; // Repeatedly swap largest element into position and re-heapify for (swapPos = nItems - 1; swapPos >= 0; swapPos--) { swap(A[0], A[swapPos]); SwapDown(A, 0, swapPos); } } Question: What would need to be changed in this code to have it sort apstrings instead of integers?