Monday, May 18, 2009

Hashcode & Equals in the Scala 2.8 Collections Library

[Edit]
Martin commented that the disparity between equals and hashCode has been changed on the Scala 2.8 trunk so that they now follow the hashCode-equals contract, therefore most of this post is irrelevant.

I just finished reading the whitepaper on the collections library that is going to be put into scala 2.8

First let me say that I really enjoy the new collections library, it seems like it is going to be a vast improvement over the old, and give scala one of the nicest, most growable collections libraries of any language out there. I really think that the new library with how well abstracted everything is will be a really cool starting point for allowing users to grow the language and give Scala what will in a very short time could be the end-all be-all of collections libraries.

Having just watched Guy Steele's talk on Growing A Language (thanks to debasishg@twitter) for tweeting it. It seems like Martin Odersky and the other architects of the new library really channelled what he was getting at in his talk. This new library while being fairly complete in it's own right, is going to give users a both the flexibility to implement collections that are highly optimized and/or the ability to implement them very quickly by reusing existing code. It will let the vendor of the Collection class focus more time on his implementation and less time on having to provide a bazillion methods that clients reasonably expect from a collection class.

There was a one thing that jumped out at me as being a bad idea in the new library, it was on page 15 and relates to how the new collections library is going to handle hashCode for mutable collections, and equals for all collections.

Per the whitepaper:

It does not matter for the equality check whether a collection is mutable or immutable. For a mutable collection one simply considers its current elements at the time the equality test is performed. This means that a mutable collection might be equal to different collections at different times, depending
what elements are added or removed. This is a potential trap when using a mutable collection as a key in a hashmap.
Example:
val map = HashMap[Array[Int], Int]()
val xs = Array(1, 2, 3)
map += (xs -> 6)
xs(0) = 0
map(xs)
In this example, the access on the last line will most likely fail because the hashcode of the array xs has changed in the second-to-last line. Therefore, the hashcode-based lookup will look at a different place than the one where xs was stored.
To avoid this trap, mutable collection classes will not support taking their hashCode—they will throw an UnsupportedOperationException with the message “unsuitable as hash key” if hashCode is called
I have a very bad feeling about any class throwing an exception when you ask it for its hashCode, the same way I would have a bad feeling if a class throw an exception just for calling its toString method, or its equals methods. I am generally not a big fan of surprise exceptions and while I am well aware of the dangers of using mutable field in a hash map, I just don't think that the hashCode method should be throwing exceptions. It's going to be a real gotcha for any programmer.

Programmer add gotcha to brain: "I can generally call hashCode on any object, except for this sub-type of objects, which are going to throw an exception at me because it is dangerous to use them in hashmaps. If I am writing generic code that calls hashCode on object I need to catch exceptions, because, certain objects decide they really did not want to ever be used in a hashmap, even though my code doesn't mutate them afterwards, as I am just sending these objects right out of the system for serialization and I need a good key to use for them that obeys the contract of hashcode. In this case, I am going to be in trouble as there is now no longer a guaranteed way for me to get hashCode that obeys the contract of hashCodes/equals. What do I do?"

Further, If a mutable collection is excluded from being used as a key because it could be mutated after being added to a map, shouldn't any 'case class' with a var field also be Unhashable? (A Case Class in scala provides some syntatic sugar by auto-implementing the equals and hashCodes methods based on the fields of the class.) Isn't Unhashable really just a synonym for Mutable? And is the real issue here forcing all client's to never use a mutable value as a key in a hashmap? And shouldn't then any mutable class throw UnsupportedOperationException when the hashCode of that class is called? Is it really a good idea to enforce this constraint this hard?

Having said that, and given the subtlety of the errors that happen when programmers do accidentally use mutable values as hash keys, then change them later on, I think that making some sort of bell go off that says: "Warning, you just used a mutable value as a key in a hashmap! I hope you know what you're doing, because you probably just messed up" is a good idea.

Another thing that scares about the 'This is Unhashable because it is mutable ideology' is that it is going to force everyone to implement their own hash functions for objects that are mutable, or always first coerce them into an immutable version just to get a hashCode (with possible severe performance impacts), if they need to for some reason get a hashCode for that class (maybe for some sort of db serialization scenario, who knows). Since their is no integrated way to get a hashCode for a mutable collection, they now are going to have make their own hashing function, and that is generally a bad idea, as making good hash functions can be hard. It's better to do it right once in the library than to have everyone do it their own way. Can it be said with absolute certainty that just because something is mutable, it must never generate a hashCode for any reason?

It would be really nice if it could be enforced by the compiler as a constrait on the type parameters of HashMaps, namely the Key type of a hashmap should be able to somehow say something like:
HashMap[K <: !Unhashable, V]
where K <: !Unhashable means K is not a subtype of Unhashable, and the compiler enforces the constraint, rather than the runtime.


The other thing that stuck out was a potential gotcha around the way equality is handled.

The Whitepaper on Equality:
The collection libraries have a uniform approach to equality and hashing. The idea is, first, to divide collections into sets, maps, and sequences. Collections in different categories are always unequal. For instance, Set(1, 2, 3) is unequal to List(1, 2, 3) even though they contain the same elements. On the other hand, within the same category, collections are equal if and only if they have the same elements (for sequences: the same elements in the same oder). For example, List(1, 2, 3) == Array(1, 2, 3), and HashSet(1, 2) == Treeset(2, 1).

From above: seems that a List(1,2,3) will now equal an Array(1,2,3) while they have different hashCodes (List is some defined value, and Array is Nothing because it throws an exception) which violates the contract of hashCode. Maybe this doesn't matter that much in this case because the mutable collections can never be used in a hashmap, but what about two different implementations of an immutable collection within the same category? If this is the direction being taken, then clients probably should never should override equals or hashCode for their own implementations of collections. Conceptually equals and hashCode should be final at the level of Map, Set, and Sequence otherwise any client who writes their own version of a collection inheriting from one of these has to ensure that they generate the same hashCode for all other collections of the same category that contain the same values or else they risk breaking the contract put forth in the whitepaper.

Another approach might be to just make a method in Sequence, Map, & Set called sameElements (or deepEquals or equalElems or whatever) that returns true iff these two collections are of the same category and contain the same elements.

I am not really sure which is preferable, on the one hand I like the idea of treating all sets, maps, and sequences uniformly, in the sense that if they have the same values, then they are the same, on the other hand, It still would be nice to quickly tell if certain maps have the same ordering as well they are iterated across (I suppose that something like this could probably do that as well):
pseudocode
def sameOrder = (map1.iterator zip map2.iterator).find((e1, e2) => {e1 != e2}).isEmpty

1 comment:

martin said...

Good observations. In fact we have already changed these decisions in trunk, but the collections whitepaper has not yet been updated.
Hascode and equals will be defined for all collections based on their elements. This means you can use mutable collections as keys in hashmaps but you have to be careful not to mutate them, or else the key might become invisible.

As to arrays, they will be treated differently. Hascode and equality for arrays will be as in Java.