Workhorse collections

It’s good to identify a few collection types to use as your default workhorse all over the place. The advantages of a custom collection type for a particular situation are usually going to be outweighed by the mental burden of selecting the theoretically perfect collection and from dealing with conversions between different parts of your code base.

What collections should you choose?

Let me share a few things I’ve seen work better or worse in my experience. I will break this down by the various features you will want from your workhorses.

Immutability

It is begging for problems to pass around mutable collections all over a code base. You will eventually update something that another part of the code did not think would be updated. On top of these actual bugs, you will spend a non-zero amount of time on scrutinizing a particular collection in your code and wondering whether you are allowed to change it or not. You will then add defensive copies to prevent all of this, but now you have to think about and decide where to put the defensive copies.

The Google Web Toolkit used ArrayList all over the place, back when I was working on it. It is a very good collection type, but it’s mutable, and it has just the problems I mentioned in the previous paragraph. If I remember, we would defensively clone a fair amount as a way to minimize the risks.

To do better, one way to get immutable collections is via persistent collections. Persistent collections are immutable but nonetheless allow a form of update: they support creating a modified version of the collection with just one small change such as the addition of one single element. This combination of features is challenging, and if we are fair about it, we should probably say that persistent collections are good but not great and each of those goals.

Here are some examples of persistent collections, to get the idea of what they are like.

Linked lists. These are the workhorse for functional languages like ML and Scala. They are very powerful and capable, and they are part of how the Lisp family of languages took off. A “cons cell” is a link in a list, and you can build “s-expressions” out of these and then build everything else out of those—talk about a workhorse! Possibly the main thing to get used to with immutable linked lists is that they can be difficult to process in an alternate order from what you wanted. Additionally, many good idioms with linked lists will require that your runtime supports very large stacks, so an environment like Java with fixed stack sizes can make linked lists be less attractive. On the whole, though, if you’ve never done a large project with linked lists, give it a try some time. It is a comfortable way to program that scales to many problems. A trick you can use is to choose which order the list is in so that it matches what you are trying to do; reversing a list is an $O(n)$ operation, so you can even reverse them back and forth for basically free whenever you are about to do something that traverses the whole list.
Trees, including binary trees, red-black trees, and b-trees. Trees are a workhorse in Scala, and they have the advantage that you can make a new tree, without disturbing a previous tree, just by constructing a new root node and then a path from the root down to one of the leaves. In this way, you can add or remove an element by adjusting $O(\log(n))$ nodes, resulting in two trees that differ by the one element and that share all the other nodes with each other.

Another way to get immutable collections is the builder pattern. Java does this with its String type, where you make a string with StringBuilder and then convert it to an immutable String. If you compare this to C’s strings, it is quite the breath of fresh air to no longer worry about when and whether a string might change. Ruby and JavaScript, on the other hand, both have ongoing issues with strings being mutable by default.

For general-purpose lists, sets, and maps, the Guava library follows this approach, and I think they are a good choice for any code base that is in plain Java. It’s just very common that you build a collection in one place and then use it in others. You can use the builder in the part of the code where you create the collection, and then call build() on it to convert it to an immutable collection for the rest of its existence.

The Scala 2 collections adapt this idea internally and give every collection an associated builder.

A third way, one that I hope catches on more widely, is to use a static checker similar to Rust’s borrow checker. Rust has it built in, but it seems like it could be added to Scala or Java by way of annotation processors or compiler plugins, or could even become an optional feature in a newer version of the language. With borrow checking, you can create an object and mutate it yourself, but then hand it to others in an immutable way. To give others an immutable version, you can either hand it over to them entirely—a transfer of ownership—or you can let them borrow a reference temporarily that they have to give back. There is a rule that at any one time, there is either one mutable reference or any number of immutable references to an object; as well, the system integrates with function calls, allowing you to loan a reference to a function you are calling so long as it gives it back afterwards.

It’s a really neat trick that I hope is explored more widely. It both beats and vindicates const from C—it is actually practical and not as annoying as const can be, while also showing that the basic idea is sound. It just needed the borrow checking to really come together; it’s just extremely common to call a function that doesn’t modify a value, but for your copy of it to be mutable, and that combination is hard to support without a borrowing mechanism that understands function calls.

Concrete types

I have had better luck using concrete collection types in APIs rather than abstract types such as Java’s List and Scala’s Seq.

The idea of the abstract types is that the caller can use different collection types and you can field whatever they give you. Likewise, in a return type, you can give a caller anything you want and can even change your mind, later. As with many things, though, this advice creates a lot of PR wars and smug put-downs if you start to question where it runs out of steam, because people will think you don’t understand or don’t believe in the more basic ideas of abstraction.

There are just other factors that come in practice. You need to not just accept an abstract parameter but then call some methods on it, and it turns out that collections are trickier to abstract than they look like at first.

When you try to form a useful abstract Seq, Set, or Map, type, it initially looks very promising. You can see how to iterate, filter, or map pretty much any collection type under the sun, and it seems like it will be simple to define abstract versions of these and put those in the base types.

It turns out that collections are trickier to abstract than they look like. I am in awe of what came of the Scala 2 collections, as a technical tour de force. Still, there are just some mathematical realities about the different operations that don’t abstract as well as you would expect. I remember struggling with this when I was in the Scala lab at EPFL—not too far from the red/black curly staircases that became the language’s logo—and trying examples like these:

Right off the bat, a linked list should be a sequence type, but you usually shouldn’t index into it by an integer even though you can. Likewise, you shouldn’t normally decompose an array into a head and a tail, even though you can if you absolutely have to. So, you cannot use either a linked list on an array in its idiomatic way if you have to accesses it via an abstraction, because you might get the other one and have your performance get very bad if you do.
Usually calling map will give you back the collection type you started with, for example an array or a list will always map to an array or a list. However, it isn’t true for ranges, because you might start with a range of numbers from a to b but end up with something that is no longer a range. It’s also not true for tree-based sets and maps, because you might map to something that isn’t ordered, and you need ordered elements in order to implement a tree.
Okay, well, map is problematic, but filter is better. It is not perfect, though. Filtering will usually give you back the same type that you started with. Filtering a tree is fine, because the element type didn’t change, so it will still be ordered. However, ranges are still a problem, because if you select the odd numbers from a range, it is no longer a contiguous set of integers.
In practice, the abstractions will add performance overhead in one way or another. I struggled on both Scala and, later, X10, to come up with a performant for loop that works via the common abstractions of the language, and I just never got it fully worked out. I do still think it’s valuable, and even doable, but it’s also worth trying to make the problem easier by just using fewer abstractions to begin with. The whole thing reminds me of how the Internet beat out the OSI; the Internet developed a working DNS while OSI was still over-engineering it, so that now everyone knows “OSI” as just a layering model.

On the flip side, the practical issues are overblown regarding concrete types like ArrayList or ImmutableMap. First, it only matters at all when calling across module boundaries to a library that uses a different workhorse type than yours. All calls within your own code will use the concrete types consistently, gaining the advantages of the known concrete types without the disadvantages of mixed types.

In many cases where you do have mixed types, one of two things is true: the callee is going to iterate every element of the collection, or you are going to construct a special collection just for calling that API. In such a case, the only penalty you pay for the mixed types is that you have to copy the collection to the correct type, a linear operation that you normally wouldn’t worry too much about if it’s right next to other linear operations. In the case of generating a custom collection just for a particular API call, you can avoid the copy and just generate the correct concrete type right away.

The good and bad of hashing

I am currently pretty down on hash tables, both as sets and maps. I have used them heavily, and they are certainly practical. They have the huge advantage that most operations are in constant time.

They have a few issues, though.

They do not have stable behavior. Depending on the details, re-running the same program may already be enough that you end up with the elements being in a different order.
The order of elements is just wild. It’s not the order you put them into the collection, unless you add links, and it’s not any natural ordering of the elements, such as alphabetic order for strings. It’s just a crazy pile of stuff that makes your stomach sink if they get printed onto a screen for you to debug.
You can fix the stability by adding links, and for a phase of GWT’s existence, we modified it to use linked collections all over the place. I think this is a good approach if you stick with hashed collections, but it comes at the cost of memory and time overhead for the framework to maintain all the links.
They are hard to make persistent due to the use of an array for their base storage. There is such a thing as a persistent array, but it’s tricky to implement and is something I’ve not seen used in practice. There are immediate practical problems that come up, but they don’t seem insurmountable, but then I’ve never seen them surmounted, so I really can’t say.

I’ve been happy to learn that Rust includes a very mature b-tree library for its in-memory sets and maps. These have the advantages of other tree-like data types, but they have fewer pointer indirections than red-black trees or binary trees due to having multiple keys and values in each node of the tree. This means that copying a node is more expensive than with a red-black or binary tree, but you don’t need to copy the nodes unless you are implementing a persistent b-tree; if you are getting your immutable collections in another way than persistent collections, then you don’t need to copy the nodes and can do okay with a larger node size.

I love how the behavior of a tree-based set or map is not just stable but user-sensible. By using a sorted order consistently throughout your code, you get better debug logs, unit tests, JSON APIs, and build cache behavior by default, without even particularly trying. These all have advantages and will really stack up, and these qualitative improvements are what will usually save you the most time when developing software. You can always optimize specific situations to use a hash table when your profiling indicates it is worth the time.