thesis/parts/background.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

This chapter provides an overview of the problem of container selection, and its effect on program correctness and performance.
Then, it provides an overview of how current programming languages approach this problem, and how the existing literature proposes to solve it.
Finally, we examine the gaps in the existing literature, and how this paper aims to contribute to it.

\section{Container Types}

The majority of programs make extensive use of collection data types, that is, types intended to hold many different instances of other data types.

In many cases, these collections have very different properties and purposes.
For instance, a \code{HashMap} is associative, mapping arbitrary keys to values and disallowing duplicate keys.
By contrast, a \code{HashSet} stores some set of values, without ordering or keys.
A social networking site may use a \code{HashMap} to map usernames to followers, and a \code{HashSet} to store a set of names of followers.

In this case, \code{HashMap} and \code{HashSet} both have a different set of operations that make sense.
This results in a different set of methods. HashMap would likely have methods such as \code{insert(Key, Value)} and \code{get(Key)}, whereas \code{HashSet} would have neither and would instead have \code{insert(T)} and \code{contains(T)}.
We will refer to the set of methods supported by a container as its ``syntactic  properties''.

However, syntactic properties alone are not enough to identify a container.
Note that an ordered container such as a \code{Vector} would be able to provide the same methods as a \code{HashSet}, and some extra.
As an application developer, we may require a container that does not allow duplicates, a constraint which \code{HashSet} satisfies but that \code{Vector} does not.
Therefore, we say that a container implementation must also have ``semantic properties''. We will avoid defining these formally for now, although informally they can be though of as conditions that will always hold for the container.

Depending on the structure of the program, these collections will have varying interfaces, for instance they may be associative (mapping key to value), ordered (mapping index to value), or unordered (only keeping track of whether an element is contained or not).
In many programming languages, different implementations of these collections will implement a shared interface, for instance Collection in Java.
However, these interfaces are normally concerned only with the programming interface, and make no guarantees on the semantic properties of the implementation. In Java, both the HashSet and the ArrayList class implement Collection, however the former does not store duplicates and the latter does.

In practice, the main way for developers to guarantee the semantic properties of some container, is to pick a concrete implementation rather than an interface.
This forces the developer to make a comparatively low-level choice, for instance between HashSet and LinkedHashSet.
In many cases, the developer does not care or understand about the implications of this choice, and so will simply choose at random.
Depending on the application however, the choice of concrete implementation can have a large effect on performance.

\section{Prior Literature}

\subsection{Chameleon}

Chameleon is one paper which attempts to solve the container selection problem.
It works on Java programs, and requires both a runtime library and a modified garbage collector.

First, it runs the program normally, and collects data on the collections used using a ``semantic profiler''.
The modified garbage collector tracks the space used by collections, and the minimum space that could be used by all of the items of that collection.
The runtime library also tracks the number of each operation performed.
These statistics are tracked per individual collection instantiated, then aggregated by 'allocation context', which is a portion of the stack frame where the collection was first instantiated.

These aggregated statistics are then passed to a rules engine, which uses a set of rules to suggest the optimal container for a given allocation site.
These rules are written using a simple language, which selects the type in use, checks some condition, and then makes a suggestion if the condition is met.
For example, \code{LinkedList -> \#get(int) > X -> ArrayList} would be evaluated in contexts where a \code{LinkedList} is used, and if the number of get operations is greater than X, it would suggest an \code{ArrayList} instead.

%% todo: something about online selection part

This results in a flexible engine for providing suggestions, which can be extended with new rules and types as necessary.
However, this approach has some drawbacks.

Firstly, the use of a modified runtime in order to collect statistics may be a significant barrier to adoption.
Compared to other options that use a runtime library or code generation, this is a much more invasive approach, although with the benefit of generating more measurements to use.

Secondly, the use of specified rules limits the use to types/patterns the developer is aware of and chooses to implement.
Although users are able to add rules, Chameleon still requires effort in order for it to support or to suggest a new container implementation.
It is also limited to patterns that the developer is able to formalise, such as the above rule for indexing a linked list.
In many cases, there may be patterns that could be used to suggest a better option, but that the developer does not see or cannot formalise.

Finally, Chameleon assuems that all implementations are semantically identical.
In other words, the program will function the same no matter which one is used.
This results in selection rules needing to be more restricted than they otherwise could be.
For instance, a rule cannot suggest a \code{HashSet} instead of a \code{LinkedList}, as the two are not semantically identical.
Chameleon has no way of knowing if doing so will break the program's functionality, and so it does not make a suggestion.