diff options
Diffstat (limited to 'thesis/parts/background.tex')
-rw-r--r-- | thesis/parts/background.tex | 122 |
1 files changed, 72 insertions, 50 deletions
diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex index ca2a670..633efde 100644 --- a/thesis/parts/background.tex +++ b/thesis/parts/background.tex @@ -1,65 +1,83 @@ -This chapter provides an overview of the problem of container selection, and its effect on program correctness and performance. -Then, it provides an overview of how current programming languages approach this problem, and how the existing literature proposes to solve it. -Finally, we examine the gaps in the existing literature, and how this paper aims to contribute to it. +In this chapter, we provide an overview of the problem of container selection, and its effect on program correctness and performance. +We then provide an overview of how modern programming languages approach this problem, and how existing literature differs. +Finally, we examine the gaps in the existing literature, and explain how this paper aims to contribute to it. -\section{Container Types} +\section{Container Selection} -The majority of programs make extensive use of collection data types, that is, types intended to hold many different instances of other data types. +The vast majority of programs will use make extensive use of collection data types - types intended to hold many different instances of other data types. +This can refer to anything from fixed-size arrays, to growable linked lists, to associative key-value mappings or dictionaries. +In some cases, these are built-in parts of the language: In Go, a list of ints has type \code{[]int} and a dictionary from string to string has type \code{map[string]string}. -In many cases, these collections have very different properties and purposes. -For instance, a \code{HashMap} is associative, mapping arbitrary keys to values and disallowing duplicate keys. -By contrast, a \code{HashSet} stores some set of values, without ordering or keys. -A social networking site may use a \code{HashMap} to map usernames to followers, and a \code{HashSet} to store a set of names of followers. +In other languages, these are instead part of some standard library. +In Rust, you might write \code{Vec<isize>} and \code{HashMap<String, String>} for the same types. +This forces us to make a choice upfront: what type should we use? -In this case, \code{HashMap} and \code{HashSet} both have a different set of operations that make sense. -This results in a different set of methods. HashMap would likely have methods such as \code{insert(Key, Value)} and \code{get(Key)}, whereas \code{HashSet} would have neither and would instead have \code{insert(T)} and \code{contains(T)}. -We will refer to the set of methods supported by a container as its ``syntactic properties''. +In this case the answer is obvious - the two have very different purposes and don't support the same operations. +However, if we were to consider \code{Vec<isize>} and \code{HashSet<isize>}, the answer is much less obvious. +If we care about the ordering, or about preserving duplicates, then we must use \code{Vec<isize>}. +But if we don't, then \code{HashSet<isize>} may be more performant - for example if we use \code{contains} a lot. -However, syntactic properties alone are not enough to identify a container. -Note that an ordered container such as a \code{Vector} would be able to provide the same methods as a \code{HashSet}, and some extra. -As an application developer, we may require a container that does not allow duplicates, a constraint which \code{HashSet} satisfies but that \code{Vector} does not. -Therefore, we say that a container implementation must also have ``semantic properties''. We will avoid defining these formally for now, although informally they can be though of as conditions that will always hold for the container. +We refer to this problem as container selection, and split it into two parts: Functional requirements, and non-functional requirements. -Depending on the structure of the program, these collections will have varying interfaces, for instance they may be associative (mapping key to value), ordered (mapping index to value), or unordered (only keeping track of whether an element is contained or not). -In many programming languages, different implementations of these collections will implement a shared interface, for instance Collection in Java. -However, these interfaces are normally concerned only with the programming interface, and make no guarantees on the semantic properties of the implementation. In Java, both the HashSet and the ArrayList class implement Collection, however the former does not store duplicates and the latter does. +\subsection{Functional requirements} -In practice, the main way for developers to guarantee the semantic properties of some container, is to pick a concrete implementation rather than an interface. -This forces the developer to make a comparatively low-level choice, for instance between HashSet and LinkedHashSet. -In many cases, the developer does not care or understand about the implications of this choice, and so will simply choose at random. -Depending on the application however, the choice of concrete implementation can have a large effect on performance. +Functional requirements refers to a similar definition as is normally used for software: The container must behave the way that the program expects it to. + +Continuing with our previous example, we can first note that \code{Vec} and \code{HashSet} implement different sets of methods. +\code{Vec} implements methods like \code{.get(index)} and \code{.push(value)}, while \code{HashSet} implements neither - they don't make sense for an unordered collection. +Similarly, \code{HashSet} implements \code{.replace(value)} and \code{.is\_subset(other)}, neither of which make sense for \code{Vec}. +If we try to swap \code{Vec} for \code{HashSet}, the resulting program may not compile. + +These restrictions form the first part of our functional requirements - the ``syntactic properties'' of the containers must satisfy the program's requirements. +In object-oriented programming, we might say they must implement an interface. + +However, syntactic properties alone are not always enough to select an appropriate container. +Suppose our program only requires a container to have \code{.insert(value)}, \code{.contains(value)}, and \code{.len()}. +Both \code{Vec} and \code{HashSet} will satisfy these requirements. + +However, our program might rely on \code{.len()} returning a count including duplicates. +In this case, \code{HashSet} would give us different behaviour, possibly causing our program to behave incorrectly. + +To express this, we say that a container implementation also has ``semantic properties'' that must satisfy our requirements. +Intuitively we can think of this as what conditions the container upholds. +For a set, this would include that there are never any duplicates % TODO + +\subsection{Non-functional requirements} + +While meeting the functional requirements is generally enough to ensure a program runs correctly, we also want to ensure we choose the 'best' type we can. +There are many measures for this, but we will focus primarily on time: how much we can affect the runtime of the program. + +If we assume we can find a selection of types that satisfy the functional requirements, then one obvious solution is just to benchmark the program with each of these implementations in place, and see which works best. + +This will obviously work, however note that as well as our program, we need to develop benchmarks. +If the benchmarks are flawed, or don't represent how our program is used in practice, then we may get drastically different results in the 'real world'. + +%% TODO: Motivate how this improves performance \section{Prior Literature} -\subsection{Chameleon} +\subsection{Approaches in common programming languages} -Chameleon\parencite{shacham_chameleon_2009} is one paper which attempts to solve the container selection problem. -It works on Java programs, and requires both a runtime library and a modified garbage collector. +%% TODO -First, it runs the program normally, and collects data on the collections used using a ``semantic profiler''. -The modified garbage collector tracks the space used by collections, and the minimum space that could be used by all of the items of that collection. -The runtime library also tracks the number of each operation performed. -These statistics are tracked per individual collection instantiated, then aggregated by 'allocation context', which is a portion of the stack frame where the collection was first instantiated. +\subsection{Chameleon} -These aggregated statistics are then passed to a rules engine, which uses a set of rules to suggest the optimal container for a given allocation site. -These rules are written using a simple language, which selects the type in use, checks some condition, and then makes a suggestion if the condition is met. -For example, \code{LinkedList -> \#get(int) > X -> ArrayList} would be evaluated in contexts where a \code{LinkedList} is used, and if the number of get operations is greater than X, it would suggest an \code{ArrayList} instead. +Chameleon\parencite{shacham_chameleon_2009} is a solution that focuses on the non-functional requirements of container selection. -%% todo: something about online selection part +First, it runs the program with some example input, and collects data on the collections used using a ``semantic profiler''. +This data includes the space used by collections, the minimum space that could be used by all of the items of that collection, and the number of each operation performed. +These statistics are tracked per individual collection allocated, and then aggregated by 'allocation context' - a portion of the callstack where the allocation occured. +These aggregated statistics are then passed to a rules engine, which uses a set of rules to suggest places a different container type might improve performance. +For example, a rule could check when a linked list often has items accessed by index, and suggest a different list implementation as a replacement. This results in a flexible engine for providing suggestions, which can be extended with new rules and types as necessary. -However, this approach has some drawbacks. -Firstly, the use of a modified runtime in order to collect statistics may be a significant barrier to adoption. -Compared to other options that use a runtime library or code generation, this is a much more invasive approach, although with the benefit of generating more measurements to use. +%% todo: something about online selection part -Secondly, the use of specified rules limits the use to types/patterns the developer is aware of and chooses to implement. -Although users are able to add rules, Chameleon still requires effort in order for it to support or to suggest a new container implementation. -It is also limited to patterns that the developer is able to formalise, such as the above rule for indexing a linked list. -In many cases, there may be patterns that could be used to suggest a better option, but that the developer does not see or cannot formalise. +Unfortunately, this does require the developer to come up with and add replacement rules for each implementation. +In many cases, there may be patterns that could be used to suggest a better option, but that the developer does not see or is not able to formalise. -Finally, Chameleon assumes that all implementations are semantically identical. -In other words, the program will function the same no matter which one is used. +Chameleon also makes no attempt to select based on functional requirements. This results in selection rules needing to be more restricted than they otherwise could be. For instance, a rule cannot suggest a \code{HashSet} instead of a \code{LinkedList}, as the two are not semantically identical. Chameleon has no way of knowing if doing so will break the program's functionality, and so it does not make a suggestion. @@ -73,9 +91,9 @@ Chameleon has no way of knowing if doing so will break the program's functionali %% - focuses on the performance difference between microarchitectures %% - intended to be run at each install site -Brainy\parencite{jung_brainy_2011} attempts to solve the container selection problem using Machine Learning. +Brainy\parencite{jung_brainy_2011} also focuses on non-functional requirements, but uses Machine Learning techniques instead of set rules. -Similar to Chameleon, Brainy runs the program with developer-provided input and collects statistics on how the collection is used. +Similar to Chameleon, Brainy runs the program with example input, and collects statistics on how collections are used. Unlike Chameleon, these statistics include some hardware counters, such as cache utilisation and branch misprediction rate. This profiling information is then fed to an ML model, which predicts the implementation likely to be most performant for the specific program and microarchitecture, from the models that the model was trained to use. @@ -90,14 +108,10 @@ This is intended to avoid overfitting on specific applications, as a large numbe However, the applications generated are unlikely to be representative of real applications. In practice, there are usually patterns of certain combinations that are repeated, meaning the next operation is never truly random. -Like the other approaches mentioned, Brainy picks from a list of semantically different containers at each site. -However, this list is picked depending on the usage at each call site, meaning it is somewhat aware of the semantics required by each usage. -The set of alternate datasets is decided based on the original data structure (vector, list, set), and whether the order is ever used. +Brainy determines which types satisfy the functional requirements based on the original data structure (vector, list, set), and whether the order is ever used. This allows for a bigger pool of containers to choose from, for instance a vector can also be swapped for a set in some circumstances. However, this approach is still limited in the semantics it can identify, for instance it cannot differentiate a stack or queue from any other type of list. -While Brainy achieves significant improvements in program performance (``average performance improvements of 27\% and 33\% on both microarchitectures''), it is subject to similar limitations as Chameleon. - \subsection{CollectionSwitch} %% - online selection - uses library so easier to integrate @@ -117,3 +131,11 @@ The total cost for each collection type is then calculated for each individual i If switching to another implementation will drop the average total cost more than a certain threshold, then CollectionSwitch will start using that collection for newly allocated instances, and may also switch existing instances over to it. By generating a cost model based on benchmarks, CollectionSwitch manages to be more flexible than other rules-based approaches such as Chameleon. + +%% TODO: comment on functional selection + +\subsection{Primrose} + +%% TODO + +\section{Contributions} |