diff options
author | Aria <me@aria.rip> | 2023-10-17 15:24:43 +0100 |
---|---|---|
committer | Aria <me@aria.rip> | 2023-10-17 15:24:43 +0100 |
commit | 6ff27479567d703285ec6b7f042c23cac0d4782d (patch) | |
tree | 5a8b27629bf5dffea3a17362d163afd3b9730eb7 /thesis/parts/background.tex | |
parent | 76c75b372609a71883f338541c2e7adf5f6ba64a (diff) |
remove a lot of stuff
Diffstat (limited to 'thesis/parts/background.tex')
-rw-r--r-- | thesis/parts/background.tex | 79 |
1 files changed, 26 insertions, 53 deletions
diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex index 630e253..dbf6b74 100644 --- a/thesis/parts/background.tex +++ b/thesis/parts/background.tex @@ -7,16 +7,10 @@ Finally, we examine the gaps in the existing literature, and explain how this pa The vast majority of programs will use make extensive use of collection data types - types intended to hold many different instances of other data types. This can refer to anything from fixed-size arrays, to growable linked lists, to associative key-value mappings or dictionaries. -In some cases, core collections are built-in parts of the language: In Go, a growable list (or vector) of ints has type \code{[]int}. - -In other languages, vectors are instead part of some standard library, or must be defined by the user. -In Rust, you might write \code{Vec<isize>} and \code{HashMap<String, String>} for the same purpose. -This forces us to make a choice upfront: what type should we use? - -In this case the answer is obvious - the two have very different purposes and don't support the same operations. -However, if we were to consider \code{Vec<isize>} and \code{HashSet<isize>}, the answer is much less obvious. -If we care about the ordering, or about preserving duplicates, then we must use \code{Vec<isize>}. -But if we don't, then \code{HashSet<isize>} might be more performant if we use \code{contains} a lot. +In many languages, the standard library provides a variety of collections, forcing us to choose which one is best. +Consider the Rust types \code{Vec<T>} (a dynamic array) and \code{HashSet<T>} (a hash-based set). +If we care about the ordering, or about preserving duplicates, then we must use \code{Vec<T>}. +But if we don't, then \code{HashSet<T>} might be more performant, if we use \code{contains} a lot. We refer to this problem as container selection, and say that we must satisfy both functional requirements, and non-functional requirements. @@ -25,34 +19,32 @@ We refer to this problem as container selection, and say that we must satisfy bo The functional requirements tell us how the container will be used, and how it must behave. Continuing with our previous example, we can see that \code{Vec} and \code{HashSet} implement different methods. -\code{Vec} implements methods like \code{.get(index)} and \code{.push(value)}, while \code{HashSet} implements neither - they don't make sense for an unordered collection. -Similarly, \code{HashSet} implements \code{.replace(value)} and \code{.is\_subset(other)}, neither of which make sense for \code{Vec}. +\code{Vec} implements \code{.get(index)} while \code{HashSet} doesn't - it wouldn't make sense for an unordered collection. If we try to swap \code{Vec} for \code{HashSet}, the resulting program will likely not compile. -These restrictions form the first part of our functional requirements - the ``syntactic properties'' of the containers must satisfy the program's requirements. -In object-oriented programming, we might say they must implement an interface. -In Rust, we would say that they implement a trait, or that they belong to a type class. +We will call the operations a container implements the ``syntactic properties'' of the container. +In object-oriented programming, we might say they must implement an interface, while in Rust, we would say that they implement a trait. However, syntactic properties alone are not always enough to select an appropriate container. Suppose our program only requires a container to have \code{.insert(value)}, and \code{.len()}. -Both \code{Vec} and \code{HashSet} will satisfy these requirements, however our program might rely on \code{.len()} returning a count including duplicates. +Both \code{Vec} and \code{HashSet} will satisfy these requirements, but we might rely on \code{.len()} including duplicates. In this case, \code{HashSet} would give us different behaviour, causing our program to behave incorrectly. -To express this, we say that a container implementation also has ``semantic properties'' that must satisfy our requirements. +Therefore we also say that a container implementation has ``semantic properties''. Intuitively we can think of this as what conditions the container upholds. For a \code{HashSet}, this would include that there are never any duplicates, whereas for a Vec it would include that ordering is preserved. \subsection{Non-functional requirements} -While meeting the functional requirements is generally enough to ensure a program runs correctly, we also want to ensure we choose the 'best' type that we can. -For our purposes, we will only consider program run time, although other approaches also consider the balance between memory usage and time. +While meeting the functional requirements should ensure our program runs correctly, we also want to choose the 'best' type that we can. +Here we will consider 'best' as striking a balance between runtime and memory usage. Prior work has shown that properly considering container selection selection can give substantial performance improvements, even in large applications. For instance, tuning performed in \cite{chung_towards_2004} achieved an up to 70\% increase in the throughput of a complex web application, and a 15-40\% decrease in the runtime of several scientific applications. \cite{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%. Similarly, \cite{jung_brainy_2011} achieves an average speedup of 27-33\% on real-world applications and libraries. -If we assume we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place, and see which works best. +If we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place, and see which works best. This will obviously work, so long as our benchmarks are roughly representative of 'real world' inputs. Unfortunately, this technique scales poorly for bigger applications. @@ -61,14 +53,15 @@ This quickly becomes unfeasible, and so we must find other selection methods. \section{Prior Literature} -This section outlines the options available in current programming languages, and in existing literature. +In this section, we outline methods for container selection available in current programming languages, and their limitations. +We then examine some of the existing solutions for container selection, and their limitations. \subsection{Approaches in common programming languages} Modern programming languages broadly take one of two approaches to container selection. Some languages, usually higher-level ones, recommend built-in structures as the default, using implementations that perform fine for the vast majority of use-cases. -One popular examples is Python, which uses dynamic arrays as its default implementation. +One popular examples is Python, which uses dynamic arrays as its built-in list implementation. This approach prioritises developer ergonomics: Programmers do not need to think about how these are implemented. Usually, other implementations are possible, but are used only when needed and come at the cost of code readability. @@ -76,66 +69,46 @@ In other languages, collections are given as part of a standard library, or must Java comes with growable lists as part of its standard library, as does Rust (with some macros to make use easier). In both cases, the ``blessed'' implementation of collections is not special - users can implement their own. -Often, interfaces or their closest equivalent are used to distinguish 'similar' collections. +Often interfaces, or their closest equivalent, are used to distinguish 'similar' collections. In Java, ordered collections implement the interface \code{List<E>}, while similar interfaces exist for \code{Set<E>}, \code{Queue<E>}, etc. This means that when the developer chooses a type, the compiler enforces the syntactic requirements of the collection, and the writer of the implementaiton ``promises'' they have met the semantic requirements. -Many other languages give much weaker guarantees, for instance Rust has no typeclasses for List or Set. -Its closest equivalents are traits like \code{Index<I>} and \code{IntoIterator}, neither of which have particularly strong semantic guarantees. - Whilst the approach Java takes is the most expressive, both of these approaches either put the choice on the developer, or remove the choice entirely. -This means that developers are forced to guess based on their knowledge of the underlying implementations, or more often to just pick the most common implementation. -The papers we will examine all attempt to choose for the developer, based on a variety of techniques. +This means that developers are forced to guess based on their knowledge of the underlying implementations, or to just pick the most common implementation. \subsection{Chameleon} Chameleon\parencite{shacham_chameleon_2009} is a tool for Java codebases, which uses a rules engine to identify sub-optimal choices. -It works by collecting data from benchmarks using a ``semantic profiler''. -This data includes the space used by collections over time, and the counts of each operation performed. -These statistics are tracked per individual collection allocated, and then aggregated by 'allocation context' - a portion of the callstack where the allocation occured. +It first collects statistics from program benchmarks using a ``semantic profiler''. +This includes the space used by collections over time, and the counts of each operation performed. +These statistics are tracked per individual collection allocated, and then aggregated by 'allocation context' - the call stack at the point where the allocation occured. These aggregated statistics are then passed to a rules engine, which uses a set of rules to suggest places a different container type might improve performance. This results in a flexible engine for providing suggestions, which can be extended with new rules and types as necessary. -Unfortunately, this does require the developer to come up with and add replacement rules for each implementation. -In many cases, there may be patterns that could be used to suggest a better option, but that the developer does not see or is not able to formalise. - To satisfy functional requirements, Chameleon only suggests new types that behave identically to the existing type. This results in selection rules needing to be more restricted than they otherwise could be. For instance, a rule cannot suggest a \code{HashSet} instead of a \code{LinkedList}, as the two are not semantically identical. Chameleon has no way of knowing if doing so will break the program's functionality, and so it does not make a suggestion. -\subsection{Brainy} - -Brainy\parencite{jung_brainy_2011} also focuses on non-functional requirements, but uses machine learning techniques instead of defined rules. - -Similar to Chameleon, Brainy runs the program with example input, and collects statistics on how collections are used. -Unlike Chameleon, these statistics include some hardware counters, such as cache utilisation and branch misprediction rate. - -This profiling information is then fed to an ML model, which predicts the best implementation. - -Of the existing literature, Brainy appears to be the only method which directly accounts for hardware factors. -The authors propose that their tool can be run at install-time for each target system, and then used by developers or by applications integrated with it to select the best data structure for that hardware. -The paper itself demonstrates the effectiveness of this, finding that ``On average, 43\% of the randomly generated applications have different optimal data structures [across different architectures].'' - -Brainy determines which types satisfy the functional requirements based on the original data structure (vector, list, set), and whether the order is ever used. -This allows for a bigger pool of containers to choose from, for instance a vector can also be swapped for a set in some circumstances. -However, this approach is still limited in the semantics it can identify, for instance it cannot differentiate a stack or queue from any other type of list. +A similar rules-based approach was also used in \cite{l_liu_perflint_2009}, while \cite{jung_brainy_2011} uses a machine learning approach with similar statistics collection. \subsection{CollectionSwitch} -CollectionSwitch\parencite{costa_collectionswitch_2018} takes a different approach to the container selection problem, adapting as the program runs and new information becomes available. +CollectionSwitch\parencite{costa_collectionswitch_2018} is an online solution, which adapts as the program runs and new information becomes available. First, a performance model is built for each container implementation. This is done by performing each operation many times in succession, varying the length of the collection. -This data is used to fit a polynomial, which gives an estimate of cost per operation at a given n. +This data is used to fit a polynomial, which gives an estimate of cost of a specific operation at a given n. This is then combined with the frequency of each operation counts to give cost estimates for each collection type, operation, and 'cost dimension' (time and space). Rules then decide when switching to a new implementation is worth it based on these cost estimates and defined thresholds. By generating a cost model based on benchmarks, CollectionSwitch manages to be more flexible than other rules-based approaches such as Chameleon. -It expects applications to use Java's \code{List}, \code{Set}, and \code{Map} interfaces, which express enough functional requirements for most problems. +It expects applications to use Java's \code{List}, \code{Set}, or \code{Map} interfaces, which express enough functional requirements for most problems. + +\cite{hutchison_coco_2013} and \cite{osterlund_dynamically_2013} both also attempt online selection, however do so with a rules-based approach more similar to Chameleon \cite{shacham_chameleon_2009}. \subsection{Primrose} |