diff options
author | Aria Shrimpton <me@aria.rip> | 2024-04-01 15:20:39 +0100 |
---|---|---|
committer | Aria Shrimpton <me@aria.rip> | 2024-04-01 15:20:39 +0100 |
commit | 47941ae2594c8eb3cea07d40352a96a7243b8cee (patch) | |
tree | 43e8a63ed49383b635c0ccaa53cd75e894336369 /thesis/parts/background.tex | |
parent | 1030ca8daa6b205722b9747997ceaecc21de9ef7 (diff) |
redraft #2
Diffstat (limited to 'thesis/parts/background.tex')
-rw-r--r-- | thesis/parts/background.tex | 80 |
1 files changed, 44 insertions, 36 deletions
diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex index e65afad..ea30f36 100644 --- a/thesis/parts/background.tex +++ b/thesis/parts/background.tex @@ -1,10 +1,10 @@ -In this chapter, we provide an overview of the problem of container selection and its effect on program correctness and performance. +In this chapter, we explain the problem of container selection and its effect on program correctness and performance. We then provide an overview of approaches taken by modern programming languages and existing literature. Finally, we explain how our system is novel, and the weaknesses in existing literature it addresses. \section{Container Selection} -A container data structure is simply a structure which holds a collection of related values. +A container data type is simply a structure which holds a collection of related values. This could include a list (growable or otherwise), a set (with no duplicate elements), or something more complex (like a min heap). In many languages, the standard library provides implementations of various container types, with users able to choose which is best for their program. @@ -12,8 +12,7 @@ This saves users a lot of time, however selecting the best type is not always st Consider a program which needs to store and query a set of numbers, and doesn't care about ordering or duplicates. If the number of items ($n$) is small enough, it might be fastest to use a dynamically-sized array (known as a vector in many languages), and scan through each time we want to check if a number is inside. -On the other hand, if the set we deal with is much larger, we might need to use a more complex method to keep things fast. -A common example would be a hash set, which provides roughly the same lookup speed regardless of size, at the cost of being slower overall. +On the other hand, if the set we deal with is much larger, we might instead use a hash set, which provides roughly the same lookup speed regardless of size at the cost of being slower overall. In this case, there are two factors driving our decision. Our \emph{functional requirements} -- that we don't care about ordering or duplicates -- and our \emph{non-functional requirements} -- that we want our program to use resources efficiently. @@ -24,7 +23,7 @@ Functional requirements tell us how the container must behave in order for the p Continuing with our previous example, we'll compare Rust's \code{Vec} implementation (a dynamic array), with its \code{HashSet} implementation (a hash table). To start with, we can see that the two types have different methods. -\code{Vec} has a \code{.get(index)} method, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this wouldn't make sense. +\code{Vec} has a \code{.get()} method, while \code{HashSet} does not. If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} would likely cause the compiler to raise an error. We will call the operations that a container implementation provides the \emph{syntactic properties} of the implementation. @@ -35,9 +34,9 @@ Suppose our program only requires a type with the \code{.contains(value)} and \c Both \code{Vec} and \code{HashSet} satisfy these requirements, but our program might also rely on the count returned from \code{.len()} including duplicates. In this case, \code{HashSet} would give us different behaviour, causing our program to behave incorrectly. -Therefore, we also say that a container implementation has \code{semantic properties}. -Intuitively we can think of this as what conditions are upheld. -For a \code{HashSet}, this would include that there are never any duplicates. +Therefore, we also say that a container implementation has \emph{semantic properties}. +Intuitively we can think of this as the conditions upheld by the container. +A \code{HashSet}, would have the property that there are never any duplicates. A \code{Vec} would not have this property, but would have the property that insertion order is preserved. To select a correct container implementation, we then need to ensure we meet some syntactic and semantic requirements specific to our program. @@ -48,18 +47,17 @@ So long as we specify our requirements correctly, and use an implementation whic While meeting our program's functional requirements should ensure that it runs correctly, this doesn't say anything about our program's efficiency. We also want to choose the most efficient implementation available, striking a balance between runtime and memory usage. -Prior work has demonstrated that the right container implementation can give substantial performance improvements. -Perflint\citep{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%. -Similarly, Brainy\citep{jung_brainy_2011} demonstrates an average increase in speed of 27-33\% on real-world applications and libraries using a similar approach. +Prior work has demonstrated that changing container implementation can give substantial performance improvements. +Perflint \citep{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%. +Similarly, Brainy \citep{jung_brainy_2011} found a 27-33\% speedup of real-world applications and libraries using a similar approach. -If we can find a set of implementations that satisfy our functional requirements, then an obvious solution is to benchmark the program with each of these implementations in place. -This will obviously work, as long as our benchmarks are roughly representative of the real world. +If we can find a set of implementations that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place. +This will obviously work, so long as our benchmarks are roughly representative of the real world. Unfortunately, this technique scales poorly for larger applications. As the number of types we must select increases, the number of combinations we have to try increases exponentially. -This quickly becomes unfeasible, so we must explore other selection methods. -\section{Prior literature} +\section{Prior art} In this section we outline existing methods for container selection, in both current programming languages and literature. @@ -68,60 +66,70 @@ In this section we outline existing methods for container selection, in both cur Modern programming languages broadly take one of two approaches to container selection. Some languages, usually higher-level ones, recommend built-in structures as the default, using implementations that perform well enough for the vast majority of use-cases. -One popular examples is Python, which uses dynamic arrays as its built-in list implementation. +A popular example is Python, which uses dynamic arrays as its built-in list implementation. + This approach prioritises developer ergonomics: programmers do not need to think about how these are implemented. Often other implementations are possible, but are used only when needed and come at the cost of code readability. In other languages, collections are given as part of a standard library or must be written by the user. Java comes with growable lists as part of its standard library, as does Rust. -In both cases, the standard library implementation is not special --- users can implement their own and use them in the same ways. +In both cases, the standard library implementation is not special -- users can implement their own and use them in the same ways. -Interfaces, or their closest equivalent, are often used to abstract over 'similar' collections. +Interfaces, or their closest equivalent, are often used to abstract over similar collections. In Java, ordered collections implement the interface \code{List<E>}, with similar interfaces for \code{Set<E>}, \code{Queue<E>}, etc. +This allows most code to be implementation-agnostic, with functional requirements specified by the interface used. + +Whilst this provides some flexibility, it still requires the developer to choose a concrete implementation at some point. +In most cases, developers will simply choose the most common implementation and assume it will be fast enough. -This allows most code to be implementation-agnostic, but still requires the developer to choose a concrete implementation at some point. -This means that developers are forced to guess based on their knowledge of the underlying implementations, or to simply choose the most common implementation. +Otherwise, developers are forced to guess based on their knowledge of specific implementations and their program's behaviour. +For more complex programs or data structures, it can be difficult or impossible to reason about an implementation's performance. \subsection{Rules-based approaches} -One approach to this problem is to allow the developer to make the choice initially, but use some tool to detect poor choices. -Chameleon\citep{shacham_chameleon_2009} uses this approach. +One way to address this is to allow the developer to make the choice initially, but attempt to detect cases where the wrong choice was made. +Chameleon \citep{shacham_chameleon_2009} is one system which uses this approach. First, it collects statistics from program benchmarks using a ``semantic profiler''. This includes the space used by collections over time and the counts of each operation performed. These statistics are tracked per individual collection allocated and then aggregated by 'allocation context' --- the call stack at the point where the allocation occured. -These aggregated statistics are passed to a rules engine, which uses a set of rules to suggest different container types which might have better performance. +These aggregated statistics are passed to a rules engine, which uses a set of rules to identify cases where a different container implementations might perform better. This results in a flexible engine for providing suggestions which can be extended with new rules and types as necessary. A similar approach is used by \cite{l_liu_perflint_2009} for the C++ standard library. -However, adding new implementations requires the developer to write new suggestion rules. -This can be difficult, as it requires the developer to understand all of the existing implementations' performance characteristics. +By using the developer's selection as a baseline, both of these tools function similarly to a linter, which the developer can use to catch mistakes and suggest improvements. +This makes it easy to integrate into existing projects and workflows. -To satisfy functional requirements, Chameleon only suggests new types that behave identically to the existing type. -This results in selection rules being more restricted than they otherwise could be. -For instance, a rule cannot suggest a \code{HashSet} instead of a \code{LinkedList} as the two are not semantically identical. -Chameleon has no way of knowing if doing so will break the program's functionality and so it does not make the suggestion. +However, the use of suggestion rules means that adding a new container implementations requires writing new suggestion rules. +This requires the developer to understand all of the existing implementations' performance characteristics, and how they relate to the new implementation. +In effect, the difficulty of selecting an implementation is offloaded to whoever writes the suggestion rules. -CoCo\citep{hutchison_coco_2013} and \cite{osterlund_dynamically_2013} use similar techniques, but work as the program runs. +To ensure that functional requirements are satisfied, both systems will only suggest implementations that behave identically to the existing one. +This results in selection rules being more restricted than necessary. +For instance, a rule could not suggest a \code{HashSet} instead of a \code{Vec}, as the two are not semantically identical. + +CoCo \citep{hutchison_coco_2013} and \cite{osterlund_dynamically_2013} use similar techniques, but work as the program runs. This was shown to work well for programs with different phases of execution, such as loading and then working on data. However, the overhead from profiling and from checking rules may not be worth the improvements in other programs, where access patterns are roughly the same throughout. \subsection{ML-based approaches} -Brainy\citep{jung_brainy_2011} gathers similar statistics, but uses machine learning for selection instead of programmed rules. +Brainy \citep{jung_brainy_2011} gathers similar statistics, but uses machine learning for selection instead of programmed rules. ML has the advantage of being able to detect patterns a human may not be aware of. For example, Brainy takes into account statistics from hardware counters, which are difficult for a human to reason about. This also makes it easier to add new collection implementations, as rules do not need to be written by hand. +Whilst this offers increased flexibility, it comes at the cost of requiring a more lengthy model training process when implementations are changed. + \subsection{Estimate-based approaches} -CollectionSwitch\citep{costa_collectionswitch_2018} is another solution, which attempts to estimate the performance characteristics of each implementation individually. +CollectionSwitch \citep{costa_collectionswitch_2018} also avoids forcing developers to write rules, by estimating the performance characteristics of each implementation individually. First, a performance model is built for each container implementation. -This gives an estimate of some cost for each operation at a given collection size. -This cost might be a measurement of memory usage, or execution time. +This gives an estimate of some cost dimensions for each operation at a given collection size. +The originally proposed cost dimensions were memory usage and execution time. The system then collects data on how the program uses containers as it runs, and combines this with the built cost models to estimate the performance impact for each collection type. It may then decide to switch between container types if the potential change in cost seems high enough. @@ -152,7 +160,7 @@ As we note above, this scales poorly. \section{Contribution} -Of the tools presented, none are able to deal with both functional and non-functional requirements properly. +Of the tools presented, none are designed to deal with both functional and non-functional requirements well. Our contribution is a system for container selection that addresses both of these aspects. Users are able to specify their functional requirements in a way that is expressive enough for most usecases, and easy to integrate with existing projects. @@ -160,4 +168,4 @@ We then find which implementations in our library satisfy these requirements, an We also aim to make it easy to add new container implementations, and for our system to scale up to large projects without selection time becoming an issue. -Whilst the bulk of our system is focused on offline selection (done before the program is compiled), we also attempt to detect when changing implementation at runtime is desirable. +Whilst the bulk of our system is focused on offline selection (done before the program is compiled), we also attempt to detect when changing implementation at runtime is desirable, a technique which has largely only been applied to higher-level languages. |