aboutsummaryrefslogtreecommitdiff
path: root/thesis/parts/background.tex
diff options
context:
space:
mode:
Diffstat (limited to 'thesis/parts/background.tex')
-rw-r--r--thesis/parts/background.tex152
1 files changed, 81 insertions, 71 deletions
diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex
index c832e50..0bf265b 100644
--- a/thesis/parts/background.tex
+++ b/thesis/parts/background.tex
@@ -1,138 +1,148 @@
-In this chapter we provide an overview of the problem of container selection, and its effect on program correctness and performance.
-We then provide an overview of approaches from modern programming languages, and existing literature.
-Finally, we examine the gaps in the existing literature, and explain what we aim to contribute.
+In this chapter we provide an overview of the problem of container selection and its effect on program correctness and performance.
+We then provide an overview of approaches from modern programming languages and existing literature.
\section{Container Selection}
-The vast majority of programs will make extensive use of collection data types --- types intended to hold many different instances of other data types.
-This includes structures like fixed-size arrays, growable lists, and key-value mappings.
+The vast majority of programs will make extensive use of collection data types --- types intended to hold multiple instances of other types.
-In many languages, the standard library provides a variety of collections, forcing us to choose which is best.
-Consider the Rust types \code{Vec<T>} (a dynamic array) and \code{HashSet<T>} (a hash-based set). %comma separate definitions instead of parentheses
-If we care about the ordering, or about preserving duplicates, then we must use \code{Vec<T>}.
-But if we don't, then \code{HashSet<T>} might be more performant, if we use \code{contains} a lot.%bad contraction (there can be good contractions)
+In many languages, the standard library provides a variety of collections, forcing users to choose which is best for their program.
+Consider the Rust types \code{Vec<T>}, a dynamic array, and \code{HashSet<T>}, a hash-based set.
+If a user cares about ordering, or about preserving duplicates, then they must use \code{Vec<T>}.
+But if they do not, then \code{HashSet<T>} might be more performant, provided \code{contains} is used repeatedly.
-We refer to this problem as container selection, and say that we must satisfy both functional requirements, and non-functional requirements.%comma?
+We refer to this problem as container selection, and say that we must satisfy both functional requirements and non-functional requirements.
\subsection{Functional requirements}
-The functional requirements tell us how the container will be used, and how it must behave.%comma?
+The functional requirements tell us how the container will be used and how it must behave.
Continuing with our previous example, we can see that \code{Vec} and \code{HashSet} implement different methods.
-\code{Vec} implements \code{.get(index)} while \code{HashSet} doesn't - it wouldn't make sense for an unordered collection.%contraction
-If we try to swap \code{Vec} for \code{HashSet}, the resulting program will likely not compile.
+\code{Vec} implements \code{.get(index)}, while \code{HashSet} does not; this would not be possible for an unordered collection.
+If we attempt to replace \code{Vec} with \code{HashSet}, the resulting program will likely not compile.
We will call the operations a container implements the ``syntactic properties'' of the container.
-In object-oriented programming, we might say they must implement an interface, while in Rust, we would say that they implement a trait.
+In object-oriented programming, we might say they must implement an ``interface'', while in Rust, we would say that they implement a ``trait''.
However, syntactic properties alone are not always enough to select an appropriate container.
-Suppose our program only requires a container to have \code{.insert(value)}, and \code{.len()}.
+Suppose our program only requires a container to have \code{.insert(value)} and \code{.len()}.
Both \code{Vec} and \code{HashSet} will satisfy these requirements, but we might rely on \code{.len()} including duplicates.
In this case, \code{HashSet} would give us different behaviour, causing our program to behave incorrectly.
Therefore we also say that a container implementation has ``semantic properties''.
Intuitively we can think of this as what conditions the container upholds.
-For a \code{HashSet}, this would include that there are never any duplicates, whereas for a Vec it would include that ordering is preserved.
+For a \code{HashSet}, this would include that there are never any duplicates, whereas for a \code{Vec} it would include that ordering is preserved.
\subsection{Non-functional requirements}
-While meeting the functional requirements should ensure our program runs correctly, we also want to choose the 'best' type that we can.
-Here we will consider 'best' as striking a balance between runtime and memory usage.
+While meeting the functional requirements should ensure our program runs correctly, we also want to choose the 'best' type that we can, striking an ideal balance between runtime and memory usage.
-Prior work has shown that properly considering container selection selection can give substantial performance improvements.
-For instance, \cite{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%.
-Similarly, \cite{jung_brainy_2011} achieves an average speedup of 27-33\% on real-world applications and libraries.
+Prior work has demonstrated that proper container selection can result in substantial performance improvements.
+\cite{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%.
+Similarly, \cite{jung_brainy_2011} demonstrates an average increase in speed of 27-33\% on real-world applications and libraries using a similar approach.
-If we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place, and see which works best.
-This will obviously work, so long as our benchmarks are roughly representative of 'real world' inputs.
+If we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place and determine which works best.
+This will work so long as our benchmarks are roughly representative of 'real world' inputs.
-Unfortunately, this technique scales poorly for bigger applications.
-As the number of container types we must select increases, the number of combinations we must try increases exponentially (assuming they all have roughly the same number of candidates).
-This quickly becomes unfeasible, and so we must find other selection methods.
+Unfortunately, this technique scales poorly for larger applications.
+As the number of types we must select increases linearly, the number of combinations we must try increases exponentially (provided they have roughly the same number of candidates).
+This quickly becomes unfeasible, so we must explore other selection methods.
-\section{Prior Literature}
+\section{Prior literature}
-In this section, we outline methods for container selection available in current programming languages, and their limitations.
-We then examine some of the existing solutions for container selection, and their limitations.
+In this section we outline methods for container selection available within and outside of current programming languages and their limitations based on existing literature on the topic.
\subsection{Approaches in common programming languages}
Modern programming languages broadly take one of two approaches to container selection.
-Some languages, usually higher-level ones, recommend built-in structures as the default, using implementations that perform fine for the vast majority of use-cases.
+Some languages, usually higher-level ones, recommend built-in structures as the default, using implementations that perform well enough for the vast majority of use-cases.
One popular examples is Python, which uses dynamic arrays as its built-in list implementation.
-This approach prioritises developer ergonomics: Programmers do not need to think about how these are implemented.
-Usually, other implementations are possible, but are used only when needed and come at the cost of code readability.
+This approach prioritises developer ergonomics: programmers do not need to think about how these are implemented.
+Often other implementations are possible, but are used only when needed and come at the cost of code readability.
-In other languages, collections are given as part of a standard library, or must be written by the user.
-Java comes with growable lists as part of its standard library, as does Rust (with some macros to make use easier).
-In both cases, the ``blessed'' implementation of collections is not special - users can implement their own.
+In other languages, collections are given as part of a standard library or must be written by the user.
+Java comes with growable lists as part of its standard library, as does Rust.
+In both cases, the ``blessed'' implementation of collections is not special --- users can implement their own and use them in the same ways.
Often interfaces, or their closest equivalent, are used to distinguish 'similar' collections.
-In Java, ordered collections implement the interface \code{List<E>}, while similar interfaces exist for \code{Set<E>}, \code{Queue<E>}, etc.
-This means that when the developer chooses a type, the compiler enforces the syntactic requirements of the collection, and the writer of the implementaiton ``promises'' they have met the semantic requirements.
+In Java, ordered collections implement the interface \code{List<E>}, with similar interfaces for \code{Set<E>}, \code{Queue<E>}, etc.
-Whilst the approach Java takes is the most expressive, both of these approaches either put the choice on the developer, or remove the choice entirely.
-This means that developers are forced to guess based on their knowledge of the underlying implementations, or to just pick the most common implementation.
+This allows most code to be implementation-agnostic, however the developer must still choose a concrete implementation at some point.
+This means that developers are forced to guess based on their knowledge of the underlying implementations, or simply choose the most common implementation.
\subsection{Rules-based approaches}
-One approach to the container selection problem is to allow the developer to make the choice initially, but use some tool to detect bad choices.
+One approach to the container selection problem is to allow the developer to make the choice initially, but use some tool to detect poor choices.
+Chameleon\parencite{shacham_chameleon_2009} is a solution of this type.
-Chameleon\parencite{shacham_chameleon_2009} is one example of this.
It first collects statistics from program benchmarks using a ``semantic profiler''.
-This includes the space used by collections over time, and the counts of each operation performed.
-These statistics are tracked per individual collection allocated, and then aggregated by 'allocation context' - the call stack at the point where the allocation occured.
+This includes the space used by collections over time and the counts of each operation performed.
+These statistics are tracked per individual collection allocated and then aggregated by 'allocation context' --- the call stack at the point where the allocation occured.
-These aggregated statistics are then passed to a rules engine, which uses a set of rules to suggest places a different container type might improve performance.
-This results in a flexible engine for providing suggestions, which can be extended with new rules and types as necessary.
+These aggregated statistics are passed to a rules engine, which uses a set of rules to suggest container types which might improve performance.
+This results in a flexible engine for providing suggestions which can be extended with new rules and types as necessary.
+
+However, adding new implementations requires the developer to write new suggestion rules.
+This can be difficult as it requires the developer to understand all of the existing implementations' performance characteristics.
To satisfy functional requirements, Chameleon only suggests new types that behave identically to the existing type.
-This results in selection rules needing to be more restricted than they otherwise could be.
-For instance, a rule cannot suggest a \code{HashSet} instead of a \code{LinkedList}, as the two are not semantically identical.
-Chameleon has no way of knowing if doing so will break the program's functionality, and so it does not make a suggestion.
+This results in selection rules being more restricted than they otherwise could be.
+For instance, a rule cannot suggest a \code{HashSet} instead of a \code{LinkedList} as the two are not semantically identical.
+Chameleon has no way of knowing if doing so will break the program's functionality and so it does not make the suggestion.
-A similar rules-based approach is used by \cite{l_liu_perflint_2009} for the C++ standard library.
\cite{hutchison_coco_2013} and \cite{osterlund_dynamically_2013} use similar techniques, but work as the program runs.
-This works well for programs with different phases of execution, however does incur an overhead.
+This works well for programs with different phases of execution, such as loading and then working on data.
+However, the overhead from profiling and from checking rules may not be worth the improvements in other programs, where access patterns are roughly the same throughout.
\subsection{ML-based approaches}
-%% TODO
-\cite{jung_brainy_2011} uses a machine learning approach with similar statistics collection
+Brainy\parencite{jung_brainy_2011} gathers statistics similarly, however it uses machine learning (ML) for selection instead of programmed rules.
+
+ML has the advantage of being able to detect patterns a human may not be aware of.
+For example, Brainy takes into account statistics from hardware counters, which are difficult for a human to reason about.
+This approach also makes it easier to add new collection implementations, as rules do not need to be written by hand.
-\cite{thomas_framework_2005} also uses an ML approach, but focuses on parallel algorithms rather than data structures, and does not take hardware counters into account.
+\cite{thomas_framework_2005} also uses an ML approach, but focuses on parallel algorithms rather than data structures.
+On installation, it collects information about the system and the performance of various algorithms.
+This is then
\subsection{Estimate-based approaches}
-CollectionSwitch\parencite{costa_collectionswitch_2018} is an online solution, which adapts as the program runs and new information becomes available.
+CollectionSwitch\parencite{costa_collectionswitch_2018} is an online solution which adapts as the program runs and new information becomes available.
First, a performance model is built for each container implementation.
-This is done by performing each operation many times in succession, varying the length of the collection.
-This data is used to fit a polynomial, which gives an estimate of cost of a specific operation at a given n.
+This gives an estimate of some cost for each operation at a given collection size.
+We call the measure of cost the ``cost dimension''.
+Examples of cost dimensions include memory usage and execution time.
+
+This is combined with profiling information to give cost estimates for each collection type and cost dimension.
+Switching between container types is then done based on the potential change in each cost dimension.
+For instance, we may choose to switch if we reduce the estimated space cost by more than 20\%, so long as the estimated time cost doesn't increase by more than 20\%.
-This is then combined with the frequency of each operation counts to give cost estimates for each collection type, operation, and 'cost dimension' (time and space).
-Rules then decide when switching to a new implementation is worth it based on these cost estimates and defined thresholds.
+By generating a cost model based on benchmarks, CollectionSwitch manages to be more flexible than rule-based approaches.
+Like ML approaches, adding new implementations requires little extra work, but has the advantage of being possible without having to re-train a model.
-By generating a cost model based on benchmarks, CollectionSwitch manages to be more flexible than other rules-based approaches such as Chameleon.
-It expects applications to use Java's \code{List}, \code{Set}, or \code{Map} interfaces, which express enough functional requirements for most problems.
+A similar approach is used by \cite{l_liu_perflint_2009} for the C++ standard library. %meowr
\subsection{Functional requirements}
-Most of the approaches highlighted above have focused on non-functional requirements, and used programming language features to enforce functional requirements.
-By contrast, Primrose \parencite{qin_primrose_2023} focuses on the functional requirements of container selection.
+Most of the approaches we have highlighted focus on non-functional requirements, and use programming language features to enforce functional requirements.
+We will now examine tools which focus on container selection based on functional requirements.
-It allows the application developer to specify semantic requirements using a DSL, and syntactic requirements using Rust's traits.
+Primrose \parencite{qin_primrose_2023} is one such tool, which uses a model-based approach.
+It allows the application developer to specify semantic requirements using a Domain-Specific Language (DSL), and syntactic requirements using Rust's traits.
-A semantic property is simply a predicate, acting on an abstract model of the container type.
-Similarly, each implementation provides an abstract version of its operations acting on this model.
-An SMT solver then checks if a given implementation will always meet the conditions required by the predicate(s).
+The semantic requirements are expressed as a list of predicates, each representing a semantic property.
+Predicates act on an abstract model of the container type.
+Each implementation also specifies the conditions it upholds using an abstract model.
+A constraint solver then checks if a given implementation will always meet the conditions required by the predicate(s).
-Developers must then choose which of these implementations will work best for their non-functional requirements.
+This allows developers to express any combination of semantic requirements, rather than limiting them to common ones (as in Java's approach).
+It can also be extended with new implementations as needed, though this does require modelling the semantics of the new implementation.
-This allows developers to express any combination of semantic requirements, rather than limiting them to common ones like Java's approach.
-It can also be extended with new implementations as needed, although this does require modelling the semantics of the new implementation.
+\cite{franke_collection_2022} also uses the idea of refinement types, but is limited to properties defined by the library authors and implemented on the container implementations.
-\cite{franke_collection_2022} also uses the idea of refinement types, but is limited to properties defined by the library authors.
+To select the final container implementation, both tools rely on benchmarking each candidate.
+As we note above, this scales poorly.
-\section{Contributions}
+We will be creating a container selection method that primarily uses the Primrose approach while incorporating elements of CollectionSwitch's approach in order to combat the issue of scaling that many existing implementations face.