From af4e8c6f7b761b950ede3f66cb459dd053351abe Mon Sep 17 00:00:00 2001 From: Aria Shrimpton Date: Sat, 30 Mar 2024 01:13:26 +0000 Subject: some redrafting --- thesis/parts/background.tex | 56 ++++++++++++++++++++++++------------------- thesis/parts/introduction.tex | 25 ++++++++++++------- 2 files changed, 48 insertions(+), 33 deletions(-) (limited to 'thesis') diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex index f705aad..6b4e295 100644 --- a/thesis/parts/background.tex +++ b/thesis/parts/background.tex @@ -1,54 +1,62 @@ In this chapter, we provide an overview of the problem of container selection and its effect on program correctness and performance. -We then provide an overview of approaches from modern programming languages and existing literature. -Finally, we explain how our system is novel, and the weaknesses in existing literature it solves. +We then provide an overview of approaches taken by modern programming languages and existing literature. +Finally, we explain how our system is novel, and the weaknesses in existing literature it addresses. \section{Container Selection} -The vast majority of programs will make extensive use of collection data types --- types intended to hold multiple instances of other types. +A container data structure is simply a structure which holds a collection of related values. +This could include a list (growable or otherwise), a set (with no duplicate elements), or something more complex (like a min heap). -In many languages, the standard library provides a variety of collections, with users able to choose which is best for their program. +In many languages, the standard library provides implementations of various container types, with users able to choose which is best for their program. This saves users a lot of time, however selecting the best type is not always straightforward. Consider a program which needs to store and query a set of numbers, and doesn't care about ordering or duplicates. -If the number of items ($n$) is small enough, it might be fastest to use a dynamic array, and scan through each time we want to check if a number is inside. -On the other hand, if the set we deal with is much larger, we may want the constant-time lookups provided by hash sets, at the cost of a generally slower lookup. +If the number of items ($n$) is small enough, it might be fastest to use a dynamically-sized array (known as a vector in many languages), and scan through each time we want to check if a number is inside. +On the other hand, if the set we deal with is much larger, we might need to use a more complex method to keep things fast. +A common example would be a hash set, which provides roughly the same lookup speed regardless of size, at the cost of being slower overall. In this case, there are two factors driving our decision. -Our functional requirements, that we don't care about ordering or duplicates, and our non-functional requirements, that we want our program to be fast. +Our \emph{functional requirements} -- that we don't care about ordering or duplicates -- and our \emph{non-functional requirements} -- that we want our program to use resources efficiently. \subsection{Functional requirements} -Functional requirements tell us how the container will be used and how it must behave. -Continuing with our previous example, we'll compare Rust's \code{Vec} type (a dynamic array), with the \code{HashSet} type. +Functional requirements tell us how the container must behave in order for the program it's used in to function correctly. +Continuing with our previous example, we'll compare Rust's \code{Vec} implementation (a dynamic array), with its \code{HashSet} implementation (a hash table). -Note that the two types have different methods: \code{Vec} implements \code{.get(index)}, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this doesn't make sense. -If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} probably wouldn't compile. +To start with, we can see that the two types have different methods. +\code{Vec} has a \code{.get(index)} method, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this wouldn't make sense. +If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} would likely cause the compiler to raise an error. -We will call the operations a container provides the ``syntactic properties'' of the container. -In object-oriented programming, we might say they must implement an ``interface'', while in Rust, we could say that they implement a ``trait''. +We will call the operations that a container implementation provides the \emph{syntactic properties} of the implementation. +In object-oriented programming, we might say they must implement an interface, while in Rust, we would say that they implement a trait.\footnote{Rust's traits are most similar to typeclasses from Haskell and other FP languages. Interfaces and typeclasses have important differences, but they are irrelevant in this case.} -However, syntactic properties alone are not always enough to select an appropriate container. -Suppose our program only requires a container to have \code{.insert(value)} and \code{.len()}. -Both \code{Vec} and \code{HashSet} will satisfy these requirements, but we might rely on \code{.len()} including duplicates. +However, syntactic properties alone are not always enough to select an appropriate container implementation. +Suppose our program only requires a type with the \code{.contains(value)} and \code{.len()} methods. +Both \code{Vec} and \code{HashSet} satisfy these requirements, but our program might also rely on the count returned from \code{.len()} including duplicates. In this case, \code{HashSet} would give us different behaviour, causing our program to behave incorrectly. -Therefore we also say that a container implementation has ``semantic properties''. -Intuitively we can think of this as what conditions the container upholds. -For a \code{HashSet}, this would include that there are never any duplicates, whereas for a \code{Vec} it would include that ordering is preserved. +Therefore, we also say that a container implementation has \code{semantic properties}. +Intuitively we can think of this as what conditions are upheld. +For a \code{HashSet}, this would include that there are never any duplicates. +A \code{Vec} would not have this property, but would have the property that insertion order is preserved. + +To select a correct container implementation, we then need to ensure we meet some syntactic and semantic requirements specific to our program. +So long as we specify our requiremets correctly, and use an implementation which provides all of the properties we're looking for, our program shouldn't be able to tell the difference. \subsection{Non-functional requirements} -While meeting the functional requirements should ensure our program runs correctly, we also want to choose the 'best' type that we can, striking a balance between runtime and memory usage. +While meeting our program's functional requirements should ensure that it runs correctly, this doesn't say anything about our program's efficiency. +We likely also want to choose the most efficient implementation available, striking a balance between runtime and memory usage. -Prior work has demonstrated that proper container selection can result in substantial performance improvements. +Prior work has demonstrated that the right container implementation can give substantial performance improvements. \cite{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%. Similarly, \cite{jung_brainy_2011} demonstrates an average increase in speed of 27-33\% on real-world applications and libraries using a similar approach. -If we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place and determine which works best. -This will work so long as our benchmarks are roughly representative of 'real world' inputs. +If we can find a set of implementations that satisfy our functional requirements, then an obvious solution is to benchmark the program with each of these implementations in place. +This will obviously work, as long as our benchmarks are roughly representative of the real world. Unfortunately, this technique scales poorly for larger applications. -As the number of types we must select increases linearly, the number of possible permutations increases exponentially (provided they have roughly the same number of candidates). +As the number of types we must select increases, the number of combinations we have to try increases exponentially. This quickly becomes unfeasible, so we must explore other selection methods. \section{Prior literature} diff --git a/thesis/parts/introduction.tex b/thesis/parts/introduction.tex index aae5474..17830a9 100644 --- a/thesis/parts/introduction.tex +++ b/thesis/parts/introduction.tex @@ -3,16 +3,17 @@ %% **** Container types common in programs -A common requirement when programming is the need to keep a collection of data together, for example in a list. +Almost every program makes extensive use of container data structures -- structures which hold a collection of values. Often, programmers will have some requirements they want to impose on this collection, such as not storing duplicate elements, or storing the items in sorted order. %% **** Functionally identical implementations -However, implementing these collection types manually is usually a waste of time, as is fine-tuning a custom implementation to perform better. -Most programmers will simply use one or two collection types provided by their language. +However, implementing these collection types manually wastes time, and can be hard to do right for more complicated structures. +Most programmers will simply use one or two of the collection types provided by their language. +Some languages, such as Python, go a step further, providing built-in implementations of growable lists and associative maps, with special syntax for both. %% **** Large difference in performance -Often, this is not the best choice. -The underlying implementation of container types which function the same can have a drastic effect on performance (\cite{l_liu_perflint_2009}, \cite{jung_brainy_2011}). +Unfortunately, the underlying implementation of container types which function the same can have a drastic effect on performance (\cite{l_liu_perflint_2009}, \cite{jung_brainy_2011}). +By largely ignoring the performance characteristics of their implementation, programmers may be missing out on large performance gains. %% *** Motivate w/ effectiveness claims We propose a system, Candelabra, for the automatic selection of container implementations, based on both user-specified requirements and inferred requirements for performance. @@ -22,11 +23,17 @@ In our testing, we are able to accurately select the best performing containers %% **** Ease of adding new container types We have designed our system with flexibility in mind: adding new container implementations requires little effort. %% **** Ease of integration into existing projects -It is easy to adopt our system incrementally, and we integrate with existing tools to making doing so easy. +It is easy to adopt our system incrementally, and we integrate with existing tools to make doing so easy. %% **** Scalability to larger projects -The time it takes to select containers scales roughly linearly, even in complex cases, allowing our tool to be used even on larger projects. +The time it takes to select containers scales roughly linearly, even in complex cases, allowing our system to be used even on larger projects. %% **** Flexibility of selection -Our system is also able to suggest adaptive containers: containers which switch underlying implementation as they grow. +It is also able to suggest adaptive containers: containers which switch from one underlying implementation to another once they get past a cretain size. %% **** Overview of results -Whilst we saw reasonable suggestions in our test cases, we found the overhead of switching and of checking the current implementation to be more of a problem than expected, which future work could improve on. +Whilst we saw reasonable suggestions in our test cases, we found important performance concerns which future work could improve on. + +In chapter \ref{chap:background}, we give a more thorough description of the container selection problem, and examine previous work. We outline gaps in existing literature, and how we aim to contribute. + +Chapter \ref{chap:design} explains the design of our solution, and how it fulfills the aims set out in chapter \ref{chap:background}. Chapter \ref{chap:implementation} expands on this, describing the implementation work in detail and the challenges faced. + +We evaluate the effectiveness of our solution in chapter \ref{chap:results}, and identify several shortcomings that future work could improve upon. -- cgit v1.2.3