aboutsummaryrefslogtreecommitdiff
path: root/thesis/parts/background.tex
diff options
context:
space:
mode:
Diffstat (limited to 'thesis/parts/background.tex')
-rw-r--r--thesis/parts/background.tex56
1 files changed, 32 insertions, 24 deletions
diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex
index f705aad..6b4e295 100644
--- a/thesis/parts/background.tex
+++ b/thesis/parts/background.tex
@@ -1,54 +1,62 @@
In this chapter, we provide an overview of the problem of container selection and its effect on program correctness and performance.
-We then provide an overview of approaches from modern programming languages and existing literature.
-Finally, we explain how our system is novel, and the weaknesses in existing literature it solves.
+We then provide an overview of approaches taken by modern programming languages and existing literature.
+Finally, we explain how our system is novel, and the weaknesses in existing literature it addresses.
\section{Container Selection}
-The vast majority of programs will make extensive use of collection data types --- types intended to hold multiple instances of other types.
+A container data structure is simply a structure which holds a collection of related values.
+This could include a list (growable or otherwise), a set (with no duplicate elements), or something more complex (like a min heap).
-In many languages, the standard library provides a variety of collections, with users able to choose which is best for their program.
+In many languages, the standard library provides implementations of various container types, with users able to choose which is best for their program.
This saves users a lot of time, however selecting the best type is not always straightforward.
Consider a program which needs to store and query a set of numbers, and doesn't care about ordering or duplicates.
-If the number of items ($n$) is small enough, it might be fastest to use a dynamic array, and scan through each time we want to check if a number is inside.
-On the other hand, if the set we deal with is much larger, we may want the constant-time lookups provided by hash sets, at the cost of a generally slower lookup.
+If the number of items ($n$) is small enough, it might be fastest to use a dynamically-sized array (known as a vector in many languages), and scan through each time we want to check if a number is inside.
+On the other hand, if the set we deal with is much larger, we might need to use a more complex method to keep things fast.
+A common example would be a hash set, which provides roughly the same lookup speed regardless of size, at the cost of being slower overall.
In this case, there are two factors driving our decision.
-Our functional requirements, that we don't care about ordering or duplicates, and our non-functional requirements, that we want our program to be fast.
+Our \emph{functional requirements} -- that we don't care about ordering or duplicates -- and our \emph{non-functional requirements} -- that we want our program to use resources efficiently.
\subsection{Functional requirements}
-Functional requirements tell us how the container will be used and how it must behave.
-Continuing with our previous example, we'll compare Rust's \code{Vec} type (a dynamic array), with the \code{HashSet} type.
+Functional requirements tell us how the container must behave in order for the program it's used in to function correctly.
+Continuing with our previous example, we'll compare Rust's \code{Vec} implementation (a dynamic array), with its \code{HashSet} implementation (a hash table).
-Note that the two types have different methods: \code{Vec} implements \code{.get(index)}, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this doesn't make sense.
-If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} probably wouldn't compile.
+To start with, we can see that the two types have different methods.
+\code{Vec} has a \code{.get(index)} method, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this wouldn't make sense.
+If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} would likely cause the compiler to raise an error.
-We will call the operations a container provides the ``syntactic properties'' of the container.
-In object-oriented programming, we might say they must implement an ``interface'', while in Rust, we could say that they implement a ``trait''.
+We will call the operations that a container implementation provides the \emph{syntactic properties} of the implementation.
+In object-oriented programming, we might say they must implement an interface, while in Rust, we would say that they implement a trait.\footnote{Rust's traits are most similar to typeclasses from Haskell and other FP languages. Interfaces and typeclasses have important differences, but they are irrelevant in this case.}
-However, syntactic properties alone are not always enough to select an appropriate container.
-Suppose our program only requires a container to have \code{.insert(value)} and \code{.len()}.
-Both \code{Vec} and \code{HashSet} will satisfy these requirements, but we might rely on \code{.len()} including duplicates.
+However, syntactic properties alone are not always enough to select an appropriate container implementation.
+Suppose our program only requires a type with the \code{.contains(value)} and \code{.len()} methods.
+Both \code{Vec} and \code{HashSet} satisfy these requirements, but our program might also rely on the count returned from \code{.len()} including duplicates.
In this case, \code{HashSet} would give us different behaviour, causing our program to behave incorrectly.
-Therefore we also say that a container implementation has ``semantic properties''.
-Intuitively we can think of this as what conditions the container upholds.
-For a \code{HashSet}, this would include that there are never any duplicates, whereas for a \code{Vec} it would include that ordering is preserved.
+Therefore, we also say that a container implementation has \code{semantic properties}.
+Intuitively we can think of this as what conditions are upheld.
+For a \code{HashSet}, this would include that there are never any duplicates.
+A \code{Vec} would not have this property, but would have the property that insertion order is preserved.
+
+To select a correct container implementation, we then need to ensure we meet some syntactic and semantic requirements specific to our program.
+So long as we specify our requiremets correctly, and use an implementation which provides all of the properties we're looking for, our program shouldn't be able to tell the difference.
\subsection{Non-functional requirements}
-While meeting the functional requirements should ensure our program runs correctly, we also want to choose the 'best' type that we can, striking a balance between runtime and memory usage.
+While meeting our program's functional requirements should ensure that it runs correctly, this doesn't say anything about our program's efficiency.
+We likely also want to choose the most efficient implementation available, striking a balance between runtime and memory usage.
-Prior work has demonstrated that proper container selection can result in substantial performance improvements.
+Prior work has demonstrated that the right container implementation can give substantial performance improvements.
\cite{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%.
Similarly, \cite{jung_brainy_2011} demonstrates an average increase in speed of 27-33\% on real-world applications and libraries using a similar approach.
-If we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place and determine which works best.
-This will work so long as our benchmarks are roughly representative of 'real world' inputs.
+If we can find a set of implementations that satisfy our functional requirements, then an obvious solution is to benchmark the program with each of these implementations in place.
+This will obviously work, as long as our benchmarks are roughly representative of the real world.
Unfortunately, this technique scales poorly for larger applications.
-As the number of types we must select increases linearly, the number of possible permutations increases exponentially (provided they have roughly the same number of candidates).
+As the number of types we must select increases, the number of combinations we have to try increases exponentially.
This quickly becomes unfeasible, so we must explore other selection methods.
\section{Prior literature}