aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAria Shrimpton <me@aria.rip>2024-03-30 01:13:26 +0000
committerAria Shrimpton <me@aria.rip>2024-03-30 01:13:26 +0000
commitaf4e8c6f7b761b950ede3f66cb459dd053351abe (patch)
tree34bf2609f34344465e7fd95666325b8e241ff464
parent05eef0761474d79250f53a81c5fd33faa0894d33 (diff)
some redrafting
-rw-r--r--thesis/parts/background.tex56
-rw-r--r--thesis/parts/introduction.tex25
2 files changed, 48 insertions, 33 deletions
diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex
index f705aad..6b4e295 100644
--- a/thesis/parts/background.tex
+++ b/thesis/parts/background.tex
@@ -1,54 +1,62 @@
In this chapter, we provide an overview of the problem of container selection and its effect on program correctness and performance.
-We then provide an overview of approaches from modern programming languages and existing literature.
-Finally, we explain how our system is novel, and the weaknesses in existing literature it solves.
+We then provide an overview of approaches taken by modern programming languages and existing literature.
+Finally, we explain how our system is novel, and the weaknesses in existing literature it addresses.
\section{Container Selection}
-The vast majority of programs will make extensive use of collection data types --- types intended to hold multiple instances of other types.
+A container data structure is simply a structure which holds a collection of related values.
+This could include a list (growable or otherwise), a set (with no duplicate elements), or something more complex (like a min heap).
-In many languages, the standard library provides a variety of collections, with users able to choose which is best for their program.
+In many languages, the standard library provides implementations of various container types, with users able to choose which is best for their program.
This saves users a lot of time, however selecting the best type is not always straightforward.
Consider a program which needs to store and query a set of numbers, and doesn't care about ordering or duplicates.
-If the number of items ($n$) is small enough, it might be fastest to use a dynamic array, and scan through each time we want to check if a number is inside.
-On the other hand, if the set we deal with is much larger, we may want the constant-time lookups provided by hash sets, at the cost of a generally slower lookup.
+If the number of items ($n$) is small enough, it might be fastest to use a dynamically-sized array (known as a vector in many languages), and scan through each time we want to check if a number is inside.
+On the other hand, if the set we deal with is much larger, we might need to use a more complex method to keep things fast.
+A common example would be a hash set, which provides roughly the same lookup speed regardless of size, at the cost of being slower overall.
In this case, there are two factors driving our decision.
-Our functional requirements, that we don't care about ordering or duplicates, and our non-functional requirements, that we want our program to be fast.
+Our \emph{functional requirements} -- that we don't care about ordering or duplicates -- and our \emph{non-functional requirements} -- that we want our program to use resources efficiently.
\subsection{Functional requirements}
-Functional requirements tell us how the container will be used and how it must behave.
-Continuing with our previous example, we'll compare Rust's \code{Vec} type (a dynamic array), with the \code{HashSet} type.
+Functional requirements tell us how the container must behave in order for the program it's used in to function correctly.
+Continuing with our previous example, we'll compare Rust's \code{Vec} implementation (a dynamic array), with its \code{HashSet} implementation (a hash table).
-Note that the two types have different methods: \code{Vec} implements \code{.get(index)}, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this doesn't make sense.
-If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} probably wouldn't compile.
+To start with, we can see that the two types have different methods.
+\code{Vec} has a \code{.get(index)} method, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this wouldn't make sense.
+If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} would likely cause the compiler to raise an error.
-We will call the operations a container provides the ``syntactic properties'' of the container.
-In object-oriented programming, we might say they must implement an ``interface'', while in Rust, we could say that they implement a ``trait''.
+We will call the operations that a container implementation provides the \emph{syntactic properties} of the implementation.
+In object-oriented programming, we might say they must implement an interface, while in Rust, we would say that they implement a trait.\footnote{Rust's traits are most similar to typeclasses from Haskell and other FP languages. Interfaces and typeclasses have important differences, but they are irrelevant in this case.}
-However, syntactic properties alone are not always enough to select an appropriate container.
-Suppose our program only requires a container to have \code{.insert(value)} and \code{.len()}.
-Both \code{Vec} and \code{HashSet} will satisfy these requirements, but we might rely on \code{.len()} including duplicates.
+However, syntactic properties alone are not always enough to select an appropriate container implementation.
+Suppose our program only requires a type with the \code{.contains(value)} and \code{.len()} methods.
+Both \code{Vec} and \code{HashSet} satisfy these requirements, but our program might also rely on the count returned from \code{.len()} including duplicates.
In this case, \code{HashSet} would give us different behaviour, causing our program to behave incorrectly.
-Therefore we also say that a container implementation has ``semantic properties''.
-Intuitively we can think of this as what conditions the container upholds.
-For a \code{HashSet}, this would include that there are never any duplicates, whereas for a \code{Vec} it would include that ordering is preserved.
+Therefore, we also say that a container implementation has \code{semantic properties}.
+Intuitively we can think of this as what conditions are upheld.
+For a \code{HashSet}, this would include that there are never any duplicates.
+A \code{Vec} would not have this property, but would have the property that insertion order is preserved.
+
+To select a correct container implementation, we then need to ensure we meet some syntactic and semantic requirements specific to our program.
+So long as we specify our requiremets correctly, and use an implementation which provides all of the properties we're looking for, our program shouldn't be able to tell the difference.
\subsection{Non-functional requirements}
-While meeting the functional requirements should ensure our program runs correctly, we also want to choose the 'best' type that we can, striking a balance between runtime and memory usage.
+While meeting our program's functional requirements should ensure that it runs correctly, this doesn't say anything about our program's efficiency.
+We likely also want to choose the most efficient implementation available, striking a balance between runtime and memory usage.
-Prior work has demonstrated that proper container selection can result in substantial performance improvements.
+Prior work has demonstrated that the right container implementation can give substantial performance improvements.
\cite{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%.
Similarly, \cite{jung_brainy_2011} demonstrates an average increase in speed of 27-33\% on real-world applications and libraries using a similar approach.
-If we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place and determine which works best.
-This will work so long as our benchmarks are roughly representative of 'real world' inputs.
+If we can find a set of implementations that satisfy our functional requirements, then an obvious solution is to benchmark the program with each of these implementations in place.
+This will obviously work, as long as our benchmarks are roughly representative of the real world.
Unfortunately, this technique scales poorly for larger applications.
-As the number of types we must select increases linearly, the number of possible permutations increases exponentially (provided they have roughly the same number of candidates).
+As the number of types we must select increases, the number of combinations we have to try increases exponentially.
This quickly becomes unfeasible, so we must explore other selection methods.
\section{Prior literature}
diff --git a/thesis/parts/introduction.tex b/thesis/parts/introduction.tex
index aae5474..17830a9 100644
--- a/thesis/parts/introduction.tex
+++ b/thesis/parts/introduction.tex
@@ -3,16 +3,17 @@
%% **** Container types common in programs
-A common requirement when programming is the need to keep a collection of data together, for example in a list.
+Almost every program makes extensive use of container data structures -- structures which hold a collection of values.
Often, programmers will have some requirements they want to impose on this collection, such as not storing duplicate elements, or storing the items in sorted order.
%% **** Functionally identical implementations
-However, implementing these collection types manually is usually a waste of time, as is fine-tuning a custom implementation to perform better.
-Most programmers will simply use one or two collection types provided by their language.
+However, implementing these collection types manually wastes time, and can be hard to do right for more complicated structures.
+Most programmers will simply use one or two of the collection types provided by their language.
+Some languages, such as Python, go a step further, providing built-in implementations of growable lists and associative maps, with special syntax for both.
%% **** Large difference in performance
-Often, this is not the best choice.
-The underlying implementation of container types which function the same can have a drastic effect on performance (\cite{l_liu_perflint_2009}, \cite{jung_brainy_2011}).
+Unfortunately, the underlying implementation of container types which function the same can have a drastic effect on performance (\cite{l_liu_perflint_2009}, \cite{jung_brainy_2011}).
+By largely ignoring the performance characteristics of their implementation, programmers may be missing out on large performance gains.
%% *** Motivate w/ effectiveness claims
We propose a system, Candelabra, for the automatic selection of container implementations, based on both user-specified requirements and inferred requirements for performance.
@@ -22,11 +23,17 @@ In our testing, we are able to accurately select the best performing containers
%% **** Ease of adding new container types
We have designed our system with flexibility in mind: adding new container implementations requires little effort.
%% **** Ease of integration into existing projects
-It is easy to adopt our system incrementally, and we integrate with existing tools to making doing so easy.
+It is easy to adopt our system incrementally, and we integrate with existing tools to make doing so easy.
%% **** Scalability to larger projects
-The time it takes to select containers scales roughly linearly, even in complex cases, allowing our tool to be used even on larger projects.
+The time it takes to select containers scales roughly linearly, even in complex cases, allowing our system to be used even on larger projects.
%% **** Flexibility of selection
-Our system is also able to suggest adaptive containers: containers which switch underlying implementation as they grow.
+It is also able to suggest adaptive containers: containers which switch from one underlying implementation to another once they get past a cretain size.
%% **** Overview of results
-Whilst we saw reasonable suggestions in our test cases, we found the overhead of switching and of checking the current implementation to be more of a problem than expected, which future work could improve on.
+Whilst we saw reasonable suggestions in our test cases, we found important performance concerns which future work could improve on.
+
+In chapter \ref{chap:background}, we give a more thorough description of the container selection problem, and examine previous work. We outline gaps in existing literature, and how we aim to contribute.
+
+Chapter \ref{chap:design} explains the design of our solution, and how it fulfills the aims set out in chapter \ref{chap:background}. Chapter \ref{chap:implementation} expands on this, describing the implementation work in detail and the challenges faced.
+
+We evaluate the effectiveness of our solution in chapter \ref{chap:results}, and identify several shortcomings that future work could improve upon.