diff options
Diffstat (limited to 'thesis/parts/background.tex')
-rw-r--r-- | thesis/parts/background.tex | 56 |
1 files changed, 32 insertions, 24 deletions
diff --git a/thesis/parts/background.tex b/thesis/parts/background.tex index f705aad..6b4e295 100644 --- a/thesis/parts/background.tex +++ b/thesis/parts/background.tex @@ -1,54 +1,62 @@ In this chapter, we provide an overview of the problem of container selection and its effect on program correctness and performance. -We then provide an overview of approaches from modern programming languages and existing literature. -Finally, we explain how our system is novel, and the weaknesses in existing literature it solves. +We then provide an overview of approaches taken by modern programming languages and existing literature. +Finally, we explain how our system is novel, and the weaknesses in existing literature it addresses. \section{Container Selection} -The vast majority of programs will make extensive use of collection data types --- types intended to hold multiple instances of other types. +A container data structure is simply a structure which holds a collection of related values. +This could include a list (growable or otherwise), a set (with no duplicate elements), or something more complex (like a min heap). -In many languages, the standard library provides a variety of collections, with users able to choose which is best for their program. +In many languages, the standard library provides implementations of various container types, with users able to choose which is best for their program. This saves users a lot of time, however selecting the best type is not always straightforward. Consider a program which needs to store and query a set of numbers, and doesn't care about ordering or duplicates. -If the number of items ($n$) is small enough, it might be fastest to use a dynamic array, and scan through each time we want to check if a number is inside. -On the other hand, if the set we deal with is much larger, we may want the constant-time lookups provided by hash sets, at the cost of a generally slower lookup. +If the number of items ($n$) is small enough, it might be fastest to use a dynamically-sized array (known as a vector in many languages), and scan through each time we want to check if a number is inside. +On the other hand, if the set we deal with is much larger, we might need to use a more complex method to keep things fast. +A common example would be a hash set, which provides roughly the same lookup speed regardless of size, at the cost of being slower overall. In this case, there are two factors driving our decision. -Our functional requirements, that we don't care about ordering or duplicates, and our non-functional requirements, that we want our program to be fast. +Our \emph{functional requirements} -- that we don't care about ordering or duplicates -- and our \emph{non-functional requirements} -- that we want our program to use resources efficiently. \subsection{Functional requirements} -Functional requirements tell us how the container will be used and how it must behave. -Continuing with our previous example, we'll compare Rust's \code{Vec} type (a dynamic array), with the \code{HashSet} type. +Functional requirements tell us how the container must behave in order for the program it's used in to function correctly. +Continuing with our previous example, we'll compare Rust's \code{Vec} implementation (a dynamic array), with its \code{HashSet} implementation (a hash table). -Note that the two types have different methods: \code{Vec} implements \code{.get(index)}, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this doesn't make sense. -If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} probably wouldn't compile. +To start with, we can see that the two types have different methods. +\code{Vec} has a \code{.get(index)} method, while \code{HashSet} does not; \code{HashSet}s aren't ordered so this wouldn't make sense. +If we were building a program that needed an ordered collection, replacing \code{Vec} with \code{HashSet} would likely cause the compiler to raise an error. -We will call the operations a container provides the ``syntactic properties'' of the container. -In object-oriented programming, we might say they must implement an ``interface'', while in Rust, we could say that they implement a ``trait''. +We will call the operations that a container implementation provides the \emph{syntactic properties} of the implementation. +In object-oriented programming, we might say they must implement an interface, while in Rust, we would say that they implement a trait.\footnote{Rust's traits are most similar to typeclasses from Haskell and other FP languages. Interfaces and typeclasses have important differences, but they are irrelevant in this case.} -However, syntactic properties alone are not always enough to select an appropriate container. -Suppose our program only requires a container to have \code{.insert(value)} and \code{.len()}. -Both \code{Vec} and \code{HashSet} will satisfy these requirements, but we might rely on \code{.len()} including duplicates. +However, syntactic properties alone are not always enough to select an appropriate container implementation. +Suppose our program only requires a type with the \code{.contains(value)} and \code{.len()} methods. +Both \code{Vec} and \code{HashSet} satisfy these requirements, but our program might also rely on the count returned from \code{.len()} including duplicates. In this case, \code{HashSet} would give us different behaviour, causing our program to behave incorrectly. -Therefore we also say that a container implementation has ``semantic properties''. -Intuitively we can think of this as what conditions the container upholds. -For a \code{HashSet}, this would include that there are never any duplicates, whereas for a \code{Vec} it would include that ordering is preserved. +Therefore, we also say that a container implementation has \code{semantic properties}. +Intuitively we can think of this as what conditions are upheld. +For a \code{HashSet}, this would include that there are never any duplicates. +A \code{Vec} would not have this property, but would have the property that insertion order is preserved. + +To select a correct container implementation, we then need to ensure we meet some syntactic and semantic requirements specific to our program. +So long as we specify our requiremets correctly, and use an implementation which provides all of the properties we're looking for, our program shouldn't be able to tell the difference. \subsection{Non-functional requirements} -While meeting the functional requirements should ensure our program runs correctly, we also want to choose the 'best' type that we can, striking a balance between runtime and memory usage. +While meeting our program's functional requirements should ensure that it runs correctly, this doesn't say anything about our program's efficiency. +We likely also want to choose the most efficient implementation available, striking a balance between runtime and memory usage. -Prior work has demonstrated that proper container selection can result in substantial performance improvements. +Prior work has demonstrated that the right container implementation can give substantial performance improvements. \cite{l_liu_perflint_2009} found and suggested fixes for ``hundreds of suboptimal patterns in a set of large C++ benchmarks,'' with one such case improving performance by 17\%. Similarly, \cite{jung_brainy_2011} demonstrates an average increase in speed of 27-33\% on real-world applications and libraries using a similar approach. -If we can find a selection of types that satisfy our functional requirements, then one obvious solution is to benchmark the program with each of these implementations in place and determine which works best. -This will work so long as our benchmarks are roughly representative of 'real world' inputs. +If we can find a set of implementations that satisfy our functional requirements, then an obvious solution is to benchmark the program with each of these implementations in place. +This will obviously work, as long as our benchmarks are roughly representative of the real world. Unfortunately, this technique scales poorly for larger applications. -As the number of types we must select increases linearly, the number of possible permutations increases exponentially (provided they have roughly the same number of candidates). +As the number of types we must select increases, the number of combinations we have to try increases exponentially. This quickly becomes unfeasible, so we must explore other selection methods. \section{Prior literature} |