diff options
Diffstat (limited to 'thesis/parts')
-rw-r--r-- | thesis/parts/introduction.tex | 24 | ||||
-rw-r--r-- | thesis/parts/results.tex | 52 |
2 files changed, 64 insertions, 12 deletions
diff --git a/thesis/parts/introduction.tex b/thesis/parts/introduction.tex index 452a6d8..6886f35 100644 --- a/thesis/parts/introduction.tex +++ b/thesis/parts/introduction.tex @@ -3,28 +3,30 @@ %% **** Container types common in programs -The vast majority of programs will make extensive use of collection data types --- types intended to hold multiple instances of other types. -This allows programmers to use things like growable lists, sets, or trees without worrying about implementing them themselves. +A common requirement when programming is the need to keep a collection of data together, for example in a list. +Often, programmers will have some requirements they want to impose on this collection, such as not storing duplicate elements, or storing the items in sorted order. %% **** Functionally identical implementations +However, implementing these collection types manually is usually a waste of time, as is fine-tuning their implementation to perform better. +Most programmers will simply use one or two collection types provided by their language. -However, this still leaves the problem of selecting the ``best'' underlying implementation. -Most programmers will simply stick with the same one every time, with some languages like Python even building in a single implementation for everyone. %% **** Large difference in performance -While this is simplest, it can have a drastic effect on performance in many cases (\cite{l_liu_perflint_2009}, \cite{jung_brainy_2011}). +Often, this is not the best choice. +The underlying implementation of container types which function the same can have a drastic effect on performance (\cite{l_liu_perflint_2009}, \cite{jung_brainy_2011}). %% *** Motivate w/ effectiveness claims +We propose a system, Candelabra, for the automatic selection of container implementations, based on both user-specified requirements and inferred requirements for performance. %% *** Overview of aims & approach - -We propose a system for the automatic selection of container implementations, based on both user-specified requirements and inferred requirements for performance. -%% **** Scalability to larger projects -%% **** Ease of integration into existing projects %% **** Ease of adding new container types -Our system is built to be scalable, both in the sense that it can be applied to large projects, and that new container types can be added with ease. +We have designed our system with flexibility in mind --- adding new container implementations requires little effort. +%% **** Ease of integration into existing projects +It is easy to adopt our system incrementally, and we integrate with existing tools to making doing so easy. +%% **** Scalability to larger projects +The time it takes to select containers scales roughly linearly, even in complex cases, allowing our tool to be used even on larger projects. %% **** Flexibility of selection -We are also able to detect some cases where the optimal container type varies at runtime, and supply containers which start off as one implementation, and move to another when it is more optimal to do so. +Our system is also able to suggest adaptive containers --- containers which switch underlying implementation as they grow. %% *** Overview of results \todo{Overview of results} diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex index 896fa9d..2e2373a 100644 --- a/thesis/parts/results.tex +++ b/thesis/parts/results.tex @@ -58,11 +58,61 @@ This could explain why we see a roughly linear growth. \subsection{Contains operations} We now examine the cost of the \code{contains} operation. +Figure \ref{fig:cm_contains} shows our built cost models. +These are grouped for readability, with the first graph showing sets and sorted lists, the second showing sets and sorted lists, and the third showing key-value mappings. -\subsection{Outliers / errors} +Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do. +This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures. +This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely. + +\begin{figure}[h] + \centering + \includegraphics[width=10cm]{assets/contains_lists.png} + \par\centering\rule{11cm}{0.5pt} + \includegraphics[width=10cm]{assets/contains_sets.png} + \par\centering\rule{11cm}{0.5pt} + \includegraphics[width=10cm]{assets/contains_mappings.png} + \caption{Estimated cost of \code{contains} operation on lists, sets/sorted lists, and \code{Mapping}s} + \label{fig:cm_contains} +\end{figure} + +Both \code{LinkedList} and \code{Vec} implementations have roughly linear growth, which makes sense as these are not kept ordered. +\code{LinkedList} has a significantly higher cost at all points, and a wider spread of outliers. +This makes sense as each item in a linked list is not guaranteed to be in the same place in memory, so traversing them is likely to be more expensive, making the best and worst cases further apart. +Some of the spread could also be explained by heap allocations being put in different locations in memory, with less or more locality between each run. + +\code{SortedVec} and \code{SortedUniqueVec} both exhibit a wide spread of observations, with what looks like a roughly linear growth. +Looking at the raw output, we find the following equations being used for each cost model: + +\begin{align*} +C(n) &\approx 22.8 + 4.6\log_2 n + 0.003n - (1 * 10^{-9}) * n^2 & \textrm{SortedVec} \\ +C(n) &\approx -5.9 + 8.8\log_2 n - (4 * 10^{-5}) n - (3 * 10^{-8}) * n^2 & \textrm{SortedUniqueVec} +\end{align*} + +As both of these implementations use a binary search for \code{contains}, the dominating logarithmic factors are expected. +\code{SortedUniqueVec} likely has a larger $n^2$ coefficient due to more collisions happening at larger container sizes. +\todo{elaborate: we insert that many random items, but some may be duplicates} + +\code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to collisions. +\code{BTreeSet} is consistently above it, with a slightly higher logarithmic rise. + +\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, though are more expensive in most places. +This is probably due to the increased size more quickly exhausting CPU cache. \subsection{Evaluation} +In the cost models we examined, we found that most were in line with our expectations. + +Although we will not examine them in detail, we briefly describe observations from the rest of the built cost models: + +\begin{enumerate} +\item Our models for \code{push} and \code{pop} operations are pretty much the same as for \code{insert} operations, as they are the same inner implementation. +\item \code{first}, \code{last}, and \code{nth} operations show the time complexity we expect. However, some overfitting appears to occur, meaning our cost models may not generalise as well outside of the range of n values they were benchmarked with. +\end{enumerate} + +Overall, our cost models appear to be a good representation of each implementations performance impact. +Future improvements could address the overfitting problems some operations had, either by pre-processing the data to detect and remove outliers, or by employing a more complex fitting procedure. + %% * Predictions \section{Selections} |