diff options
Diffstat (limited to 'thesis/parts/results.tex')
-rw-r--r-- | thesis/parts/results.tex | 83 |
1 files changed, 48 insertions, 35 deletions
diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex index f16c502..cc66073 100644 --- a/thesis/parts/results.tex +++ b/thesis/parts/results.tex @@ -1,6 +1,6 @@ -In this chapter, we present our results from benchmarking our system. +In this chapter, we present the methodology used for benchmarking our system, and comment on the results we got. We examine the produced cost models of certain operations in detail, with reference to the expected asymptotics of each operation. -We then compare the selections made by our system to the actual optimal selections for a variety of test cases. +We then compare the selections made by our system to the actual optimal selections (obtained by brute force) for a variety of test cases. This includes examining when adaptive containers are suggested, and their effectiveness. %% * Testing setup, benchmarking rationale @@ -26,11 +26,11 @@ The most important software versions are listed below. We start by examining some of our generated cost models, and comparing them both to the observations they are based on, and what we expect from asymptotic analysis. As we build a total of 77 cost models from our library, we will not examine them all in detail. -We look at models of the most common operations, and group them by containers that are commonly selected together. +We look at models of the most common operations, grouped by containers that are commonly selected together. \subsection{Insertion operations} Starting with the \code{insert} operation, Figure \ref{fig:cm_insert} shows how the estimated cost changes with the size of the container. -The lines correspond to our fitted curves, while the points indicate the raw observations these curves are fitted from. +The lines correspond to our fitted curves, while the points indicate the raw observations we drew from. \begin{figure}[h!] \centering @@ -39,13 +39,13 @@ The lines correspond to our fitted curves, while the points indicate the raw obs \label{fig:cm_insert} \end{figure} -Starting with the operation on a \code{Vec}, we see that insertion is very cheap, and gets slightly cheaper as the size of the container increases. +Starting with \code{Vec}, we see that insertion is very cheap, and gets slightly cheaper as the size of the container increases. This roughly agrees with the expected $O(1)$ time of amortised inserts on a Vec. However, we also note a sharply increasing curve when $n$ is small, and a slight 'bump' around $n=35,000$. The former appears to be in line with the observations, and is likely due to the static growth rate of Rust's Vec implementation. The latter appears to diverge from the observations, and may indicate poor fitting. -\code{LinkedList} has a more stable, but significantly slower insertion. +\code{LinkedList} has a significantly slower insertion. This is likely because it requires a syscall for heap allocation for every item inserted, no matter the current size. This would also explain why data points appear spread out more, as system calls have more unpredictable latency, even on systems with few other processes running. Notably, insertion appears to start to get cheaper past $n=24,000$, although this is only weakly suggested by observations. @@ -66,16 +66,17 @@ This is what we expect for hash-based collections, with the slight growth likely \code{BTreeSet} has similar behaviour, but settles at a larger value overall. \code{BTreeMap} appears to grow more rapidly, and cost more overall. -It's important to note that Rust's \code{BTreeSet}s are not based on binary tree search, but instead a more general tree search originally proposed by \cite{bayer_organization_1970}, where each node contains $B-1$ to $2B-1$ elements in an array. +It's important to note that Rust's \code{BTreeSet} is not based on binary tree search, but instead a more general tree search originally proposed by \cite{bayer_organization_1970}, where each node contains $B-1$ to $2B-1$ elements in an unsorted array. The standard library documentation\citep{rust_documentation_team_btreemap_2024} states that search is expected to take $O(B\lg n)$ comparisons. -Since both of these implementations require searching the collection before inserting, the close-to-logarithmic growth makes sense. +Since both of these implementations require searching the collection before inserting, the close-to-logarithmic growth seems to makes sense. \subsubsection{Small n values} +\label{section:cm_small_n} -Whilst our main figures for insertion operations indicate a clear winner within each category, looking at small $n$ values reveals some more complexity. +Whilst our main figures for insertion operations indicate a clear winner within each category, looking at small $n$ values reveals more complexity. Figure \ref{fig:cm_insert_small_n} shows the cost models for insert operations on different set implementations at smaller n values. -In particular, for $n<1800$ the overhead from sorting a vec is less than running the default hasher function (at least on this hardware). +Note that for $n<1800$ the overhead from sorting a vec is less than running the default hasher function (at least on this hardware). We also see a sharp spike in the cost for \code{SortedVecSet} at low $n$ values, and an area of supposed 0 cost from around $n=200$ to $n=800$. This seems inaccurate, and indicates that our current fitting procedure may not be able to deal with low $n$ values properly. @@ -100,22 +101,23 @@ Figure \ref{fig:cm_contains} shows our built cost models, again grouped for read \label{fig:cm_contains} \end{figure} -Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do. +The observations in these graphs have a much wider spread than our \code{insert} operations do. This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures. This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely. For the \code{SortedVec} family, we would expect to see roughly logarithmic growth, as contains is based on binary search. This is the case for \code{SortedVecMap}, however \code{SortedVec} and \code{SortedVecSet} both show exponential growth with a 'dip' around $n=25,000$. -It's unclear why this happened, although it could be due to how the elements we query are distributed throughout the list. +It's unclear why this happened, although it could be due to how the elements we query are randomly distributed throughout the list. A possible improvement would be to run contains with a known distribution of values, including low, high, and not present values in equal parts. The \code{Vec} family exhibits roughly linear growth, which is expected, since this implementation scans through the whole array each time. + \code{LinkedList} has roughly logarithmic growth, at a significantly higher cost. The higher cost is expected, although its unclear why growth is logarithmic rather than linear. As the spread of points also appears to increase at larger $n$ values, its possible that this is due to larger $n$ values causing a higher proportion of the program's memory to be dedicated to the container, resulting in better cache utilisation. \code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to an increasing amount of collisions. -\code{BTreeSet} is consistently above it, with a slightly higher logarithmic rise. +\code{BTreeSet} is consistently above it, with a slightly faster logarithmic rise. \code{BTreeMap} and \code{HashMap} both mimic their set counterparts, but with a slightly lower cost and growth rate. It's unclear why this is, however it could be related to the larger spread in observations for both implementations. @@ -123,30 +125,30 @@ It's unclear why this is, however it could be related to the larger spread in ob \subsection{Evaluation} Overall, our cost models appear to be a good representation of each implementations performance impact. -Future improvements could address the overfitting problems some operations had, such as by employing a more complex fitting procedure, or by doing more to ensure operations have their best and worst cases tested fairly. +Future improvements should focus on improving accuracy at lower $n$ values, such as by employing a more complex fitting procedure, or on ensuring operations have their best and worst cases tested fairly. %% * Predictions \section{Selections} -We now proceed with end-to-end testing of the system, selecting containers for a selection of programs with varying needs. +We now proceed with end-to-end testing of the system, selecting containers for a sample of test programs with varying needs. \subsection{Benchmarks} %% ** Chosen benchmarks -Our test cases broadly fall into two categories: Example cases, which repeat a few operations many times, and 'real' cases, which are implementations of common algorithms and solutions to programming puzles. -We expect the results from our example cases to be relatively unsurprising, while our real cases are more complex and harder to predict. +Our test programs broadly fall into two categories: Examples, which repeat a few operations many times, and real-life programs, which are implementations of common algorithms and solutions to programming puzles. +We expect the results from our example programs to be relatively obvious, while our real programs are more complex and harder to predict. -Most of our real cases are solutions to puzzles from Advent of Code\citep{wastl_advent_2015}, a popular collection of programming puzzles. -Table \ref{table:test_cases} lists and briefly describes our test cases. +Most of our real programs are solutions to puzzles from Advent of Code\citep{wastl_advent_2015}, a popular collection of programming puzzles. +Table \ref{table:test_cases} lists and briefly describes our test programs. \begin{table}[h!] \centering \begin{tabular}{|c|c|} Name & Description \\ \hline - example\_sets & Repeated insert and contains on a set. \\ - example\_stack & Repeated push and pop from a stack. \\ - example\_mapping & Repeated insert and get from a mapping. \\ + example\_sets & Repeated insert and contains operations on a set. \\ + example\_stack & Repeated push and pop operations on a stack. \\ + example\_mapping & Repeated insert and get operations on a mapping. \\ prime\_sieve & Sieve of eratosthenes algorithm. \\ aoc\_2021\_09 & Flood-fill like algorithm (Advent of Code 2021, Day 9) \\ aoc\_2022\_08 & Simple 2D raycasting (AoC 2022, Day 8) \\ @@ -154,13 +156,14 @@ Table \ref{table:test_cases} lists and briefly describes our test cases. aoc\_2022\_14 & Simple 2D particle simulation (AoC 2022, Day 14) \\ \end{tabular} - \caption{Our test applications} + \caption{Our test programs} \label{table:test_cases} \end{table} %% ** Effect of selection on benchmarks (spread in execution time) Table \ref{table:benchmark_spread} shows the difference in benchmark results between the slowest possible assignment of containers, and the fastest. -Even in our example projects, we see that the wrong choice of container can slow down our programs substantially, with the exception of two of our test cases which were largely unaffected. +Even in our example programs, we see that the wrong choice of container can slow down our programs substantially. +In all but two programs, the wrong implementation can more than double the runtime. \begin{table}[h!] \centering @@ -176,15 +179,15 @@ example\_sets & $1.33$ & $1.6$ \\ example\_stack & $0.36$ & $19.2$ \\ prime\_sieve & $26093.26$ & $34.1$ \\ \end{tabular} -\caption{Spread in total benchmark results by project} +\caption{Spread in total benchmark results by program} \label{table:benchmark_spread} \end{table} \subsection{Prediction accuracy} -We now compare the implementations suggested by our system to the selection that is actually best, obtained by brute force. -For now, we ignore suggestions for adaptive containers. +We now compare the implementations suggested by our system to the selection that is actually best, which we obtain by brute-forcing all possible implementations. +We leave analysis of adaptive container suggestions to section \ref{section:results_adaptive_containers} Table \ref{table:predicted_actual} shows the predicted best assignments alongside the actual best assignment, obtained by brute-force. In all but two of our test cases (marked with *), we correctly identify the best container. @@ -212,13 +215,14 @@ In all but two of our test cases (marked with *), we correctly identify the best Both of these failures appear to be caused by being overly eager to suggest a \code{LinkedList}. From looking at detailed profiling information, it seems that both of these container types had a relatively small amount of items in them. -Therefore this is likely caused by our cost models being inaccurate at small $n$ values, such as in Figure \ref{fig:cm_insert_small_n}. +Therefore this is likely caused by our cost models being inaccurate at small $n$ values, as mentioned in section \ref{section:cm_small_n}. -Overall, our results show our system is able to suggest the best containers, at least for large enough $n$ values. -Unfortunately, these tests are somewhat limited, as the best container seems relatively predictable: \code{Vec} where uniqueness is not important, and \code{Hash*} otherwise. -Therefore more thorough testing is needed to fully establish the system's effectiveness. +Overall, our results suggest that our system is effective, at least for large enough $n$ values. +Unfortunately, these tests are somewhat limited, as the best container is almost always predictable: \code{Vec} where uniqueness is not important, and \code{Hash*} otherwise. +Therefore, more thorough testing is needed to fully establish the system's effectiveness. \subsection{Adaptive containers} +\label{section:results_adaptive_containers} We now look at cases where an adaptive container was suggested, and evaluate the result. @@ -252,10 +256,10 @@ As the $n$ threshold after which we switch is outside the range we benchmark our %% ** Comment on relative performance speedup Table \ref{table:adaptive_perfcomp} compares our adaptive container suggestions with the fastest non-adaptive implementation. -Since we must select an implementation for all containers before selecting a project, we show all possible combinations of adaptive and non-adaptive container selections. +Since we must select an implementation for all containers before selecting a project, we show all possible combinations of adaptive and non-adaptive container selections where appropriate. Note that the numbered columns indicate the benchmark 'size', not the actual size that the container reaches within that benchmark. -The exact definition of this varies by benchmark. +What this means exactly varies by benchmark. \begin{table}[h] \centering @@ -299,9 +303,18 @@ In the \code{aoc_2022_09} project, the adaptive container is marginally faster u This shows that adaptive containers as we have implemented them are not effective in practice. Even in cases where we never reach the size threshold, the presence of adaptive containers has an overhead which slows down the program 3x in the worst case (\code{example_mapping}, size = 150). -One explanation for this could be that every operation now requires checking which inner implementation we are using, resulting in branching overhead. -More work could be done to minimise the overhead introduced, such as by using indirect jumps rather than branching instructions. +One explanation for this could be that every operation now requires checking which inner implementation we are using, resulting in an additional check for each operation. +More work could be done to minimise this overhead, although it's unclear exactly how much this could be minimised. It is also unclear if the threshold values that we suggest are the optimal ones. Currently, we decide our threshold by picking a value between two partitions with different best containers. Future work could take a more complex approach that finds the best threshold value based on our cost models, and takes the overhead of all operations into account. + +\subsection{Evaluation} + +Overall, we find that the main part of our container selection system appears to have merit. +Whilst our testing has limitations, it shows that we can correctly identify the best container even in complex programs. +More work is needed on improving our system's performance for very small containers, and on testing with a wider range of programs. + +Our proposed technique for identifying adaptive containers appears ineffective. +The primary challenges appear to be in the overhead introduced to each operation, and in finding the correct point at which to switch implementations. |