rest of results chapter

author: Aria Shrimpton <me@aria.rip> 2024-03-29 21:49:15 +0000
committer: Aria Shrimpton <me@aria.rip> 2024-03-29 21:49:15 +0000
commit: 7924e466d32cf93b7e455d1360bc22fa86340100 (patch)
tree: 0f1c5e28607a8421e778811b020f396dc7ea8c6b /thesis/parts/results.tex
parent: 61a6ce0ede29e5595473852a39cc216288d50a25 (diff)
1 files changed, 124 insertions, 44 deletions
diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex
index bd30fcf..6562d38 100644
--- a/thesis/parts/results.tex
+++ b/thesis/parts/results.tex
@@ -34,7 +34,7 @@ The lines correspond to our fitted curves, while the points indicate the raw obs
 
 \begin{figure}[h!]
   \centering
-  \includegraphics[width=15cm]{assets/insert.png}
+  \includegraphics[width=12cm]{assets/insert.png}
   \caption{Estimated cost of insert operation by implementation}
   \label{fig:cm_insert}
 \end{figure}
@@ -82,7 +82,7 @@ This seems inaccurate, and is likely a result of few data points at low n values
 
 \begin{figure}[h!]
   \centering
-  \includegraphics[width=15cm]{assets/insert_small_n.png}
+  \includegraphics[width=12cm]{assets/insert_small_n.png}
   \caption{Estimated cost of insert operation on set implementations, at small n values}
   \label{fig:cm_insert_small_n}
 \end{figure}
@@ -94,7 +94,7 @@ Figure \ref{fig:cm_contains} shows our built cost models, again grouped for read
 
 \begin{figure}[h!]
   \centering
-  \includegraphics[width=15cm]{assets/contains.png}
+  \includegraphics[width=12cm]{assets/contains.png}
   \caption{Estimated cost of \code{contains} operation by implementation}
   \label{fig:cm_contains}
 \end{figure}
@@ -127,10 +127,12 @@ Future improvements could address the overfitting problems some operations had,
 %% * Predictions
 \section{Selections}
 
+We now proceed with end-to-end testing of the system, selecting containers for a selection of programs with varying needs.
+
 \subsection{Benchmarks}
 
 %% ** Chosen benchmarks
-Our test cases broadly fall into two categories: Example cases, which just repeat a few operations many times, and our 'real' cases, which are implementations of common algorithms and solutions to programming puzles.
+Our test cases broadly fall into two categories: Example cases, which repeat a few operations many times, and 'real' cases, which are implementations of common algorithms and solutions to programming puzles.
 We expect the results from our example cases to be relatively unsurprising, while our real cases are more complex and harder to predict.
 
 Most of our real cases are solutions to puzzles from Advent of Code\citep{wastl_advent_2015}, a popular collection of programming puzzles.
@@ -157,77 +159,155 @@ Table \ref{table:test_cases} lists and briefly describes our test cases.
 
 %% ** Effect of selection on benchmarks (spread in execution time)
 Table \ref{table:benchmark_spread} shows the difference in benchmark results between the slowest possible assignment of containers, and the fastest.
-Even in our example projects, we see that the wrong choice of container can slow down our programs substantially.
+Even in our example projects, we see that the wrong choice of container can slow down our programs substantially, with the exception of two of our test cases which were largely unaffected.
 
 \begin{table}[h!]
 \centering
 \begin{tabular}{|c|c|c|}
-Project & worst - best time (seconds) & Maximum slowdown \\
+Project & Maximum slowdown (ms) & Maximum relative slowdown \\
 \hline
-aoc\_2021\_09 & 29.685 & 4.75 \\
-aoc\_2022\_08 & 0.036 & 2.088 \\
-aoc\_2022\_09 & 10.031 & 132.844 \\
-aoc\_2022\_14 & 0.293 & 2.036 \\
-prime\_sieve & 28.408 & 18.646 \\
-example\_mapping & 0.031 & 1.805 \\
-example\_sets & 0.179 & 12.65 \\
-example\_stack & 1.931 & 8.454 \\
+aoc\_2021\_09 & $55206.94$ & $12.0$ \\
+aoc\_2022\_08 & $12161.38$ & $392.5$ \\
+aoc\_2022\_09 & $18.96$ & $0.3$ \\
+aoc\_2022\_14 & $83.82$ & $0.3$ \\
+example\_mapping & $85.88$ & $108.4$ \\
+example\_sets & $1.33$ & $1.6$ \\
+example\_stack & $0.36$ & $19.2$ \\
+prime\_sieve & $26093.26$ & $34.1$ \\
 \end{tabular}
 \caption{Spread in total benchmark results by project}
 \label{table:benchmark_spread}
+
 \end{table}
 
-%% ** Summarise predicted versus actual
 \subsection{Prediction accuracy}
 
-We now compare the implementations suggested by our system, to the selection that is actually best.
+We now compare the implementations suggested by our system to the selection that is actually best, obtained by brute force.
 For now, we ignore suggestions for adaptive containers.
 
 Table \ref{table:predicted_actual} shows the predicted best assignments alongside the actual best assignment, obtained by brute-force.
-In all but two of our test cases (marked with *), we correctly identify the best container.
-
-\todo{but also its just vec/hashset every time, which is kinda boring. we should either get more variety (by adding to the library or adding new test cases), or mention this as a limitation in testing}
+In all but three of our test cases (marked with *), we correctly identify the best container.
 
 \begin{table}[h!]
   \centering
-\begin{tabular}{|c|c|c|c|}
-Project & Container Type & Best implementation & Predicted best \\
-\hline
-aoc\_2022\_09 & Set & HashSet & HashSet \\
-example\_stack & StackCon & Vec & Vec \\
-aoc\_2021\_09 & Set & HashSet & HashSet \\
-aoc\_2021\_09 & Map & HashMap & HashMap \\
-aoc\_2022\_14 & Set & HashSet & HashSet \\
-aoc\_2022\_14 & List & Vec & LinkedList \\
-aoc\_2022\_08 & Map & HashMap & HashMap \\
-example\_sets & Set & HashSet & HashSet \\
-example\_mapping & Map & HashMap & HashMap \\
-prime\_sieve & Primes & HashSet & BTreeSet \\
-prime\_sieve & Sieve & Vec & LinkedList \\
-\end{tabular}
+  \begin{tabular}{|c|c|c|c|c|}
+    Project & Container Type & Best implementation & Predicted best &   \\
+    \hline
+    aoc\_2021\_09 & Set & HashSet & HashSet &  \\
+    aoc\_2021\_09 & Map & HashMap & HashMap &  \\
+    aoc\_2022\_08 & Map & HashMap & HashMap &  \\
+    aoc\_2022\_09 & Set & HashSet & HashSet &  \\
+    aoc\_2022\_14 & Set & HashSet & HashSet &  \\
+    aoc\_2022\_14 & List & Vec & LinkedList & * \\
+    example\_mapping & Map & HashMap & HashMap &  \\
+    example\_sets & Set & HashSet & HashSet &  \\
+    example\_stack & StackCon & Vec & Vec &  \\
+    prime\_sieve & Sieve & Vec & LinkedList & * \\
+    prime\_sieve & Primes & HashSet & BTreeSet & * \\
+  \end{tabular}
   \caption{Actual best vs predicted best implementations}
   \label{table:predicted_actual}
 \end{table}
 
-%% ** Evaluate performance
-\subsection{Evaluation}
+Two of these failures appear to be caused by being overly eager to suggest a \code{LinkedList}.
+From looking at detailed profiling information, it seems that both of these container types had a relatively small amount of items in them.
+Therefore this is likely caused by our cost models being inaccurate at small $n$ values, such as in Figure \ref{fig:cm_insert_small_n}.
 
-%% ** Comment on distribution of best implementation
+Our only other failure comes from suggesting a \code{BTreeSet} instead of a \code{HashSet}.
+Our cost models suggest that a \code{BTreeSet} is more suitable for the \code{prime_sieve} benchmarks with a smaller $n$ value, but not for the larger ones.
+However, because the smaller benchmarks complete in less time, Criterion (the benchmarking framework used) chooses to run them for more iterations.
+This causes the smaller $n$ values to carry more weight than they should.
 
-%% ** Surprising ones / Explain failures
+This could be worked around by adjusting Criterion's settings to run all benchmarks for the same number of iterations, at the cost of the increased accuracy for smaller benchmarks that the existing strategy gives.
+Another method would be to only fix the number of iterations when profiling the application, and to run benchmarks as normal otherwise.
+Whilst this should be possible, Criterion doesn't currently support this.
 
-%% * Performance of adaptive containers
-\section{Adaptive containers}
+Overall, our results show our system is able to suggest the best containers, at least for large $n$ values.
+Unfortunately, these tests are somewhat limited, as the best container seems relatively predictable: \code{Vec} where uniqueness is not important, and \code{Hash*} otherwise.
+Therefore more thorough testing is needed to fully establish the system's effectiveness.
 
-\todo{These also need more work, and better test cases}
+\subsection{Adaptive containers}
+
+We now look at cases where an adaptive container was suggested, and evaluate the result.
+
+Table \ref{table:adaptive_suggestions} shows the container types for which adaptive containers were suggested, along with the inner types and the threshold at which to switch.
 
 %% ** Find where adaptive containers get suggested
+\begin{table}[h]
+  \centering
+  \begin{tabular}{|c|c|c|}
+    Project & Container Type & Suggestion \\
+    \hline
+    aoc\_2022\_08 & Map & SortedVecMap until n=1664, then HashMap \\
+    aoc\_2022\_09 & Set & HashSet until n=185540, then BTreeSet \\
+    example\_mapping & Map & VecMap until n=225, then HashMap \\
+    prime\_sieve & Primes & BTreeSet until n=34, then HashSet \\
+    prime\_sieve & Sieve & LinkedList until n=747, then Vec \\
+  \end{tabular}
+  \caption{Suggestions for adaptive containers}
+  \label{table:adaptive_suggestions}
+\end{table}
+
+The suggested containers for both \code{aoc_2022_08} and \code{example_mapping} are unsurprising.
+Since hashing incurs a roughly constant cost, it makes sense that below a certain $n$ value, simply searching through a list is more effective.
+The suggestion of \code{SortedVecMap} vs \code{VecMap} likely has to do with the relative frequency of \code{insert} operations compared to others.
+
+The suggestion to start with a \code{LinkedList} for \code{prime_sieve / Sieve} is likely due to the same issues that cause a \code{LinkedList} to be suggested in the non-adaptive case.
+This may also be the case for the suggestion of \code{BTreeSet} for \code{prime_sieve / Primes}.
+
+The suggestion of \code{BTreeSet} for \code{aoc_2022_09} is most surprising.
+As the $n$ threshold after which we switch is outside the range we benchmark our implementations at, this suggestion is based on our model attempting to generalise far outside the range it has seen before.
 
 %% ** Comment on relative performance speedup
+Table \ref{table:adaptive_perfcomp} compares our adaptive container suggestions with the fastest non-adaptive implementation.
+Since we must select an implementation for all containers before selecting a project, we show all possible combinations of adaptive and non-adaptive container selections.
+
+Note that the numbered columns indicate the benchmark 'size', not the actual size that the container reaches within that benchmark.
+The exact definition of this varies by benchmark.
+
+\begin{table}[h]
+  \centering
+  \begin{adjustbox}{angle=90}
+  \begin{tabular}{|c|c|c|c|c|c|}
+    \hline
+    Project & Assignment & 100 & 1000 & 2000 & \\
+    \hline
+    aoc\_2022\_09 & Set=HashSet & 1ms $\pm$ 5us & 13ms $\pm$ 828us & 27ms $\pm$ 1ms & \\
+    aoc\_2022\_09 & Set=Adaptive & 1ms $\pm$ 2us & 11ms $\pm$ 17us & 39ms $\pm$ 684us & \\
+    \hline
+     &  & 100 & 200 & & \\
+    \hline
+    aoc\_2022\_08 & Map=HashMap & 1ms $\pm$ 9us & 5ms $\pm$ 66us & &\\
+    aoc\_2022\_08 & Map=Adaptive & 1ms $\pm$ 6us & 5ms $\pm$ 41us & & \\
+    \hline
+     &  & 50 & 150 & 2500 & 7500 \\
+    \hline
+    example\_mapping & Map=HashMap & 3us $\pm$ 7ns & 11us $\pm$ 49ns & 185us $\pm$ 2us & 591us $\pm$ 1us \\
+    example\_mapping & Map=Adaptive & 4us $\pm$ 7ns & 33us $\pm$ 55ns & 187us $\pm$ 318ns & 595us $\pm$ 1us \\
+    \hline
+     &  & 50 & 500 & 50000 & \\
+    \hline
+    prime\_sieve & Sieve=Vec, Primes=HashSet & 1us $\pm$ 3ns & 78us $\pm$ 1us & 766ms $\pm$ 1ms & \\
+    prime\_sieve & Sieve=Vec, Primes=Adaptive & 1us $\pm$ 3ns & 84us $\pm$ 138ns & 785ms $\pm$ 730us & \\
+    prime\_sieve & Sieve=Adaptive, Primes=HashSet & 2us $\pm$ 6ns & 208us $\pm$ 568ns & 763ms $\pm$ 1ms & \\
+    prime\_sieve & Sieve=Adaptive, Primes=Adaptive & 2us $\pm$ 4ns & 205us $\pm$ 434ns & 762ms $\pm$ 2ms & \\
+    \hline
+  \end{tabular}
+  \end{adjustbox}
+  \caption{Adaptive containers vs the best single container, by size of benchmark}
+  \label{table:adaptive_perfcomp}
+\end{table}
+
+In all but one project, the non-adaptive containers are as fast or faster than the adaptive containers at all sizes of benchmarks.
+In the \code{aoc_2022_09} project, the adaptive container is marginally faster until the benchmark size reaches 2000, at which point it is significantly slower.
 
 %% ** Suggest future improvements?
+This shows that adaptive containers as we have implemented them are not effective in practice.
+Even in cases where we never reach the size threshold, the presence of adaptive containers has an overhead which slows down the program 3x in the worst case (\code{example_mapping}, size = 150).
 
-%% * Selection time / developer experience
-%% \section{Selection time}
+One explanation for this could be that every operation now requires checking which inner implementation we are using, resulting in branching overhead.
+More work could be done to minimise the overhead introduced, such as by using indirect jumps rather than branching instructions.
 
-%% ** Mention speedup versus naive brute force
+It is also unclear if the threshold values that we suggest are the optimal ones.
+Currently, we decide our threshold by picking a value between two partitions with different best containers.
+Future work could take a more complex approach that finds the best threshold value based on our cost models, and takes the overhead of all operations into account.
author	Aria Shrimpton <me@aria.rip>	2024-03-29 21:49:15 +0000
committer	Aria Shrimpton <me@aria.rip>	2024-03-29 21:49:15 +0000
commit	7924e466d32cf93b7e455d1360bc22fa86340100 (patch)
tree	0f1c5e28607a8421e778811b020f396dc7ea8c6b /thesis/parts/results.tex
parent	61a6ce0ede29e5595473852a39cc216288d50a25 (diff)