thesis/parts/results.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320

In this chapter, we present the methodology used for benchmarking our system, and comment on the results we got.
We examine the produced cost models of certain operations in detail, with reference to the expected asymptotics of each operation.
We then compare the selections made by our system to the actual optimal selections (obtained by brute force) for a variety of test cases.
This includes examining when adaptive containers are suggested, and their effectiveness.

%% * Testing setup, benchmarking rationale
\section{Testing setup}

%% ** Specs and VM setup
In order to ensure consistent results and reduce the effect of other running processes, all benchmarks were run on a KVM virtual machine on server hardware.
We used 4 cores of an Intel Xeon E5-2687Wv4 CPU, and 4GiB of RAM.

%% ** Reproducibility
The VM was managed and provisioned using NixOS, meaning it can be easily reproduced with the exact software we used.
Instructions on how to do so are in the supplementary materials.
The most important software versions are listed below.

\begin{itemize}
\item Linux 6.1.64
\item Rust nightly 2024-01-25
\item LLVM 17.0.6
\item Racket 8.10
\end{itemize}

\section{Cost models}

We start by examining some of our generated cost models, and comparing them both to the observations they are based on, and what we expect from asymptotic analysis.
As we build a total of 77 cost models from our library, we will not examine them all in detail.
We look at models of the most common operations, grouped by containers that are commonly selected together.

\subsection{Insertion operations}
Starting with the \code{insert} operation, Figure \ref{fig:cm_insert} shows how the estimated cost changes with the size of the container.
The lines correspond to our fitted curves, while the points indicate the raw observations we drew from.

\begin{figure}[h!]
  \centering
  \includegraphics[width=12cm]{assets/insert.png}
  \caption{Estimated cost of insert operation by implementation}
  \label{fig:cm_insert}
\end{figure}

Starting with \code{Vec}, we see that insertion is very cheap, and gets slightly cheaper as the size of the container increases.
This roughly agrees with the expected $O(1)$ time of amortised inserts on a Vec.
However, we also note a sharply increasing curve when $n$ is small, and a slight 'bump' around $n=35,000$.
The former appears to be in line with the observations, and is likely due to the static growth rate of Rust's Vec implementation.
The latter appears to diverge from the observations, and may indicate poor fitting.

\code{LinkedList} has a significantly slower insertion.
This is likely because it requires a syscall for heap allocation for every item inserted, no matter the current size.
This would also explain why data points appear spread out more, as system calls have more unpredictable latency, even on systems with few other processes running.
Notably, insertion appears to start to get cheaper past $n=24,000$, although this is only weakly suggested by observations.

It's unsurprising that these two implementations are the cheapest, as they have no ordering or uniqueness guarantees, unlike our other implementations.

The \code{SortedVec} family of containers (\code{SortedVec}, \code{SortedVecSet}, and \code{SortedVecMap}) all exhibit roughly logarithmic growth, with \code{SortedVecMap} exhibiting a slightly higher growth rate.
This is expected, as internally all of these containers perform a binary search to determine where the new element should go, which is $O(\lg n)$ time.

\code{SortedVecMap} exhibits roughly the same shape as its siblings, but with a slightly higher growth rate.
This pattern is shared across all of the \code{*Map} types we examine, and could be explained by the increased size of each element reducing the effectiveness of the cache.

\code{VecMap} and \code{VecSet} both have a significantly higher, roughly linear, growth rate.
Both of these implementations work by scanning through the existing array before each insertion to check for existing keys, therefore a linear growth rate is expected.

\code{HashSet} and \code{HashMap} insertions are much less expensive, and mostly linear with only a slight growth at very large $n$ values.
This is what we expect for hash-based collections, with the slight growth likely due to more hash collisions as the size of the collection increases.

\code{BTreeSet} has similar behaviour, but settles at a larger value overall.
\code{BTreeMap} appears to grow more rapidly, and cost more overall.
It's important to note that Rust's \code{BTreeSet} is not based on binary tree search, but instead a more general tree search originally proposed by \cite{bayer_organization_1970}, where each node contains $B-1$ to $2B-1$ elements in an unsorted array.
The standard library documentation\citep{rust_documentation_team_btreemap_2024} states that search is expected to take $O(B\lg n)$ comparisons.
Since both of these implementations require searching the collection before inserting, the close-to-logarithmic growth seems to makes sense.

\subsubsection{Small n values}
\label{section:cm_small_n}

Whilst our main figures for insertion operations indicate a clear winner within each category, looking at small $n$ values reveals more complexity.
Figure \ref{fig:cm_insert_small_n} shows the cost models for insert operations on different set implementations at smaller n values.

Note that for $n<1800$ the overhead from sorting a vec is less than running the default hasher function (at least on this hardware).

We also see a sharp spike in the cost for \code{SortedVecSet} at low $n$ values, and an area of supposed 0 cost from around $n=200$ to $n=800$.
This seems inaccurate, and indicates that our current fitting procedure may not be able to deal with low $n$ values properly.
More work is required to improve this.

\begin{figure}[h!]
  \centering
  \includegraphics[width=12cm]{assets/insert_small_n.png}
  \caption{Estimated cost of insert operation on set implementations, at small n values}
  \label{fig:cm_insert_small_n}
\end{figure}

\subsection{Contains operations}

We now examine the cost of the \code{contains} operation.
Figure \ref{fig:cm_contains} shows our built cost models, again grouped for readability.

\begin{figure}[h!]
  \centering
  \includegraphics[width=12cm]{assets/contains.png}
  \caption{Estimated cost of \code{contains} operation by implementation}
  \label{fig:cm_contains}
\end{figure}

The observations in these graphs have a much wider spread than our \code{insert} operations do.
This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures.
This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely.

For the \code{SortedVec} family, we would expect to see roughly logarithmic growth, as contains is based on binary search.
This is the case for \code{SortedVecMap}, however \code{SortedVec} and \code{SortedVecSet} both show exponential growth with a 'dip' around $n=25,000$.
It's unclear why this happened, although it could be due to how the elements we query are randomly distributed throughout the list.
A possible improvement would be to run contains with a known distribution of values, including low, high, and not present values in equal parts.

The \code{Vec} family exhibits roughly linear growth, which is expected, since this implementation scans through the whole array each time.

\code{LinkedList} has roughly logarithmic growth, at a significantly higher cost.
The higher cost is expected, although its unclear why growth is logarithmic rather than linear.
As the spread of points also appears to increase at larger $n$ values, its possible that this is due to larger $n$ values causing a higher proportion of the program's memory to be dedicated to the container, resulting in better cache utilisation.

\code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to an increasing amount of collisions.
\code{BTreeSet} is consistently above it, with a slightly faster logarithmic rise.

\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, but with a slightly lower cost and growth rate.
It's unclear why this is, however it could be related to the larger spread in observations for both implementations.

\subsection{Evaluation}

Overall, our cost models appear to be a good representation of each implementations performance impact.
Future improvements should focus on improving accuracy at lower $n$ values, such as by employing a more complex fitting procedure, or on ensuring operations have their best and worst cases tested fairly.

%% * Predictions
\section{Selections}

We now proceed with end-to-end testing of the system, selecting containers for a sample of test programs with varying needs.

\subsection{Benchmarks}

%% ** Chosen benchmarks
Our test programs broadly fall into two categories: Examples, which repeat a few operations many times, and real-life programs, which are implementations of common algorithms and solutions to programming puzles.
We expect the results from our example programs to be relatively obvious, while our real programs are more complex and harder to predict.

Most of our real programs are solutions to puzzles from Advent of Code\citep{wastl_advent_2015}, a popular collection of programming puzzles.
Table \ref{table:test_cases} lists and briefly describes our test programs.

\begin{table}[h!]
  \centering
  \begin{tabular}{|c|c|}
    Name & Description \\
    \hline
    example\_sets & Repeated insert and contains operations on a set. \\
    example\_stack & Repeated push and pop operations on a stack. \\
    example\_mapping & Repeated insert and get operations on a mapping. \\
    prime\_sieve & Sieve of eratosthenes algorithm. \\
    aoc\_2021\_09 & Flood-fill like algorithm (Advent of Code 2021, Day 9) \\
    aoc\_2022\_08 & Simple 2D raycasting (AoC 2022, Day 8) \\
    aoc\_2022\_09 & Simple 2D soft-body simulation (AoC 2022, Day 9) \\
    aoc\_2022\_14 & Simple 2D particle simulation (AoC 2022, Day 14) \\
  \end{tabular}

  \caption{Our test programs}
  \label{table:test_cases}
\end{table}

%% ** Effect of selection on benchmarks (spread in execution time)
Table \ref{table:benchmark_spread} shows the difference in benchmark results between the slowest possible assignment of containers, and the fastest.
Even in our example programs, we see that the wrong choice of container can slow down our programs substantially.
In all but two programs, the wrong implementation can more than double the runtime.

\begin{table}[h!]
\centering
\begin{tabular}{|c|c|c|}
Project & Maximum slowdown (ms) & Maximum relative slowdown \\
\hline
aoc\_2021\_09 & $55206.94$ & $12.0$ \\
aoc\_2022\_08 & $12161.38$ & $392.5$ \\
aoc\_2022\_09 & $18.96$ & $0.3$ \\
aoc\_2022\_14 & $83.82$ & $0.3$ \\
example\_mapping & $85.88$ & $108.4$ \\
example\_sets & $1.33$ & $1.6$ \\
example\_stack & $0.36$ & $19.2$ \\
prime\_sieve & $26093.26$ & $34.1$ \\
\end{tabular}
\caption{Spread in total benchmark results by program}
\label{table:benchmark_spread}

\end{table}

\subsection{Prediction accuracy}

We now compare the implementations suggested by our system to the selection that is actually best, which we obtain by brute-forcing all possible implementations.
We leave analysis of adaptive container suggestions to section \ref{section:results_adaptive_containers}

Table \ref{table:predicted_actual} shows the predicted best assignments alongside the actual best assignment, obtained by brute-force.
In all but two of our test cases (marked with *), we correctly identify the best container.

\begin{table}[h!]
  \centering
  \begin{tabular}{|c|c|c|c|c|}
    Project & Container Type & Best implementation & Predicted best &   \\
    \hline
    aoc\_2021\_09 & Map & HashMap & HashMap &  \\
    aoc\_2021\_09 & Set & HashSet & HashSet &  \\
    aoc\_2022\_08 & Map & HashMap & HashMap &  \\
    aoc\_2022\_09 & Set & HashSet & HashSet &  \\
    aoc\_2022\_14 & Set & HashSet & HashSet &  \\
    aoc\_2022\_14 & List & Vec & LinkedList & * \\
    example\_mapping & Map & HashMap & HashMap &  \\
    example\_sets & Set & HashSet & HashSet &  \\
    example\_stack & StackCon & Vec & Vec &  \\
    prime\_sieve & Primes & BTreeSet & BTreeSet &  \\
    prime\_sieve & Sieve & Vec & LinkedList & * \\
  \end{tabular}
  \caption{Actual best vs predicted best implementations}
  \label{table:predicted_actual}
\end{table}

Both of these failures appear to be caused by being overly eager to suggest a \code{LinkedList}.
From looking at detailed profiling information, it seems that both of these container types had a relatively small amount of items in them.
Therefore this is likely caused by our cost models being inaccurate at small $n$ values, as mentioned in section \ref{section:cm_small_n}.

Overall, our results suggest that our system is effective, at least for large enough $n$ values.
Unfortunately, these tests are somewhat limited, as the best container is almost always predictable: \code{Vec} where uniqueness is not important, and \code{Hash*} otherwise.
Therefore, more thorough testing is needed to fully establish the system's effectiveness.

\subsection{Adaptive containers}
\label{section:results_adaptive_containers}

We now look at cases where an adaptive container was suggested, and evaluate the result.

Table \ref{table:adaptive_suggestions} shows the container types for which adaptive containers were suggested, along with the inner types and the threshold at which to switch.

%% ** Find where adaptive containers get suggested
\begin{table}[h]
  \centering
  \begin{tabular}{|c|c|c|}
    Project & Container Type & Suggestion \\
    \hline
    aoc\_2022\_08 & Map & SortedVecMap until n=1664, then HashMap \\
    aoc\_2022\_09 & Set & HashSet until n=185540, then BTreeSet \\
    example\_mapping & Map & VecMap until n=225, then HashMap \\
    prime\_sieve & Primes & BTreeSet until n=34, then HashSet \\
    prime\_sieve & Sieve & LinkedList until n=747, then Vec \\
  \end{tabular}
  \caption{Suggestions for adaptive containers}
  \label{table:adaptive_suggestions}
\end{table}

The suggested containers for both \code{aoc_2022_08} and \code{example_mapping} are unsurprising.
Since hashing incurs a roughly constant cost, it makes sense that below a certain $n$ value, simply searching through a list is more effective.
The suggestion of \code{SortedVecMap} vs \code{VecMap} likely has to do with the relative frequency of \code{insert} operations compared to others.

The suggestion to start with a \code{LinkedList} for \code{prime_sieve / Sieve} is likely due to the same issues that cause a \code{LinkedList} to be suggested in the non-adaptive case.
This may also be the case for the suggestion of \code{BTreeSet} for \code{prime_sieve / Primes}.

The suggestion of \code{BTreeSet} for \code{aoc_2022_09} is most surprising.
As the $n$ threshold after which we switch is outside the range we benchmark our implementations at, this suggestion is based on our model attempting to generalise far outside the range it has seen before.

%% ** Comment on relative performance speedup
Table \ref{table:adaptive_perfcomp} compares our adaptive container suggestions with the fastest non-adaptive implementation.
Since we must select an implementation for all containers before selecting a project, we show all possible combinations of adaptive and non-adaptive container selections where appropriate.

Note that the numbered columns indicate the benchmark 'size', not the actual size that the container reaches within that benchmark.
What this means exactly varies by benchmark.

\begin{table}[h]
  \centering
  \begin{adjustbox}{angle=90}
    \begin{tabular}{|c|c|c|c|c|c|}
      \hline
      Project & Implementations & \multicolumn{4}{|c|}{Benchmark size} \\
      \hline
       &  & 100 & 200 & & \\
      \hline
      aoc\_2022\_08 & Map=HashMap & 1ms $\pm$ 12us & 6ms $\pm$ 170us & & \\
      aoc\_2022\_08 & Map=Adaptive & 1ms $\pm$ 74us & 6ms $\pm$ 138us & & \\
      \hline
       &  & 100 & 1000 & 2000 & \\
      \hline
      aoc\_2022\_09 & Set=HashSet & 1ms $\pm$ 6us & 10ms $\pm$ 51us & 22ms $\pm$ 214us & \\
      aoc\_2022\_09 & Set=Adaptive & 1ms $\pm$ 3us & 10ms $\pm$ 27us & 40ms $\pm$ 514us & \\
      \hline
       &  & 50 & 150 & 2500 & 7500 \\
      \hline
      example\_mapping & Map=HashMap & 3us $\pm$ 6ns & 11us $\pm$ 20ns & 184us $\pm$ 835ns & 593us $\pm$ 793ns \\
      example\_mapping & Map=Adaptive & 4us $\pm$ 9ns & 33us $\pm$ 55ns & 192us $\pm$ 311ns & 654us $\pm$ 19us \\
      \hline
       &  & 50 & 500 & 50000 & \\
      \hline
      prime\_sieve & Primes=BTreeSet, Sieve=Vec & 1us $\pm$ 2ns & 75us $\pm$ 490ns & 774ms $\pm$ 4ms & \\
      prime\_sieve & Primes=BTreeSet, Sieve=Adaptive & 2us $\pm$ 7ns & 194us $\pm$ 377ns & 765ms $\pm$ 4ms & \\
      prime\_sieve & Primes=Adaptive, Sieve=Vec & 1us $\pm$ 10ns & 85us $\pm$ 179ns & 788ms $\pm$ 2ms & \\
      prime\_sieve & Primes=Adaptive Sieve=Adaptive & 2us $\pm$ 5ns & 203us $\pm$ 638ns & 758ms $\pm$ 4ms & \\
      \hline
    \end{tabular}
  \end{adjustbox}
  \caption{Adaptive containers vs the best single container, by size of benchmark}
  \label{table:adaptive_perfcomp}
\end{table}

In all but one project, the non-adaptive containers are as fast or faster than the adaptive containers at all sizes of benchmarks.
In the \code{aoc_2022_09} project, the adaptive container is marginally faster until the benchmark size reaches 2000, at which point it is significantly slower.

%% ** Suggest future improvements?
This shows that adaptive containers as we have implemented them are not effective in practice.
Even in cases where we never reach the size threshold, the presence of adaptive containers has an overhead which slows down the program 3x in the worst case (\code{example_mapping}, size = 150).

One explanation for this could be that every operation now requires checking which inner implementation we are using, resulting in an additional check for each operation.
More work could be done to minimise this overhead, although it's unclear exactly how much this could be minimised.

It is also unclear if the threshold values that we suggest are the optimal ones.
Currently, we decide our threshold by picking a value between two partitions with different best containers.
Future work could take a more complex approach that finds the best threshold value based on our cost models, and takes the overhead of all operations into account.

\subsection{Evaluation}

Overall, we find that the main part of our container selection system appears to have merit.
Whilst our testing has limitations, it shows that we can correctly identify the best container even in complex programs.
More work is needed on improving our system's performance for very small containers, and on testing with a wider range of programs.

Our proposed technique for identifying adaptive containers appears ineffective.
The primary challenges appear to be in the overhead introduced to each operation, and in finding the correct point at which to switch implementations.