5 files changed, 94 insertions, 72 deletions
diff --git a/Tasks.org b/Tasks.org
index 20c04ca..9c84056 100644
--- a/Tasks.org
+++ b/Tasks.org
@@ -282,13 +282,13 @@ Ideas:
 
 **** DONE Conclusion
 
-*** TODO Predictions
+*** DONE Predictions
 
-**** TODO Chosen benchmarks
+**** DONE Chosen benchmarks
 
-**** TODO Effect of selection on benchmarks (spread in execution time)
+**** DONE Effect of selection on benchmarks (spread in execution time)
 
-**** TODO Summarise predicted versus actual
+**** DONE Summarise predicted versus actual
 
 **** TODO Evaluate performance
 
@@ -304,8 +304,4 @@ Ideas:
 
 **** TODO Suggest future improvements?
 
-*** TODO Selection time / developer experience
-
-**** TODO Mention speedup versus naive brute force
-
 ** TODO Conclusion
diff --git a/thesis/main.tex b/thesis/main.tex
index 2c2ed90..a3c71c3 100644
--- a/thesis/main.tex
+++ b/thesis/main.tex
@@ -9,12 +9,13 @@
 \usepackage{amsmath}
 
 \usepackage{microtype}
+\usepackage{calc}
 \usepackage[style=numeric]{biblatex}
 \addbibresource{biblio.bib}
 
 %% Convenience macros
-\newcommand{\code}{\lstinline}
-\newcommand{\todo}[1]{\colorbox{yellow}{TODO: #1} \par}
+\newcommand{\code}[1]{\lstinline$#1$}
+\newcommand{\todo}[1]{\par\noindent\colorbox{yellow}{\begin{minipage}{\linewidth-2\fboxsep}TODO: #1\end{minipage}}\par}
 
 %% Code blocks
 \usepackage{listings, listings-rust}
diff --git a/thesis/parts/acknowledgements.tex b/thesis/parts/acknowledgements.tex
index 47cceed..0337f02 100644
--- a/thesis/parts/acknowledgements.tex
+++ b/thesis/parts/acknowledgements.tex
@@ -1,5 +1,3 @@
 Firstly, I'd like to express my deepest gratitude to my supervisor, Liam O' Connor, for his help.
 
-I'd also like to thank my partner, Lucy, and my friend Artemis for their support throughout.
-
-I would also like to thank the Tardis Project for the compute resources I used for benchmarking, and the members of CompSoc for their advice.
+I'd also like to thank the Tardis Project for the compute resources used for benchmarking, and the members of CompSoc for their advice.
diff --git a/thesis/parts/implementation.tex b/thesis/parts/implementation.tex
index 1c131ed..cd7b4b7 100644
--- a/thesis/parts/implementation.tex
+++ b/thesis/parts/implementation.tex
@@ -1,4 +1,5 @@
-\todo{Introduction}
+This chapter elaborates on some implementation details glossed over in the previous chapter.
+With reference to the source code, we explain the structure of our system's implementation, and highlight areas with difficulties.
 
 \section{Modifications to Primrose}
 
@@ -14,7 +15,29 @@ Operations on mapping implementations can be modelled and checked against constr
 They are modelled in Rosette as a list of key-value pairs.
 \code{src/crates/library/src/hashmap.rs} shows how mapping container types can be declared, and operations on them modelled.
 
-\todo{add and list library types}
+Table \ref{table:library} shows the library of container types we used.
+Most come from the Rust standard library, with the exceptions of \code{SortedVec} and \code{SortedUniqueVec}, which use \code{Vec} internally.
+The library source can be found in \code{src/crates/library}.
+
+\todo{This might be expanded}
+
+\begin{table}[h]
+  \centering
+  \begin{tabular}{|c|c|c|}
+    Implementation & Description \\
+    \hline
+    \code{LinkedList} & Doubly-linked list \\
+    \code{Vec} & Contiguous growable array \\
+    \code{SortedVec} & Vec kept in sorted order \\
+    \code{SortedUniqueVec} & Vec kept in sorted order, with no duplicates \\
+    \code{HashMap} & Hash map with quadratic probing \\
+    \code{HashSet} & Hash map with empty values \\
+    \code{BTreeMap} & B-Tree\parencite{bayer_organization_1970} map with linear search. \\
+    \code{BTreeSet} & B-Tree map with empty values \\
+  \end{tabular}
+  \caption{Implementations in our library}
+  \label{table:library}
+\end{table}
 
 We also added new syntax to the language to support defining properties that only make sense for mappings (\code{dictProperty}), however this was unused.
 
@@ -39,39 +62,38 @@ When benchmarks need to be run for an implementation, we dynamically generate a
 As Rust's generics are monomorphised, our generic code is compiled as if we were using the concrete type in our code, so we don't need to worry about affecting the benchmark results.
 
 Each benchmark is run in a 'warmup' loop for a fixed amount of time (currently 500ms), then runs for a fixed number of iterations (currently 50).
-This is important because we use every observation when fitting our cost models, so varying our number of iterations would change our curve's fit.
-We repeat each benchmark at a range of $n$ values, ranging from $64$ to $65,536$.
+This is important because we use every observation when fitting our cost models, so varying the number of iterations would change our curve's fit.
+We repeat each benchmark at a range of $n$ values, ranging from $10$ to $60,000$.
 
 Each benchmark we run corresponds to one container operation.
-For most operations, we prepare a container of size $n$ and run the operation once per iteration.
+For most operations, we insert $n$ random values to a new container, then run the operation once per iteration.
 For certain operations which are commonly amortized (\code{insert}, \code{push}, and \code{pop}), we instead run the operation itself $n$ times and divide all data points by $n$.
 
-Our benchmarker crate outputs every observation in a similar format to Criterion (a popular benchmarking crate for Rust).
-We then parse this from our main program, and use least squares to fit a polynomial to our data.
-We initially tried other approaches to fitting a curve to our data, however we found that they all overfitted, resulting in more sensitivity to benchmarking noise.
+We use least squares to fit a polynomial to all of our data.
 As operations on most common data structures are polynomial or logarithmic complexity, we believe that least squares fitting is good enough to capture the cost of most operations.
-
-\todo{variable coefficients, which ones we tried}
+We originally experimented with coefficients up to $x^3$, but found that this led to bad overfitting.
 
 \section{Profiling}
 
-We implement profiling by using a \code{ProfilerWrapper} type (\code{src/crates/library/src/profiler.rs}), which takes as a type parameter the 'inner' container implementation.
+We implement profiling by using a \code{ProfilerWrapper} type (\code{src/crates/library/src/profiler.rs}), which takes as type parameters the 'inner' container implementation and an index later used to identify what type the profiling info corresponds to.
 We then implement any primrose traits that the inner container implements, counting the number of times each operation is called.
 We also check the length of the container after each insertion operation, and track the maximum.
 
 This tracking is done per-instance, and recorded when the instance goes out of scope and its \code{Drop} implementation is called.
-We write the counts of each operation and maximum size of the collection to a location specified by an environment variable, and a constant generic parameter which allows us to match up container types to their profiler outputs.
+We write the counts of each operation and maximum size of the collection to a location specified by an environment variable.
 
 When we want to profile a program, we pick any valid inner implementation for each selection site, and use that candidate with our profiling wrapper as the concrete implementation for that site.
 
 This approach has the advantage of giving us information on each individual collection allocated, rather than only statistics for the type as a whole.
 For example, if one instance of a container type is used in a very different way from the rest, we will be able to see it more clearly than a normal profiling tool would allow us to.
-Although it has some amount of overhead, it's not important as we aren't measuring the program's execution time when profiling.
+
+Although there is noticeable overhead in our current implementation, it's not important as we aren't measuring the program's execution time when profiling.
+Future work could likely improve the overhead by batching file outputs, however this wasn't necessary for us.
 
 \section{Selection and Codegen}
 
 %% Selection Algorithm incl Adaptiv
-Selection is done per container site.
+Selection is done per container type.
 For each candidate implementation, we calculate its cost on each partition in the profiler output, then sum these values to get the total estimated cost for each implementation.
 This provides us with estimates for each singular candidate.
 
@@ -144,22 +166,3 @@ fn _StackCon<S: PartialEq + Ord + std::hash::Hash>() -> StackCon<S> {
 }
 \end{lstlisting}
 \end{figure}
-
-\section{Miscellaneous concerns}
-
-In this section, we highlight some other design decisions we made, and justify them.
-
-\todo{Explain cargo's role in rust projects \& how it is integrated}
-
-%% get project metadata from cargo
-%% available benchmarks and source directories
-%% works with most projects
-%% could be expanded to run as cargo command
-
-\todo{Caching and stuff}
-
-\todo{Ease of use}
-
-%% parse minimal amount of information from criterion benchmark
-%% most common benchmarking tool, closest there is to a standard
-%% should be easy to adapt if/when cargo ships proper benchmarking support
diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex
index 6915b12..747631d 100644
--- a/thesis/parts/results.tex
+++ b/thesis/parts/results.tex
@@ -19,7 +19,7 @@ Starting with the \code{insert} operation, Figure \ref{fig:cm_insert} shows how
 The lines correspond to our fitted curves, while the points indicate the raw observations they are drawn from.
 To help readability, we group these into regular \code{Container} implementations, and our associative key-value \code{Mapping} implementations.
 
-\begin{figure}[h]
+\begin{figure}[h!]
   \centering
   \includegraphics[width=10cm]{assets/insert_containers.png}
   \par\centering\rule{11cm}{0.5pt}
@@ -42,7 +42,6 @@ This is likely due to hash collisions being more likely as the size of the colle
 
 \code{BTreeSet} insertions are also expensive, however the cost appears to level out as the collection size goes up (a logarithmic curve).
 It's important to note that Rust's \code{BTreeSet}s are not based on binary tree search, but instead a more general tree search originally proposed by R Bayer and E McCreight\parencite{bayer_organization_1970}, where each node contains $B-1$ to $2B-1$ elements in an array.
-\todo{The standard library documentation states that searches are expected to take $B\log(n)$ comparisons on average\parencite{rust_documentation_team_btreemap_2024}, which would explain the logarithm-like growth.}
 
 Our two mapping types, \code{BTreeMap} and \code{HashMap}, mimic the behaviour of their set counterparts.
 
@@ -52,7 +51,7 @@ This would suggest we should see a roughly logarithmic complexity.
 However, as we will be inserting most elements near the middle of a list, we will on average be copying half the list every time.
 This could explain why we see a roughly linear growth.
 
-\todo{Graph this, and justify further}
+\todo{This explanation could be better}
 
 \subsection{Contains operations}
 
@@ -64,7 +63,7 @@ Notably, the observations in these graphs have a much wider spread than our \cod
 This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures.
 This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely.
 
-\begin{figure}[h]
+\begin{figure}[h!]
   \centering
   \includegraphics[width=10cm]{assets/contains_lists.png}
   \par\centering\rule{11cm}{0.5pt}
@@ -89,11 +88,11 @@ C(n) &\approx -5.9 + 8.8\log_2 n - (4 * 10^{-5}) n - (3 * 10^{-8}) * n^2 & \text
 \end{align*}
 
 As both of these implementations use a binary search for \code{contains}, the dominating logarithmic factors are expected.
-\code{SortedUniqueVec} likely has a larger $n^2$ coefficient due to more collisions happening at larger container sizes.
-\todo{elaborate: we insert that many random items, but some may be duplicates}
+This is possibly a case of overfitting, as the observations for both implementations also have a wide spread.
 
-\code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to collisions.
+\code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to an increasing amount of collisions.
 \code{BTreeSet} is consistently above it, with a slightly higher logarithmic rise.
+The standard library documentation states that searches are expected to take $B\log(n)$ comparisons on average\parencite{rust_documentation_team_btreemap_2024}, which is in line with observations.
 
 \code{BTreeMap} and \code{HashMap} both mimic their set counterparts, though are more expensive in most places.
 This is probably due to the increased size more quickly exhausting CPU cache.
@@ -124,7 +123,7 @@ We expect the results from our example cases to be relatively unsurprising, whil
 Most of our real cases are solutions to puzzles from Advent of Code\parencite{wastl_advent_2015}, a popular collection of programming puzzles.
 Table \ref{table:test_cases} lists and briefly describes our test cases.
 
-\begin{table}[h]
+\begin{table}[h!]
   \centering
   \begin{tabular}{|c|c|}
     Name & Description \\
@@ -147,29 +146,56 @@ Table \ref{table:test_cases} lists and briefly describes our test cases.
 Table \ref{table:benchmark_spread} shows the difference in benchmark results between the slowest possible assignment of containers, and the fastest.
 Even in our example projects, we see that the wrong choice of container can slow down our programs substantially.
 
-
-\begin{table}[h]
+\begin{table}[h!]
 \centering
-\begin{tabular}{|c|c|}
-  Project & Total difference between best and worst benchmarks (seconds) & Maximum slowdown from bad container choices \\
-  \hline
-  aoc\_2021\_09 & 29.685 & 4.75 \\
-  aoc\_2022\_08 & 0.036 & 2.088 \\
-  aoc\_2022\_09 & 10.031 & 132.844 \\
-  aoc\_2022\_14 & 0.293 & 2.036 \\
-  prime\_sieve & 28.408 & 18.646 \\
-  example\_mapping & 0.031 & 1.805 \\
-  example\_sets & 0.179 & 12.65 \\
-  example\_stack & 1.931 & 8.454 \\
+\begin{tabular}{|c|c|c|}
+Project & worst - best time (seconds) & Maximum slowdown \\
+\hline
+aoc\_2021\_09 & 29.685 & 4.75 \\
+aoc\_2022\_08 & 0.036 & 2.088 \\
+aoc\_2022\_09 & 10.031 & 132.844 \\
+aoc\_2022\_14 & 0.293 & 2.036 \\
+prime\_sieve & 28.408 & 18.646 \\
+example\_mapping & 0.031 & 1.805 \\
+example\_sets & 0.179 & 12.65 \\
+example\_stack & 1.931 & 8.454 \\
 \end{tabular}
 \caption{Spread in total benchmark results by project}
 \label{table:benchmark_spread}
 \end{table}
 
-
 %% ** Summarise predicted versus actual
 \subsection{Prediction accuracy}
 
+We now compare the implementations suggested by our system, to the selection that is actually best.
+For now, we ignore suggestions for adaptive containers.
+
+Table \ref{table:predicted_actual} shows the predicted best assignments alongside the actual best assignment, obtained by brute-force.
+In all but two of our test cases (marked with *), we correctly identify the best container.
+
+\todo{but also its just vec/hashset every time, which is kinda boring. we should either get more variety (by adding to the library or adding new test cases), or mention this as a limitation in testing}
+
+\begin{table}[h!]
+  \centering
+  \begin{tabular}{c|c|c|c|c|}
+    & Project & Container Type & Actual Best & Predicted Best \\
+    \hline
+    & aoc\_2021\_09 & Map & HashMap & HashMap \\
+    & aoc\_2021\_09 & Set & HashSet & HashSet \\
+    & aoc\_2022\_14 & Set & HashSet & HashSet \\
+    * & aoc\_2022\_14 & List & Vec & LinkedList \\
+    & example\_stack & StackCon & Vec & Vec \\
+    & example\_sets & Set & HashSet & HashSet \\
+    & example\_mapping & Map & HashMap & HashMap \\
+    & aoc\_2022\_08 & Map & HashMap & HashMap \\
+    * & prime\_sieve & Primes & BTreeSet & HashSet \\
+    & prime\_sieve & Sieve & Vec & Vec \\
+    & aoc\_2022\_09 & Set & HashSet & HashSet \\
+  \end{tabular}
+  \caption{Actual best vs predicted best implementations}
+  \label{table:predicted_actual}
+\end{table}
+
 %% ** Evaluate performance
 \subsection{Evaluation}
 
@@ -180,7 +206,7 @@ Even in our example projects, we see that the wrong choice of container can slow
 %% * Performance of adaptive containers
 \section{Adaptive containers}
 
-\todo{Try and make these fucking things work}
+\todo{These also need more work, and better test cases}
 
 %% ** Find where adaptive containers get suggested
 
@@ -189,8 +215,6 @@ Even in our example projects, we see that the wrong choice of container can slow
 %% ** Suggest future improvements?
 
 %% * Selection time / developer experience
-\section{Selection time}
-
-\todo{selection time}
+%% \section{Selection time}
 
 %% ** Mention speedup versus naive brute force