From 43cd2c2362b123de24b4381d1fa46acaeb602c18 Mon Sep 17 00:00:00 2001
From: Aria Shrimpton <me@aria.rip>
Date: Sun, 10 Mar 2024 13:53:12 +0000
Subject: rest of cost model section

---
 thesis/parts/results.tex | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

(limited to 'thesis/parts/results.tex')

diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex
index 896fa9d..2e2373a 100644
--- a/thesis/parts/results.tex
+++ b/thesis/parts/results.tex
@@ -58,11 +58,61 @@ This could explain why we see a roughly linear growth.
 \subsection{Contains operations}
 
 We now examine the cost of the \code{contains} operation.
+Figure \ref{fig:cm_contains} shows our built cost models.
+These are grouped for readability, with the first graph showing sets and sorted lists, the second showing sets and sorted lists, and the third showing key-value mappings.
 
-\subsection{Outliers / errors}
+Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do.
+This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures.
+This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely.
+
+\begin{figure}[h]
+  \centering
+  \includegraphics[width=10cm]{assets/contains_lists.png}
+  \par\centering\rule{11cm}{0.5pt}
+  \includegraphics[width=10cm]{assets/contains_sets.png}
+  \par\centering\rule{11cm}{0.5pt}
+  \includegraphics[width=10cm]{assets/contains_mappings.png}
+  \caption{Estimated cost of \code{contains} operation on lists, sets/sorted lists, and \code{Mapping}s}
+  \label{fig:cm_contains}
+\end{figure}
+
+Both \code{LinkedList} and \code{Vec} implementations have roughly linear growth, which makes sense as these are not kept ordered.
+\code{LinkedList} has a significantly higher cost at all points, and a wider spread of outliers.
+This makes sense as each item in a linked list is not guaranteed to be in the same place in memory, so traversing them is likely to be more expensive, making the best and worst cases further apart.
+Some of the spread could also be explained by heap allocations being put in different locations in memory, with less or more locality between each run.
+
+\code{SortedVec} and \code{SortedUniqueVec} both exhibit a wide spread of observations, with what looks like a roughly linear growth.
+Looking at the raw output, we find the following equations being used for each cost model:
+
+\begin{align*}
+C(n) &\approx 22.8 + 4.6\log_2 n + 0.003n - (1 * 10^{-9}) * n^2 & \textrm{SortedVec} \\
+C(n) &\approx -5.9 + 8.8\log_2 n - (4 * 10^{-5}) n - (3 * 10^{-8}) * n^2 & \textrm{SortedUniqueVec}
+\end{align*}
+
+As both of these implementations use a binary search for \code{contains}, the dominating logarithmic factors are expected.
+\code{SortedUniqueVec} likely has a larger $n^2$ coefficient due to more collisions happening at larger container sizes.
+\todo{elaborate: we insert that many random items, but some may be duplicates}
+
+\code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to collisions.
+\code{BTreeSet} is consistently above it, with a slightly higher logarithmic rise.
+
+\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, though are more expensive in most places.
+This is probably due to the increased size more quickly exhausting CPU cache.
 
 \subsection{Evaluation}
 
+In the cost models we examined, we found that most were in line with our expectations.
+
+Although we will not examine them in detail, we briefly describe observations from the rest of the built cost models:
+
+\begin{enumerate}
+\item Our models for \code{push} and \code{pop} operations are pretty much the same as for \code{insert} operations, as they are the same inner implementation.
+\item \code{first}, \code{last}, and \code{nth} operations show the time complexity we expect. However, some overfitting appears to occur, meaning our cost models may not generalise as well outside of the range of n values they were benchmarked with.
+\end{enumerate}
+
+Overall, our cost models appear to be a good representation of each implementations performance impact.
+Future improvements could address the overfitting problems some operations had, either by pre-processing the data to detect and remove outliers, or by employing a more complex fitting procedure.
+
 %% * Predictions
 \section{Selections}
 
-- 
cgit v1.2.3