From 43cd2c2362b123de24b4381d1fa46acaeb602c18 Mon Sep 17 00:00:00 2001 From: Aria Shrimpton Date: Sun, 10 Mar 2024 13:53:12 +0000 Subject: rest of cost model section --- thesis/parts/results.tex | 52 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 51 insertions(+), 1 deletion(-) (limited to 'thesis/parts/results.tex') diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex index 896fa9d..2e2373a 100644 --- a/thesis/parts/results.tex +++ b/thesis/parts/results.tex @@ -58,11 +58,61 @@ This could explain why we see a roughly linear growth. \subsection{Contains operations} We now examine the cost of the \code{contains} operation. +Figure \ref{fig:cm_contains} shows our built cost models. +These are grouped for readability, with the first graph showing sets and sorted lists, the second showing sets and sorted lists, and the third showing key-value mappings. -\subsection{Outliers / errors} +Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do. +This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures. +This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely. + +\begin{figure}[h] + \centering + \includegraphics[width=10cm]{assets/contains_lists.png} + \par\centering\rule{11cm}{0.5pt} + \includegraphics[width=10cm]{assets/contains_sets.png} + \par\centering\rule{11cm}{0.5pt} + \includegraphics[width=10cm]{assets/contains_mappings.png} + \caption{Estimated cost of \code{contains} operation on lists, sets/sorted lists, and \code{Mapping}s} + \label{fig:cm_contains} +\end{figure} + +Both \code{LinkedList} and \code{Vec} implementations have roughly linear growth, which makes sense as these are not kept ordered. +\code{LinkedList} has a significantly higher cost at all points, and a wider spread of outliers. +This makes sense as each item in a linked list is not guaranteed to be in the same place in memory, so traversing them is likely to be more expensive, making the best and worst cases further apart. +Some of the spread could also be explained by heap allocations being put in different locations in memory, with less or more locality between each run. + +\code{SortedVec} and \code{SortedUniqueVec} both exhibit a wide spread of observations, with what looks like a roughly linear growth. +Looking at the raw output, we find the following equations being used for each cost model: + +\begin{align*} +C(n) &\approx 22.8 + 4.6\log_2 n + 0.003n - (1 * 10^{-9}) * n^2 & \textrm{SortedVec} \\ +C(n) &\approx -5.9 + 8.8\log_2 n - (4 * 10^{-5}) n - (3 * 10^{-8}) * n^2 & \textrm{SortedUniqueVec} +\end{align*} + +As both of these implementations use a binary search for \code{contains}, the dominating logarithmic factors are expected. +\code{SortedUniqueVec} likely has a larger $n^2$ coefficient due to more collisions happening at larger container sizes. +\todo{elaborate: we insert that many random items, but some may be duplicates} + +\code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to collisions. +\code{BTreeSet} is consistently above it, with a slightly higher logarithmic rise. + +\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, though are more expensive in most places. +This is probably due to the increased size more quickly exhausting CPU cache. \subsection{Evaluation} +In the cost models we examined, we found that most were in line with our expectations. + +Although we will not examine them in detail, we briefly describe observations from the rest of the built cost models: + +\begin{enumerate} +\item Our models for \code{push} and \code{pop} operations are pretty much the same as for \code{insert} operations, as they are the same inner implementation. +\item \code{first}, \code{last}, and \code{nth} operations show the time complexity we expect. However, some overfitting appears to occur, meaning our cost models may not generalise as well outside of the range of n values they were benchmarked with. +\end{enumerate} + +Overall, our cost models appear to be a good representation of each implementations performance impact. +Future improvements could address the overfitting problems some operations had, either by pre-processing the data to detect and remove outliers, or by employing a more complex fitting procedure. + %% * Predictions \section{Selections} -- cgit v1.2.3