From 61a6ce0ede29e5595473852a39cc216288d50a25 Mon Sep 17 00:00:00 2001 From: Aria Shrimpton Date: Fri, 29 Mar 2024 14:41:15 +0000 Subject: rest of cost model analysis --- thesis/parts/results.tex | 52 ++++++++++++++++++------------------------------ 1 file changed, 19 insertions(+), 33 deletions(-) (limited to 'thesis/parts/results.tex') diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex index 9d0b5b4..bd30fcf 100644 --- a/thesis/parts/results.tex +++ b/thesis/parts/results.tex @@ -75,7 +75,10 @@ Since both of these implementations require searching the collection before inse Whilst our main figures for insertion operations indicate a clear winner within each category, looking at small $n$ values reveals some more complexity. Figure \ref{fig:cm_insert_small_n} shows the cost models for insert operations on different set implementations at smaller n values. -\todo{Explain this} +In particular, for $n<1800$ the overhead from sorting a vec is less than running the default hasher function (at least on this hardware). + +We also see a sharp spike in the cost for \code{SortedVecSet} at low $n$ values, and an area of supposed 0 cost from around $n=200$ to $n=800$. +This seems inaccurate, and is likely a result of few data points at low n values resulting in poor fitting. \begin{figure}[h!] \centering @@ -87,12 +90,7 @@ Figure \ref{fig:cm_insert_small_n} shows the cost models for insert operations o \subsection{Contains operations} We now examine the cost of the \code{contains} operation. -Figure \ref{fig:cm_contains} shows our built cost models. -These are grouped for readability, with the first graph showing sets and sorted lists, the second showing sets and sorted lists, and the third showing key-value mappings. - -Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do. -This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures. -This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely. +Figure \ref{fig:cm_contains} shows our built cost models, again grouped for readability. \begin{figure}[h!] \centering @@ -101,42 +99,30 @@ This is desirable assuming that \code{contains} operations are actually randomly \label{fig:cm_contains} \end{figure} -Both \code{LinkedList} and \code{Vec} implementations have roughly linear growth, which makes sense as these are not kept ordered. -\code{LinkedList} has a significantly higher cost at all points, and a wider spread of outliers. -This makes sense as each item in a linked list is not guaranteed to be in the same place in memory, so traversing them is likely to be more expensive, making the best and worst cases further apart. -Some of the spread could also be explained by heap allocations being put in different locations in memory, with less or more locality between each run. - -\code{SortedVec} and \code{SortedUniqueVec} both exhibit a wide spread of observations, with what looks like a roughly linear growth. -Looking at the raw output, we find the following equations being used for each cost model: +Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do. +This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures. +This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely. -\begin{align*} -C(n) &\approx 22.8 + 4.6\log_2 n + 0.003n - (1 * 10^{-9}) * n^2 & \textrm{SortedVec} \\ -C(n) &\approx -5.9 + 8.8\log_2 n - (4 * 10^{-5}) n - (3 * 10^{-8}) * n^2 & \textrm{SortedUniqueVec} -\end{align*} +For the \code{SortedVec} family, we would expect to see roughly logarithmic growth, as contains is based on binary search. +This is the case for \code{SortedVecMap}, however \code{SortedVec} and \code{SortedVecSet} both show exponential growth with a 'dip' around $n=25,000$. +It's unclear why this happened, although it could be due to how the elements we query are distributed throughout the list. +A possible improvement would be to run contains with a known distribution of values, including low, high, and not present values in equal parts. -As both of these implementations use a binary search for \code{contains}, the dominating logarithmic factors are expected. -This is possibly a case of overfitting, as the observations for both implementations also have a wide spread. +The \code{Vec} family exhibits roughly linear growth, which is expected, since this implementation scans through the whole array each time. +\code{LinkedList} has roughly logarithmic growth, at a significantly higher cost. +The higher cost is expected, although its unclear why growth is logarithmic rather than linear. +As the spread of points also appears to increase at larger $n$ values, its possible that this is due to larger $n$ values causing a higher proportion of the program's memory to be dedicated to the container, resulting in better cache utilisation. \code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to an increasing amount of collisions. \code{BTreeSet} is consistently above it, with a slightly higher logarithmic rise. -The standard library documentation states that searches are expected to take $B\log(n)$ comparisons on average\citep{rust_documentation_team_btreemap_2024}, which is in line with observations. -\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, though are more expensive in most places. -This is probably due to the increased size more quickly exhausting CPU cache. +\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, but with a slightly lower overall cost this time. +\todo{It's unclear why this is.} \subsection{Evaluation} -In the cost models we examined, we found that most were in line with our expectations. - -Although we will not examine them in detail, we briefly describe observations from the rest of the built cost models: - -\begin{enumerate} -\item Our models for \code{push} and \code{pop} operations are pretty much the same as for \code{insert} operations, as they are the same inner implementation. -\item \code{first}, \code{last}, and \code{nth} operations show the time complexity we expect. However, some overfitting appears to occur, meaning our cost models may not generalise as well outside of the range of n values they were benchmarked with. -\end{enumerate} - Overall, our cost models appear to be a good representation of each implementations performance impact. -Future improvements could address the overfitting problems some operations had, either by pre-processing the data to detect and remove outliers, or by employing a more complex fitting procedure. +Future improvements could address the overfitting problems some operations had, such as by employing a more complex fitting procedure, or by doing more to ensure operations have their best and worst cases tested fairly. %% * Predictions \section{Selections} -- cgit v1.2.3