aboutsummaryrefslogtreecommitdiff
path: root/thesis/parts
diff options
context:
space:
mode:
authorAria Shrimpton <me@aria.rip>2024-03-29 14:41:15 +0000
committerAria Shrimpton <me@aria.rip>2024-03-29 14:41:15 +0000
commit61a6ce0ede29e5595473852a39cc216288d50a25 (patch)
treee1b52de940bdb515215da6fe103d8fa7480467c7 /thesis/parts
parent2f75ce401867feaddce578e09be542407c327f48 (diff)
rest of cost model analysis
Diffstat (limited to 'thesis/parts')
-rw-r--r--thesis/parts/results.tex52
1 files changed, 19 insertions, 33 deletions
diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex
index 9d0b5b4..bd30fcf 100644
--- a/thesis/parts/results.tex
+++ b/thesis/parts/results.tex
@@ -75,7 +75,10 @@ Since both of these implementations require searching the collection before inse
Whilst our main figures for insertion operations indicate a clear winner within each category, looking at small $n$ values reveals some more complexity.
Figure \ref{fig:cm_insert_small_n} shows the cost models for insert operations on different set implementations at smaller n values.
-\todo{Explain this}
+In particular, for $n<1800$ the overhead from sorting a vec is less than running the default hasher function (at least on this hardware).
+
+We also see a sharp spike in the cost for \code{SortedVecSet} at low $n$ values, and an area of supposed 0 cost from around $n=200$ to $n=800$.
+This seems inaccurate, and is likely a result of few data points at low n values resulting in poor fitting.
\begin{figure}[h!]
\centering
@@ -87,12 +90,7 @@ Figure \ref{fig:cm_insert_small_n} shows the cost models for insert operations o
\subsection{Contains operations}
We now examine the cost of the \code{contains} operation.
-Figure \ref{fig:cm_contains} shows our built cost models.
-These are grouped for readability, with the first graph showing sets and sorted lists, the second showing sets and sorted lists, and the third showing key-value mappings.
-
-Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do.
-This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures.
-This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely.
+Figure \ref{fig:cm_contains} shows our built cost models, again grouped for readability.
\begin{figure}[h!]
\centering
@@ -101,42 +99,30 @@ This is desirable assuming that \code{contains} operations are actually randomly
\label{fig:cm_contains}
\end{figure}
-Both \code{LinkedList} and \code{Vec} implementations have roughly linear growth, which makes sense as these are not kept ordered.
-\code{LinkedList} has a significantly higher cost at all points, and a wider spread of outliers.
-This makes sense as each item in a linked list is not guaranteed to be in the same place in memory, so traversing them is likely to be more expensive, making the best and worst cases further apart.
-Some of the spread could also be explained by heap allocations being put in different locations in memory, with less or more locality between each run.
-
-\code{SortedVec} and \code{SortedUniqueVec} both exhibit a wide spread of observations, with what looks like a roughly linear growth.
-Looking at the raw output, we find the following equations being used for each cost model:
+Notably, the observations in these graphs have a much wider spread than our \code{insert} operations do.
+This is probably because we attempt to get a different random element in our container every time, so our observations show the best and worst case of our data structures.
+This is desirable assuming that \code{contains} operations are actually randomly distributed in the real world, which seems likely.
-\begin{align*}
-C(n) &\approx 22.8 + 4.6\log_2 n + 0.003n - (1 * 10^{-9}) * n^2 & \textrm{SortedVec} \\
-C(n) &\approx -5.9 + 8.8\log_2 n - (4 * 10^{-5}) n - (3 * 10^{-8}) * n^2 & \textrm{SortedUniqueVec}
-\end{align*}
+For the \code{SortedVec} family, we would expect to see roughly logarithmic growth, as contains is based on binary search.
+This is the case for \code{SortedVecMap}, however \code{SortedVec} and \code{SortedVecSet} both show exponential growth with a 'dip' around $n=25,000$.
+It's unclear why this happened, although it could be due to how the elements we query are distributed throughout the list.
+A possible improvement would be to run contains with a known distribution of values, including low, high, and not present values in equal parts.
-As both of these implementations use a binary search for \code{contains}, the dominating logarithmic factors are expected.
-This is possibly a case of overfitting, as the observations for both implementations also have a wide spread.
+The \code{Vec} family exhibits roughly linear growth, which is expected, since this implementation scans through the whole array each time.
+\code{LinkedList} has roughly logarithmic growth, at a significantly higher cost.
+The higher cost is expected, although its unclear why growth is logarithmic rather than linear.
+As the spread of points also appears to increase at larger $n$ values, its possible that this is due to larger $n$ values causing a higher proportion of the program's memory to be dedicated to the container, resulting in better cache utilisation.
\code{HashSet} appears roughly linear as expected, with only a slow logarithmic rise, probably due to an increasing amount of collisions.
\code{BTreeSet} is consistently above it, with a slightly higher logarithmic rise.
-The standard library documentation states that searches are expected to take $B\log(n)$ comparisons on average\citep{rust_documentation_team_btreemap_2024}, which is in line with observations.
-\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, though are more expensive in most places.
-This is probably due to the increased size more quickly exhausting CPU cache.
+\code{BTreeMap} and \code{HashMap} both mimic their set counterparts, but with a slightly lower overall cost this time.
+\todo{It's unclear why this is.}
\subsection{Evaluation}
-In the cost models we examined, we found that most were in line with our expectations.
-
-Although we will not examine them in detail, we briefly describe observations from the rest of the built cost models:
-
-\begin{enumerate}
-\item Our models for \code{push} and \code{pop} operations are pretty much the same as for \code{insert} operations, as they are the same inner implementation.
-\item \code{first}, \code{last}, and \code{nth} operations show the time complexity we expect. However, some overfitting appears to occur, meaning our cost models may not generalise as well outside of the range of n values they were benchmarked with.
-\end{enumerate}
-
Overall, our cost models appear to be a good representation of each implementations performance impact.
-Future improvements could address the overfitting problems some operations had, either by pre-processing the data to detect and remove outliers, or by employing a more complex fitting procedure.
+Future improvements could address the overfitting problems some operations had, such as by employing a more complex fitting procedure, or by doing more to ensure operations have their best and worst cases tested fairly.
%% * Predictions
\section{Selections}