diff options
-rw-r--r-- | Tasks.org | 2 | ||||
-rw-r--r-- | analysis/vis.livemd | 65 | ||||
-rw-r--r-- | thesis/biblio.bib | 24 | ||||
-rw-r--r-- | thesis/parts/analysis.tex | 7 | ||||
-rw-r--r-- | thesis/parts/results.tex | 38 |
5 files changed, 109 insertions, 27 deletions
@@ -274,7 +274,7 @@ Ideas: *** DONE Cost model analysis -**** TODO Insertion operations +**** DONE Insertion operations **** TODO Contains operations diff --git a/analysis/vis.livemd b/analysis/vis.livemd index 044e41f..2fa59d2 100644 --- a/analysis/vis.livemd +++ b/analysis/vis.livemd @@ -132,9 +132,11 @@ mapping_impls = ["HashMap", "BTreeMap"] list_impls = ["Vec", "LinkedList", "SortedVec"] stack_impls = ["Vec", "LinkedList"] -inspect_op = "insert" -# impls = set_impls ++ list_impls -impls = mapping_impls +inspect_op = "clear" +# impls = set_impls ++ list_impls ++ mapping_impls +impls = ["Vec"] +# impls = mapping_impls +# impls = ["SortedUniqueVec", "SortedVec"] Tucan.layers([ cost_models @@ -160,7 +162,7 @@ Tucan.layers([ |> Tucan.Axes.set_y_title("Estimated cost") |> Tucan.Axes.set_x_title("Size of container (n)") |> Tucan.Scale.set_x_domain(startn, endn) -|> Tucan.Scale.set_y_domain(0, 200) +# |> Tucan.Scale.set_y_domain(0, 200) |> Tucan.set_size(500, 250) |> Tucan.Legend.set_title(:color, "Implementation") |> Tucan.Legend.set_orientation(:color, "bottom") @@ -316,33 +318,54 @@ estimated_costs = |> DF.new() ``` -## Estimates vs results +## Estimates vs results (ignoring adaptive containers) ```elixir -# Compare each assignments position in the estimates to its position in the results -sorted_estimates = +# Don't worry about adaptive containers for now +singular_estimated_costs = estimated_costs + |> DF.to_rows_stream() + |> Enum.filter(fn %{"using" => using} -> + Enum.all?(using, fn %{"impl" => impl} -> !String.contains?(impl, "until") end) + end) + |> DF.new() + +singular_benchmarks = + benchmarks + |> DF.to_rows_stream() + |> Enum.filter(fn %{"using" => using} -> + Enum.all?(using, fn %{"impl" => impl} -> !String.contains?(impl, "until") end) + end) + |> DF.new() + +DF.n_rows(singular_benchmarks) +``` + +```elixir +# Compare each assignments position in the estimates to its position in the results +sorted_singular_estimates = + singular_estimated_costs |> DF.group_by(["proj"]) |> DF.sort_by(estimated_cost) -sorted_results = - benchmarks +sorted_singular_results = + singular_benchmarks |> DF.group_by(["proj"]) |> DF.sort_by(time) -position_comparison = - sorted_estimates +singular_position_comparison = + sorted_singular_estimates |> DF.to_rows_stream() |> Enum.map(fn %{"proj" => proj, "using" => using} -> %{ proj: proj, using: using, pos_estimate: - DF.filter(sorted_estimates, proj == ^proj)["using"] + DF.filter(sorted_singular_estimates, proj == ^proj)["using"] |> SE.to_list() |> Enum.find_index(fn u -> u == using end), pos_results: - DF.filter(sorted_results, proj == ^proj)["using"] + DF.filter(sorted_singular_results, proj == ^proj)["using"] |> SE.to_list() |> Enum.find_index(fn u -> u == using end) } @@ -352,7 +375,19 @@ position_comparison = ```elixir # Everywhere we predicted wrong. -position_comparison -|> DF.filter(pos_estimate != pos_results) +singular_position_comparison +|> DF.filter(pos_estimate == 0 and pos_estimate != pos_results) |> DF.collect() ``` + +```elixir +singular_estimated_costs +|> DF.filter(proj == "aoc_2022_14") +|> DF.sort_by(estimated_cost) +``` + +```elixir +singular_benchmarks +|> DF.filter(proj == "aoc_2022_14") +|> DF.sort_by(time) +``` diff --git a/thesis/biblio.bib b/thesis/biblio.bib index 203e938..bea3669 100644 --- a/thesis/biblio.bib +++ b/thesis/biblio.bib @@ -200,3 +200,27 @@ urldate = {2024-03-08}, date = {2015}, } + +@misc{rust_documentation_team_btreemap_2024, + title = {{BTreeMap} documentation}, + url = {https://doc.rust-lang.org/stable/std/collections/struct.BTreeMap.html}, + author = {{Rust Documentation Team}}, + urldate = {2024-03-08}, + date = {2024}, +} + +@inproceedings{bayer_organization_1970, + location = {Houston, Texas}, + title = {Organization and maintenance of large ordered indices}, + url = {http://portal.acm.org/citation.cfm?doid=1734663.1734671}, + doi = {10.1145/1734663.1734671}, + eventtitle = {the 1970 {ACM} {SIGFIDET} (now {SIGMOD}) Workshop}, + pages = {107}, + booktitle = {Proceedings of the 1970 {ACM} {SIGFIDET} (now {SIGMOD}) Workshop on Data Description, Access and Control - {SIGFIDET} '70}, + publisher = {{ACM} Press}, + author = {Bayer, R. and {McCreight}, E.}, + urldate = {2024-03-08}, + date = {1970}, + langid = {english}, + file = {Full Text:/home/aria/Zotero/storage/84VSCDAG/Bayer and McCreight - 1970 - Organization and maintenance of large ordered indi.pdf:application/pdf}, +} diff --git a/thesis/parts/analysis.tex b/thesis/parts/analysis.tex deleted file mode 100644 index beeda83..0000000 --- a/thesis/parts/analysis.tex +++ /dev/null @@ -1,7 +0,0 @@ -\todo{Cost models vs Asymptotics} - -\todo{Accuracy of estimated costs} - -\todo{Areas of improvement} - -\todo{Developer experience / time to run selection} diff --git a/thesis/parts/results.tex b/thesis/parts/results.tex index 17b6088..896fa9d 100644 --- a/thesis/parts/results.tex +++ b/thesis/parts/results.tex @@ -14,8 +14,9 @@ We start by looking at our generated cost models, and comparing them both to the As we build a total of 51 cost models from our library, we will not examine all of them. We look at ones for the most common operations, and group them by containers that are commonly selected together. -%% ** Insertion operations +\subsection{Insertion operations} Starting with the \code{insert} operation, Figure \ref{fig:cm_insert} shows how the estimated cost changes with the size of the container. +The lines correspond to our fitted curves, while the points indicate the raw observations they are drawn from. To help readability, we group these into regular \code{Container} implementations, and our associative key-value \code{Mapping} implementations. \begin{figure}[h] @@ -27,11 +28,40 @@ To help readability, we group these into regular \code{Container} implementation \label{fig:cm_insert} \end{figure} -%% ** Contains operations -%% ** Comment on some bad/weird ones +For \code{Vec}, we see that insertion is incredibly cheap, and gets slightly cheaper as the size of the container increases. +This is to be expected, as Rust's Vector implementation grows by a multiple whenever it reaches its maximum capacity, so we would expect amortised inserts to require less resizes as $n$ increases. -%% ** Conclusion +\code{LinkedList} has a more stable, but significantly slower insertion. +This is likely because it requires a heap allocation for every item inserted, no matter the current size. +This would also explain why data points appear spread out more, as it can be hard to predict the performance of kernel calls, even on systems with few other processes running. + +It's unsurprising that these two implementations are the cheapest, as they have no ordering or uniqueness guarantees, unlike our other implementations. + +\code{HashSet} insertions are the next most expensive, however the cost appears to rise as the size of the collection goes up. +This is likely due to hash collisions being more likely as the size of the collection increases. + +\code{BTreeSet} insertions are also expensive, however the cost appears to level out as the collection size goes up (a logarithmic curve). +It's important to note that Rust's \code{BTreeSet}s are not based on binary tree search, but instead a more general tree search originally proposed by R Bayer and E McCreight\parencite{bayer_organization_1970}, where each node contains $B-1$ to $2B-1$ elements in an array. +\todo{The standard library documentation states that searches are expected to take $B\log(n)$ comparisons on average\parencite{rust_documentation_team_btreemap_2024}, which would explain the logarithm-like growth.} + +Our two mapping types, \code{BTreeMap} and \code{HashMap}, mimic the behaviour of their set counterparts. + +Our two outlier containers, \code{SortedUniqueVec} and \code{SortedVec}, both have a substantially higher insertion cost which grows roughly linearly. +Internally, both of these containers perform a binary search to determine where the new element should go. +This would suggest we should see a roughly logarithmic complexity. +However, as we will be inserting most elements near the middle of a list, we will on average be copying half the list every time. +This could explain why we see a roughly linear growth. + +\todo{Graph this, and justify further} + +\subsection{Contains operations} + +We now examine the cost of the \code{contains} operation. + +\subsection{Outliers / errors} + +\subsection{Evaluation} %% * Predictions \section{Selections} |