collapse and the fastverse: Reflections on the Past, Present and Future
<!–
–>
Want to share your content on Rbloggers? click here if you have a blog, or here if you don’t.
Last week reached 1M downloads off CRAN. This is a milestone that was largely unforeseen for a package that started 4 years ago as a collection of functions intended to ease the R life of an economics master student. Today, collapse provides cuttingedge performance in many areas of statistical computing and data manipulation, and a breadth of statistical algorithms that can meet applied economists’ or statisticians’ demands on a programming environment like R. It is also the only programming framework in R that is effectively classagnostic. Version 1.9.5 just released to CRAN this week, is also the first version that includes Single Instruction Multiple Data (SIMD) instructions for a limited set of operations. The future will see more efforts to take advantage of the capabilities of modern processors.
Meanwhile, the – a lightweight collection of C/C++based R packages for statistical computing and data manipulation – is becoming more popular as an alternative to the for data analysis and as backends to statistical packages developed for R – a trend that is needed.
It is on this positive occasion that I decided it was the right time to provide you with a personal note, or rather, some reflections, regarding the history, present state, and the future of collapse and the fastverse.
The Past
collapse started in 2019 as a small package with only two functions: collap()
– intended to facilitate the aggregation of mixedtype data in R, and qsu()
– intended to facilitate summarizing panel data in R. Both were inspired by STATA’s collapse
and (xt)summarize
commands, and implemented with data.table as a backend. The package – called collapse alluding to the STATA command – would probably have stayed this way, had not unforeseen events affected my career plans.
Having completed a master’s in international economics in summer 2019, I was preparing for a twoyear posting as an ODI Fellow in the Central Bank of Papua New Guinea in fall. However, things did not work out, I had difficulties getting a working visa and there were coordination issues with the Bank, so ODI decided to offer me a posting in the Ugandan Ministry of Finance, starting in January 2020. This gave me 4 months, SeptemberDecember 2019, during which I ended up writing a new backend for collapse – in C++.
While collapse with data.table backed was never particularly slow, some of the underlying metaprogramming seemed arcane, especially because I wanted to utilize data.table’s GeForce optimizations which require the aggregation function to be recognizable in the call for data.table to internally replace it with an optimized version. But there were also statistical limitations. As an economist, I often employed sampling or trade weights in statistics, and in this respect, R was quite limited. There was also no straightforward way to aggregate categorical data, using, as I would have it, a (weighted) statistical mode function. I also felt R was lacking basic things in the time series domain – evidenced by the lengths I went to handle (irregular) trade panels. Finally, I felt limited by the division of software development around different classes in R. I found data.table useful for analytics, but the class too complex to behave in predictable ways. Thus I often ended up converting back to ‘data.frame’ or ‘tibble’ to use functions from a different package. Sometimes it would also have been practical to simply keep data as a vector or matrix – in linearalgebraheavy programs – but I needed data.table to do something ‘by groups’. So in short, my workflow in R employed frequent object conversions, faced statistical limitations, and, in the case of early collapse’s data.table backend, also involved tedious metaprogramming.
The will for change pushed me to practically rethink the way statistics could be done in R. It required a framework that encompassed increased statistical complexity, including advanced statistical algorithms like (weighted) medians, quantiles, modes, support for (irregular) time series and panels etc., and enabling these operations to be vectored efficiently across many groups and columns without a limiting syntax that would again encourage metaprogramming. The framework would also need to be classagnostic/support multiple R objects and classes, to easily integrate with different frameworks and reduce the need for object conversions. These considerations led to the creation of a comprehensive set of S3 generic grouped and weighted Fast Statistical Functions for vectors matrices and data.framelike objects, initially programmed fully in C++. The functions natively supported R factors for grouping. To facilitate programming further, I created multivariate grouping (‘GRP’) objects that could be used to perform multiple statistical operations across the same groups without grouping overhead. With this backend and hand, it was easy to reimplement collap()
^{1}, and also provide a whole array of other useful functions, including dplyrlike functions like fgroup_by()
, and time series functions that could be used adhoc but also supported plm’s indexed ‘pseries’ and ‘pdata.frame’ classes. collapse 1.0.0, released to CRAN on 19th March 2020 (me sitting in the Ugandan finance ministry) was already a substantial piece of statistical software offering cuttingedge performance (see the benchmarks in the introductory blog post).
To then cut a long story short, in the coming 3 years collapse became better, broader, and faster in multiple iterations. Additional speed came especially from rewriting central parts of the package in C – reimplementing some core algorithms in C rather than relying on the C++ standard library or Rcpp sugar – as well as introducing data transformation by reference and OpenMP multithreading. For example, fmode()
, rewritten from C++ to C for v1.8.0 (May 2022), became about 3x faster in serial mode (grouped execution), with additional gains through multithreading across groups. Other noteworthy functionality was a modern reimplementation of ‘pseries’ and ‘pdata.frame’, through ‘indexed_frame’ and ‘indexed_series’ classes, fully fledged fsummarise()
, fmutate()
and across()
functions enabling tidyverselike programming with vectorization for Fast Statistical Functions in the backend, a set of functions facilitating memory efficient R programming and lowcost data object conversions, functions to effectively deal with (nested) lists of data objects – such as unlisting to data frame with unlist2d()
, and additional descriptive statistical tools like qtab()
and descr()
. Particularly 2022 has seen two major updates: v1.7 and v1.8, and the bulk of development for 1.9 – released in January 2023. In improving collapse, I always took inspiration from other packages, most notably data.table, kit, dplyr, fixest, and R itself, to which I am highly indebted. The presentation of collapse at UseR 2022 in June 2022 marks another milestone of its establishment in the R community.
While using R and improving collapse, I became increasingly aware that I was not alone in the pursuit of making R faster and statistically more powerful. Apart from wellestablished packages like data.table, matrixStats, and fst, I noticed and started using many smaller packages like kit, roll, stringfish, qs, Rfast, coop, fastmap, fasttime, rrapply etc. aimed at improving particular aspects of R in a statistical or computational sense, often offering clean C or C++ implementations with few Rlevel dependencies. I saw a pattern of common traits and development efforts that were largely complimentary and needed encouragement. My impression at the time – largely unaltered today – was that such efforts were ignored by large parts of the R user community. One reason is of course the lack of visibility and coordination, compared to institutional stakeholders like Rstudio and H2O backing the tidyverse and data.table. Another consideration, it seemed to me, was that the tidyverse is particularly popular simply because there exists an R package and website called tidyverse which loads a set of packages that work well together, and thus alleviates users of the burden of searching CRAN and choosing their own collection of data manipulation packages.
Thus I decided in early 2021 to also create a meta package and GitHub repo called fastverse and use it to promote highperformance R packages with few dependencies. The first version 0.1.6 made it to CRAN in August 2021, attaching 6 core packages (data.table, collapse, matrixStats, kit, fst and magrittr), and allowing easy extension with additional packages using the fastverse_extend()
function. With 7 total dependencies instead of 80, it was a considerably more lightweight and computationally powerful alternative to the tidyverse. The README of the GitHub repository has grown largely due to suggestions from the community and now lists many of the highest performing and (mostly) lightweight R packages. Over time I also introduced more useful functionality into the fastverse package, such as the ability to configure the environment and set of packages included using a .fastverse
file, an option to install missing packages on the fly, and the fastverse_child()
function to create wholly separate package verses. Observing my own frequent usage of data.table, collapse, kit, and magrittr in combination, I did a poll on Twitter in Fall 2022 suggesting the removal of matrixStats and fst from the core set of packages – which as accepted and implemented from v0.3.0 (November 2022). The fastverse package has thus become an extremely lightweight, customizable, and fast tidyverse alternative.
The Present
Today, both collapse and fastverse are well established in a part of the R community closer to econometrics and highperformance statistics. A growing number of econometric packages benefit from collapse as a computational backend, most notably the wellknown plm package – which experienced orderofmagnitude performance gains. I am also developing dfms (first CRAN release October 2022), demonstrating that very efficient estimation of Dynamic Factor Models is possible in R combining collapse and RcppArmadillo. collapse is also powering various shiny apps across the web. I ended up creating a collapsepowered public macroeconomic data portal for Uganda, and later, at the Kiel Institute for the World Economy, for Africa at large.
So collapse has made it into production in my own work and the work of others. Core benefits in my experience are that it is lightweight to install on a server, has very low baseline function execution speeds (of a few microseconds instead of milliseconds with most other frameworks) making for speedy reaction times, scales very well to large data, and supports multiple R objects and modes of programming – reducing the need for metaprogramming. Since my own work and the work of others depends on it, API stability has always been important. collapse has not seen any major API changes in updates v1.7v1.9, and currently no further API changes are planned. This lightweight and robust nature – characteristic of all core fastverse packages esp. data.table – stands in contrast to dplyr, who’s core API involving summarise()
, mutate()
and across()
keeps changing to an extent that at some point in 2022 I removed unit tests of fsummarise()
and fmutate()
against the dplyr versions from CRAN.
Apart from development, it has also been very fun using the fastverse in the wild for some research projects. Lately, I’ve been working a lot with geospatial data, where the fastverse has enabled numerous interesting applications.
For example, I was interested in how the area of OSM buildings needs to be scaled using a power weight to correlate optimally with nightlights luminosity within a million cells of populated places in SubSaharan Africa. Having extracted around 12 million buildings from OSM, I programmed the following objective function and optimized it for power weights between 0.0001 and 5.
library(fastverse) library(microbenchmark) a < abs(rnorm(12e6, 100, 100)) # Think of this as building areas in m^2 g < GRP(sample.int(1e6, 12e6, TRUE)) # Think of this as grid cells y < fsum(a^1.5, g, use.g.names = FALSE) + # Think of this as nightlights rnorm(g$N.groups, sd = 10000) length(y) ## [1] 999989 # Objective function cor_ay < function(w, a, y) { aw_agg = fsum(a^w, g, use.g.names = FALSE, na.rm = FALSE) cor(aw_agg, y) } # Checking the speed of the objective microbenchmark(cor_ay(2, a, y)) ## Unit: milliseconds ## expr min lq mean median uq max neval ## cor_ay(2, a, y) 30.42331 32.1136 35.02078 34.43118 36.75326 55.36505 100 # Now the optimization system.time(res < optimise(cor_ay, c(0.0001, 5), a, y, maximum = TRUE)) ## user system elapsed ## 1.375 0.051 1.427 res ## $maximum ## [1] 1.501067 ## ## $objective ## [1] 0.5792703
The speed of the objective due to GRP()
and fsum()
^{2} allowed further subdivision of buildings into different classes, and experimentation with finer spatial resolutions.
Another recent application involved finding the 100 nearest neighbors for each of around 100,000 cells (rows) in a rich geospatial dataset with about 50 variables (columns), and estimating a simple proximityweighted linear regression of an outcome of interest y
on a variable of interest z
. Since computing a distance matrix on 100,000 rows upfront is infeasible memorywise, I needed to go rowbyrow. Here functions dapply()
, fdist()
and flm()
from collapse, and topn()
from kit became very handy.
# Generating the data X < rnorm(5e6) # 100,000 x 50 matrix of characteristics dim(X) < c(1e5, 50) z < rnorm(1e5) # Characteristic of interest y < z + rnorm(1e5) # Outcome of interest Xm < t(forwardsolve(t(chol(cov(X))), t(X))) # Transform X to compute Mahalanobis distance (takes account of correlations) # Coefficients for a single row coef_i 0] # Removing the point itself (mdist = 0) weights = 1 / mdist[best_idx] # Distance weights flm(y[best_idx], z[best_idx], weights, add.icpt = TRUE) # Weighted lm coefficients } # Benchmarking a single execution microbenchmark(coef_i(Xm[1L, ])) ## Unit: microseconds ## expr min lq mean median uq max neval ## coef_i(Xm[1L, ]) 927.051 965.591 1149.184 998.473 1031.068 3167.988 100 # Compute coefficients for all rows system.time(coefs < dapply(Xm, coef_i, MARGIN = 1)) ## user system elapsed ## 214.942 10.322 114.123 head(coefs, 3) ## coef_i1 coef_i2 ## [1,] 0.04329208 1.189331 ## [2,] 0.20741015 1.107963 ## [3,] 0.02860692 1.106427
Due to the efficiency of fdist()
and topn()
, a single call to the function takes around 1.2 milliseconds on the M1, giving a total execution time of around 120 seconds for 100,000 iterations of the program – one for each row of Xm
.
A final recent application involved creating geospatial GINI coefficients for South Africa using remotely sensed population and nightlights data. Since population data from WorldPop and Nightlights from Google Earth Engine are easily obtained from the web, I reproduce the exercise here in full.
# Downloading 1km2 UNAdjusted population data for South Africa from WorldPop pop_year < function(y) sprintf("https://data.worldpop.org/GIS/Population/Global_2000_2020_1km_UNadj/%i/ZAF/zaf_ppp_%i_1km_Aggregated_UNadj.tif", y, y) for (y in 2014:2020) download.file(pop_year(y), mode = "wb", destfile = sprintf("data/WPOP_SA_1km_UNadj/zaf_ppp_%i_1km_Aggregated_UNadj.tif", y))
VIIRS Nightlights are available on Google Earth Engine on a monthly basis from 2014 to 2022. I extracted annual median composites for South Africa using instructions found here and saved them to my google drive^{3}.
# Reading population files using terra and creating a single data.table pop_data % set_names(substr(., 9, 12)) %>% lapply(function(x) paste0(pop_path, "/", x) %>% terra::rast() %>% terra::as.data.frame(xy = TRUE) %>% set_names(c("lon", "lat", "pop"))) %>% unlist2d("year", id.factor = TRUE, DT = TRUE) %>% fmutate(year = as.integer(levels(year))[year]) %T>% print(2) ## year lon lat pop ## ## 1: 2014 29.62792 22.12958 38.21894 ## 2: 2014 29.63625 22.12958 19.25175 ##  ## 11420562: 2020 37.83625 46.97958 0.00000 ## 11420563: 2020 37.84458 46.97958 0.00000 # Same for nightlights nl_data % set_names(substr(., 1, 4)) %>% lapply(function(x) paste0(nl_annual_path, "/", x) %>% terra::rast() %>% terra::as.data.frame(xy = TRUE) %>% fsubset(avg_rad %!=% 9999)) %>% # Values outside land area of SA, coded using a mask in GEE unlist2d("year", id.factor = TRUE, DT = TRUE) %>% frename(x = lon, y = lat) %>% fmutate(year = as.integer(levels(year))[year]) %T>% print(2) ## year lon lat avg_rad ## ## 1: 2014 29.64583 22.12500 0.08928293 ## 2: 2014 29.65000 22.12500 0.04348474 ##  ## 58722767: 2022 20.00833 34.83333 0.37000000 ## 58722768: 2022 20.01250 34.83333 0.41000000
Since nightlights are available up to 2022, but population only up to 2020, I did a crude celllevel population forecast for 2021 and 2022 based on 1.6 million linear models of celllevel population between 2014 and 2020.
# Unset unneeded options for greater efficiency set_collapse(na.rm = FALSE, sort = FALSE) system.time({ # Forecasting population in 1.6 million grid cells based on linear regression pop_forecast % fgroup_by(lat, lon) %>% fmutate(dm_year = fwithin(year)) %>% fsummarise(pop_2020 = flast(pop), beta = fsum(pop, dm_year) %/=% fsum(dm_year^2)) %>% fmutate(pop_2021 = pop_2020 + beta, pop_2022 = pop_2021 + beta, beta = NULL) }) ## user system elapsed ## 0.210 0.054 0.263 head(pop_forecast) ## lat lon pop_2020 pop_2021 pop_2022 ## ## 1: 22.12958 29.62792 52.35952 54.35387 56.34821 ## 2: 22.12958 29.63625 23.65122 24.44591 25.24060 ## 3: 22.12958 29.64458 33.29427 34.10418 34.91409 ## 4: 22.12958 29.65292 194.08760 216.78054 239.47348 ## 5: 22.12958 29.66125 123.92527 139.52940 155.13353 ## 6: 22.13792 29.56958 13.61020 13.39950 13.18880
The above expression is an optimized version of univariate linear regression: beta = cov(pop, year)/var(year) = sum(pop * dm_year) / sum(dm_year^2)
, where dm_year = year  mean(year)
, that is fully vectorized across 1.6 million groups. Two further tricks are applied here: fsum()
has an argument for sampling weights, which I utilize here instead of writing fsum(pop * dm_year)
, which would require materializing a vector pop * dm_year
before summing. The division by reference (%/=%
) saves another unneeded copy. The expression could also have been written in one line as fsummarise(beta = fsum(pop, W(year)) %/=% fsum(W(year)^2))
, given that 3/4 of the computation time here is actually spent on grouping 11.4 million records by lat
and lon
.
# Appending population data with celllevel forecasts for 2021 and 2022 pop_data_forecast % fselect(pop_2020) %>% rm_stub("pop_") %>% melt(1:2, variable.name = "year", value.name = "pop") %>% fmutate(year = as.integer(levels(year))[year]) %>% colorder(year))
As you may have noticed, the nightlights data has a higher resolution of around 464m than the population data at 1km resolution. To match the two datasets, I use a function that transforms the coordinates to a rectilinear grid of a certain size in km, using an Approximation to the Haversine Formula which rescales longitude coordinates based on the latitude coordinate (to have them approximately represent distance as at the equator). The coordinates are then divided by the grid size in km transformed to degrees at the equator, and the modulus from this division is removed. Afterward, half of the grid size is added again, reflecting the grid centroids. Finally, longitudes are rescaled back to their original extent using the same scale factor.
# Transform coordinates to cell centroids of a rectilinear square grid of a certain size in kms round_to_kms_fast < function(lon, lat, km, round = TRUE, digits = 6) { degrees = km / (40075.017 / 360) # Gets the degreedistance of the kms at the equator if(round) div = round(degrees, digits) # Round to precision res_lat = TRA(lat, div, "%%") %+=% (div/2) # This transforms the data to the grid centroid scale_lat = cos(res_lat * pi/180) # Approx. scale factor based on grid centroid res_lon = setTRA(lon * scale_lat, div, "%%") %+=% (div/2) %/=% scale_lat return(list(lon = res_lon, lat = res_lat)) }
The virtue of this approach, while appearing crude and not fully respecting the spherical earth model, is that it allows arbitrary grid sizes and transforms coordinates from different datasets in the same way. To determine the grid size, I take the largest 2digit grid size that keeps the population cells unique, i.e. that largest number such that:
pop_data %>% ftransform(round_to_kms_fast(lon, lat, 0.63)) %>% fselect(year, lat, lon) %>% any_duplicated() ## [1] FALSE
It turns out that 0.63km is the ideal grid size. I apply this to both datasets and merge them, aggregating nightlights using the mean^{4}.
system.time({ nl_pop_data % ftransform(round_to_kms_fast(lon, lat, 0.63)) %>% merge(nl_data %>% ftransform(round_to_kms_fast(lon, lat, 0.63)) %>% fgroup_by(year, lat, lon) %>% fmean(), by = .c(year, lat, lon)) }) ## user system elapsed ## 8.195 1.380 4.280 head(nl_pop_data, 2) ## Key: ## year lat lon pop avg_rad ## ## 1: 2014 34.82266 19.98068 2.140570 0.07518135 ## 2: 2014 34.82266 19.98758 4.118959 0.09241374
Given the matched data, I define a function to compute the weighted GINI coefficient and an unweighted version for comparison.
# Taken from Wikipedia: with smallsample correction gini_wiki < function(x) 1 + 2/(length(x)1) * (sum(seq_along(x)*sort(x)) / sum(x)  length(x)) # No smallsample correction gini_noss < function(x) 2/length(x) * sum(seq_along(x)*sort(x)) / sum(x)  (length(x)+1)/length(x) Skp1qm1 < function(k, q) (q1)/2 * (2*(k+1) + q) + k + 1 all.equal(Skp1qm1(301, 70+1), sum(30:100)) ## [1] TRUE # Weighted GINI (by default without correction) w_gini < function(x, w, sscor = FALSE) { o = radixorder(x) w = w[o] x = x[o] sw = sum(w) csw = cumsum(w) sx = Skp1qm1(c(0, csw[length(csw)]), w) if(sscor) return(1 + 2/(sw1)*(sum(sx*x) / sum(x*w)  sw)) 2/sw * sum(sx*x) / sum(x*w)  (sw+1)/sw }
This computes the populationweighted and unweighted GINI on a percentage scale for each year.
raw_gini_ts % fsubset(pop > 0 & avg_rad > 0) %>% fgroup_by(year) %>% fsummarise(gini = gini_noss(avg_rad)*100, w_gini = w_gini(avg_rad, pop)*100) %T>% print() ## year gini w_gini ## ## 1: 2014 79.34750 55.51574 ## 2: 2015 91.35048 55.06437 ## 3: 2016 92.16993 54.75063 ## 4: 2017 55.96135 53.53097 ## 5: 2018 59.87219 52.84233 ## 6: 2019 64.43899 52.23766 ## 7: 2020 53.05498 51.15202 ## 8: 2021 52.19359 50.26020 ## 9: 2022 48.07294 49.69182 # Plotting library(ggplot2) raw_gini_ts %>% melt(1) %>% ggplot(aes(x = year, y = value, colour = variable)) + geom_line()
As evident from the plot, the populationweighted GINI is more smooth, which could be due to unpopulated areas exhibiting greater fluctuations in nightlights (such as fires or flares).
A final thing that we can do is calibrate the GINI to an official estimate. I use the africamonitor R API to get World Bank GINI estimates for South Africa.
WB_GINI % print() ## Key: ## Date SI_POV_GINI ## ## 1: 19930101 59.3 ## 2: 20000101 57.8 ## 3: 20050101 64.8 ## 4: 20080101 63.0 ## 5: 20100101 63.4 ## 6: 20140101 63.0
The last estimate in the series is from 2014, estimating a GINI of 63%. To bring the nightlights data in line with this estimate, I again use optimize()
to determine an appropriate power weight:
np_pop_data_pos_14 % fsubset(pop > 0 & avg_rad > 0 & year == 2014, year, pop, avg_rad) objective < function(k) { nl_gini = np_pop_data_pos_14 %$% w_gini(avg_rad^k, pop) * 100 abs(63  nl_gini) } res % print() ## $minimum ## [1] 1.308973 ## ## $objective ## [1] 0.0002598319
With the ideal weight determined, it is easy to obtain a final calibrated nightlightsbased GINI series and use it to extend the World Bank estimate.
final_gini_ts % fsubset(pop > 0 & avg_rad > 0) %>% fgroup_by(year) %>% fsummarise(nl_gini = w_gini(avg_rad^res$minimum, pop)*100) %T>% print() ## year nl_gini ## ## 1: 2014 62.99974 ## 2: 2015 62.49832 ## 3: 2016 62.22387 ## 4: 2017 61.22247 ## 5: 2018 60.54190 ## 6: 2019 59.82656 ## 7: 2020 58.83987 ## 8: 2021 57.93322 ## 9: 2022 57.51001 final_gini_ts %>% merge(WB_GINI %>% fcompute(year = year(Date), wb_gini = SI_POV_GINI), by = "year", all = TRUE) %>% melt("year", na.rm = TRUE) %>% ggplot(aes(x = year, y = value, colour = variable)) + geom_line() + scale_y_continuous(limits = c(50, 70))
It should be noted, at this point, that this estimate and the declining trend it shows may be seriously misguided. Research by Galimberti et al. (2020) using the old DMSP OLS Nightlights series from 19922013 for 234 countries and territories, shows that nightlights based inequality measures much better resemble the crosssectional variation in inequality between countries than the time series dimension within countries.
The example is nevertheless instrumental in showing how the fastverse, in various respects, facilitates and enables complex data science in R.
The Future
Future development of collapse will see an increased use of SIMD instructions to further increase performance. The impact of such instructions – visible in frameworks like Apache arrow and Python’s polars (which is based on arrow) can be considerable. The following shows a benchmark computing the means of a matrix with 100 columns and 1 million rows using base R, collapse 1.9.0 (no SIMD), and collapse 1.9.5 (with SIMD).
library(collapse) library(microbenchmark) fmean19 < collapsedev19::fmean m < rnorm(1e8) dim(m) < c(1e6, 100) # matrix with 100 columns and 1 million rows microbenchmark(colMeans(m), fmean19(m, na.rm = FALSE), fmean(m, na.rm = FALSE), fmean(m), # default is na.rm = TRUE, can be changed with set_collapse() fmean19(m, nthreads = 4, na.rm = FALSE), fmean(m, nthreads = 4, na.rm = FALSE), fmean(m, nthreads = 4)) ## Unit: milliseconds ## expr min lq mean median uq max neval ## colMeans(m) 93.09308 97.52766 98.80975 97.99094 99.05563 190.67317 100 ## fmean19(m, na.rm = FALSE) 93.04056 97.47590 97.68101 98.05097 99.14058 100.48612 100 ## fmean(m, na.rm = FALSE) 12.75202 13.04181 14.05289 13.49043 13.81448 18.79206 100 ## fmean(m) 12.67806 13.02974 14.02059 13.49638 13.81009 18.72581 100 ## fmean19(m, nthreads = 4, na.rm = FALSE) 24.84251 25.20573 26.12640 25.52416 27.08612 28.71300 100 ## fmean(m, nthreads = 4, na.rm = FALSE) 13.07941 13.18853 13.96326 13.38853 13.68627 18.04652 100 ## fmean(m, nthreads = 4) 13.05813 13.18277 13.99704 13.33753 13.71505 19.18242 100
Despite these impressive results, I am somewhat doubtful that much of collapse will benefit from SIMD. The main reason is that SIMD is a lowlevel vectorization that can be used to speed up simple operations like addition, subtraction, division, and multiplication. This is especially effective with large amounts of adjacent data. But with many groups and little data in each group, serial programming can be just as efficient or even more efficient if it allows writing grouped operations in a nonnested way. So it depends on the data to groups ratio. My arrow benchmark from August 2022 showed just that: with few groups relative to the data size, arrow considerably outperforms collapse and data.table, but with more groups the latter catch up considerably and collapse took lead with many very small groups. More complex statistics algorithms like the median (involving selection) or mode / distinct value count (involving hashing), also cannot (to my knowledge) benefit from SIMD, and here collapse implementations are already pretty much state of the art.
Apart from additional vectorization, I am also considering a possible broadening of the package to support further data manipulation operations such as table joins. This may take a while for me to get into though, so I cannot promise an update including this in 2023. At this stage, I am very happy with the API, so no changes are planned here, and I will also try to keep collapse harmonious with other fastverse packages, in particular data.table and kit.
Most of all, I hope to see an increased breadth of statistical R packages using collapse as a backend, so that its potential for increasing the performance and complexity of statistical R packages is realized in the community. I have in the past assisted package maintainers interested in developing collapse backends and hope to increase further collaborations along these lines.
At last, I wish to thank all users that provided feedback and inspiration or promoted this software in the community, and more generally all people that encouraged, contributed to, and facilitated these projects. Much credit is also due to the CRAN maintainers who endured many of my mistakes and insisted on high standards, which made collapse better and more robust.

Actually what is taking most time here is raising 12 million data points to a fractional power.︎

You can download the composites for South Africa from my drive. It actually took me a while to figure out how to properly extract the images from Earth Engine. You may find my answer here helpful if you want to export images for other countries.︎

Not the sum, as there could be a differing amount of nightlights observations for each cell.︎
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don’t.
Continue reading: collapse and the fastverse: Reflections on the Past, Present and Future