The CSV package offers a very memory-efficient way of iterating through the rows of a (possibly larger-than-memory) CSV file.
Here is a toy example (Iris dataset) of how to iterate through the rows of a CSV file one-by-one and calculate histograms grouped by another variable.
using OnlineStats, CSV, Plots url = "https://gist.githubusercontent.com/joshday/df7bdaa1d58b398592e7656395de6335/raw/5a1c83f498f8ca7e25ff2372340e44b3389be9b1/iris.csv" rows = CSV.Rows(download(url); reusebuffer = true) itr = (string(row.variety) => parse(Float64, row.sepal_length) for row in rows) o = GroupBy(String, Hist(4:0.25:8)) fit!(o, itr) plot(o, layout=(3,1))
The ThreadsX package offers multithreaded implementations of many functions in Base and supports OnlineStats via
- See "A quick introduction to data parallelism in Julia" by ThreadsX author Takafumi Arakaki (
@tkf) for more details.
OnlineStats can be merged together to facilitate Embarassingly parallel computations.
fit! is a cheaper operation than
OnlineStat can be merged. In these cases, OnlineStats either uses an approximation or provides a warning that no merging occurred.
y1 = randn(10_000) y2 = randn(10_000) y3 = randn(10_000) a = Series(Mean(), Variance(), KHist(20)) b = Series(Mean(), Variance(), KHist(20)) c = Series(Mean(), Variance(), KHist(20)) fit!(a, y1) fit!(b, y2) fit!(c, y3) merge!(a, b) # merge `b` into `a` merge!(a, c) # merge `c` into `a`
using Distributed addprocs(3) @everywhere using OnlineStats s = @distributed merge for i in 1:3 o = Series(Mean(), Variance(), KHist(20)) fit!(o, randn(10_000)) end