Big Data
OnlineStats + CSV
The CSV package offers a very memory-efficient way of iterating through the rows of a (possibly larger-than-memory) CSV file.
Example
Here is a toy example (Iris dataset) of how to iterate through the rows of a CSV file one-by-one and calculate histograms grouped by another variable.
using OnlineStats, CSV, Plots
url = "https://gist.githubusercontent.com/joshday/df7bdaa1d58b398592e7656395de6335/raw/5a1c83f498f8ca7e25ff2372340e44b3389be9b1/iris.csv"
rows = CSV.Rows(download(url); reusebuffer = true)
itr = (string(row.variety) => parse(Float64, row.sepal_length) for row in rows)
o = GroupBy(String, Hist(4:0.25:8))
fit!(o, itr)
plot(o, layout=(3,1))Threaded Parallelism
The ThreadsX package offers multithreaded implementations of many functions in Base and supports OnlineStats via ThreadsX.reduce(::OnlineStat, data).
- See "A quick introduction to data parallelism in Julia" by ThreadsX author Takafumi Arakaki (
@tkf) for more details.
Distributed Parallelism
OnlineStats can be merged together to facilitate Embarassingly parallel computations.
Not every OnlineStat can be merged. In these cases, OnlineStats either uses an approximation or provides a warning that no merging occurred.
Examples
Simplified (Not Actually in Parallel)
y1 = randn(10_000)
y2 = randn(10_000)
y3 = randn(10_000)
a = Series(Mean(), Variance(), KHist(20))
b = Series(Mean(), Variance(), KHist(20))
c = Series(Mean(), Variance(), KHist(20))
fit!(a, y1)
fit!(b, y2)
fit!(c, y3)
merge!(a, b) # merge `b` into `a`
merge!(a, c) # merge `c` into `a`In Parallel
using Distributed
addprocs(3)
@everywhere using OnlineStats
s = @distributed merge for i in 1:3
o = Series(Mean(), Variance(), KHist(20))
fit!(o, randn(10_000))
end