Big Data

OnlineStats + CSV

The CSV package offers a very memory-efficient way of iterating through the rows of a (possibly larger-than-memory) CSV file.

Example

Here is a toy example (Iris dataset) of how to iterate through the rows of a CSV file one-by-one and calculate histograms grouped by another variable.

using OnlineStats, CSV, Plots

url = "https://gist.githubusercontent.com/joshday/df7bdaa1d58b398592e7656395de6335/raw/5a1c83f498f8ca7e25ff2372340e44b3389be9b1/iris.csv"

rows = CSV.Rows(download(url); reusebuffer = true)

itr = (row.variety => parse(Float64, row.sepal_length) for row in rows)

o = GroupBy(String, Hist(4:0.25:8))

fit!(o, itr)

plot(o, layout=(3,1))

Threaded Parallelism

The ThreadsX package offers multithreaded implementations of many functions in Base and supports OnlineStats via ThreadsX.reduce(::OnlineStat, data).

Distributed Parallelism

OnlineStats can be merged together to facilitate Embarassingly parallel computations.

Note

In general, fit! is a cheaper operation than merge!.

Warn

Not every OnlineStat can be merged. In these cases, OnlineStats either uses an approximation or provides a warning that no merging occurred.

Examples

Simplified (Not Actually in Parallel)

y1 = randn(10_000)
y2 = randn(10_000)
y3 = randn(10_000)

a = Series(Mean(), Variance(), KHist(20))
b = Series(Mean(), Variance(), KHist(20))
c = Series(Mean(), Variance(), KHist(20))

fit!(a, y1)
fit!(b, y2)
fit!(c, y3)

merge!(a, b)  # merge `b` into `a`
merge!(a, c)  # merge `c` into `a`

In Parallel

using Distributed
addprocs(3)
@everywhere using OnlineStats

s = @distributed merge for i in 1:3
    o = Series(Mean(), Variance(), KHist(20))
    fit!(o, randn(10_000))
end