# Big Data

## OnlineStats + CSV

The CSV package offers a very memory-efficient way of iterating through the rows of a (possibly larger-than-memory) CSV file.

### Example

Here is a toy example (Iris dataset) of how to iterate through the rows of a CSV file one-by-one and calculate histograms grouped by another variable.

using OnlineStats, CSV, Plots

url = "https://gist.githubusercontent.com/joshday/df7bdaa1d58b398592e7656395de6335/raw/5a1c83f498f8ca7e25ff2372340e44b3389be9b1/iris.csv"

itr = (string(row.variety) => parse(Float64, row.sepal_length) for row in rows)

o = GroupBy(String, Hist(4:0.25:8))

fit!(o, itr)

plot(o, layout=(3,1))

The ThreadsX package offers multithreaded implementations of many functions in Base and supports OnlineStats via ThreadsX.reduce(::OnlineStat, data).

## Distributed Parallelism

OnlineStats can be merged together to facilitate Embarassingly parallel computations.

Note

In general, fit! is a cheaper operation than merge!.

Warn

Not every OnlineStat can be merged. In these cases, OnlineStats either uses an approximation or provides a warning that no merging occurred.

### Examples

#### Simplified (Not Actually in Parallel)

y1 = randn(10_000)
y2 = randn(10_000)
y3 = randn(10_000)

a = Series(Mean(), Variance(), KHist(20))
b = Series(Mean(), Variance(), KHist(20))
c = Series(Mean(), Variance(), KHist(20))

fit!(a, y1)
fit!(b, y2)
fit!(c, y3)

merge!(a, b)  # merge b into a
merge!(a, c)  # merge c into a

#### In Parallel

using Distributed