# Big Data

## OnlineStats + CSV

The CSV package offers a very memory-efficient way of iterating through the rows of a (possibly larger-than-memory) CSV file.

### Example

Here is a toy example (Iris dataset) of how to iterate through the rows of a CSV file one-by-one and calculate histograms grouped by another variable.

```
using OnlineStats, CSV, Plots
url = "https://gist.githubusercontent.com/joshday/df7bdaa1d58b398592e7656395de6335/raw/5a1c83f498f8ca7e25ff2372340e44b3389be9b1/iris.csv"
rows = CSV.Rows(download(url); reusebuffer = true)
itr = (row.variety => parse(Float64, row.sepal_length) for row in rows)
o = GroupBy(String, Hist(4:0.25:8))
fit!(o, itr)
plot(o, layout=(3,1))
```

## Threaded Parallelism

The ThreadsX package offers multithreaded implementations of many functions in Base and supports OnlineStats via `ThreadsX.reduce(::OnlineStat, data)`

.

- See "A quick introduction to data parallelism in Julia" by ThreadsX author Takafumi Arakaki (
`@tkf`

) for more details.

## Distributed Parallelism

`OnlineStat`

s can be merged together to facilitate Embarassingly parallel computations.

In general, `fit!`

is a cheaper operation than `merge!`

.

Not every `OnlineStat`

can be merged. In these cases, **OnlineStats** either uses an approximation or provides a warning that no merging occurred.

### Examples

#### Simplified (Not Actually in Parallel)

```
y1 = randn(10_000)
y2 = randn(10_000)
y3 = randn(10_000)
a = Series(Mean(), Variance(), KHist(20))
b = Series(Mean(), Variance(), KHist(20))
c = Series(Mean(), Variance(), KHist(20))
fit!(a, y1)
fit!(b, y2)
fit!(c, y3)
merge!(a, b) # merge `b` into `a`
merge!(a, c) # merge `c` into `a`
```

#### In Parallel

```
using Distributed
addprocs(3)
@everywhere using OnlineStats
s = @distributed merge for i in 1:3
o = Series(Mean(), Variance(), KHist(20))
fit!(o, randn(10_000))
end
```