Basics

Basics

OnlineStats is a Julia package for statistical analysis with algorithms that run both online and in parallel.. Online algorithms are well suited for streaming data or when data is too large to hold in memory. Observations are processed one at a time and all algorithms use O(1) memory.

Installation

Pkg.add("OnlineStats")

Basics

Every Stat is <: OnlineStat

julia> using OnlineStats

julia> m = Mean()
Mean: n=0 | value=0.0

Stats Can Be Updated

julia> y = randn(100);

julia> fit!(m, y)
Mean: n=100 | value=0.298034

Stats Can Be Merged

julia> y2 = randn(100);

julia> m2 = fit!(Mean(), y2)
Mean: n=100 | value=0.133769

julia> merge!(m, m2)
Mean: n=200 | value=0.215901

Stats Have a Value

julia> value(m)
0.21590105719631947

Details of fit!-ting

Stats are subtypes of the parametric abstract type OnlineStat{T}, where T is the type of a single observation. For example, Mean <: OnlineStat{Number}.

fit!(::OnlineStat{T}, x::T) = ...
function fit!(o::OnlineStat{T}, y::S) where {T, S}
    for yi in y 
        fit!(o, yi)
    end
    o
end

A Common Error

julia> fit!(Mean(), "asdf")
ERROR: The input for Mean is a Number.  Found Char.

Here is what's happening:

  1. String is not a subtype of Number, so OnlineStats attempts to iterate through "asdf".

  2. The first element of "asdf" is the Char 'a'.

  3. The above error is produced (rather than a stack overflow).

When you see this error:

  1. Check that eltype(x) in fit!(stat, x) is what you think it is.

  2. Check if the stat is parameterized by observation type (use ?Stat)

    • i.e. Extrema is a parametric type that defaults to Float64. If my data is Int64, I need to use Extrema(Int64).

Helper functions

To iterate over the rows/columns of a matrix, use eachrow or eachcol, respectively.

fit!(CovMatrix(), eachrow(randn(100,2)))
CovMatrix: n=100 | value=[NaN NaN; NaN 1.08482]