Basics
OnlineStats is a Julia package for statistical analysis with algorithms that run both online and in parallel.. Online algorithms are well suited for streaming data or when data is too large to hold in memory. Observations are processed one at a time and all algorithms use O(1) memory.
Installation
Pkg.add("OnlineStats")Basics
Every Stat is <: OnlineStat
julia> using OnlineStats
julia> m = Mean()
Mean: n=0 | value=0.0Stats Can Be Updated
julia> y = randn(100);
julia> fit!(m, y)
Mean: n=100 | value=0.298034Stats Can Be Merged
julia> y2 = randn(100);
julia> m2 = fit!(Mean(), y2)
Mean: n=100 | value=0.133769
julia> merge!(m, m2)
Mean: n=200 | value=0.215901Stats Have a Value
julia> value(m)
0.21590105719631947Details of fit!-ting
Stats are subtypes of the parametric abstract type OnlineStat{T}, where T is the type of a single observation. For example, Mean <: OnlineStat{Number}.
One of the two
fit!methods updates the stat from a single observation:
fit!(::OnlineStat{T}, x::T) = ...In any other case, OnlineStats will attempt to iterate through
xandfit!each element (with checks to avoid stack overflows).
function fit!(o::OnlineStat{T}, y::S) where {T, S}
for yi in y
fit!(o, yi)
end
o
endA Common Error
julia> fit!(Mean(), "asdf")
ERROR: The input for Mean is a Number. Found Char.Here is what's happening:
Stringis not a subtype ofNumber, so OnlineStats attempts to iterate through "asdf".The first element of
"asdf"is theChar'a'.The above error is produced (rather than a stack overflow).
When you see this error:
Check that
eltype(x)infit!(stat, x)is what you think it is.Check if the stat is parameterized by observation type (use
?Stat)i.e.
Extremais a parametric type that defaults toFloat64. If my data isInt64, I need to useExtrema(Int64).
Helper functions
To iterate over the rows/columns of a matrix, use eachrow or eachcol, respectively.
fit!(CovMatrix(), eachrow(randn(100,2)))CovMatrix: n=100 | value=[NaN NaN; NaN 1.08482]