Basics
OnlineStats is a Julia package for statistical analysis with algorithms that run both online and in parallel.. Online algorithms are well suited for streaming data or when data is too large to hold in memory. Observations are processed one at a time and all algorithms use O(1) memory.
Installation
Pkg.add("OnlineStats")
Basics
Every Stat is <: OnlineStat
julia> using OnlineStats
julia> m = Mean()
Mean: n=0 | value=0.0
Stats Can Be Updated
julia> y = randn(100);
julia> fit!(m, y)
Mean: n=100 | value=0.298034
Stats Can Be Merged
julia> y2 = randn(100);
julia> m2 = fit!(Mean(), y2)
Mean: n=100 | value=0.133769
julia> merge!(m, m2)
Mean: n=200 | value=0.215901
Stats Have a Value
julia> value(m)
0.21590105719631947
Details of fit!
-ting
Stats are subtypes of the parametric abstract type OnlineStat{T}
, where T
is the type of a single observation. For example, Mean <: OnlineStat{Number}
.
One of the two
fit!
methods updates the stat from a single observation:
fit!(::OnlineStat{T}, x::T) = ...
In any other case, OnlineStats will attempt to iterate through
x
andfit!
each element (with checks to avoid stack overflows).
function fit!(o::OnlineStat{T}, y::S) where {T, S}
for yi in y
fit!(o, yi)
end
o
end
A Common Error
julia> fit!(Mean(), "asdf")
ERROR: The input for Mean is a Number. Found Char.
Here is what's happening:
String
is not a subtype ofNumber
, so OnlineStats attempts to iterate through "asdf".The first element of
"asdf"
is theChar
'a'
.The above error is produced (rather than a stack overflow).
When you see this error:
Check that
eltype(x)
infit!(stat, x)
is what you think it is.Check if the stat is parameterized by observation type (use
?Stat
)i.e.
Extrema
is a parametric type that defaults toFloat64
. If my data isInt64
, I need to useExtrema(Int64)
.
Helper functions
To iterate over the rows/columns of a matrix, use eachrow
or eachcol
, respectively.
fit!(CovMatrix(), eachrow(randn(100,2)))
CovMatrix: n=100 | value=[NaN NaN; NaN 1.08482]