Data Surrogates

Data Surrogates

Some OnlineStats are especially useful for out-of-core computations. After they've been fit, they act as a data stand-in to get summaries, quantiles, regressions, etc, without the need to revisit the entire dataset again.

Linear Regressions

The LinRegBuilder type allows you to fit any linear regression model where y can be any variable and the x's can be any subset of variables.

# make some data
x = randn(10^6, 10)
y = x * range(-1, stop=1, length=10) + randn(10^6)

o = fit!(LinRegBuilder(11), [x y])

# adds intercept term by default as last coefficient
coef(o; y = 11, verbose = true)
[ Info: Regress 11 on [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] with bias
11-element Array{Float64,1}:
 -0.9997610542099887
 -0.7778464975922809
 -0.5557311702691645
 -0.3345800405075213
 -0.10865244093437426
  0.11222992875390482
  0.33265271415582315
  0.5538212153934471
  0.7766384101713056
  0.9992954228747221
  0.00014274199980085581

Histograms

The Hist type for online histograms uses a different algorithm based on whether the argument to the constructor is the number of bins or the bin edges. Hist can be used to calculate approximate summary statistics, without the need to revisit the actual data.

o = Hist(20)        # adaptively find bins
o2 = Hist(0:.5:5)  # specify the bin edges
s = Series(o, o2)

using Random, Statistics
fit!(s, randexp(100_000))

quantile(o, .5)
quantile(o, [.2, .8])
mean(o)
var(o)
std(o)

using Plots
plot(s)