Diagnostics post-inference

After performing inference, we can use the Gaussian process to diagnose the quality of the fit. This can be done in the time domain or the frequency domain.

In this section, we will work with the following time series y with measurement error yerr indexed by time t.

using DelimitedFiles
using Plots
using CairoMakie
using MCMCChains
using PairPlots
using Pioran
using Random

data = readdlm("data/subset_simu_single_subset_time_series.txt",comments=true)
t, y, yerr = data[:,1], data[:,2], data[:,3]
Plots.scatter(t, y,yerr=yerr, label="data",xlabel="Time (days)",ylabel="Value",legend=false,framestyle = :box,ms=3)
Example block output

Let's take posterior samples from a run of the inference with ultranest and plot the pairplot of the samples.

data = readdlm("data/inference/chains/equal_weighted_post.txt", comments=true)
samples = convert(Array{Float64},reshape(Array(data[2:end,:]),(size(data[2:end,:])...,1)))
c = Chains(samples,data[1,:])
pairplot(c)
Example block output

We can now use the function run_posterior_predict_checks to perform the diagnostics. This function calls several other functions to plot the graphical diagnostics.

f0, fM = 1 / (t[end] - t[1])/4.0, 1 / minimum(diff(t)) / 2 * 4.0
basis_function = "SHO"
n_components = 20
model = SingleBendingPowerLaw
paramnames = data[1,:]
array = data[2:end,:]
6191×7 Matrix{Any}:
 0.696778  0.00295622  2.54877  …  1.11704    0.48407      2.37838e-5
 1.14882   0.0125038   3.1593      1.07506    0.173744     0.000121685
 1.12806   0.00376592  2.82925     0.987719   0.652464     0.000338202
 0.35729   0.00213092  2.76632     0.959707   0.357615     4.44739e-6
 0.507734  0.00262427  2.81882     0.969674   0.291513     0.00124111
 0.79309   0.00783003  3.5739   …  1.19599    0.152513     1.88203e-6
 0.521578  0.00592803  2.97459     1.03736    0.277896     4.22953e-5
 0.979505  0.00791432  3.33333     1.35453    0.0530084    0.00248094
 0.68002   0.00257445  2.44408     1.01985    0.183394     0.000101945
 0.582614  0.00419094  3.00525     1.25601   -0.345007     0.00108161
 ⋮                              ⋱             ⋮            
 0.737521  0.00836645  2.98214     0.91934    0.163393     2.91151e-6
 0.774934  0.0160586   3.54052     0.806934   0.0693059    0.00330326
 0.296263  0.00153782  2.50223     1.05533    0.352117     0.00160173
 0.390908  0.00205682  2.18926  …  0.781045   0.25945      3.99982e-6
 0.322226  0.00537511  2.8954      1.19504    0.347022     8.14029e-6
 0.94519   0.0125508   3.43676     0.8425     0.0372545    8.46856e-6
 0.814284  0.00121973  2.54089     1.18884    0.182327     3.9866e-6
 0.711486  0.00703952  3.40933     1.10685    0.199357     0.00114942
 0.328705  0.00442217  3.1288   …  1.38679   -0.000307503  1.55874e-5

This function requires a function GP_model to describe the GP model used and it requires the dictionary paramnames_split to split the parameters between the PSD model, the mean or the errors.

function GP_model(t, y, σ, params, basis_function=basis_function, n_components=n_components, model=model)

    T = (t[end] - t[1]) # duration of the time series
    Δt = minimum(diff(t)) # min time separation

    f_min, f_max = 1 / T, 1 / Δt / 2

    α₁, f₁, α₂, variance, ν, μ, c = params

    # Rescale the measurement variance
    σ² = ν .* σ .^ 2  ./ (y.-c).^2

    # Define power spectral density function
    𝓟 = model(α₁, f₁, α₂)

    # Approximation of the PSD to form a covariance function
    𝓡 = approx(𝓟, f_min, f_max, n_components, variance, basis_function=basis_function)

    # Build the GP
    f = ScalableGP(μ, 𝓡)

    # Condition on the times and errors
    fx = f(t, σ²)
    return fx
end
GP_model (generic function with 4 methods)

The associated paramnames_split can be written as follows:

paramnames_split = Dict("psd" => ["α₁", "f₁", "α₂"],
        "norm" => "variance",
        "scale_err" => "ν",
        "log_transform" =>"c",
        "mean" => "μ")
Dict{String, Any} with 5 entries:
  "log_transform" => "c"
  "scale_err"     => "ν"
  "norm"          => "variance"
  "mean"          => "μ"
  "psd"           => ["α₁", "f₁", "α₂"]
figs = run_posterior_predict_checks(array, paramnames,paramnames_split, t, y, yerr, model,GP_model, true; S_low=20., S_high=20., path="", basis_function=basis_function, n_components=n_components, is_integrated_power=false)
4-element Vector{Any}:
 Figure()
 Figure()
 Figure()
 Figure()

In the Fourier domain

We can plot the posterior predictive distribution of the power spectral density and the approximate power spectral density. We can verify if the approximation is valid with the posterior samples. The noise level is given by $2 \nu \sigma^2_{\rm yerr}\Delta t$.

This figure is plotted using the function Pioran.plot_psd_ppc.

figs[1]
Example block output

A second diagnostic in the frequency domain is the posterior predictive periodograms using the Lomb-Scargle periodogram of simulated data. Using the posterior samples, we can draw realisations of the process at the same instants t and compute the Lomb-Scargle periodogram for each realisation. To do so, we use the package LombScargle.jl. We can then compare the median of the periodogram with the periodogram of the data. This figure is plotted using the function Pioran.plot_lsp_ppc.

figs[2]
Example block output

In the time domain

In the time domain, we can draw realisations of the process conditioned on the observations and compare it with the data with the function Pioran.plot_ppc_timeseries. The predictive time series is plotted with the function Pioran.plot_simu_ppc_timeseries.

figs[3]
Example block output
Note

As mentioned in Sampling from the conditioned distribution it can be very expensive to sample from the conditioned distribution, especially if the number of points is large.

Finally, we can compute the residuals of the realisations and the data. This figure is plotted using the function Pioran.plot_residuals_diagnostics. The distribution of the residuals is also plotted, the lower panel shows the autocorrelation of the residuals. One should note that the lags of the autocorrelation are not the same as the lags of the time series. The lags are in indexes of the residuals, therefore it may be difficult to interpret the autocorrelation in terms of time and realisation of a white noise process.

figs[4]
Example block output