Performance¶

Benchmarks and optimization tips.

Benchmarks¶

Typical performance on modern CPU:

Grid Size	Samples	Time (s)	Samples/sec
50	10	0.002	5000
100	10	0.008	1250
200	10	0.030	333
500	10	0.190	53
1000	10	0.760	13

Benchmarked on Apple M1 Pro

Complexity¶

Prior Sampling¶

Time complexity: \(O(n^3)\) for covariance decomposition, where \(n\) is total grid size.

Cholesky decomposition of covariance matrix dominates
Sampling is \(O(n \cdot k)\) where \(k\) is number of samples

For joint sampling (f, g, h):

Total grid size: \(n = n_f + n_g + n_h\)
Decomposition: \(O(n^3)\)
Per sample: \(O(n)\)

Posterior Sampling¶

Additional operations:

Conditioning: \(O(m^3)\) where \(m\) is number of observations
Cross-covariance: \(O(m \cdot n)\)
Predictive mean: \(O(m \cdot n)\)

Posterior is typically faster than prior when \(m \ll n\).

Optimization Tips¶

1. Use Coarser Grids¶

Integral is smooth - use fewer points:

# Dense function samples
spec = SamplingSpec(
    x_f=np.linspace(0, 5, 200),  # Fine
    x_g=np.linspace(0, 5, 50)    # Coarse
)

Reduces computational cost by 75% while preserving accuracy.

2. Sample Only What You Need¶

Skip unused quantities:

# Only need derivative
spec = SamplingSpec(x_h=x)  # Don't request f or g

Smaller covariance matrix = faster decomposition.

3. Batch Sampling¶

Generate multiple samples in one call:

# Efficient
result = sample_prior(spec, n_samples=100)

# Inefficient
samples = [sample_prior(spec, n_samples=1) for _ in range(100)]

Amortizes decomposition cost over all samples.

4. Larger Length Scales¶

Small length scales require finer grids for accuracy:

# Small ell needs dense grid
ell_small = 0.1
x_dense = np.linspace(0, 5, 500)

# Large ell works with coarse grid
ell_large = 2.0
x_coarse = np.linspace(0, 5, 100)

5. Sparse Observations¶

For posterior sampling, use fewer observations:

# Instead of 100 observations
x_train_dense = np.linspace(0, 5, 100)

# Use 20 well-placed observations
x_train_sparse = np.linspace(0, 5, 20)

Reduces conditioning overhead from \(O(100^3)\) to \(O(20^3)\).

Memory Usage¶

Memory scales as \(O(n^2)\) for covariance matrix.

Approximate memory:

Grid Size	Matrix Size	Memory (MB)
100	100×100	0.08
500	500×500	2
1000	1000×1000	8
5000	5000×5000	200

For large grids (\(n > 1000\)), consider:

Using sparse/low-rank approximations
Splitting into smaller regions
Increasing ell to reduce required resolution

Parallelization¶

C implementation uses GSL's BLAS routines, which may parallelize automatically if linked with multithreaded BLAS (e.g., OpenBLAS, MKL).

Check BLAS backend:

import numpy as np
np.show_config()

Profiling¶

Profile your code to identify bottlenecks:

import cProfile
import pstats

with cProfile.Profile() as pr:
    result = sample_prior(spec, n_samples=100)

stats = pstats.Stats(pr)
stats.sort_stats('cumtime')
stats.print_stats(10)

Next Steps¶

Troubleshooting - Numerical issues
Testing - Validation and tests