Authors: Hal Canary and Cory Quammen
Contents
The MADAI Distribution Sampling tools enable you to estimate the most likely parameters of a model and the surrounding probability density by comparing the output from the model at a given set of parameters to a set of values obtained through actual experiment. The tools are intended to help answer which parameters of a model of a system best explain aspects of the system being modeled.
This tutorial will introduce you to the MADAI Distribution Sampling tools. It will walk you through generating a probability distribution of an example model in two different ways, computing information about the samples in the distribution, and visualizating the distribution. The tutorial is designed to be read from start to finish.
Throughout the document, some special notation is used
This indicates that you should enter the
command command
at the terminal:
$ command
This indicates output produced by a program:
output
Aside: Indicates additional information that is not critical to completing the tutorial but which may be useful to know. You may opt to skip these sections and come back to them later.
$ sudo apt-get -y install build-essential $ sudo apt-get -y install cmake libboost-dev libeigen3-dev gnuplot
$ sudo yum -y groupinstall "Development Tools" $ sudo yum -y install cmake boost-devel eigen3-devel gnuplot
$ sudo port install cmake boost eigen3 gnuplot
DistributionSampling-VERSION.tgz
from the MADAI web site.
$ tar -x -z -f DistributionSampling.tar.gz
${HOME}/local
.
$ cd DistributionSampling $ build.sh "${HOME}/local"
Aside: MacOS 10.x users who do not have macports and need Boost or Eigen3 can still install this library. You must get CMake from www.cmake.org. Then run the command:
$ full_build.sh "${HOME}/local"
This will download the necessary Boost and Eigen3 headers into a temporary directory for the compilation.
Aside: The build.sh
and full_build.sh
scripts simply call CMake. You can
also run CMake directly.
$ cd .. $ mkdir build $ cd build $ cmake "../DistributionSampling" \ -DCMAKE_INSTALL_PREFIX:PATH="${HOME}/local" \ -DBoost_INCLUDE_DIR:PATH="/usr/include" \ -DEIGEN3_INCLUDE_DIR:PATH="/usr/include/eigen3" \ -DCMAKE_BUILD_TYPE:STRING=Release \ -DBUILD_TESTING:BOOL=0 \ -DUSE_OPENMP:BOOL=0 \ -DUSE_GPROF:BOOL=0 $ make $ make install $ PATH="${PATH}:${HOME}/local/bin" $ cd ../DistributionSampling
Your executables will be in the
directory ${HOME}/local/bin
. You should
make these executables available by including this directory in your
path. If you use bash, write
$ PATH="${PATH}:${HOME}/local/bin"
If you have csh/tcsh as your shell, write:
$ set path = ($path "${HOME}/local/bin")
If you do not know your shell type, write:
$ echo $SHELL
Aside: If you wish to compile your own software that links against the library, the include files are in ${HOME}/local/include
and the library is in ${HOME}/local/lib/madai
.
The MADAI Workbench version 1.8 or higher is required for Section 3 of the tutorial. The MADAI Workbench is a customized version of ParaView with additional filters that support visualization of high dimensional data.
This section describes how to perform Markov Chain Monte Carlo sampling with a "fast model". By a "fast model" we mean that it can execute millions of times in the time it takes to get a cup of coffee.
To invoke such a model from the MADAI Distribution Tools, you will
need to build an executable program to interface with
the madai_generate_trace
program. Your program will write
information about the model to stdout
, read parameter
values from stdin
, and write model outputs
to stdout
. madai_generate_trace
will start
your program and interactively query it for model outputs at given
parameter vectors.
In this example, we have written a program that models a parabolic
potential. In this case, the program,
named parabolic_interactive.py
, is written in Python. It
is located in the
directory DistributionSampling/tutorial/parabolic_example/
Aside: The Parabolic Potential Example.
This is a simple model with three inputs and three outputs. It is based on a picture of a thermalized particle in a one-dimensional harmonic oscillator potential at constant temperature, with the complication that the particle is constrained to x > 0. The particle has mass m = 1 and the potential has the form
V(x) = (x - X0)^2 / (2 × Kinv)
if x > 0
V(x) = ∞
if x ≤ 0
The three parameters are the temperature TEMP, the offset of the potential X0, and the inverse spring constant, Kinv. We use Kinv = 1/K instead of K because some of the model dependencies will go as 1/K, and it's a bad idea to pick parameters for which the behavior is singular. The three observables are the average energy <E>, the average position, <x> and the average squared position <x^2>.
This model has the
parameters X0
, K
, and TEMP
; and
the observables MEAN_X
, MEAN_X_SQUARED
, and
MEAN_ENERGY
. When we run the program, the output looks
like:
$ cd tutorial $ python parabolic_example/parabolic_interactive.py # ParabolicPotentialModel VERSION 1 PARAMETERS 3 X0 UNIFORM -2.0 2.0 Kinv UNIFORM 0.25 4.0 TEMP UNIFORM 0.25 4.0 OUTPUTS 3 MEAN_X MEAN_X_SQUARED MEAN_ENERGY VARIANCE 3 END_OF_HEADER STOP
Line 1 (# ParabolicPotentialModel) is a comment. In this case, it reports the name of the model.
Line 2 (VERSION 1) indicates the version of the input/output format the Distribution Sampling tools use to interface with external programs.
Line 3 (PARAMETERS 3) begins the parameters description section and indicates how many parameters are in the model.
Lines 4-6 describe the parameters. The first item is the parameter name, the second item is the type of prior distribution for the parameter, the third and fourth items indicate the minimum and maximum values where the uniform distribution is non-zero.
Line 7 (OUTPUTS 3) begins the output description section.
Lines 8-10 gives the output parameter names.
Line 11 (VARIANCE 3) gives the form of the model output covariance.
Line 12 (END_OF_HEADER) indicates that the header describing the model inputs and outputs has ended.
The external model then waits for a list of parameter values (encoded as text) on stdin. Then it returns the model outputs followed by the model covariances. For example, if you type the first line shown below, you will get the output values from the model (the second line) and the output variances (the third line):
0 2.25 2.25
1.7952402618064471 5.0625 2.25
0.1125 0.6570540589064506 0.25851201309032235
Aside: The Interactive Model Language.
To create a distribution sampling from the example model, first make sure you are in the DistributionSampling/tutorial
directory.
Then make a working directory:
$ mkdir parabolic_fast $ cd parabolic_fast
For the rest of this section, we will assume that you are in your
working directory and will refer to it as .
(a single
dot).)
Next, copy the “experimental” results file into the working directory. This specifies all of the experimentally observed measurements and errors. (If you skip that step, it assumes that the measurements are 0.0 and the error is 1.0.)
$ cp ../parabolic_example/experimental_results.dat . $ cat experimental_results.dat MEAN_X 1.28759997 0.050 MEAN_X_SQUARED 2.28759997 0.179 MEAN_ENERGY 0.856200015 0.139
Next, we will create a settings.dat
file in
this directory with some default values in it. All of the
Distribution Sampling tools will look for this file.
$ madai_print_default_settings > ./settings.dat
Aside: madai_print_default_settings
produces
numerous settings. A detailed description of the options
in settings.dat
can be found in the MADAI Statistics
Manual.
Now, use you favorite text editor to edit
the settings.dat
file. Change the setting
EXTERNAL_MODEL_EXECUTABLE to
../parabolic_example/parabolic_interactive.py
.
Alternatively, use the madai_change_setting
program to
change the setting:
$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE \
"../parabolic_example/parabolic_interactive.py"
The final step is to run the Markov chain Monte Carlo (MCMC)
routine on your model. The madai_generate_trace
program
uses the external model to produce model outputs for points in
parameter space. These model outputs are compared to the experimental
values to calculate likelihood.
The Metropolis-Hastings MCMC algorithm is used to draw a large number of samples from the distribution proportional to likelihood. These values are stored in a comma-separated-value (CSV) file specified in the arguments.
$ madai_generate_trace . "mcmc.csv"
We use the term "trace" to refer to the samples generated from the MCMC algorithm.
$ head mcmc.csv
"X0","Kinv","TEMP","MEAN_X","MEAN_X_SQUARED","MEAN_ENERGY","LogLikelihood"
0.753519,1.57836,0.899513,1.28201,2.38578,0.77336,-4.36406
0.753519,1.57836,0.899513,1.28201,2.38578,0.77336,-4.36406
0.746218,1.66737,0.818389,1.26077,2.30536,0.703248,-4.78413
0.767981,1.59083,0.805214,1.24539,2.2374,0.689978,-5.1405
0.772022,1.54362,0.861303,1.26323,2.30477,0.738468,-4.5119
0.772022,1.54362,0.861303,1.26323,2.30477,0.738468,-4.5119
0.772022,1.54362,0.861303,1.26323,2.30477,0.738468,-4.5119
0.695059,1.4549,0.863633,1.19886,2.08978,0.743292,-6.54539
0.695059,1.4549,0.863633,1.19886,2.08978,0.743292,-6.54539
The madai_generate_trace
program will produce the
number of points specified in the settings.dat
under
the SAMPLER_NUMBER_OF_SAMPLES
setting. The more
parameters a model has, the larger the number of samples that will
need to be drawn to fill the parameter space. Once you are sure that
the program is working correctly,
set SAMPLER_NUMBER_OF_SAMPLES
to a large number (e.g.,
one million) and let the program run for a while.
$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 1000000
old: SAMPLER_NUMBER_OF_SAMPLES 100
new: SAMPLER_NUMBER_OF_SAMPLES 1000000
It is not unusual for the MCMC algorithm to start in a region of low
likelihood. The result is that some of the first samples in the
generated trace appear to stick out from the main regions of high
likelihood in the distribution. We refer to this phase of the MCMC
evolution as the "burn-in" phase. To remove these samples, you can
specify a setting called MCMC_NUMBER_OF_BURN_IN_SAMPLES
that discards the first N samples.
$ madai_change_setting . MCMC_NUMBER_OF_BURN_IN_SAMPLES 200 old: MCMC_NUMBER_OF_BURN_IN_SAMPLES 0 new: MCMC_NUMBER_OF_BURN_IN_SAMPLES 200 $ madai_generate_trace . "million.csv"
Here, we discard 200 samples from the trace generated by the MCMC
algorithm. Note that the number of burn-in samples are not subtracted
from the number of samples specified
by SAMPLER_NUMBER_OF_SAMPLES
.
The utility madai_analyze_trace
computes some basic
statistics about the samples in the generated trace that may be useful
to know. Before doing that, we first need to create a file listing the
names of the parameters. We will borrow a file used in the next
section to do this:
$ cp ../parabolic_example/parameter_priors.dat .
Example output from this program is shown below.
$ madai_analyze_trace . million.csv
parameter mean std.dev. scaled dev. best value
X0 0.644224 0.475285 0.411609 0.999923
Kinv 1.68241 0.903106 0.834254 1.00104
TEMP 0.929729 0.158476 0.146394 0.998231
best log likelihood
-4.0299
covariance:
X0 Kinv TEMP
X0 0.225896 -0.401274 0.014398
Kinv -0.401274 0.8156 -0.0649426
TEMP 0.014398 -0.0649426 0.0251147
scaled covariance:
X0 Kinv TEMP
X0 0.169422 -0.321019 0.0115184
Kinv -0.321019 0.695979 -0.0554177
TEMP 0.0115184 -0.0554177 0.0214312
While visualization and analysis software is capable of handling tens of millions of points, for faster visualization and analysis you may need to reduce the number of points. One way to keep the same random distribution is to pick out every N-th line. For example, to reduce one million points down to a fifty thousand, pick out every 20th line:
$ madai_subsample_text_file 20 million.csv mcmc_50000.csv
The madai_subsample_text_file
program will sample
every N-th sample point from a CSV file into a new CSV file.
Aside: A faster way to interface with the Distribution
Sampling tools is to link directly to the Distribution Sampling
library. For a complete example, look in
the DistributionSampling/tutorial/examples/ParabolicPotentialModelCxx
directory.
A skeleton for your own C++ model can be found
in DistributionSampling/tutorial/examples/MCMC_Example
.
Now that the distribution samples have been computed, it is useful to visualize the distribution. One way to do this is to create a scatterplot matrix that shows pair-wise scatterplots of the parameter values.
This file lists the property names and some other information. Next, generate the scatterplot matrix with
$ madai_gnuplot_scatterplot_matrix mcmc.csv mcmc.pdf parameter_priors.dat 50
You will see the plot shown in Figure 1.
Figure 1: Scatterplot matrix of the trace data. The horizontal axis is labeled with the input parameter names. The left vertical axis also lists the input parameter names except for the first row which is labeled "likelihood". Each plot in the matrix, which the exception of plots on the diagonal, is the scatterplot of the parameter by which it is labeled in the horizontal axis and the parameter by which it is labeled in the vertical axis. The plots in the upper right of the matrix are redundant with the plots in the lower left. Each plot on the diagonal is a histogram that shows the density of the sample points in each dimension.
The Distribution Sampling tools use a Gaussian Process Emulator to emulate a slow model much more quickly than if the model were run directly. An emulator is basically an interpolator that can estimate the uncertainty associated with the outputs it produces. To train the emulator, the software requires hundreds of sample points. At each training point, you will need the parameter values and the model outputs.
We will use the parabolic potential model as an example again.
This model has the parameters X0
, Kinv
,
and TEMP
; and the observables MEAN_X
,
MEAN_X_SQUARED
, and MEAN_ENERGY
.
First move into the DistributionSampling/tutorial
directory.
Then make a working directory:
$ mkdir parabolic_emulator $ cd parabolic_emulator
For the rest of this section, we will assume that you are in your
working directory and will refer to it as .
(a single
dot).)
The first thing we will do is create a setting.dat
file in this directory with some default values in it. All of the
Distribution Sampling tools will look for this file.
$ madai_print_default_settings > ./settings.dat
Next, we need to create a parameter_priors.dat file that contains our assumptions about prior probability distribution for the values for the parameters.
$ cp ../parabolic_example/parameter_priors.dat . $ cat parameter_priors.dat UNIFORM X0 -2.0 2.0 UNIFORM Kinv 0.25 4.0 UNIFORM TEMP 0.25 4.0
Next, we specify all of the experimentally observed measurements and errors:
$ cp ../parabolic_example/experimental_results.dat . $ cat experimental_results.dat MEAN_X 1.28759997 0.050 MEAN_X_SQUARED 2.28759997 0.179 MEAN_ENERGY 0.856200015 0.139
Finally, we create a file that contains the list of observables
that we want to make use of. This should be a subset
of results.dat
.
$ cp ../parabolic_example/observable_names.dat . $ cat observable_names.dat MEAN_X MEAN_X_SQUARED MEAN_ENERGY
We are now ready to run the first Distribution Sampling tool in
this workflow, madai_generate_training_points
. This
program will generate a Latin hypercube in the parameter space. The
number of sample points is determined by
the GENERATE_TRAINING_POINTS_NUMBER_OF_POINTS
setting. Like the other Distribution Sampling tools, the command-line
argument is the name of the working directory.
$ madai_generate_training_points .
The output of madai_generate_training_points
is a series of
files model_output/run*/parameters.dat
. For example:
$ cat ./model_output/run0001/parameters.dat
X0 -1.5
Kinv 2.51875
TEMP 3.45625
The next task is to actually evaluate the model at this point in
parameter space. This step is typically left up to you because your
modeling program likely has its own way of reading in parameters and
writing outputs. You will need to read in the parameter values in
the parameters.dat
files and write results as described
below.
For this tutorial, we have written a little Python
program to generate model outputs from the files and directories
created by madai_generate_training_points
. The
parabolic_evaluate.py
program can be found in the
directory DistributionSampling/tutorial/parabolic_example/
. To
run it, write
$ ../parabolic_example/parabolic_evaluate.py model_output/run*
parabolic_evaluate.py
takes as command-line arguments
the names of a directories that contain a file
named parameters.dat
and writes a file
called results.dat
in each directory.
Verify that it works:
$ cat ./model_output/run0001/results.dat
MEAN_X 1.88486922161 0.147524825771
MEAN_X_SQUARED 5.87812585509 0.991401491412
MEAN_ENERGY 4.46415150519 0.369616766285
Aside: It doesn't matter how you populate
the results.dat
files. We expect that for your actual
simulation runs, you will be running the full model on a supercomputer
or on a dozen nodes of a cluster.
Also,
the parameters.dat
files don't have to be a Latin
hypercube. You are free to select them using any method.
After generating the training points, we compute the principal component analysis decomposition of the outputs.
$ madai_pca_decompose .
The file pca_decomposition.dat
generated by the
command above contains the PCA data. The eigenvalues are sorted in
order of increasing magnitude.
The next step is to generate a Gaussian Process Emulator on the
training points. The hyper-parameters of the emulator are the
hyper-parameters of the covariance (kernel) functions on the parameter
space and the regression function. The madai_train_emulator
program generates “okay” values for the
hyper-parameters by default.
Aside: For more information about covariance functions and hyperparameters, please see the MADAI Statistics Manual.
To run the basic training program:
$ madai_train_emulator .
madai_train_emulator
writes out a
file called emulator_state.dat
which contains hyper-parameters
for each of the retained pca-decomposed sub-models.
One of the Distribution Sampling tools is a program that follows
the same communication protocol as
the parabolic_interactive.py
program used in Section
2a. This program, called madai_emulate
, can be run
interactively at the command line.
$ madai_emulate .
VERSION 1
PARAMETERS
3
X0 UNIFORM -2 2
Kinv UNIFORM 0.25 4
TEMP UNIFORM 0.25 4
OUTPUTS
3
MEAN_X
MEAN_X_SQUARED
MEAN_ENERGY
COVARIANCE
TRIANGULAR_MATRIX
6
END_OF_HEADER
As with the parabolic_interactive.py
program, madai_emulate
waits for the number of parameters
in the model to be written to standard input and writes the model
outputs and covariance to standard output. Below is what you should
see when you enter -0.6 3.5 3.4
at the terminal. Note
that the numbers may not exactly match what is below due to randomness
when generating the Latin hypercube sampling.
-0.6 3.5 3.4
2.4367666912344625
9.5582566165380527
3.8049408478401152
0.012703386214090669
-8.7378171278191194e-18
-7.839420506032379e-19
0.41323617041363248
7.7322297868210109e-19
0.08618840459630378
Now we need to set madai_emulate
as
the EXTERNAL_MODEL_EXECUTABLE
and set its single argument.
$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE madai_emulate
old: EXTERNAL_MODEL_EXECUTABLE
new: EXTERNAL_MODEL_EXECUTABLE madai_emulate
It may be possible that an external program requires command-line
arguments. The setting EXTERNAL_MODEL_ARGUMENTS
provides this capability. Arguments should all be specified on
one line and separated by white space. Double quotation marks should
surround arguments with spaces.
madai_emulate
takes a single argument, the
current working directory. Set that argument as shown below.
$ madai_change_setting . EXTERNAL_MODEL_ARGUMENTS .
old: EXTERNAL_MODEL_ARGUMENTS
new: EXTERNAL_MODEL_ARGUMENTS .
The final step is to run the Markov chain Monte Carlo (MCMC)
routine on the code. The madai_generate_trace
program
uses the trained Gaussian Process model emulator to produce model
outputs for points in parameter space. These model outputs are
compared to observed values to calculate relative likelihood.
$ madai_generate_trace . "mcmc.csv"
$ head mcmc.csv
"X0","K","TEMP","MEAN_X","MEAN_X_SQUARED","MEAN_ENERGY","LogLikelihood"
-0.167188,1.63734,0.790651,0,0,0,-7.0052
-0.176625,1.60696,0.673026,-0.176625,1.87896,0.673026,-6.89768
-0.181315,1.61748,0.744548,-0.181315,1.89524,0.744548,-6.97992
-0.181315,1.61748,0.744548,-0.181315,1.89524,0.744548,-6.97992
-0.222542,1.62965,0.668291,-0.222542,1.85553,0.668291,-6.85991
-0.188486,1.62792,0.638246,-0.188486,1.84436,0.638246,-6.81262
-0.198998,1.65095,0.659951,-0.198998,1.82891,0.659951,-6.80038
-0.19968,1.6921,0.574281,-0.19968,1.75331,0.574281,-6.61224
-0.19968,1.6921,0.574281,-0.19968,1.75331,0.574281,-6.61224
We expect the emulator to be used frequently enough that we have
written madai_generate_trace
to use the emulator internally if
EXTERNAL_MODEL_EXECUTABLE
is not set or is set to an
empty value. In fact, this the default behavior
of madai_generate_trace
is to use the emulator. You may
use the madai_emulate
program to access the emulator if
you wish, but it will typically be slower than if you use the built-in
emulator.
To use the internal emulator, run
$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE ""
The madai_generate_trace
program will produce the
number of points specified by
the SAMPLER_NUMBER_OF_SAMPLES
entry in
the settings.dat
file. Depending on the number of the
parameters, a larger or smaller number of samples will need to be
drawn. Once you are sure that the program is working correctly,
set SAMPLER_NUMBER_OF_SAMPLES
to a large number (a
million or more) and let the program run for a while.
$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 1000000
old: SAMPLER_NUMBER_OF_SAMPLES 100
new: SAMPLER_NUMBER_OF_SAMPLES 1000000
As before, we can generate a scatterplot of the samples generated from the emulator. Figure 2 shows a side-by-side comparison of two scatterplot matrices. The one on the left shows the density of samples from direct execution of the model. The one on the right shows the density of samples from the Gaussian Process Emulator trained on the parabolic potential training data. While the bulk of the density is in the correct place in parameter space on the right scatterplot matrix, the shape of the distribution is quite different.
$ madai_gnuplot_scatterplot_matrix mcmc.csv mcmc.pdf parameter_priors.dat 50
Figure 2: Left - Scatterplot matrix of the parabolic potential model sampled directly. Right - Scatterplot matrix of the emulated parabolic potential model with basic training of the Gaussian Process Emulator enabled.
As Figure 2 shows, the default basic training mode offered by the emulator does not always capture the shape of the high-dimensional distribution. For better results, we can enable a slightly slower but more thorough emulator training algorithm.
Change the setting EMULATOR_TRAINING_ALGORITHM
to the following:
$ madai_change_setting . EMULATOR_TRAINING_ALGORITHM exhaustive_geometric_kfold_common
Now retrain the emulator with
$ madai_train_emulator .
and generate a new trace.
$ madai_generate_trace . "mcmc_better.csv"
Figure 3 shows the results of the emulator with improved training. The matrix plot from the samples generated with the emulator (right plot) trained using exhaustive geometric k-fold validation better matches the scatterplot matrix generated from the parabolic potential model directly (left plot).
Figure 3: Left - Scatterplot matrix of the parabolic potential model sampled directly. Right - Scatterplot matrix of the emulated parabolic potential model with more advanced Gaussian Process Emulator training.
So far, we have been visualizing the results of the MCMC sampling with scatterplot matrices. This section describes more advanced visualizations available in the MADAI Workbench.
The Distribution Sampling tools can also uniformly sample from the parameter space. In the next example, the model has four input parameters. We will first visualize a two-dimensional slice through parameter space. After that, we will visualize a three-dimensional volume.
To produce a slice, set the SAMPLER
setting
to PercentileGrid
and set
the SAMPLER_INACTIVE_PARAMETERS_FILE
setting to point at
a file that specifies which parameters should be inactivated
or locked down to a single constant value. The inactivated parameters
are inactive in the sense that they are not changed by the MCMC
sampler.
For this example, first make sure you are in the
DistributionSampling/tutorial
directory.
Then make a working directory:
$ mkdir slice $ cd slice
Next, point EXTERNAL_MODEL_EXECUTABLE
at the example
program.
$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE ../cup_example/cup.py
Set the EXPERIMENTAL_RESULTS_FILE
location:
$ madai_change_setting . EXPERIMENTAL_RESULTS_FILE ../cup_example/experimental_results.dat
Then we set the inactive parameters file,
$ madai_change_setting . SAMPLER_INACTIVE_PARAMETERS_FILE ../cup_example/inactive_parameters_2d.dat
Let's take a look at the format of the inactive parameters file. It has one line for each inactive parameter. Each parameter line is followed by the value at which it will be fixed. Active parameters are not listed in this file.
$ cat ../cup_example/inactive_parameters_2d.dat
x2 0.0
x4 0.0
Finally, we need to change the sampler method from the
default MetropolisHastings
, which returns samples whose
equilibrium distribution is proportional to posterior likelihood, to
PercentileGrid
, which generates non-random samples evenly
spaced in the active parameter space according to the prior
distribution distributions assigned to each parameter. If the prior
distribution for a parameter is a uniform distribution from 0 to 100,
and we ask the PercentileGrid
sampler for 50 samples,
that parameter will have the values {1,3,5,...97,99}. If the prior
distribution for a parameter is a Gaussian distribution, then the
samples will be clustered more tightly near the mean of the
distribution in that parameter dimension.
$ madai_change_setting . SAMPLER PercentileGrid
Since we have two active parameters,
the PercentileGrid
sampler will return a rectangular grid
of points. If we want 300 samples in each direction, we will need
300² samples.
$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 90000
We will use the same madai_generate_trace
program to
$ madai_generate_trace . grid.csv
To visualize this slice, open the grid.csv
file in the
MADAI Workbench.
Convert from tabular data to 2D spatial data using the Table To
Points filter. Set X
and Y
to x1
and x3
. Check the “2D Points” checkbox
After applying the filter, create a new 3D View to see the points in
space. When you zoom in, you should see discrete points.
Use the Delaunay2D filter to make the points into a surface. Here, this surface is colored with LogLikelihood.
If you want to see Likelihood rather than LogLikelihood, use the Calculator filter to calculate it. Make the settings of the Calculator match what is shown below. Next, change the color map to the Black-Body Radiation color map. This color map better emphasizes the regions of high likelihood.
Make a three-dimensional “slice” through this model's parameter space. Hints: you only need to deactivate one parameter; use a smaller number of samples; use the Delaunay3D filter; and use the Contour filter to find the high-likelihood region.
There are limitations to viewing the results of the MCMC sampling using either scatterplot matrices or slices, so we will use the MADAI Workbench to project those points into three dimensions.
First, change to the parabolic_emulator
directory.
$ cd ../parabolic_emulator
Then use madai_subsample_text_file
to reduce the
size of your MCMC trace down to about 10000 points.
Next, load the CSV file in the MADAI Workbench.
After loading the CSV file, convert from tabular data to 3D
spatial data using the Table To Points filter. Set X
,
Y
, and Z
to the three model
parameters X0
, K
, and TEMP
.
After applying the filter, create a new 3D View to see the points in
space.
In general, the scales for each parameter will not be comparable, sometimes by many orders of magnitude. You may need to apply the Rescale Points Filter to rescale each dimension.
Understanding the shape of a collection of points in space is difficult, which is why we provide the Percentile Surface Filter, which draws a surface around the densest part of the point cloud.
To figure out which part of the cloud is densest, the filter does not calculate density, but makes use of the fact that the density increases with log-likelihood. It simply draws as surface around the 95% of the points with the highest log-likelihood. Before running the filter, be sure that “Point Scalars to Use for Percentile” is set to LogLikelihood.
The resulting surface can be colored with one of the other parameters not used for projection to see the relationship between that parameter and the three parameters we used as spatial dimensions.
The MaximizeNormalizedShapeIndexProjectionFilter
will
search for a percentile surface projection with the most interesting
shape. We define the most interesting shape to be the one with the
largest normalized shape index.
For this example, we will use the
dataset f_gomez_emu_reg0
(Facundo Gómez's Galaxy
Formation Model Emulator). First we will extract the tarfile:
$ tar -x -z -f f_gomez_emu_reg0_2013.tar.gz
Next, we will make a working directory and move to it.
$ mkdir galaxy $ cd galaxy
Next, copy over the parameter_priors.dat
and observable_names.dat
files.
$ cp ../f_gomez_emu_reg0_2013/parameter_priors.dat . $ cp ../f_gomez_emu_reg0_2013/observable_names.dat .
Next, make a settings.dat
file and be sure to point
the MODEL_OUTPUT_DIRECTORY
and EXPERIMENTAL_RESULTS_FILE
. We also suggest
setting EMULATOR_TRAINING_ALGORITHM
to exhaustive_geometric_kfold_common
for the best
results.
### settings.dat ### MODEL_OUTPUT_DIRECTORY ../f_gomez_emu_reg0_2013/model_output EXPERIMENTAL_RESULTS_FILE ../f_gomez_emu_reg0_2013/experimental_results.dat VERBOSE 1 EMULATOR_TRAINING_ALGORITHM exhaustive_geometric_kfold_common MCMC_NUMBER_OF_BURN_IN_SAMPLES 200 SAMPLER_NUMBER_OF_SAMPLES 1000000 EXTERNAL_MODEL_EXECUTABLE madai_emulate EXTERNAL_MODEL_ARGUMENTS .
We are ready to train the emulator:
$ madai_pca_decompose . PCA decomposition succeeded. Wrote PCA decomposition file './pca_decomposition.dat'. $ madai_train_emulator . Emulator training succeeded. Wrote emulator state file './emulator_state.dat'.
That should take about a minute. Next, we will run the MCMC on the emulator.
$ madai_generate_trace . mcmc-1000000.csv
Using external model executable 'madai_emulate'.
Using MetropolisHastingsSampler for sampling
Succeeded writing trace file 'mcmc-1000000.csv'.
This should take a while (around half an hour). The sampling
process runs on at most one core. If your system has N cores,
you can divide
SAMPLER_NUMBER_OF_SAMPLES
by N and
run N jobs in parallel.
$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 250000 old: SAMPLER_NUMBER_OF_SAMPLES 1000000 new: SAMPLER_NUMBER_OF_SAMPLES 250000 $ madai_launch_multiple_madai_generate_trace . 4 mcmc_parallel $ madai_catenate_traces mcmc_parallel_*.csv > mcmc-1000000.csv
Either way you create mcmc-1000000.csv
, we can analyze and plot it:
$ madai_analyze_trace . mcmc-1000000.csv > mcmc-analysis.txt $ madai_gnuplot_scatterplot_matrix mcmc-1000000.csv mcmc.pdf parameter_priors.dat 100 input: /d/data/galaxy/mcmc-1000000.csv output: /d/data/galaxy/mcmc.pdf gnuplot file: /tmp/madai_gnuplot_scatterplot_matrix_1jkDqq/scatterplot_matrix.gplt
Let's subsample down to ten thousand points.
$ madai_subsample_text_file 100 mcmc-1000000.csv mcmc-10000.csv
Now we can open mcmc-10000.csv
in the workbench and
apply the Maximize Normalized Shape Index Projection filter.
This filter operates directly on the table data. Be sure to check only the parameters, and select LogLikelihood for the column to use for the percentile.
The filter has two outputs: the projected points and the percentile surface around those points.
To find out which projection was used, look at the field data attached to the output. Click the "+" sign next to the tab named "Layout #1" over the visualization window. In the list of buttons that appears, click on "Spreadsheet View". Select "MaximizeNormalizedShapeIndexProjection1" as the dataset to show in the spreadsheet view and select the attribute "Field Data".
Under the "Best Dimensions" column, you will see the three parameters that, when assigned to the spatial dimensions X, Y, and Z, produce the most interesting shape.Congratulations! You have successfully completed the tutorial for the Distribution Sampling tools.