MADAI Distribution Sampling Tutorial


Authors: Hal Canary and Cory Quammen


Contents


0. Introduction

The MADAI Distribution Sampling tools enable you to estimate the most likely parameters of a model and the surrounding probability density by comparing the output from the model at a given set of parameters to a set of values obtained through actual experiment. The tools are intended to help answer which parameters of a model of a system best explain aspects of the system being modeled.

This tutorial will introduce you to the MADAI Distribution Sampling tools. It will walk you through generating a probability distribution of an example model in two different ways, computing information about the samples in the distribution, and visualizating the distribution. The tutorial is designed to be read from start to finish.

0a. Tutorial Notation

Throughout the document, some special notation is used

1. Installing Software

1a. The Distribution Sampling Tools

  1. Install the prerequisite packages (CMake, Boost, Eigen3):
    • Ubuntu/Debian-based:
      $ sudo apt-get -y install build-essential
      $ sudo apt-get -y install cmake libboost-dev libeigen3-dev gnuplot
    • Red Hat-based:
      $ sudo yum -y groupinstall "Development Tools"
      $ sudo yum -y install cmake boost-devel eigen3-devel gnuplot
    • MacOS 10.x with macports:
      $ sudo port install cmake boost eigen3 gnuplot
  2. Download the file DistributionSampling-VERSION.tgz from the MADAI web site.
  3. Navigate to the directory where you downloaded the Distribution Sampling tools.
  4. Extract the tools archive.
    $ tar -x -z -f DistributionSampling.tar.gz
  5. Build and install. In this example, we will install in ${HOME}/local.
    $ cd DistributionSampling
    $ build.sh "${HOME}/local"

Aside: MacOS 10.x users who do not have macports and need Boost or Eigen3 can still install this library. You must get CMake from www.cmake.org. Then run the command:

$ full_build.sh "${HOME}/local"

This will download the necessary Boost and Eigen3 headers into a temporary directory for the compilation.

Aside: The build.sh and full_build.sh scripts simply call CMake. You can also run CMake directly.

$ cd ..
$ mkdir build
$ cd build
$ cmake "../DistributionSampling" \
    -DCMAKE_INSTALL_PREFIX:PATH="${HOME}/local" \
    -DBoost_INCLUDE_DIR:PATH="/usr/include" \
    -DEIGEN3_INCLUDE_DIR:PATH="/usr/include/eigen3" \
    -DCMAKE_BUILD_TYPE:STRING=Release \
    -DBUILD_TESTING:BOOL=0 \
    -DUSE_OPENMP:BOOL=0 \
    -DUSE_GPROF:BOOL=0
$ make
$ make install
$ PATH="${PATH}:${HOME}/local/bin"
$ cd ../DistributionSampling

Your executables will be in the directory ${HOME}/local/bin. You should make these executables available by including this directory in your path. If you use bash, write

$ PATH="${PATH}:${HOME}/local/bin"

If you have csh/tcsh as your shell, write:

$ set path = ($path "${HOME}/local/bin")

If you do not know your shell type, write:

$ echo $SHELL

Aside: If you wish to compile your own software that links against the library, the include files are in ${HOME}/local/include and the library is in ${HOME}/local/lib/madai .

1b. MADAI Workbench

The MADAI Workbench version 1.8 or higher is required for Section 3 of the tutorial. The MADAI Workbench is a customized version of ParaView with additional filters that support visualization of high dimensional data.

2. Running the Software

2a. Markov Chain Monte Carlo with a Fast Model

This section describes how to perform Markov Chain Monte Carlo sampling with a "fast model". By a "fast model" we mean that it can execute millions of times in the time it takes to get a cup of coffee.

To invoke such a model from the MADAI Distribution Tools, you will need to build an executable program to interface with the madai_generate_trace program. Your program will write information about the model to stdout, read parameter values from stdin, and write model outputs to stdout. madai_generate_trace will start your program and interactively query it for model outputs at given parameter vectors.

In this example, we have written a program that models a parabolic potential. In this case, the program, named parabolic_interactive.py, is written in Python. It is located in the directory DistributionSampling/tutorial/parabolic_example/

Aside: The Parabolic Potential Example.

This is a simple model with three inputs and three outputs. It is based on a picture of a thermalized particle in a one-dimensional harmonic oscillator potential at constant temperature, with the complication that the particle is constrained to x > 0. The particle has mass m = 1 and the potential has the form

V(x) = (x - X0)^2 / (2 × Kinv)   if x > 0
V(x) = ∞   if x ≤ 0

The three parameters are the temperature TEMP, the offset of the potential X0, and the inverse spring constant, Kinv. We use Kinv = 1/K instead of K because some of the model dependencies will go as 1/K, and it's a bad idea to pick parameters for which the behavior is singular. The three observables are the average energy <E>, the average position, <x> and the average squared position <x^2>.

This model has the parameters X0, K, and TEMP; and the observables MEAN_X, MEAN_X_SQUARED, and MEAN_ENERGY. When we run the program, the output looks like:

$ cd tutorial
$ python parabolic_example/parabolic_interactive.py
# ParabolicPotentialModel
VERSION 1
PARAMETERS 3
X0	UNIFORM	-2.0	2.0
Kinv	UNIFORM	0.25	4.0
TEMP	UNIFORM	0.25	4.0
OUTPUTS 3
MEAN_X
MEAN_X_SQUARED
MEAN_ENERGY
VARIANCE 3
END_OF_HEADER
STOP

The external model then waits for a list of parameter values (encoded as text) on stdin. Then it returns the model outputs followed by the model covariances. For example, if you type the first line shown below, you will get the output values from the model (the second line) and the output variances (the third line):

0 2.25 2.25
1.7952402618064471 5.0625 2.25
0.1125 0.6570540589064506 0.25851201309032235

Aside: The Interactive Model Language.

  1. First the external model (your program) will output on standard output (stdout) a list of comments (each beginning with a '#' and ending with a '\n').
  2. Then it will output the string "VERSION 1 \n"
  3. Then it will output the string "PARAMETERS Nparam \n", where Nparam is the number of parameters.
  4. Then, for each parameter, it will output the name of the parameter (a string without whitespace), whitespace, the prior distribution, and '\n'
    • A uniform prior distribution is in the format "UNIFORM MIN MAX"
    • A Gaussian prior distribution is in the format "GAUSSIAN MEAN STDDEV"
  5. Then the string "OUTPUTS Nouts \n", where Nouts is the number of outputs.
  6. Then, for each parameter, it will output the name of the output (a string without whitespace), followed by a '\n'
  7. Then it either outputs:
    • "VARIANCE Nouts \n" (Nouts is still the number of outputs),
      or
    • "COVARIANCE TRIANGULAR_MATRIX M \n", where M=Nouts×(Nouts+1)/2)
      or
    • "COVARIANCE FULL_MATRIX K \n", where K=Nouts^2
  8. The external model them prints "END_OF_HEADER \n"
  9. The external model will then wait for Nparam ASCII-encode floating-point numbers on standard input. These will be interpreted as a set of parameter values.
  10. The external model should calculate model outputs at that point in parameter space, as well as (co)variance. It will then print those numbers (ASCII-encoded) onto standard output. Remember to flush standard out after writing.
  11. The external model should then wait for the next set of parameters and repeat the output calculation
  12. The external model should exit when it reads the string "STOP" or end-of-file or received an interrupt signal.

To create a distribution sampling from the example model, first make sure you are in the DistributionSampling/tutorial directory.

Then make a working directory:

$ mkdir parabolic_fast
$ cd parabolic_fast

For the rest of this section, we will assume that you are in your working directory and will refer to it as . (a single dot).)

Next, copy the “experimental” results file into the working directory. This specifies all of the experimentally observed measurements and errors. (If you skip that step, it assumes that the measurements are 0.0 and the error is 1.0.)

$ cp ../parabolic_example/experimental_results.dat .
$ cat experimental_results.dat
MEAN_X         1.28759997   0.050
MEAN_X_SQUARED 2.28759997   0.179
MEAN_ENERGY    0.856200015  0.139

Next, we will create a settings.dat file in this directory with some default values in it. All of the Distribution Sampling tools will look for this file.

$ madai_print_default_settings > ./settings.dat

Aside: madai_print_default_settings produces numerous settings. A detailed description of the options in settings.dat can be found in the MADAI Statistics Manual.

Now, use you favorite text editor to edit the settings.dat file. Change the setting EXTERNAL_MODEL_EXECUTABLE to ../parabolic_example/parabolic_interactive.py. Alternatively, use the madai_change_setting program to change the setting:

$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE \
    "../parabolic_example/parabolic_interactive.py"

The final step is to run the Markov chain Monte Carlo (MCMC) routine on your model. The madai_generate_trace program uses the external model to produce model outputs for points in parameter space. These model outputs are compared to the experimental values to calculate likelihood.

The Metropolis-Hastings MCMC algorithm is used to draw a large number of samples from the distribution proportional to likelihood. These values are stored in a comma-separated-value (CSV) file specified in the arguments.

$ madai_generate_trace . "mcmc.csv"

We use the term "trace" to refer to the samples generated from the MCMC algorithm.

$ head mcmc.csv
"X0","Kinv","TEMP","MEAN_X","MEAN_X_SQUARED","MEAN_ENERGY","LogLikelihood"
0.753519,1.57836,0.899513,1.28201,2.38578,0.77336,-4.36406
0.753519,1.57836,0.899513,1.28201,2.38578,0.77336,-4.36406
0.746218,1.66737,0.818389,1.26077,2.30536,0.703248,-4.78413
0.767981,1.59083,0.805214,1.24539,2.2374,0.689978,-5.1405
0.772022,1.54362,0.861303,1.26323,2.30477,0.738468,-4.5119
0.772022,1.54362,0.861303,1.26323,2.30477,0.738468,-4.5119
0.772022,1.54362,0.861303,1.26323,2.30477,0.738468,-4.5119
0.695059,1.4549,0.863633,1.19886,2.08978,0.743292,-6.54539
0.695059,1.4549,0.863633,1.19886,2.08978,0.743292,-6.54539

The madai_generate_trace program will produce the number of points specified in the settings.dat under the SAMPLER_NUMBER_OF_SAMPLES setting. The more parameters a model has, the larger the number of samples that will need to be drawn to fill the parameter space. Once you are sure that the program is working correctly, set SAMPLER_NUMBER_OF_SAMPLES to a large number (e.g., one million) and let the program run for a while.

$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 1000000
old: SAMPLER_NUMBER_OF_SAMPLES 100
new: SAMPLER_NUMBER_OF_SAMPLES 1000000

It is not unusual for the MCMC algorithm to start in a region of low likelihood. The result is that some of the first samples in the generated trace appear to stick out from the main regions of high likelihood in the distribution. We refer to this phase of the MCMC evolution as the "burn-in" phase. To remove these samples, you can specify a setting called MCMC_NUMBER_OF_BURN_IN_SAMPLES that discards the first N samples.

$ madai_change_setting . MCMC_NUMBER_OF_BURN_IN_SAMPLES 200
old: MCMC_NUMBER_OF_BURN_IN_SAMPLES 0
new: MCMC_NUMBER_OF_BURN_IN_SAMPLES 200
$ madai_generate_trace . "million.csv"

Here, we discard 200 samples from the trace generated by the MCMC algorithm. Note that the number of burn-in samples are not subtracted from the number of samples specified by SAMPLER_NUMBER_OF_SAMPLES.

2b. Basic Statistics

The utility madai_analyze_trace computes some basic statistics about the samples in the generated trace that may be useful to know. Before doing that, we first need to create a file listing the names of the parameters. We will borrow a file used in the next section to do this:

$ cp ../parabolic_example/parameter_priors.dat .

Example output from this program is shown below.

$ madai_analyze_trace . million.csv
     parameter          mean      std.dev.   scaled dev.    best value
            X0      0.644224      0.475285      0.411609      0.999923
          Kinv       1.68241      0.903106      0.834254       1.00104
          TEMP      0.929729      0.158476      0.146394      0.998231

best log likelihood
       -4.0299

covariance:
                          X0          Kinv          TEMP
            X0      0.225896     -0.401274      0.014398
          Kinv     -0.401274        0.8156    -0.0649426
          TEMP      0.014398    -0.0649426     0.0251147

scaled covariance:
                          X0          Kinv          TEMP
            X0      0.169422     -0.321019     0.0115184
          Kinv     -0.321019      0.695979    -0.0554177
          TEMP     0.0115184    -0.0554177     0.0214312

2c. Downsampling

While visualization and analysis software is capable of handling tens of millions of points, for faster visualization and analysis you may need to reduce the number of points. One way to keep the same random distribution is to pick out every N-th line. For example, to reduce one million points down to a fifty thousand, pick out every 20th line:

$ madai_subsample_text_file 20 million.csv mcmc_50000.csv

The madai_subsample_text_file program will sample every N-th sample point from a CSV file into a new CSV file.

Aside: A faster way to interface with the Distribution Sampling tools is to link directly to the Distribution Sampling library. For a complete example, look in the DistributionSampling/tutorial/examples/ParabolicPotentialModelCxx directory.

A skeleton for your own C++ model can be found in DistributionSampling/tutorial/examples/MCMC_Example .

2d. Plotting

Now that the distribution samples have been computed, it is useful to visualize the distribution. One way to do this is to create a scatterplot matrix that shows pair-wise scatterplots of the parameter values.

This file lists the property names and some other information. Next, generate the scatterplot matrix with

$ madai_gnuplot_scatterplot_matrix mcmc.csv mcmc.pdf parameter_priors.dat 50

You will see the plot shown in Figure 1.

Figure 1: Scatterplot matrix of the trace data. The horizontal axis is labeled with the input parameter names. The left vertical axis also lists the input parameter names except for the first row which is labeled "likelihood". Each plot in the matrix, which the exception of plots on the diagonal, is the scatterplot of the parameter by which it is labeled in the horizontal axis and the parameter by which it is labeled in the vertical axis. The plots in the upper right of the matrix are redundant with the plots in the lower left. Each plot on the diagonal is a histogram that shows the density of the sample points in each dimension.

2e. Use an Emulator for a Slow Model

The Distribution Sampling tools use a Gaussian Process Emulator to emulate a slow model much more quickly than if the model were run directly. An emulator is basically an interpolator that can estimate the uncertainty associated with the outputs it produces. To train the emulator, the software requires hundreds of sample points. At each training point, you will need the parameter values and the model outputs.

We will use the parabolic potential model as an example again. This model has the parameters X0, Kinv, and TEMP; and the observables MEAN_X, MEAN_X_SQUARED, and MEAN_ENERGY.

First move into the DistributionSampling/tutorial directory.

Then make a working directory:

$ mkdir parabolic_emulator
$ cd parabolic_emulator

For the rest of this section, we will assume that you are in your working directory and will refer to it as . (a single dot).)

The first thing we will do is create a setting.dat file in this directory with some default values in it. All of the Distribution Sampling tools will look for this file.

$ madai_print_default_settings > ./settings.dat

Next, we need to create a parameter_priors.dat file that contains our assumptions about prior probability distribution for the values for the parameters.

$ cp ../parabolic_example/parameter_priors.dat .
$ cat parameter_priors.dat
UNIFORM X0      -2.0    2.0
UNIFORM Kinv    0.25    4.0
UNIFORM TEMP    0.25    4.0

Next, we specify all of the experimentally observed measurements and errors:

$ cp ../parabolic_example/experimental_results.dat .
$ cat experimental_results.dat
MEAN_X         1.28759997   0.050
MEAN_X_SQUARED 2.28759997   0.179
MEAN_ENERGY    0.856200015  0.139

Finally, we create a file that contains the list of observables that we want to make use of. This should be a subset of results.dat.

$ cp ../parabolic_example/observable_names.dat .
$ cat observable_names.dat
MEAN_X
MEAN_X_SQUARED
MEAN_ENERGY

We are now ready to run the first Distribution Sampling tool in this workflow, madai_generate_training_points. This program will generate a Latin hypercube in the parameter space. The number of sample points is determined by the GENERATE_TRAINING_POINTS_NUMBER_OF_POINTS setting. Like the other Distribution Sampling tools, the command-line argument is the name of the working directory.

$ madai_generate_training_points .

The output of madai_generate_training_points is a series of files model_output/run*/parameters.dat. For example:

$ cat ./model_output/run0001/parameters.dat
X0 -1.5
Kinv 2.51875
TEMP 3.45625

The next task is to actually evaluate the model at this point in parameter space. This step is typically left up to you because your modeling program likely has its own way of reading in parameters and writing outputs. You will need to read in the parameter values in the parameters.dat files and write results as described below.

For this tutorial, we have written a little Python program to generate model outputs from the files and directories created by madai_generate_training_points. The parabolic_evaluate.py program can be found in the directory DistributionSampling/tutorial/parabolic_example/. To run it, write

$ ../parabolic_example/parabolic_evaluate.py model_output/run*

parabolic_evaluate.py takes as command-line arguments the names of a directories that contain a file named parameters.dat and writes a file called results.dat in each directory.

Verify that it works:

$ cat ./model_output/run0001/results.dat
MEAN_X 1.88486922161 0.147524825771
MEAN_X_SQUARED 5.87812585509 0.991401491412
MEAN_ENERGY 4.46415150519 0.369616766285

Aside: It doesn't matter how you populate the results.dat files. We expect that for your actual simulation runs, you will be running the full model on a supercomputer or on a dozen nodes of a cluster.

Also, the parameters.dat files don't have to be a Latin hypercube. You are free to select them using any method.

After generating the training points, we compute the principal component analysis decomposition of the outputs.

$ madai_pca_decompose .

The file pca_decomposition.dat generated by the command above contains the PCA data. The eigenvalues are sorted in order of increasing magnitude.

The next step is to generate a Gaussian Process Emulator on the training points. The hyper-parameters of the emulator are the hyper-parameters of the covariance (kernel) functions on the parameter space and the regression function. The madai_train_emulator program generates “okay” values for the hyper-parameters by default.

Aside: For more information about covariance functions and hyperparameters, please see the MADAI Statistics Manual.

To run the basic training program:

$ madai_train_emulator .

madai_train_emulator writes out a file called emulator_state.dat which contains hyper-parameters for each of the retained pca-decomposed sub-models.

One of the Distribution Sampling tools is a program that follows the same communication protocol as the parabolic_interactive.py program used in Section 2a. This program, called madai_emulate, can be run interactively at the command line.

$ madai_emulate .
VERSION 1
PARAMETERS
3
X0	UNIFORM	-2	2
Kinv	UNIFORM	0.25	4
TEMP	UNIFORM	0.25	4
OUTPUTS
3
MEAN_X
MEAN_X_SQUARED
MEAN_ENERGY
COVARIANCE
TRIANGULAR_MATRIX
6
END_OF_HEADER

As with the parabolic_interactive.py program, madai_emulate waits for the number of parameters in the model to be written to standard input and writes the model outputs and covariance to standard output. Below is what you should see when you enter -0.6 3.5 3.4 at the terminal. Note that the numbers may not exactly match what is below due to randomness when generating the Latin hypercube sampling.

-0.6 3.5 3.4
2.4367666912344625
9.5582566165380527
3.8049408478401152
0.012703386214090669
-8.7378171278191194e-18
-7.839420506032379e-19
0.41323617041363248
7.7322297868210109e-19
0.08618840459630378

Now we need to set madai_emulate as the EXTERNAL_MODEL_EXECUTABLE and set its single argument.

$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE madai_emulate
old: EXTERNAL_MODEL_EXECUTABLE
new: EXTERNAL_MODEL_EXECUTABLE madai_emulate

It may be possible that an external program requires command-line arguments. The setting EXTERNAL_MODEL_ARGUMENTS provides this capability. Arguments should all be specified on one line and separated by white space. Double quotation marks should surround arguments with spaces.

madai_emulate takes a single argument, the current working directory. Set that argument as shown below.

$ madai_change_setting . EXTERNAL_MODEL_ARGUMENTS .
old: EXTERNAL_MODEL_ARGUMENTS
new: EXTERNAL_MODEL_ARGUMENTS .

The final step is to run the Markov chain Monte Carlo (MCMC) routine on the code. The madai_generate_trace program uses the trained Gaussian Process model emulator to produce model outputs for points in parameter space. These model outputs are compared to observed values to calculate relative likelihood.

$ madai_generate_trace . "mcmc.csv"
$ head mcmc.csv
"X0","K","TEMP","MEAN_X","MEAN_X_SQUARED","MEAN_ENERGY","LogLikelihood"
-0.167188,1.63734,0.790651,0,0,0,-7.0052
-0.176625,1.60696,0.673026,-0.176625,1.87896,0.673026,-6.89768
-0.181315,1.61748,0.744548,-0.181315,1.89524,0.744548,-6.97992
-0.181315,1.61748,0.744548,-0.181315,1.89524,0.744548,-6.97992
-0.222542,1.62965,0.668291,-0.222542,1.85553,0.668291,-6.85991
-0.188486,1.62792,0.638246,-0.188486,1.84436,0.638246,-6.81262
-0.198998,1.65095,0.659951,-0.198998,1.82891,0.659951,-6.80038
-0.19968,1.6921,0.574281,-0.19968,1.75331,0.574281,-6.61224
-0.19968,1.6921,0.574281,-0.19968,1.75331,0.574281,-6.61224

We expect the emulator to be used frequently enough that we have written madai_generate_trace to use the emulator internally if EXTERNAL_MODEL_EXECUTABLE is not set or is set to an empty value. In fact, this the default behavior of madai_generate_trace is to use the emulator. You may use the madai_emulate program to access the emulator if you wish, but it will typically be slower than if you use the built-in emulator.

To use the internal emulator, run

$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE ""

The madai_generate_trace program will produce the number of points specified by the SAMPLER_NUMBER_OF_SAMPLES entry in the settings.dat file. Depending on the number of the parameters, a larger or smaller number of samples will need to be drawn. Once you are sure that the program is working correctly, set SAMPLER_NUMBER_OF_SAMPLES to a large number (a million or more) and let the program run for a while.

$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 1000000
old: SAMPLER_NUMBER_OF_SAMPLES 100
new: SAMPLER_NUMBER_OF_SAMPLES 1000000

As before, we can generate a scatterplot of the samples generated from the emulator. Figure 2 shows a side-by-side comparison of two scatterplot matrices. The one on the left shows the density of samples from direct execution of the model. The one on the right shows the density of samples from the Gaussian Process Emulator trained on the parabolic potential training data. While the bulk of the density is in the correct place in parameter space on the right scatterplot matrix, the shape of the distribution is quite different.

$ madai_gnuplot_scatterplot_matrix mcmc.csv mcmc.pdf parameter_priors.dat 50

Figure 2: Left - Scatterplot matrix of the parabolic potential model sampled directly. Right - Scatterplot matrix of the emulated parabolic potential model with basic training of the Gaussian Process Emulator enabled.

As Figure 2 shows, the default basic training mode offered by the emulator does not always capture the shape of the high-dimensional distribution. For better results, we can enable a slightly slower but more thorough emulator training algorithm.

Change the setting EMULATOR_TRAINING_ALGORITHM to the following:

$ madai_change_setting . EMULATOR_TRAINING_ALGORITHM exhaustive_geometric_kfold_common

Now retrain the emulator with

$ madai_train_emulator .

and generate a new trace.

$ madai_generate_trace . "mcmc_better.csv"

Figure 3 shows the results of the emulator with improved training. The matrix plot from the samples generated with the emulator (right plot) trained using exhaustive geometric k-fold validation better matches the scatterplot matrix generated from the parabolic potential model directly (left plot).

Figure 3: Left - Scatterplot matrix of the parabolic potential model sampled directly. Right - Scatterplot matrix of the emulated parabolic potential model with more advanced Gaussian Process Emulator training.

3. Visualizing the Results

So far, we have been visualizing the results of the MCMC sampling with scatterplot matrices. This section describes more advanced visualizations available in the MADAI Workbench.

3a. Visualizing a Slice of Parameter Space

The Distribution Sampling tools can also uniformly sample from the parameter space. In the next example, the model has four input parameters. We will first visualize a two-dimensional slice through parameter space. After that, we will visualize a three-dimensional volume.

To produce a slice, set the SAMPLER setting to PercentileGrid and set the SAMPLER_INACTIVE_PARAMETERS_FILE setting to point at a file that specifies which parameters should be inactivated or locked down to a single constant value. The inactivated parameters are inactive in the sense that they are not changed by the MCMC sampler.

For this example, first make sure you are in the DistributionSampling/tutorial directory.

Then make a working directory:

$ mkdir slice
$ cd slice

Next, point EXTERNAL_MODEL_EXECUTABLE at the example program.

$ madai_change_setting . EXTERNAL_MODEL_EXECUTABLE ../cup_example/cup.py

Set the EXPERIMENTAL_RESULTS_FILE location:

$ madai_change_setting . EXPERIMENTAL_RESULTS_FILE ../cup_example/experimental_results.dat

Then we set the inactive parameters file,

$ madai_change_setting . SAMPLER_INACTIVE_PARAMETERS_FILE ../cup_example/inactive_parameters_2d.dat

Let's take a look at the format of the inactive parameters file. It has one line for each inactive parameter. Each parameter line is followed by the value at which it will be fixed. Active parameters are not listed in this file.

$ cat ../cup_example/inactive_parameters_2d.dat
x2 0.0
x4 0.0

Finally, we need to change the sampler method from the default MetropolisHastings, which returns samples whose equilibrium distribution is proportional to posterior likelihood, to PercentileGrid, which generates non-random samples evenly spaced in the active parameter space according to the prior distribution distributions assigned to each parameter. If the prior distribution for a parameter is a uniform distribution from 0 to 100, and we ask the PercentileGrid sampler for 50 samples, that parameter will have the values {1,3,5,...97,99}. If the prior distribution for a parameter is a Gaussian distribution, then the samples will be clustered more tightly near the mean of the distribution in that parameter dimension.

$ madai_change_setting . SAMPLER PercentileGrid

Since we have two active parameters, the PercentileGrid sampler will return a rectangular grid of points. If we want 300 samples in each direction, we will need 300² samples.

$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 90000

We will use the same madai_generate_trace program to

$ madai_generate_trace . grid.csv

Viewing the Slice

To visualize this slice, open the grid.csv file in the MADAI Workbench.

Convert from tabular data to 2D spatial data using the Table To Points filter. Set X and Y to x1 and x3. Check the “2D Points” checkbox After applying the filter, create a new 3D View to see the points in space. When you zoom in, you should see discrete points.

Use the Delaunay2D filter to make the points into a surface. Here, this surface is colored with LogLikelihood.

If you want to see Likelihood rather than LogLikelihood, use the Calculator filter to calculate it. Make the settings of the Calculator match what is shown below. Next, change the color map to the Black-Body Radiation color map. This color map better emphasizes the regions of high likelihood.

Exercise:

Make a three-dimensional “slice” through this model's parameter space. Hints: you only need to deactivate one parameter; use a smaller number of samples; use the Delaunay3D filter; and use the Contour filter to find the high-likelihood region.

3b. Percentile Surface Filter

There are limitations to viewing the results of the MCMC sampling using either scatterplot matrices or slices, so we will use the MADAI Workbench to project those points into three dimensions.

First, change to the parabolic_emulator directory.

$ cd ../parabolic_emulator

Then use madai_subsample_text_file to reduce the size of your MCMC trace down to about 10000 points.

Next, load the CSV file in the MADAI Workbench.

After loading the CSV file, convert from tabular data to 3D spatial data using the Table To Points filter. Set X, Y, and Z to the three model parameters X0, K, and TEMP. After applying the filter, create a new 3D View to see the points in space.

In general, the scales for each parameter will not be comparable, sometimes by many orders of magnitude. You may need to apply the Rescale Points Filter to rescale each dimension.

Understanding the shape of a collection of points in space is difficult, which is why we provide the Percentile Surface Filter, which draws a surface around the densest part of the point cloud.

To figure out which part of the cloud is densest, the filter does not calculate density, but makes use of the fact that the density increases with log-likelihood. It simply draws as surface around the 95% of the points with the highest log-likelihood. Before running the filter, be sure that “Point Scalars to Use for Percentile” is set to LogLikelihood.

The resulting surface can be colored with one of the other parameters not used for projection to see the relationship between that parameter and the three parameters we used as spatial dimensions.

3c. Optimal Percentile Surface Projection

The MaximizeNormalizedShapeIndexProjectionFilter will search for a percentile surface projection with the most interesting shape. We define the most interesting shape to be the one with the largest normalized shape index.

For this example, we will use the dataset f_gomez_emu_reg0 (Facundo Gómez's Galaxy Formation Model Emulator). First we will extract the tarfile:

$ tar -x -z -f f_gomez_emu_reg0_2013.tar.gz

Next, we will make a working directory and move to it.

$ mkdir galaxy
$ cd galaxy

Next, copy over the parameter_priors.dat and observable_names.dat files.

$ cp ../f_gomez_emu_reg0_2013/parameter_priors.dat .
$ cp ../f_gomez_emu_reg0_2013/observable_names.dat .

Next, make a settings.dat file and be sure to point the MODEL_OUTPUT_DIRECTORY and EXPERIMENTAL_RESULTS_FILE. We also suggest setting EMULATOR_TRAINING_ALGORITHM to exhaustive_geometric_kfold_common for the best results.

### settings.dat ###
MODEL_OUTPUT_DIRECTORY ../f_gomez_emu_reg0_2013/model_output
EXPERIMENTAL_RESULTS_FILE ../f_gomez_emu_reg0_2013/experimental_results.dat
VERBOSE 1
EMULATOR_TRAINING_ALGORITHM exhaustive_geometric_kfold_common
MCMC_NUMBER_OF_BURN_IN_SAMPLES 200
SAMPLER_NUMBER_OF_SAMPLES 1000000
EXTERNAL_MODEL_EXECUTABLE madai_emulate
EXTERNAL_MODEL_ARGUMENTS .

We are ready to train the emulator:

$ madai_pca_decompose .
PCA decomposition succeeded.
Wrote PCA decomposition file './pca_decomposition.dat'.
$ madai_train_emulator .
Emulator training succeeded.
Wrote emulator state file './emulator_state.dat'.

That should take about a minute. Next, we will run the MCMC on the emulator.

$ madai_generate_trace . mcmc-1000000.csv
Using external model executable 'madai_emulate'.
Using MetropolisHastingsSampler for sampling
Succeeded writing trace file 'mcmc-1000000.csv'.

This should take a while (around half an hour). The sampling process runs on at most one core. If your system has N cores, you can divide SAMPLER_NUMBER_OF_SAMPLES by N and run N jobs in parallel.

$ madai_change_setting . SAMPLER_NUMBER_OF_SAMPLES 250000
old: SAMPLER_NUMBER_OF_SAMPLES 1000000
new: SAMPLER_NUMBER_OF_SAMPLES 250000
$ madai_launch_multiple_madai_generate_trace . 4 mcmc_parallel
$ madai_catenate_traces mcmc_parallel_*.csv > mcmc-1000000.csv

Either way you create mcmc-1000000.csv, we can analyze and plot it:

$ madai_analyze_trace . mcmc-1000000.csv > mcmc-analysis.txt
$ madai_gnuplot_scatterplot_matrix mcmc-1000000.csv mcmc.pdf parameter_priors.dat 100
input:   /d/data/galaxy/mcmc-1000000.csv
output:  /d/data/galaxy/mcmc.pdf
gnuplot file: /tmp/madai_gnuplot_scatterplot_matrix_1jkDqq/scatterplot_matrix.gplt

Let's subsample down to ten thousand points.

$ madai_subsample_text_file 100 mcmc-1000000.csv mcmc-10000.csv

Now we can open mcmc-10000.csv in the workbench and apply the Maximize Normalized Shape Index Projection filter.

This filter operates directly on the table data. Be sure to check only the parameters, and select LogLikelihood for the column to use for the percentile.

The filter has two outputs: the projected points and the percentile surface around those points.

To find out which projection was used, look at the field data attached to the output. Click the "+" sign next to the tab named "Layout #1" over the visualization window. In the list of buttons that appears, click on "Spreadsheet View". Select "MaximizeNormalizedShapeIndexProjection1" as the dataset to show in the spreadsheet view and select the attribute "Field Data".

Under the "Best Dimensions" column, you will see the three parameters that, when assigned to the spatial dimensions X, Y, and Z, produce the most interesting shape.


Congratulations! You have successfully completed the tutorial for the Distribution Sampling tools.