Working with Probability Distributions

deeplenstronomy has several built-in probability distributions directly callable from the configuration file. Sometimes these are enough, and sometimes not. If you find you need more flexibility than the built-in distributions, you can supply any distribution you want as a text file. This notebook will review the standard way of using the built-in distributions and then demonstrate the text file method.

Using built-in probability distributions

The standard way of using one of these distributions is to use the DISTRIBUTION keyword within the configuration file. Let's look at a quick example.

At present, all of the parameters are set to constant values. This configuration file offers no variance in the resulting dataset. As an example, let's play with drawing the exposure_time and magnitude of GALAXY_2 from distributions.

We'll replace exposure_time: 90 with

exposure_time:
    DISTRIBUTION:
        NAME: uniform
        PARAMETERS:
            minimum: 30.0
            maximum: 300.0

to draw the exposure_time from a uniform distribution on the interval [30.0, 300.0]. Similarly, we can replace magnitude: 21.5 with

magnitude:
    DISTRIBUTION:
        NAME: normal
        PARAMETERS:
            mean: 20.0
            std: 1.0

to draw the magnitude of GALAXY_2 from a normal distribution with mean 20.0 and standard deviation 1.0.

I have put these update in a new config file called "demo_distributions.yaml", removed configurations 2-4 for efficiency, and increased the number of images to simualte to better characterize the distributions.

We can verify that the distributions were used in the data set by inspecting the metadata:

And we can see the uniform distribution of exposure time and the normal distribution of magnitude have been recovereed. With the exception of a few safe-guarded parameters, any single parameter can be sampled from an underlying built-in distribution using the method above.

Correlations and Non-Standard Distributions

If you would like to create correlations between parameters or utilize your own empriical distribution in the construction of your dataset, you can make use of deeplenstronomy's USERDIST feature. To use this feature, you add an entry to the configuration file that looks like this:

DISTRIBUTIONS:
    USERDIST_1:
        FILENAME: data/seeing.txt
        MODE: interpolate
        STEP: 20
    USERDIST_2:
        FILENAME: data/Rsersic_magnitude.txt
        MODE: sample

The DISTRIBUTIONS section goes a the same level of the yaml file as the DATASET, COSMOLOGY, IMAGE etc. sections. Let's dive into what each of the parts of that entry mean.

USERDIST_#

FILENAME

MODE

Writing probability distribution files

Let's inspect "seeing.txt" to learn how to work with a one-dimensional distribution.

These files are whitespace-separated, use the paramater name as a column header, and specify the probability weight associated with each point in parameter space. The weights are defined relative to each other and do not need to sum to one. There is also no requirement of regular spacing in the distributions, though it may lead to more accurate interpolations.

At present, the supplied seeing distribution will be applied to all bands and all configurations in the dataset. If, for example, we only wanted the distribution to apply to the $g$-band seeing, the column name would be changed to seeing-g. If we wanted the distribution to only apply to CONFIGURATION_1, then we could use CONFIGURATION_1-seeing as the column name. And if we only want to target the $g$ -band seeing in CONFIGURATION_1, then we would use CONFIGURATION_1-seeing-g.

We can verify this distribution was put into the dataset by plotting the raw text file distribution over the simulated seeing values:

Let's now use this feature to input a correlation, and let's plan to use the sample mode instead of the interpolate mode.

Here we're using a distribution in the text file to draw the size of the galaxy measured in the $i$-band (PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-i) with the magnitude measured in all bands (PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude). We'll compare the raw distribution from the text file to the simulated metadata parameters:

Notice that in the left plot, points are assigned a color based on the weight in the text file, while in the right plot we have a histogram counting the number of simulated images with a particular parameter value combination.

How am I supposed to know what to put as the column names in my text files?

Good question.

The column names are certianly scary to look at, but deeplenstronomy has a functionality to help you out. Let's revisit the example of trying to correlate the $g$-band magnitude of a galaxy with the $i$-band R_sersic. Now that we've simualted a dataset, we can search for USERDIST column names:

The dataset.search() function returns all possible USERDIST column names containing the parameter of interest. The returned object is a dictionary where the keys are the object names in the SPECIES section and the values are the possible USERDIST column names.

Looking at the output, and knowing you are concerned with the object named LENS would point you directly to the column name CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-g, where we could leave off the CONFIGURATION_1 part to apply the USERDIST to all configurations. In this case, there is only one CONFIGURATION, so the prefix doesn't matter.

We can repeat the process for R_sersic: