deeplenstronomy
has several built-in probability distributions directly callable from the configuration file. Sometimes these are enough, and sometimes not. If you find you need more flexibility than the built-in distributions, you can supply any distribution you want as a text file. This notebook will review the standard way of using the built-in distributions and then demonstrate the text file method.
from deeplenstronomy import distributions
[x for x in dir(distributions) if not x.startswith('_')]
['delta_function', 'delve_exposure_time', 'delve_magnitude_zero_point', 'delve_seeing', 'delve_sky_brightness', 'des_ccd_gain', 'des_exposure_time', 'des_magnitude_zero_point', 'des_num_exposures', 'des_seeing', 'des_sky_brightness', 'lognormal', 'lsst_exposure_time', 'lsst_magnitude_zero_point', 'lsst_num_exposures', 'lsst_seeing', 'lsst_sky_brightness', 'normal', 'np', 'poisson', 'poisson_noise', 'random', 'symmetric_uniform_annulus', 'uniform', 'uniform_int', 'ztf_magnitude_zero_point', 'ztf_seeing', 'ztf_sky_brightness']
The standard way of using one of these distributions is to use the DISTRIBUTION
keyword within the configuration file. Let's look at a quick example.
import deeplenstronomy.deeplenstronomy as dl
! cat data/demo.yaml
DATASET: NAME: MyDeeplenstronomyDataset PARAMETERS: SIZE: 100 OUTDIR: MySimulationResults COSMOLOGY: PARAMETERS: H0: 70 Om0: 0.3 IMAGE: PARAMETERS: exposure_time: 90 numPix: 100 pixel_scale: 0.263 psf_type: 'GAUSSIAN' read_noise: 7 ccd_gain: 6.083 SURVEY: PARAMETERS: BANDS: g,r,i,z,Y seeing: 0.9 magnitude_zero_point: 30.0 sky_brightness: 23.5 num_exposures: 10 SPECIES: GALAXY_1: NAME: LENS LIGHT_PROFILE_1: NAME: SERSIC_ELLIPSE PARAMETERS: magnitude: 19.5 center_x: 0.0 center_y: 0.0 R_sersic: 10 n_sersic: 4 e1: 0.2 e2: -0.1 LIGHT_PROFILE_2: NAME: SERSIC_ELLIPSE PARAMETERS: magnitude: 18.0 center_x: 0.0 center_y: 0.0 R_sersic: 3 n_sersic: 8 e1: 0.05 e2: -0.05 MASS_PROFILE_1: NAME: SIE PARAMETERS: theta_E: 1.0 e1: 0.1 e2: -0.1 center_x: 0.0 center_y: 0.0 SHEAR_PROFILE_1: NAME: SHEAR PARAMETERS: gamma1: 0.08 gamma2: 0.01 GALAXY_2: NAME: SOURCE LIGHT_PROFILE_1: NAME: SERSIC_ELLIPSE PARAMETERS: magnitude: 21.5 center_x: 0.0 center_y: 0.0 R_sersic: 6 n_sersic: 5 e1: 0.2 e2: -0.1 SHEAR_PROFILE_1: NAME: SHEAR PARAMETERS: gamma1: 0.08 gamma2: 0.01 POINTSOURCE_1: NAME: AGN HOST: SOURCE PARAMETERS: magnitude: 16 POINTSOURCE_2: NAME: SUPERNOVA HOST: SOURCE PARAMETERS: magnitude: 21.0 sep: 2.0 sep_unit: arcsec POINTSOURCE_3: NAME: STAR HOST: Foreground PARAMETERS: magnitude: 14.0 NOISE_1: NAME: POISSON_NOISE PARAMETERS: mean: 2.0 GEOMETRY: CONFIGURATION_1: NAME: GALAXY_AGN FRACTION: 0.25 PLANE_1: OBJECT_1: LENS PARAMETERS: REDSHIFT: 0.2 PLANE_2: OBJECT_1: SOURCE OBJECT_2: AGN PARAMETERS: REDSHIFT: 0.7 NOISE_SOURCE_1: POISSON_NOISE CONFIGURATION_2: NAME: GALAXY_AGN_NOISELESS FRACTION: 0.25 PLANE_1: OBJECT_1: LENS PARAMETERS: REDSHIFT: 0.2 PLANE_2: OBJECT_1: SOURCE OBJECT_2: AGN PARAMETERS: REDSHIFT: 0.7 CONFIGURATION_3: NAME: LENSED_SNE FRACTION: 0.25 PLANE_1: OBJECT_1: LENS PARAMETERS: REDSHIFT: 0.2 PLANE_2: OBJECT_1: SOURCE OBJECT_2: SUPERNOVA PARAMETERS: REDSHIFT: 0.7 NOISE_SOURCE_1: POISSON_NOISE CONFIGURATION_4: NAME: SPICY_LIFE FRACTION: 0.25 PLANE_1: OBJECT_1: LENS OBJECT_2: STAR OBJECT_3: STAR PARAMETERS: REDSHIFT: 0.2 PLANE_2: OBJECT_1: LENS PARAMETERS: REDSHIFT: 0.7 PLANE_3: OBJECT_1: SOURCE OBJECT_2: SUPERNOVA OBJECT_3: AGN PARAMETERS: REDSHIFT: 1.3 NOISE_SOURCE_1: POISSON_NOISE
At present, all of the parameters are set to constant values. This configuration file offers no variance in the resulting dataset. As an example, let's play with drawing the exposure_time
and magnitude
of GALAXY_2
from distributions.
We'll replace exposure_time: 90
with
exposure_time:
DISTRIBUTION:
NAME: uniform
PARAMETERS:
minimum: 30.0
maximum: 300.0
to draw the exposure_time
from a uniform distribution on the interval [30.0, 300.0]. Similarly, we can replace magnitude: 21.5
with
magnitude:
DISTRIBUTION:
NAME: normal
PARAMETERS:
mean: 20.0
std: 1.0
to draw the magnitude
of GALAXY_2
from a normal distribution with mean 20.0 and standard deviation 1.0.
I have put these update in a new config file called "demo_distributions.yaml", removed configurations 2-4 for efficiency, and increased the number of images to simualte to better characterize the distributions.
! cat data/demo_distributions.yaml
DATASET: NAME: MyDeeplenstronomyDataset PARAMETERS: SIZE: 1000 OUTDIR: MySimulationResults COSMOLOGY: PARAMETERS: H0: 70 Om0: 0.3 IMAGE: PARAMETERS: exposure_time: DISTRIBUTION: NAME: uniform PARAMETERS: minimum: 30.0 maximum: 300.0 numPix: 100 pixel_scale: 0.263 psf_type: 'GAUSSIAN' read_noise: 7 ccd_gain: 6.083 SURVEY: PARAMETERS: BANDS: g,r,i,z,Y seeing: 0.9 magnitude_zero_point: 30.0 sky_brightness: 23.5 num_exposures: 10 SPECIES: GALAXY_1: NAME: LENS LIGHT_PROFILE_1: NAME: SERSIC_ELLIPSE PARAMETERS: magnitude: 19.5 center_x: 0.0 center_y: 0.0 R_sersic: 10 n_sersic: 4 e1: 0.2 e2: -0.1 LIGHT_PROFILE_2: NAME: SERSIC_ELLIPSE PARAMETERS: magnitude: 18.0 center_x: 0.0 center_y: 0.0 R_sersic: 3 n_sersic: 8 e1: 0.05 e2: -0.05 MASS_PROFILE_1: NAME: SIE PARAMETERS: theta_E: 2.0 e1: 0.1 e2: -0.1 center_x: 0.0 center_y: 0.0 SHEAR_PROFILE_1: NAME: SHEAR PARAMETERS: gamma1: 0.08 gamma2: 0.01 GALAXY_2: NAME: SOURCE LIGHT_PROFILE_1: NAME: SERSIC_ELLIPSE PARAMETERS: magnitude: DISTRIBUTION: NAME: normal PARAMETERS: mean: 20.0 std: 1.0 center_x: 0.0 center_y: 0.0 R_sersic: 6 n_sersic: 5 e1: 0.2 e2: -0.1 SHEAR_PROFILE_1: NAME: SHEAR PARAMETERS: gamma1: 0.08 gamma2: 0.01 POINTSOURCE_1: NAME: AGN HOST: SOURCE PARAMETERS: magnitude: 16 POINTSOURCE_2: NAME: SUPERNOVA HOST: SOURCE PARAMETERS: magnitude: 21.0 sep: 2.0 sep_unit: arcsec POINTSOURCE_3: NAME: STAR HOST: Foreground PARAMETERS: magnitude: 14.0 NOISE_1: NAME: POISSON_NOISE PARAMETERS: mean: 2.0 DISTRIBUTIONS: USERDIST_1: FILENAME: data/seeing.txt MODE: interpolate STEP: 20 USERDIST_2: FILENAME: data/Rsersic_magnitude.txt MODE: sample GEOMETRY: CONFIGURATION_1: NAME: GALAXY_AGN FRACTION: 0.25 PLANE_1: OBJECT_1: LENS PARAMETERS: REDSHIFT: 0.2 PLANE_2: OBJECT_1: SOURCE OBJECT_2: AGN PARAMETERS: REDSHIFT: 0.7 NOISE_SOURCE_1: POISSON_NOISE
dataset = dl.make_dataset("data/demo_distributions.yaml", verbose=True)
Entering main organization loop Organizing CONFIGURATION_1 Generating images for CONFIGURATION_1 Progress: 100.0 % --- Elapsed Time: 0 H 1 M 11 S
We can verify that the distributions were used in the data set by inspecting the metadata:
exp_times = dataset.CONFIGURATION_1_metadata['exposure_time-g'].values
magnitudes = dataset.CONFIGURATION_1_metadata['PLANE_2-OBJECT_1-LIGHT_PROFILE_1-magnitude-g'].values
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.hist(exp_times)
ax1.set_xlabel("Exposure Time (s)")
ax2.hist(magnitudes)
ax2.set_xlabel("GALAXY_2 Magnitude")
plt.show()
And we can see the uniform distribution of exposure time and the normal distribution of magnitude have been recovereed. With the exception of a few safe-guarded parameters, any single parameter can be sampled from an underlying built-in distribution using the method above.
If you would like to create correlations between parameters or utilize your own empriical distribution in the construction of your dataset, you can make use of deeplenstronomy
's USERDIST
feature. To use this feature, you add an entry to the configuration file that looks like this:
DISTRIBUTIONS:
USERDIST_1:
FILENAME: data/seeing.txt
MODE: interpolate
STEP: 20
USERDIST_2:
FILENAME: data/Rsersic_magnitude.txt
MODE: sample
The DISTRIBUTIONS
section goes a the same level of the yaml file as the DATASET
, COSMOLOGY
, IMAGE
etc. sections. Let's dive into what each of the parts of that entry mean.
USERDIST_#
USERDIST
entry represents an independent probability distribution you want to add into your dataset. They must be indexed as 1, 2, 3, ... for deeplenstronomy
to track them properly.FILENAME
MODE
This tells deeplenstronomy
how you want to use the probability distribution you supplied. The options are interpolate
and sample
.
interpolate
will use the points in your distribution to form the basis of a grid that will be sampled during dataset generation.
STEP
is only used if you use the interpolate
mode, and it is the number of points in each dimension of the grid that is sampled. The number of elements in the grid is the value of STEP
raised to the power of the number of dimensions in your distribution. This can easily exceed the available memory in your computer if you have high-dimensional distributions and a large STEP
value. If STEP
is not specified, the default of 10 will be used.sample
will draw only from the raw points specified and not perform any interpolation. This mode is recommended if you have a highly-dimensional distribution with a lot of structure in each dimension for memory considerations.
Let's inspect "seeing.txt" to learn how to work with a one-dimensional distribution.
! cat data/seeing.txt
seeing WEIGHT 0.5 0.1 0.6 0.2 0.7 0.25 0.8 0.3 0.9 0.3 1.0 0.25 1.1 0.6 1.2 0.9 1.3 0.2 1.4 0.1
These files are whitespace-separated, use the paramater name as a column header, and specify the probability weight associated with each point in parameter space. The weights are defined relative to each other and do not need to sum to one. There is also no requirement of regular spacing in the distributions, though it may lead to more accurate interpolations.
At present, the supplied seeing distribution will be applied to all bands and all configurations in the dataset. If, for example, we only wanted the distribution to apply to the $g$-band seeing, the column name would be changed to seeing-g
. If we wanted the distribution to only apply to CONFIGURATION_1
, then we could use CONFIGURATION_1-seeing
as the column name. And if we only want to target the $g$ -band seeing in CONFIGURATION_1
, then we would use CONFIGURATION_1-seeing-g
.
We can verify this distribution was put into the dataset by plotting the raw text file distribution over the simulated seeing values:
import numpy as np
import pandas as pd
df = pd.read_csv('data/seeing.txt', delim_whitespace=True)
# just normalizing weights to an AUC of 1.0 using a rectangular sum
norm_const = sum((df['seeing'].values.max() - df['seeing'].values.min()) / len(df) * df['WEIGHT'].values.mean() * df['seeing'].values)
plt.figure()
plt.scatter(df['seeing'].values, df['WEIGHT'].values / norm_const, color='black', label='Raw Distribution Points')
plt.hist(dataset.CONFIGURATION_1_metadata['seeing-g'].values, density=True, histtype='step', lw=3, color='red', label="Simulated Values")
plt.xlabel("Seeing (arcsec)")
plt.legend(loc='upper left')
plt.show()
Let's now use this feature to input a correlation, and let's plan to use the sample
mode instead of the interpolate
mode.
! head data/Rsersic_magnitude.txt
PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-i PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude WEIGHT 0.56 19.25 0.63 1.34 17.26 0.89 1.29 17.65 0.73 2.10 17.02 0.01 0.83 18.33 0.60 2.61 16.00 0.10 0.73 18.84 0.88 0.77 18.79 0.10 0.69 14.78 0.09
Here we're using a distribution in the text file to draw the size of the galaxy measured in the $i$-band (PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-i
) with the magnitude measured in all bands (PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude
). We'll compare the raw distribution from the text file to the simulated metadata parameters:
df = pd.read_csv('data/Rsersic_magnitude.txt', delim_whitespace=True)
r_sersic = dataset.CONFIGURATION_1_metadata['PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-i'].values
magnitudes = dataset.CONFIGURATION_1_metadata['PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-g'].values
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.scatter(df['PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-i'].values,
df['PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude'].values,
c=df['WEIGHT'].values)
ax1.set_xlabel("GALAXY_1 R_sersic i-band (arcsec)")
ax1.set_ylabel("GALAXY_1 magnitude g-band")
ax1.set_title("Raw Distribution")
ax2.hist2d(r_sersic, magnitudes,
bins=[np.linspace(r_sersic.min(), r_sersic.max(), 10),
np.linspace(magnitudes.min(), magnitudes.max(), 10)])
ax2.set_xlabel("GALAXY_1 R_sersic i-band (arcsec)")
ax2.set_ylabel("GALAXY_1 magnitude g-band")
ax2.set_title("Simulated Dataset")
fig.tight_layout()
fig.show()
Notice that in the left plot, points are assigned a color based on the weight in the text file, while in the right plot we have a histogram counting the number of simulated images with a particular parameter value combination.
Good question.
The column names are certianly scary to look at, but deeplenstronomy
has a functionality to help you out. Let's revisit the example of trying to correlate the $g$-band magnitude
of a galaxy with the $i$-band R_sersic
. Now that we've simualted a dataset, we can search
for USERDIST column names:
dataset.search('magnitude')
{'LENS': ['CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-g', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-r', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-i', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-z', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-Y', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-magnitude-g', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-magnitude-r', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-magnitude-i', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-magnitude-z', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-magnitude-Y'], 'SOURCE': ['CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-magnitude-g', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-magnitude-r', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-magnitude-i', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-magnitude-z', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-magnitude-Y'], 'AGN': ['CONFIGURATION_1-PLANE_2-OBJECT_2-magnitude-g', 'CONFIGURATION_1-PLANE_2-OBJECT_2-magnitude-r', 'CONFIGURATION_1-PLANE_2-OBJECT_2-magnitude-i', 'CONFIGURATION_1-PLANE_2-OBJECT_2-magnitude-z', 'CONFIGURATION_1-PLANE_2-OBJECT_2-magnitude-Y']}
The dataset.search()
function returns all possible USERDIST column names containing the parameter of interest. The returned object is a dictionary where the keys are the object names in the SPECIES
section and the values are the possible USERDIST column names.
Looking at the output, and knowing you are concerned with the object named LENS
would point you directly to the column name CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-magnitude-g
, where we could leave off the CONFIGURATION_1
part to apply the USERDIST to all configurations. In this case, there is only one CONFIGURATION, so the prefix doesn't matter.
We can repeat the process for R_sersic
:
dataset.search('R_sersic')
{'LENS': ['CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-g', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-r', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-i', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-z', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_1-R_sersic-Y', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-R_sersic-g', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-R_sersic-r', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-R_sersic-i', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-R_sersic-z', 'CONFIGURATION_1-PLANE_1-OBJECT_1-LIGHT_PROFILE_2-R_sersic-Y'], 'SOURCE': ['CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-R_sersic-g', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-R_sersic-r', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-R_sersic-i', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-R_sersic-z', 'CONFIGURATION_1-PLANE_2-OBJECT_1-LIGHT_PROFILE_1-R_sersic-Y']}