So, what can we do with this data?Figure 3.

Simplified diagram showing exoplanet transiting in front of the host star and its effect on the star’s apparent brightness over time.

Credit: NASA.

Transit data primarily consists of a transit “light curve” displaying stellar brightness over time.

However, this data must be heavily reduced and cleaned up before any science can be done.

This is where data science comes in to save the day by taking the raw observational data and finding the signals hidden within the noise.

In my project, the light curves used were measured with the Kepler Space Telescope and extracted from the raw data archives on MAST (Mikulski Archive for Space Telescopes).

These look something like Figure 4 before any analysis or reduction has been done, but this is not even close to the ideal flat line with a dip seen in Figure 3.

Figure 4.

Example raw light curve for exoplanet Kepler 1656b showing high noise and variability before reduction.

Credit: Kepler Space Telescope; MAST; M.

MacDougall; E.

Petigura; Brady et al.

2018.

Before attempting to extract any scientific measurements from this light curve, we normalize the data and mask out bad data points that recorded non-real, null, or infinite brightness values.

We then detrend our data by removing the low-frequency background noise trend causing the strong flux variations using a Savitzky-Golay filter.

With this filter, we successfully flatten our data via convolution over successive sub-sets of adjacent points in our data set to fit these points with a low-degree polynomial via linear least squares fitting.

The final step in the reduction process is removing significant outliers that might substantially skew statistical modeling results.

This is done through both a sigma cutoff of 6-s and manual clipping of high variance regions of post-reduction data.

The pipeline that I created to successfully run the raw data through took about a month to finalize, and even then, I have only checked its success with one exoplanet light curve.

In the end, we have a flattened, normalized, higher signal-to-noise light curve with clearly defined transits occurring periodically Figure 5.

Figure 5.

Example of reduced light curve for Kepler 1656b showing normalized flux with periodic dimming.

Credit: M.

MacDougall; E.

Petigura; Brady et al.

2018.

From the data contained in this light curve, the best-constrained property of the planet that can be derived is an estimate of the planet’s radius based on how much stellar light is blocked during transit.

We can also determine the orbital period of a transiting planet by noting how far apart in time transits occur.

To ensure that all potential transits are actually related events occurring at a fixed orbital period, we must phase-fold our data around the transit midpoint and see if all transit candidates stacked on top of one another have the same shape and depth (Figure 6).

Figure 6.

Example of the phase-folded light curve for Kepler 1656b — Period of 31.

578659 days.

Credit: M.

MacDougall; E.

Petigura; Brady et al.

2018.

Once a period and radius have been estimated, we can approximate a variety of other orbital parameters including inclination, semi-major axis (distance between planet and star), and eccentricity (how circular or elongated the orbit is).

However, there is no simple equation that we can plug values into and suddenly know everything about the planetary system.

Although the physics of orbital dynamics is well understood, many of the parameters that are considered are degenerate with one another so a wide range of combinations could give similar results.

In order to home in on the parameter values that produce a model which best fits our data, we must set up a statistical modeling program to optimize the fit.

The modeling software used in this process is a Python package known as BATMAN (Bad-Ass Transit Model cAlculatioN — this is the actual name) which takes in various orbital parameters and produces an idealized light curve model based on the inputs.

Thus, given a guess of the optimal parameters, we can produce an accompanying model and compare it to the actual data to assess the fit.

This assessment is made on the basis of chi-squared testing to determine the level of correlation between each observed data point and each modeled data point.

The better the fit, the lower the total chi-squared value for a model derived from a particular set of guessed parameters.

Figure 7.

Example of attempting to fit a BATMAN model to the phase-folded light curve of Kepler 1656b.

Credit: M.

MacDougall; E.

Petigura; Brady et al.

2018.

The catch here is that the fastest way to find the best fit model is to already have a pretty good idea of what the best parameter estimates are.

This problem is made even worse by the fact that the Kepler Space Telescope long cadence data is measured only every 30 minutes, so for a transit that occurs over the course of about 3 hours we have at most 6 data points per transit.

With such low resolution, it’s difficult to precisely determine the shape of the transit — including depth, flattening, steepness, and symmetry which are all used to infer the orbital parameters.

It is much easier to get a rough estimate by modeling the phase-folded light curve as seen in Figure 7, but ultimately we must rely on statistical modeling to be able to properly model the entire unfolded light curve at each time stamp in the original data.

Luckily, this particular system was already studied in depth by Brady et al.

2018, where the planet was found to have an eccentricity of roughly 0.

84 among other precise estimates.

We take this information (planet radius, distance from the host star, period, inclination, eccentricity, position along an orbit, and time of first transit) as the initial condition to an optimization tool known as a Markov Chain Monte Carlo (MCMC).

An MCMC is designed to take as its input a set of guessed parameter values, create a model from these values, and use chi-squared testing to compare the model to the data.

Once this process has completed, the program creates a new set of guessed parameters that are slightly perturbed from the initial set.

Then, given the chi-squared value of the last fit, the program weighs whether or not it is valuable to take the proposed step to this new set of parameters or stay where it is.

We run this for 106 steps from 20 different initial guesses (walkers) simultaneously, with each step slowly getting closer to the optimal fit.

We would ideally end up with a Gaussian distribution of our walkers’ final parameters, but so far we have not been able to achieve this convergence, unfortunately.

This is likely a matter of poorly constrained prior assumptions, bad initial guesses, or too few steps — all of which we are still looking into.

Nevertheless, we still have fairly well-constrained estimates of all of the orbital parameters in this system which we can use to model the light curve data to a substantial level of accuracy (Figure 8).

Figure 8.

Example plotting a best-fit BATMAN model with the reduced light curve of Kepler 1656b demonstrating strong agreement.

Credit: M.

MacDougall; E.

Petigura; Brady et al.

2018.

Although our optimization method has not yet been finalized, we have made significant progress towards accomplishing our goal of automatedly identifying the best fit light curve model for a given planetary system using BATMAN.

We will continue to work on getting our MCMC up and running well while also looking into other statistical techniques that might better assess how good a fit is.

We aim to test our program on a larger sample of Kepler planet candidates with previously estimated orbital parameters to improve our ability to properly model these already well-studied light curves.

Once our technique has demonstrated consistently high accuracy, we will begin to use it to model new TESS candidates in an attempt to better understand the orbital characteristics of the population of planets being observed.

With such knowledge, we may be able to gain new knowledge regarding the habitability of distant worlds and the likelihood of finding life beyond Earth.

.