Whether life exists beyond Earth is one of the most profound questions of all time. The answer will change us forever, whether it reveals a universe rich with life, one in which life is rare and fragile, or even a universe in which we can find no other life at all.
Source: https://exoplanets.nasa.gov/what-is-an-exoplanet/overview/
For this project, I attempt to see if I can train a Machine Learning model to predict whether or not an astral object might be an exoplanet.
Image credit: NASA/JPL-Caltech/Lizbeth B. De La Torre
Exoplanets are planets that orbit around a star, as in our solar system. These bodies are very hard to see directly with telescopes since they are hidden by the bright glare of the stars they orbit. The Kepler Objects of Interest (KOI) dataset contains various measurements of transit, in addition to many others that aid in identifying exoplanets. In 2009, NASA launched the Kepler spacecraft to search for exoplanets. Kepler looked for planets in a wide range of sizes and orbits that circled around stars of varied size and temperature. [2] The Kepler spacecraft detected exoplanets using the transit method. As a planet passes (transits) in front of a star, it blocks out a bit of the star’s light. By observing the stars’ change in brightness astronomers can figure out the size of the orbiting planet as well as how far away it is from the star. Further, this data then aids in calculating the planet’s temperature, and the chances that it may contain liquid water—the stuff of life…
Source: https://spaceplace.nasa.gov/all-about-exoplanets/en/
The NASA Exoplanet Archive is an online astronomical exoplanet and stellar catalog and data service that collates and cross-correlates astronomical data and information on exoplanets and their host stars. As this data is used by astronomers to arrive at whether an object is an exoplanet, it follows that it is likely a Machine Learning model can be developed to make predictions based on the same data.
A baseline model was established by training a Random Forest Classifier
, randomly assigning 1,000 estimators. The function containing the model also applied a StandardScaler
to standardize the data prior to model fitting. This resulted in an accuracy score of 86%.
Measure | Precision | Recall | F1 score | Support |
---|---|---|---|---|
Candidate | 0.78 | 0.66 | 0.71 | 601 |
Confirmed | 0.86 | 0.87 | 0.87 | 600 |
False Positive | 0.89 | 0.96 | 0.93 | 1190 |
Accuracy | 0.86 | 2391 | ||
Macro avg | 0.85 | 0.83 | 0.84 | 2391 |
Weighted avg | 0.86 | 0.86 | 0.86 | 2391 |
Actual | Predicted Candidate | Predicted Confirmed | Predicted False Positive |
---|---|---|---|
Candidate | 1104 | 289 | 372 |
Confirmed | 198 | 1517 | 43 |
False Positive | 163 | 7 | 3480 |
This was followed by performing a randomized search by leveraging a RandomizedSearchCV model
selector. Interestingly, this did not result in an improvement in accuracy.
The eight categorical variables that were dropped were either informational (url paths, transit model used) or were predominately one value. As an example, “fittype” was one of three values, such as Least Mean Squares. Roughly 90% of the values in this observation were the same value. A StandardScaler
was employed rather than a min/max scaler (normalization). Standardization does not bind the variables to a specific range, and some of the variables had an extreme range. As one example, the minimum value for koi_period
was .24, and the maximum value was 129,995. I did look at these extreme values and decided they were valid for predicting false positives; therefore I did not want to exclude them. Standardization is much less affected by outliers. I originally had hoped to train a model on the subsets based on categories, but ran out of time. I feel this is still worth pursuing. If one of the subsets results in better accuracy than the the entire subset, less observational data will be required to predict object disposition. Lastly, other models were evaluated (SVM, AdaBoost, XGBoost, TPOT
) but the original Random Forest Classifier
still resulted in the best results.
While 86% accuracy is not exemplary, that was achieved with a simple baseline model. I am confident greater accuracy is possible, either by fine-tuning the hyperparameters for the Random Forest Classifier, or by training against a larger dataset, if one becomes available. Given that Machine Learning algorithms can process data faster than humans by an order of magnitude, deployment of such a model that can classify observations of the trillions of solar objects present in the observable universe with blinding speed will free up astronomers for more creative endeavors; one area where humans still dominate.
References
[1] NASA Exoplanet Archive – NASA Exoplanet Science Institute https://exoplanetarchive.ipac.caltech.edu/index.html
[2] What Is an Exoplanet? | NASA Space Place – NASA Science for Kids
[3] Overview | What is an Exoplanet? – Exoplanet Exploration: Planets Beyond our Solar System (nasa.gov)