This Project was part of course designed by IBM
- Introduction
- Project Overview
- Technologies Used
- Data
- Exploratory Data Analysis (EDA) Using Python
- Exploratory Data Analysis (EDA) Using SQL
- Machine Learning Prediction
- Application Development
- Results
- Conclusion
- Getting Started
- Project Structure
- License
- Contact
Welcome to the SpaceX Reusable Rockets Prediction project! This project aims to predict the success of SpaceX's Falcon 9 rocket landings using machine learning techniques. By analyzing historical launch data, we identify key factors that influence landing success and build a predictive model to forecast future outcomes.
The project encompasses the following key components:
- Data Collection & Preprocessing: Gathering and cleaning data related to SpaceX launches.
- Exploratory Data Analysis (EDA):
- Using Python for data visualization and pattern recognition.
- Utilizing SQL for in-depth data querying and analysis.
- Machine Learning Prediction: Building and evaluating classification models to predict landing success.
- Application Development: Creating an interactive web application to showcase the model's predictions.
- Programming Languages: Python, SQL
- Libraries & Frameworks: Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn, Flask
- Databases: SQLite/MySQL (specify which one you used)
- Tools: Jupyter Notebook, Git, GitHub
- Deployment: Heroku (if applicable)
The dataset includes historical data of SpaceX launches, encompassing various attributes such as:
- Launch Site: The location from where the rocket was launched.
- Payload Mass: The mass of the payload carried by the rocket.
- Orbit Type: The type of orbit the payload was intended for.
- Launch Outcome: Success or failure of the mission.
- Landing Outcome: Success or failure of the rocket landing.
You can include a Data Dictionary or link to the data source if applicable.
In this section, we perform EDA using Python to uncover insights and patterns in the data.
- Data Cleaning: Handling missing values, correcting data types, and removing duplicates.
- Visualization:
- Distribution of payload mass.
- Success rate by launch site.
- Correlation between different features.
- Insights:
- Identifying the most successful launch sites.
- Understanding the impact of payload mass on launch and landing success.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
data = pd.read_csv('spacex_launch_data.csv')
# Data Cleaning
data.dropna(inplace=True)
# *Visualization*
sns.countplot(x='Launch Site', hue='Landing Outcome', data=data)
plt.title('Landing Success by Launch Site')
plt.show()
# Plot a scatter point chart with x axis to be Pay Load Mass (kg) and y axis to be the launch site, and hue to be the class value
sns.catplot(x="LaunchSite", y="PayloadMass", hue="Class", data=df, aspect = 5)
plt.xlabel("Launch Site", size=20)
plt.ylabel("Payload Mass (kg)", size=20)
# A function to Extract years from the date
year=[]
def Extract_year(date):
for i in df["Date"]:
year.append(i.split("-")[0])
return year
Extract_year(df["Date"])
zipped = zip(df['Date'], df['Orbit'], df['Outcome'],df['Class'], year)
df1=pd.DataFrame(zipped, columns=['Date', 'Orbit', 'Outcome', 'Class', 'Year'])
df1- Created dummy variables to categorical columns
- Cast all numeric columns to float64
- Overview
- Understand the Spacex DataSet
- Load the dataset into the corresponding table in a Db2 database
- Execute SQL queries to answer assignment questions
- Display the names of the unique launch sites in the space mission
- Display 5 records where launch sites begin with the string 'CCA'
- Display the total payload mass carried by boosters launched by NASA (CRS)
- Display average payload mass carried by booster version F9 v1.1
- List the date when the first successful landing outcome in ground pad was acheived.
- List the names of the boosters which have success in drone ship and have payload mass greater than 4000 but less than 6000
- List the total number of successful and failure mission outcomes
SELECT MISSION_OUTCOME, COUNT(MISSION_OUTCOME) AS TOTAL_NUMBER
FROM SPACEXTBL
GROUP BY MISSION_OUTCOME;- List the names of the booster_versions which have carried the maximum payload mass. Use a subquery
SELECT DISTINCT BOOSTER_VERSION
FROM SPACEXTBL
WHERE PAYLOAD_MASS__KG_ = (
SELECT MAX(PAYLOAD_MASS__KG_)
FROM SPACEXTBL);- List the failed landing_outcomes in drone ship, their booster versions, and launch site names for in year 2015
SELECT LANDING__OUTCOME, BOOSTER_VERSION, LAUNCH_SITE
FROM SPACEXTBL
WHERE Landing__Outcome = 'Failure (drone ship)'
AND YEAR(DATE) = 2015;- Rank the count of landing outcomes (such as Failure (drone ship) or Success (ground pad)) between the date 2010-06-04 and 2017-03-20, in descending order
SELECT LANDING__OUTCOME, COUNT(LANDING__OUTCOME) AS TOTAL_NUMBER
FROM SPACEXTBL
WHERE DATE BETWEEN '2010-06-04' AND '2017-03-20'
GROUP BY LANDING__OUTCOME
ORDER BY TOTAL_NUMBER DESC- Performing exploratory Data Analysis and determine Training Labels
- creating a column for the class
- Standardizing the data
- Spliting into train data and test data
- Finding best Hyperparameter for SVM, Classification Trees and Logistic Regression
- Finding the method performs best using test data
- Creating a NumPy array from the column Class in data, by applying the method to_numpy() then assign it to the variable Y,make sure the output is a Pandas series (only one bracket df['name of column']).
- Standardizing the data in X then reassign it to the variable X using the transform provided below.
- Using the function train_test_split to split the data X and Y into training and test data. Set the parameter test_size to 0.2 and random_state to 2. The training data and test data should be assigned to the following labels.
- Creating a logistic regression object then create a GridSearchCV object logreg_cv with cv = 10. Fit the object to find the best parameters from the dictionary parameters.
- Calculating the accuracy on the test data using the method score:
- Creating a support vector machine object then create a GridSearchCV object svm_cv with cv - 10. Fit the object to find the best parameters from the dictionary parameters.
- Calculating the accuracy on the test data using the method score:
- Creating a decision tree classifier object then create a GridSearchCV object tree_cv with cv = 10. Fit the object to find the best parameters from the dictionary parameters.
- Calculating the accuracy of tree_cv on the test data using the method score
- Creatoing a k nearest neighbors object then create a GridSearchCV object knn_cv with cv = 10. Fit the object to find the best parameters from the dictionary parameters.
- Calculatoing the accuracy of knn_cv on the test data using the method score:
- Finding the method performs best:
# Assuming that we have already fitted the logreg_cv, svm_cv, tree_cv, and knn_cv models using GridSearchCV
# Also, X_test, Y_test are defined
# Calculating accuracy on the test data for each model
test_accuracy_logreg = logreg_cv.score(X_test, Y_test)
test_accuracy_svm = svm_cv.score(X_test, Y_test)
test_accuracy_tree = tree_cv.score(X_test, Y_test)
test_accuracy_knn = knn_cv.score(X_test, Y_test)
# Printing the accuracy scores for each model
print("Accuracy on Test Data (Logistic Regression): ", test_accuracy_logreg)
print("Accuracy on Test Data (SVM): ", test_accuracy_svm)
print("Accuracy on Test Data (Decision Tree): ", test_accuracy_tree)
print("Accuracy on Test Data (KNN): ", test_accuracy_knn)
# Finding the method that performs best
best_method = max(test_accuracy_logreg, test_accuracy_svm, test_accuracy_tree, test_accuracy_knn)
# Printing the best method
print("\nBest Performing Method:")
if best_method == test_accuracy_logreg:
print("Logistic Regression")
elif best_method == test_accuracy_svm:
print("Support Vector Machine (SVM)")
elif best_method == test_accuracy_tree:
print("Decision Tree")
elif best_method == test_accuracy_knn:
print("K-Nearest Neighbors (KNN)")- Import required libraries
- Read the airline data into pandas dataframe
- Creating a dash application
- Creating an app layout
- TASK 1: Add a dropdown list to enable Launch Site selection
- The default select value is for ALL sites
- dcc.Dropdown(id='site-dropdown',...)
- TASK 2: Add a pie chart to show the total successful launches count for all sites
- Add a slider to select payload range
- Add a scatter chart to show the correlation between payload and launch success
- Add a callback function for
site-dropdownas input,success-pie-chartas output - Add a callback function for
site-dropdownandpayload-slideras inputs,success-payload-scatter-chartas output - Run the app
- Model Accuracy: Achieved an accuracy of 88% with the Random Forest classifier.
- Key Insights:
- Higher payload mass tends to decrease the probability of a successful landing.
- Certain launch sites have higher success rates, indicating better infrastructure or operational efficiency.
- Application Impact: The web app provides a user-friendly interface for stakeholders to predict landing success, aiding in decision-making processes.
This project successfully demonstrates the application of data science techniques to predict the success of SpaceX's reusable rockets. Through comprehensive data analysis and machine learning modeling, we identified key factors influencing landing outcomes and developed an interactive tool to leverage these insights. This work not only reinforces foundational data science skills but also contributes to the ongoing advancements in aerospace technology.
- Python 3.8+
- Pip
- Virtual Environment (optional but recommended)
- SQL Database (SQLite/MySQL)
- Clone the Repository
git clone https://github.com/yourusername/spacex-reusable-rockets-prediction.git
cd spacex-reusable-rockets-prediction- Create and Activate Virtual Environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install Dependencies
pip install -r requirements.txt- Set Up the Database
Import the SQL schema and data.
sqlite3 spacex.db < schema.sqlAdjust the commands based on your chosen database
spacex-reusable-rockets-prediction/
│
├── assets/
│ └── app_screenshot.png
│
├── data/
│ ├── spacex_launch_data.csv
│ └── schema.sql
│
├── notebooks/
│ ├── EDA_Python.ipynb
│ └── EDA_SQL.ipynb
│
├── src/
│ ├── data_preprocessing.py
│ ├── model_training.py
│ └── app.py
│
├── requirements.txt
├── README.md
└── LICENSEThis project is licensed under the MIT License. See the LICENSE file for details.
Mohammed Ammaruddin md.ammaruddin2020@gmail.com https://www.linkedin.com/in/m-ammaruddin/