Cryptocurrencies coins by Worldspectrum | Free License
Generated analysis of cryptocurrencies are available on the trading market and how they can be grouped using classification. To do this I used unsupervivsed learning and Amazon SageMaker by clustering cryptocurrencies and creating plots to present results.
Main Tasks:
-
Data Preprocessing: Prepare data for dimension reduction with PCA and clustering using K-Means.
-
Reducing Data Dimensions Using PCA: Reduce data dimension using the
PCA
algorithm fromsklearn
. -
Clustering Cryptocurrencies Using K-Means: Predict clusters using the cryptocurrencies data using the
KMeans
algorithm fromsklearn
. -
Visualizing Results: Create some plots and data tables to present your results.
-
Optional Challenge: Deploy notebook to Amazon SageMaker.
Load the information about cryptocurrencies and perform data preprocessing tasks.
-
Using the
CSV
file, created aPath
object and read the file data directly into a DataFrame namedcrypto_df
usingpd.read_csv()
. -
Using the
requests
library, retreive the necessary data from the following API endpoint from CryptoCompare -https://min-api.cryptocompare.com/data/all/coinlist
.
With the data loaded into a Pandas DataFrame, continue with the following data preprocessing tasks.
-
Keep only the necessary columns: 'CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','TotalCoinSupply'
-
Keep only the cryptocurrencies that are trading.
-
Keep only the cryptocurrencies with a working algorithm.
-
Remove the
IsTrading
column. -
Remove all cryptocurrencies with at least one null value.
-
Remove all cryptocurrencies that have no coins mined.
-
Drop all rows where there are 'N/A' text values.
-
Store the names of all cryptocurrencies in a DataFrame named
coins_name
, use thecrypto_df.index
as the index for this new DataFrame. -
Remove the
CoinName
column. -
Create dummy variables for all the text features, and store the resulting data in a DataFrame named
X
. -
Use the
StandardScaler
fromsklearn
to standardize all the data of theX
DataFrame.
Used the PCA
algorithm from sklearn
to reduce the dimensions of the X
DataFrame down to three principal components.
After reducing the data dimensions, created a DataFrame named pcs_df
using as columns names "PC 1", "PC 2"
and "PC 3"
; used the crypto_df.index
as the index for this new DataFrame.
Final DataFrame looks like:
Used the KMeans
algorithm from sklearn
to cluster the cryptocurrencies using the PCA data.
Performed the following tasks:
-
Created an Elbow Curve to find the best value for
k
using thepcs_df
DataFrame. -
After defining the best value for
k
, ran theKmeans
algorithm to predict thek
clusters for the cryptocurrencies data. Used thepcs_df
to run theKMeans
algorithm. -
Created a new DataFrame named
clustered_df
, that includes the following columns"Algorithm", "ProofType", "TotalCoinsMined", "TotalCoinSupply", "PC 1", "PC 2", "PC 3", "CoinName", "Class"
. Maintained the index of thecrypto_df
DataFrames as is shown bellow.
Created some data visualization to present the final results. Performed the following tasks:
-
Created a 3D-Scatter using Plotly Express to plot the clusters using the
clustered_df
DataFrame. Included the following parameters on the plot:hover_name="CoinName"
andhover_data=["Algorithm"]
to show this additional info on each data point. -
Used
hvplot.table
to create a data table with all the current tradable cryptocurrencies. The table has the following columns:"CoinName", "Algorithm", "ProofType", "TotalCoinSupply", "TotalCoinsMined", "Class"
-
Created a scatter plot using
hvplot.scatter
, to present the clustered data about cryptocurrencies havingx="TotalCoinsMined"
andy="TotalCoinSupply"
to contrast the number of available coins versus the total number of mined coins. Used thehover_cols=["CoinName"]
parameter to include the cryptocurrency name on each data point.
Uploaded the Jupyter notebook to Amazon SageMaker and deployed it.
The hvplot
and Plotly Express libraries are not included in the built-in anaconda environments, used the altair
library instead.
Performed the following tasks:
-
Uploaded the Jupyter notebook and renamed it as
crypto_clustering_sm.ipynb
-
Selected the
conda_python3
environment. -
Installed the
altair
library by running the following code before the initial imports.
!pip install -U altair
-
Useed the
altair
scatter plot to create the Elbow Curve. -
Used the
altair
scatter plot, instead of the 3D-Scatter from Plotly Express, to visualize the clusters. Since this is a 2D-Scatter, usex="PC 1"
andy="PC 2"
for the axes, and added the following columns as tool tips:"CoinName", "Algorithm", "TotalCoinsMined", "TotalCoinSupply"
. -
Used the
altair
scatter plot to visualize the tradable cryptocurrencies usingx="TotalCoinsMined"
andy="TotalCoinSupply"
for the axes. -
Showed the table of current tradable cryptocurrencies using the
display()
command.