Pre-requisite: Getting started with machine learning
scikit-learn is an open-source Python library that implements a range of machine learning, pre-processing, cross-validation, and visualization algorithms using a unified interface.
Important features of scikit-learn:
In this article, we are going to see how we can easily build a machine learning model using scikit-learn.
Installation:
The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.
Scikit-learn requires:
Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:
pip install -U scikit-learn
Let us get started with the modeling process now.
Step 1: Load a dataset
A dataset is nothing but a collection of data. A dataset generally has two main components:
Loading exemplar dataset: scikit-learn comes loaded with a few example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.
Given below is an example of how one can load an exemplar dataset:
# load the iris dataset as an example from sklearn.datasets import load_iris iris = load_iris() # store the feature matrix (X) and response vector (y) X = iris.data y = iris.target # store the feature and target names feature_names = iris.feature_names target_names = iris.target_names # printing features and target names of our dataset print("Feature names:", feature_names) print("Target names:", target_names) # X and y are numpy arrays print("\nType of X is:", type(X)) # printing first 5 input rows print("\nFirst 5 rows of X:\n", X[:5])
Output:
Feature names: ['sepal length (cm)','sepal width (cm)', 'petal length (cm)','petal width (cm)'] Target names: ['setosa' 'versicolor' 'virginica'] Type of X is: First 5 rows of X: [[ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2] [ 5. 3.6 1.4 0.2]]
Loading external dataset: Now, consider the case when we want to load an external dataset. For this purpose, we can use the pandas library for easily loading and manipulating datasets.
To install pandas, use the following pip command:
pip install pandas
In pandas, important data types are:
Series: Series is a one-dimensional labeled array capable of holding any data type.
DataFrame: It is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
Note: The CSV file used in the example below can be downloaded from here: weather.csv
import pandas as pd # reading csv file data = pd.read_csv('weather.csv') # shape of dataset print("Shape:", data.shape) # column names print("\nFeatures:", data.columns) # storing the feature matrix (X) and response vector (y) X = data[data.columns[:-1]] y = data[data.columns[-1]] # printing first 5 rows of feature matrix print("\nFeature matrix:\n", X.head()) # printing first 5 values of response vector print("\nResponse vector:\n", y.head())
Output:
Shape: (14, 5) Features: Index([u'Outlook', u'Temperature', u'Humidity', u'Windy', u'Play'], dtype='object') Feature matrix: Outlook Temperature Humidity Windy 0 overcast hot high False 1 overcast cool normal True 2 overcast mild high True 3 overcast hot normal False 4 rainy mild high False Response vector: 0 yes 1 yes 2 yes 3 yes 4 yes Name: Play, dtype: object
Step 2: Splitting the dataset
One important aspect of all machine learning models is to determine their accuracy. Now, in order to determine their accuracy, one can train the model using the given dataset and then predict the response values for the same dataset using that model and hence, find the accuracy of the model.
But this method has several flaws in it, like:
A better option is to split our data into two parts: the first one for training our machine learning model, and the second one for testing our model.
To summarize:
Advantages of train/test split:
Consider the example below:
# load the iris dataset as an example from sklearn.datasets import load_iris iris = load_iris() # store the feature matrix (X) and response vector (y) X = iris.data y = iris.target # splitting X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) # printing the shapes of the new X objects print(X_train.shape) print(X_test.shape) # printing the shapes of the new y objects print(y_train.shape) print(y_test.shape)
Output:
(90L, 4L) (60L, 4L) (90L,) (60L,)
The train_test_split function takes several arguments which are explained below:
Step 3: Training the model
Now, it’s time to train some prediction models using our dataset. Scikit-learn provides a wide range of machine learning algorithms that have a unified/consistent interface for fitting, predicting accuracy, etc.
The example given below uses KNN (K nearest neighbors) classifier.
Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.
Now, consider the example below:
# load the iris dataset as an example from sklearn.datasets import load_iris iris = load_iris() # store the feature matrix (X) and response vector (y) X = iris.data y = iris.target # splitting X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) # training the model on training set from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) # making predictions on the testing set y_pred = knn.predict(X_test) # comparing actual response values (y_test) with predicted response values (y_pred) from sklearn import metrics print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred)) # making prediction for out of sample data sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = knn.predict(sample) pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species) # saving the model from sklearn.externals import joblib joblib.dump(knn, 'iris_knn.pkl')
Output:
kNN model accuracy: 0.983333333333 Predictions: ['versicolor', 'virginica']
Important points to note from the above code:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = knn.predict(sample)
joblib.dump(knn, 'iris_knn.pkl')
knn = joblib.load('iris_knn.pkl')
As we approach the end of this article, here are some benefits of using scikit-learn over some other machine learning libraries(like R libraries):
Australia
UK
UAE
Singapore
Canada
New
Zealand
Malaysia
USA
India
South
Africa
Ireland
Saudi
Arab
Qatar
Kuwait
Hongkong
Copyright 2016-2023 www.programmingshark.com - All Rights Reserved.
Disclaimer : Any type of help and guidance service given by us is just for reference purpose. We never ask any of our clients to submit our solution guide as it is, anywhere.