Learn to Publish Your Python Package in Simple Ways
Learn How to Turn Your ML Project into a Reusable Python Library
As a data scientist with Python programming knowledge, I am sure the reader has experience installing the machine learning package for any data work. However, have you thought about creating and publishing your package? Especially if you have a reusable pipeline that you want to share with the community.
Building and publishing your package has many advantages, including consistency across coding projects, easy collaboration with the community, and improved credibility within the field.
Understanding how important it is to build and publish your package, this article will teach you how to do that in these easy steps.
Curious about it? Let’s get into it.
Build and Publish Your Python ML Packages
In this article, we will build a reusable machine-learning classifier pipeline that we can use easily by changing the parameters instead of structuring the pipeline from scratch.
For example, the pipeline above will be distributed as a package called selectml. You can also see the whole code in the selectml repository.
As a reminder, the reader is already comfortable using Python and understands how to build a simple machine-learning model. We will not explore machine learning; we will focus on building and publishing the ML packages.
Let’s start our project by preparing the structure.
Step 1: Prepare Your Project Structure
The first step is to prepare our project structure. For the selected package, we will use the following structure. Also, don’t forget to create a virtual environment to isolate everything from your main environment.
selectml/
├── selectml/
│ ├── __init__.py
│ ├── preprocessing.py
│ ├── models.py
│ └── pipeline.py
├── tests/
│ └── test_pipeline.py
├── README.md
├── setup.py
├── LICENSE
└── requirements.txt
Create all the folders and files as structured above, and let’s move on to the next step.
Step 2: Develop The Library Code
This is the step in which we will develop all the required machine learning pipeline codes for the selected library. We will divide it into three parts: the data preprocessing, the models, and the pipeline.
First, let’s start with the preprocessing. We will use the following code inside the preprocessing file.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
class DataPreprocessor:
def __init__(self):
self.preprocessor = None
def fit_transform(self, df, numerical_features, categorical_features):
"""Fits transformers on numerical and categorical features and transforms the data."""
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
self.preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
return self.preprocessor.fit_transform(df)
def transform(self, df):
""" Transforms new data using the previously fitted transformer."""
if self.preprocessor is None:
raise ValueError("The preprocessor has not been fitted. Call fit_transform() first.")
return self.preprocessor.transform(df)
The code above initiates a class called DataProcessor with two different methods that transform all the numerical and categorical features the machine learning model can accept.
Next, we will prepare the model selector class, in which we only need to pass the string parameter to switch models easily.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
class ModelSelector:
def __init__(self, model_type='logistic'):
self.model_type = model_type
if model_type == 'logistic':
self.model = LogisticRegression()
elif model_type == 'random_forest':
self.model = RandomForestClassifier(n_estimators=100)
elif model_type == 'svm':
self.model = SVC(probability=True)
else:
raise ValueError("Unsupported model type. Choose 'logistic', 'random_forest', or 'svm'.")
def train(self, X, y):
""" Fits the selected model on the training data."""
self.model.fit(X, y)
return self
def predict(self, X):
""" Generates predictions from the fitted model."""
return self.model.predict(X)
def predict_proba(self, X):
""" Provides probability estimates if available."""
if hasattr(self.model, "predict_proba"):
return self.model.predict_proba(X)
else:
raise AttributeError("This model does not support probability estimates.")
The Class ModelSelector also comes with three different methods, similar to the Scikit-Learn API, where we can train and make predictions based on the data we pass.
Lastly, we will tie together both modules we have developed previously to create a smooth machine-learning pipeline.
from .preprocessing import DataPreprocessor
from .models import ModelSelector
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
class ModelSelectionPipeline:
def __init__(self, model_type='logistic'):
self.preprocessor = DataPreprocessor()
self.model_selector = ModelSelector(model_type=model_type)
def run_pipeline(self, df, target, numerical_features, categorical_features, test_size=0.2, random_state=42):
"""
Executes the entire pipeline:
- Preprocesses the data
- Splits into training and test sets
- Trains the selected model
- Evaluates model performance
"""
X = df.drop(columns=[target])
y = df[target]
X_processed = self.preprocessor.fit_transform(X, numerical_features, categorical_features)
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=test_size, random_state=random_state)
self.model_selector.train(X_train, y_train)
predictions = self.model_selector.predict(X_test)
acc = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)
return {
'accuracy': acc,
'report': report,
'model': self.model_selector.model
}
The ModelSelectionPipeline class will process an end-to-end pipeline from preprocessing our data to evaluating our model performance. That’s everything for our core module. Let’s expose the entire module by adding the following code in the init file.
from .preprocessing import DataPreprocessor
from .models import ModelSelector
from .pipeline import ModelSelection
Pipeline__all__ = ["DataPreprocessor", "ModelSelector", "ModelSelectionPipeline"]
With all the code ready, let’s add a Unit Test to evaluate the code correctly.
Step 3: Unit Test
A unit test is later used to validate the pipeline we created previously. It’s like a simple test with the data example to see if our pipeline provides the desired output.
Put the test pipeline file with the following code to do the above.
import pandas as pd
from selectml.pipeline import ModelSelectionPipeline
def test_model_selection_pipeline():
data = {
'age': [25, 32, 47, 51, 23, 45, 36, 29, 40, 33, 28, 52, 37, 46, 31, 44, 39, 27, 50, 35],
'income': [50000, 60000, 80000, 90000, 40000, 75000, 65000, 55000, 70000, 62000,
48000, 91000, 68000, 77000, 59000, 80000, 72000, 53000, 85000, 66000],
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia',
'San Antonio', 'San Diego', 'Dallas', 'San Jose', 'Austin', 'Jacksonville',
'Fort Worth', 'Columbus', 'Charlotte', 'Indianapolis', 'San Francisco',
'Seattle', 'Denver', 'Washington'],
'purchased': [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
numerical_features = ['age', 'income']
categorical_features = ['city']
pipeline = ModelSelectionPipeline(model_type='random_forest')
result = pipeline.run_pipeline(df, target='purchased',
numerical_features=numerical_features,
categorical_features=categorical_features)
assert 'accuracy' in result
assert 'report' in result
print("Test passed with accuracy:", result['accuracy'])
We can run the test using the code below. You can change the code to import the pipeline class from the module we have created above. However, we will try out the test after publishing our library.
pytest tests/test_pipeline.py
With all the code ready, let’s document our package.
Keep reading with a 7-day free trial
Subscribe to Non-Brand Data to keep reading this post and get 7 days of free access to the full post archives.