In my previous post, I went over the significance of drilling down into the classification report of a model’s results when evaluating a model’s performance in relative terms. This was because the overall accuracy score is not fully representative and could be misleading due to class imbalances particularly when there is a massive bottom line difference between Type I and Type II errors.
We’ll start off with a couple models and go over the results to see what recommendations we can make for business functions in charge of managing customer turnover.
The XGBoost model with some simple preprocessing and hyperparameter…
I’ve noticed that there are more than enough cool visualizations and mostly cookie-cutter solutions to ML problems, but not enough focus on the business side to understanding the problem and evaluating the ML model results.
The purpose of why the ML problem exists is to provide actionable insight and value to the business.
So here goes…
According to this article, the cost of acquiring a new customer is almost 50x the cost of retaining an existing customer for telecom companies.
However, the solutions I’ve found for this Kaggle data set don’t explain the impact of the model nor dig deep…
The purpose of eval() and query() is to improve performance by using C within the bounds of NumPy.
However, a reduction in computational time is not guaranteed for all situations especially since these libraries are continually updated. Referring to dated articles and books does not accurately reflect the current performance of these libraries.
I will ignore discussing Numba which is also a powerful performance enhancing tool.
NumPy vs. NumExpr
NumPy is built for mathematical computations and takes advantage of vectorization and broadcasting. One downside being additional overhead costs due to memory allocation for each computational step.
This is where NumExpr…
ANOVA is the acronym for “Analysis of Variance”. Here we will focus on Single Factor ANOVA.
For example, if you want to know whether a statistically significant difference exists between multiple categories, then you would use ANOVA.
Let’s say you have variables 1, 2, & 3. You would state your null hypothesis as the following.
This null hypothesis states that there is no statistically significant difference between the means of variable 1, 2, & 3.
You could do this up to n variables.
import numpy as np
import tensorflow as tf
# Determine CSV, label, and key columns
CSV_COLUMNS = 'weight_pounds,is_male,mother_age,plurality,gestation_weeks,key'.split(',')
LABEL_COLUMN = 'weight_pounds'
KEY_COLUMN = 'key'# Set default values for each CSV column
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], ['nokey']]
TRAIN_STEPS = 1000
def read_dataset(filename, mode, batch_size = 512):
columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
features = dict(zip(CSV_COLUMNS, columns))
label = features.pop(LABEL_COLUMN) …