Who is become a customer ?

Customer Segmentation & Identify Customers from a Mailout Campaign for Arvato Financial Services


This blog post is about the Capstone project for Udacity’s Data Scientist Nano Degree program. Project is based on the real-life data science problem and data is provided by Udacity’s partners at Bertelsmann Arvato Analytics.

Goal of this project is to identify which individuals are most likely to respond to the campaign and become customers of the mail-order company

Project Overview

We analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. Then, we create customer segments of the general population to target with their marketing in order to grow and use a model to predict which individuals are most likely to convert into becoming customers for the company.

In this project, there are 2 approach below mentioned;

I. Unsupervised Learning to perform customer segmentation and identify clusters/segments from general population who best match mail-order company’s customer base.

II. Supervised Learning to identify targets for marketing campaign of the mail-order company who could possibly become their customers.

Finally, we make predictions on the campaign data as part of a Kaggle Competition for rank the individuals by how likely they are to convert to being a customer.

Project Steps

In this project we are provided with 4 datasets in total.

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Also, two other datasets were highly important during the modeling process, DIAS Attributes — Values 2017.xlsx and DIAS Attributes — Values 2017.xlsx, which explain and detail features in the dataset.

Firstly, we start cleaning the data.

In data, some columns which are staring with ‘LP’ have some strange values that have not been specified in the attribute description.Also, in these columns there is a lot of information that can be compressed. For these reason I modified these columns to contain useful information.

From the information given in the attributes dataframe, some columns have unknown values recorded with specific representations. There are 232 columns which have unknown values, these are displayed below. I converted all these unknown values into NaN values. (Figure1)


After that I can Identify, understand and deal with missing data.

  • By looking at the plot above (Figure2) there is overlap in both Azdias and Customers data i.e. data is missing from same columns.
  • Also the percentage of missing values in Azdias and Customers coincide a lot, which confirms that customers data is a subset of azdias data.

We can threshold the missing percentage and see how many columns have missing data percentage more than the threshold. I identified the treshold value as 30%.

As a result of;

  • We have 11 features with more than 30% missing values in Customer data, and 9 feature s in Azdias data. Whereas, in total we had 279 columns with missing values in both dataframes.
  • I dropped these 279 columns from dataframes. (Figure3)

After removing the columns which have more than 30% missing values, we can look at the dataframes with remaining features for any missing values in rows.


See in Figure4;

  • Most of the rows have less than 50 missing values in both dataframes.
  • Customers data has comparatively more missing rows than Azdias.

We dropped all the rows which have more than 50 missing values.


After removed all the rows having more than 50 missing features, as seen in the above plot we still had some missing values in range of 1 to 50. I addressed these missing values by either filling them with the most common values.(Figure5)

Before applying my clustering technique (Kmeans), I decided to reduce the dimensions of my dataset to avoid the curse of dimensionality. I used PCA to achieve this. I first ran PCA on all the features I had (353), and then made a plot of cumulative explained variance against number of components. The aim was to achieve at least 90% variance and decide to keep 150 components. (Figure6)


Look at some component’s feature weights given by the PCA algorithm and understand what each component is comprised of.

Component 0:

  • Has a high positive weight to moving patterns of people.
  • Has a high weight to number of 1–2 family houses in the neighbourhood and a negative weight to number of 6–10 family houses.
  • KBA13_* — this feature has no description given (in Attributes — levels data), but some similar features (which have a description) correspond to shares of cars with some specification.

Component 1:

  • Has a high positive weights towards features describing online activity and transactions of last 12 and 24 months.
  • Has a negative weight to features containing information about when was the last transaction made.

Component 3:

  • This component corresponds to people who are always financially prepared.
  • This component has a negative weight to people who save money or invest money.
  • Also the age determined through prename analysis has a big impact on this component.
  • The movement a person witnessed/participated during their youth has a negative weight on this component.

With dimension now reduced, let’s do clustering. To decide on number of clusters, we will try using elbow method. The idea behind the elbow method, is that the number of clusters can be selected in such a way that adding one more cluster to the existing clusters will not improve the intra-cluster variation. Which means adding a cluster will not reduce the sum of sqaured distances between the clusters.

From the elbow above, we can see that the sum of sqaured error decreasing with a high slope until around 7 clusters and then the slope decreases. (Figure7)


How much percentage of population under consideration is present in each cluster ?

  • The distribution of general population is close to uniform (although not perfectly uniform).
  • The customers are mostly from clusters 0, 1, 5. (Figure8)

Ratio of Proportion of Customers to Proportion of General Population in each cluster ?

The ratio > 1 indicates that the cluster contains more customers,

  • This also gives an idea about which cluster can be targetted for future customers.
  • In this context, cluster 0,1 and 5 could be targetted for future new customers.

Supervised Learning Model with customer segmentation now complete, we move ahead with the last part of the project i.e. analyze MAILOUT_TRAIN and MAILOUT_TEST dataset and predict whether or not a person will become a customer of the company following the campaign.

In MAILOUT_TRAIN dataset, we can find that among 43000 individuals, only 532 people responded to the mail-out campaign which means the training data is highly imbalanced so accuracy/precision/recall score will not be the appropriate evaluation metric and we would be using ROC AUC. (Figure10)


Steps followed:

Data cleaned using same preprocessing pipeline built for AZDIAS and CUSTOMER dataset

Data split into training and validation based on Stratified technique (to deal with data imbalance)

Evaluate and select best performing algorithm — We try 3 algorithms (AdaBoostRegressor, GradientBoostingRegressor and XGB Classifier)and use ROC AUC evaluation metrics to finalise on the best algorithm to use — In this case we picked XGB Classifier

Fine tune the algorithm — Various hyper parameters are validated and best performing parameters are selected

Train the model and make inference on validation set


As we see below mentioned, final model has the highest AUC Test Score (0.7855). So, we select XGB Classifier model with final parameters below mentioned; (Figure11)

  • learning_rate =0.05
  • n_estimators=76
  • max_depth=9
  • min_child_weight=6
  • gamma=0.1
  • subsample=0.8
  • colsample_bytree=0.9
  • reg_alpha=0.05
  • objective= ‘binary:logistic’
  • nthread=4
  • scale_pos_weight=1
  • seed=42

Feature importance: “D19_SOZIALES” is the most important feature. Even though there is no information available about D19_SOZIALES specifically, all D19 features represents transactions of certain product group. People who are involved in this (D19_SOZIALES)kind of transaction are most likely to respond to the market campaign and become customer of the mail-order company. (Figure12)



As we come to the end; We increase the XGBoost model’s performance up to from ‘0.774917’ to ‘0.7855 with hyper parameter tunnig.

This is a decent improvement but significant jump can be obtained by other methods like feature engineering, creating ensemble of models, stacking, etc

Advantages over The XGBoos to others:

  • Regularization: XGBoost is also known as a ‘regularized boosting‘ technique.
  • Parallel Processing: XGBoost implements parallel processing and is blazingly faster as compared to GBM. XGBoost also supports implementation on Hadoop.
  • High Flexibility: XGBoost allows users to define custom optimization objectives and evaluation criteria.
  • Handling Missing Values: XGBoost has an in-built routine to handle missing values.
  • Tree Pruning: XGBoost make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
  • Built-in Cross-Validation: XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run

Kaggle submission

With fine-tuned model now trained, use MAILOUT_TEST dataset make inference (after running through preprocessing pipeline) and export & submit results to Kaggle.

Our model was able to reach >80% score on the test set. Screenshot from leader board at the time of submission:


From the real life demographic data provided by Arvato Financials, we have been able to create segmentation of customers and also able to identify key features that will help identify customers for a company.

By consuming the demographic data, we were able to create customer segmentation analysis as well as identify features that help identify customers for a company. During the project, we intended to improve customer segmentation in a more assertive campaign, by using machine learning algorithms in the decision process.

Next Steps

As an extension of this project, some approaches could be addressed, such as:

  1. Revisit the preprocess step, how features are treated in terms of variable type and missing values, for example
  2. A significant jump can be obtained by other methods like feature engineering, creating ensemble of models, stacking, etc
  3. Try and test other learning techniques (curious about PU learning to cluster data)

Github Link

Kaggle Leaderboard

Thanks for reading….



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store