top of page

Project: Identify Customer Segments (Part 2)

  • tcanengin
  • Jan 20, 2025
  • 16 min read

Updated: Jan 21, 2025


Step 1.2.3: Complete Feature Selection

In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:

  • All numeric, interval, and ordinal type columns from the original dataset.

  • Binary categorical features (all numerically-encoded).

  • Engineered features from other multi-level categorical features and mixed features.

Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.

In [49]:

# If there are other re-engineering tasks you need to perform, make sure you
# take care of them here. (Dealing with missing data will come in step 2.1.)

mixedones = feat_info.loc[feat_info['type'== 'mixed''attribute'].values
print(mixedones)

azdiasNan = azdiasNan.drop(['LP_LEBENSPHASE_FEIN''LP_LEBENSPHASE_GROB',
 'WOHNLAGE''PLZ8_BAUMAX'], axis=1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Columns: 192 entries, ALTERSKATEGORIE_GROB to CAMEO_INTL_2015_L
dtypes: float64(44), int64(20), uint8(128)
memory usage: 544.0 MB

In [51]:

# Do whatever you need to in order to ensure that the dataframe only contains
# the columns that should be passed to the algorithm functions.

Step 1.3: Create a Cleaning Function

Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.

In [90]:

#def clean_data(df):
    #"""
    #Perform feature trimming, re-encoding, and engineering for demographics
    #data
    
    #INPUT: Demographics DataFrame
    #OUTPUT: Trimmed and cleaned demographics DataFrame
    #"""
    
    # Put in code here to execute all main cleaning steps:
    # convert missing value codes into NaNs, ...
    
    
    # remove selected columns and rows, ...

    
    # select, re-encode, and engineer column values.

    
    # Return the cleaned dataframe.
    
def clean_data(dataframe):
    
        for i in dataframe.columns:
            dataframe[i] = dataframe[i].replace(nan_Info.loc[i][0], np.nan)
    
        dataframe = dataframe.drop(['TITEL_KZ''AGER_TYP''KK_KUNDENTYP''KBA05_BAUMAX''GEBURTSJAHR''ALTER_HH'], axis=1#drop
        numrows= dataframe.isnull().sum(axis=1)
        numrowsExclude=numrows[numrows>30]
        dataframe30 = dataframe.iloc[numrowsExclude.index]
        dataframe = dataframe[~dataframe.index.isin(numrowsExclude.index)]
    
        #0.0 -> 1.0
        #1.0 -> 2.0
        dataframe['SOHO_KZ'].replace([0.0,1.0], [1.0,2.0], inplace=True)
        #0 -> 1.0
        #1 -> 2.0
        dataframe['GREEN_AVANTGARDE'].replace([0,1], [1.0,2.0], inplace=True)
        #W -> 1.0
        #O -> 2.0
        dataframe['OST_WEST_KZ'].replace(['W','O'], [1.0,2.0], inplace=True)
        #2 -> 2.0
        #1 -> 1.0
        dataframe['ANREDE_KZ'].replace([2,1], [2.0,1.0], inplace=True)


        dataframe = pd.get_dummies(dataframe, columns=Multi_Category)
        dataframe['PRAEGENDE_JUGENDJAHRE_DM']=dataframe['PRAEGENDE_JUGENDJAHRE'].apply(dominating_movement)
        dataframe['PRAEGENDE_JUGENDJAHRE_D']=dataframe['PRAEGENDE_JUGENDJAHRE'].apply(decades)
        dataframe = dataframe.drop(['PRAEGENDE_JUGENDJAHRE'], axis=1#drop
        dataframe['CAMEO_INTL_2015_W']=dataframe['CAMEO_INTL_2015'].apply(wealth)
        dataframe['CAMEO_INTL_2015_L']=dataframe['CAMEO_INTL_2015'].apply(lifestage)
        dataframe = dataframe.drop(['CAMEO_INTL_2015'], axis=1#drop
        dataframe = dataframe.drop(['LP_LEBENSPHASE_FEIN''LP_LEBENSPHASE_GROB','WOHNLAGE''PLZ8_BAUMAX'], axis=1#drop
    
        return dataframe, dataframe30    
    
    

Step 2: Feature Transformation

Step 2.1: Apply Feature Scaling

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:

  • sklearn requires that data not have missing values in order for its estimators to work properly. So, before applying the scaler to your data, make sure that you've cleaned the DataFrame of the remaining missing values. This can be as simple as just removing all data points with missing data, or applying an Imputer to replace all missing values. You might also try a more complicated procedure where you temporarily remove missing values in order to compute the scaling parameters before re-introducing those missing values and applying imputation. Think about how much missing data you have and what possible effects each approach might have on your analysis, and justify your decision in the discussion section below.

  • For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1.

  • For these classes, you can make use of the .fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.

In [53]:

#Applying Imputer

#print('Can')

imputer = Imputer(strategy='most_frequent')
azdiasNanImputer = imputer.fit_transform(azdiasNan)

#print('Can')

In [54]:

# Apply feature scaling to the general population demographics data.

scaler = StandardScaler()
Features_scaled = scaler.fit_transform(azdiasNanImputer)

#print('Can', Features_scaled)

Discussion 2.1: Apply Feature Scaling

(Double-click this cell and replace this text with your own text, reporting your decisions regarding feature scaling.)

There are many columns which contain missing values. I applied “most frequent” strategy for not loosing the variables. I preferred to use this strategy because it performs well and fast on categorical data. Because multi impution is more accurate in value prediction in the dataset, I chose to apply it.

Step 2.2: Perform Dimensionality Reduction

On your scaled data, you are now ready to apply dimensionality reduction techniques.

  • Use sklearn's PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, you should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there's enough features to see the general trend in variability).

  • Check out the ratio of variance explained by each principal component as well as the cumulative variance explained. Try plotting the cumulative or sequential values using matplotlib's plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.

  • Once you've made a choice for the number of components to keep, make sure you re-fit a PCA instance to perform the decided-on transformation.

In [55]:

# Apply PCA to the data.

pca = PCA()
missing_pca = pca.fit_transform(Features_scaled)

In [56]:

# Investigate the variance accounted for by each principal component.

def scree_plot(pca):

    num_components=len(pca.explained_variance_ratio_)
    ind = np.arange(num_components)
    value = pca.explained_variance_ratio_
 
    plt.figure(figsize=(2015))
    ax = plt.subplot(111)
    cumvalue = np.cumsum(value)
    ax.bar(ind, value)
    ax.plot(ind, cumvalue)
    for i in range(num_components):
        ax.annotate(r"%s%%" % ((str(value[i]*100)[:4])), (ind[i]+0.2, value[i]), va="bottom", ha="center", fontsize=12)
 
    ax.xaxis.set_tick_params(width=0)
    ax.yaxis.set_tick_params(width=2, length=12)
 
    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained (%)")
    plt.title('Explained Variance Per Principal Component')

scree_plot(pca)

# Re-apply PCA to the data while selecting for number of components to retain. pca = PCA(50) Features_Pca = pca.fit_transform(Features_scaled) scree_plot(pca)



print(Features_scaled.shape) print(Features_Pca.shape)

(891221, 192)
(891221, 50)

Discussion 2.2: Perform Dimensionality Reduction

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding dimensionality reduction. How many principal components / transformed features are you retaining for the next step of the analysis?)

I used PCA(1) to apply principal component analysis on the data, so all components are computed. Then, I preferred to use 50 components for the next step of analysis.

Step 2.3: Interpret Principal Components

Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.

As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.

  • To investigate the features, you should map each weight to their corresponding feature name, then sort the features according to weight. The most interesting features for each principal component, then, will be those at the beginning and end of the sorted list. Use the data dictionary document to help you understand these most prominent features, their relationships, and what a positive or negative value on the principal component might indicate.

  • You should investigate and interpret feature associations from the first three principal components in this substep. To help facilitate this, you should write a function that you can call at any time to print the sorted list of feature weights, for the i-th principal component. This might come in handy in the next step of the project, when you interpret the tendencies of the discovered clusters.

In [59]:

# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.


def plot_comp(data, pca, PComponent):
    '''Plot the features with the most absolute variance for given pca component '''
    comp = pd.DataFrame(np.round(pca.components_, 4), columns = data.keys()).iloc[PComponent-1]
    comp.sort_values(ascending=False, inplace=True)
    comp = pd.concat([comp.head(5), comp.tail(5)])
    
    comp.plot(kind='bar', title='Component ' + str(PComponent))
    ax = plt.gca()
    ax.grid(linewidth='1.5', alpha=1)
    ax.set_axisbelow(True)
    plt.show()

In [60]:

# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.

plot_comp(azdiasNan, pca, 1)

# Map weights for the third principal component to corresponding feature names # and then print the linked values, sorted by weight. plot_comp(azdiasNan, pca, 2)

plot_comp(azdiasNan, pca, 3)

Discussion 2.3: Interpret Principal Components

(Double-click this cell and replace this text with your own text, reporting your observations from detailed investigation of the first few principal components generated. Can we interpret positive and negative values from them in a meaningful way?)

Comp-1

Dealing with the financial,social and living situation in the first component. CAMEO_INTK_2015_W, PLZ8_ANTG3, LP_STATUS_GROB_1.0, EWDICHTE, ORTSGR_KLS9 are positive correlation in this component. 6-10 family houses and low income as a social status are associated with the component. In addion to this, in the negative sides there are FINANZ_MINIMALIST, MOBI_REGIO,KBA05_ANTG1, PLZ8_ANTG1,KBA05_GBZ. According to this, low financial interest and movement patterns are negative associated with the component.

Comp-2

Dealing with the age,lifestyle (financial / movement) and energy consumption in the first component. ALTERSKATEGORIE_GROB (age), FINANZ_VORSORGER (be prepared), SEMIO_ERL(event-oriented), RETOURTYP_BK_S, ZABEOTYP_3 (Energy consumption, fair supplied) are positive correlation in this component. In the negative sides there are SEMIO_REL (religious), FINANZ_ANLEGER (investor), FINANZ_UNAUFFAELLIGER (inconspicuous), FINANZ_SPARER (money-saver), PRAGENDEJUGENDJAHRE_D (Dominating movement of person's youth, max 90s)

Comp-3

Dealing with personal traits in the third component. SEMIO_VERT (dreamful), SEMIO_KULT (cultural-minded), SEMIO_SOZ (socially-minded), SEMIO_FAM (family-minded), and SHOPPER_TYP_0.0 (external supplied hedonists) are positive correlation in this component. In the negative sides there are ZABEOTYP_3 (fair supplied), ANDREDE_KZ (Gender), SEMIO_DOM (dominant-minded), SEMIO_KAEM (combative attitude), SEMIO_KRIT (critical-minded)

Step 3: Clustering

Step 3.1: Apply Clustering to General Population

You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.

  • Use sklearn's KMeans class to perform k-means clustering on the PCA-transformed data.

  • Then, compute the average difference from each point to its assigned cluster's center. Hint: The KMeans object's .score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.

  • Perform the above two steps for a number of different cluster counts. You can then see how the average distance decreases with an increasing number of clusters. However, each additional cluster provides a smaller net benefit. Use this fact to select a final number of clusters in which to group the data. Warning: because of the large size of the dataset, it can take a long time for the algorithm to resolve. The more clusters to fit, the longer the algorithm will take. You should test for cluster counts through at least 10 clusters to get the full picture, but you shouldn't need to test for a number of clusters above about 30.

  • Once you've selected a final number of clusters to use, re-fit a KMeans instance to perform the clustering operation. Make sure that you also obtain the cluster assignments for the general demographics data, since you'll be using them in the final Step 3.3.

In [159]:

# Over a number of different cluster counts...
import time

startTime = time.time()
    # run k-means clustering on the data and...
    
    
    # compute the average within-cluster distances.
    
#print('Can') 

def kmeans_score(data, center):
    
    kmeans = KMeans(n_clusters=center)
    model = kmeans.fit(data)
    score = np.abs(model.score(data))
    
    return score


scored = []
centers = list(range(1,30,2))
#print('Can1') 

for center in centers:
    scored.append(kmeans_score(Features_Pca, center))
    
# Investigate the change in within-cluster distance across number of clusters.
# HINT: Use matplotlib's plot function to visualize this relationship.

#print('Can2')     
plt.plot(centers, scored, linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');

print("Run time: %s min" % np.round(((time.time() - startTime)/60),2))
    
Run time: 66.15 min

# Re-fit the k-means model with the selected number of clusters and obtain # cluster predictions for the general population demographics data. import time startTime = time.time() kmeans = KMeans(n_clusters=8) model_general = kmeans.fit(Features_Pca) Predcluster = model_general.predict(Features_Pca) print("Run time: %s min" % np.round(((time.time() - startTime)/60),2))

Run time: 2.43 min

Discussion 3.1: Apply Clustering to General Population

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding clustering. Into how many clusters have you decided to segment the population?)

I tried the Kmeans clustering with a range from 1 to 30. After that I decided to segment the population into 8 (n_clusters = 8) clusters.

Step 3.2: Apply All Steps to the Customer Data

Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.

  • Don't forget when loading in the customers data, that it is semicolon (;) delimited.

  • Apply the same feature wrangling, selection, and engineering steps to the customer demographics using the clean_data() function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.)

  • Use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, you should not be using a .fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.

In [108]:

# Load in the customer demographics data.
customers = pd.read_csv("Udacity_CUSTOMERS_Subset.csv", sep=";")



customers.shape
#customers.info()

Out[108]:

(191652, 85)

In [111]:

# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.

features_customers, customers_many_missing  = clean_data(customers)

features_customers.shape
features_customers.info()
#customers_many_missing.info()

print(list(set(azdiasNan.columns) - set(Cleancustomers.columns)))

print(customers.shape[0]) # sanity check
customers.tail(3)

customers_extended = customers.copy()

customers_extended = pd.concat([customers_extended, customers_extended.iloc[-1:]], ignore_index=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 141725 entries, 0 to 191651
Columns: 191 entries, ALTERSKATEGORIE_GROB to CAMEO_INTL_2015_L
dtypes: float64(44), int64(20), uint8(127)
memory usage: 87.4 MB
['GEBAEUDETYP_5.0']
191652

In [112]:

print(customers_extended.shape[0]) # sanity check
customers_extended.tail(3)
191653

Out[112]:


AGER_TYP

ALTERSKATEGORIE_GROB

ANREDE_KZ

CJT_GESAMTTYP

FINANZ_MINIMALIST

FINANZ_SPARER

FINANZ_VORSORGER

FINANZ_ANLEGER

FINANZ_UNAUFFAELLIGER

FINANZ_HAUSBAUER

...

PLZ8_ANTG1

PLZ8_ANTG2

PLZ8_ANTG3

PLZ8_ANTG4

PLZ8_BAUMAX

PLZ8_HHZ

PLZ8_GBZ

ARBEIT

ORTSGR_KLS9

RELAT_AB

191650

3.0

3.0

2

4.0

2

1

5

1

2

5

...

3.0

2.0

1.0

1.0

1.0

2.0

3.0

3.0

4.0

4.0

191651

3.0

2.0

1

2.0

5

1

5

1

1

2

...

3.0

2.0

0.0

0.0

1.0

4.0

5.0

1.0

3.0

1.0

191652

3.0

2.0

1

2.0

5

1

5

1

1

2

...

3.0

2.0

0.0

0.0

1.0

4.0

5.0

1.0

3.0

1.0

3 rows × 85 columns

In [113]:

customers_extended.loc[191652,'GEBAEUDETYP'= 5.0

features_customers, customers_many_missing  = clean_data(customers_extended)

features_customers.drop([191652], inplace=True)

features_customers.tail(3)

Out[113]:


ALTERSKATEGORIE_GROB

ANREDE_KZ

FINANZ_MINIMALIST

FINANZ_SPARER

FINANZ_VORSORGER

FINANZ_ANLEGER

FINANZ_UNAUFFAELLIGER

FINANZ_HAUSBAUER

GREEN_AVANTGARDE

HEALTH_TYP

...

CAMEO_DEU_2015_8D

CAMEO_DEU_2015_9A

CAMEO_DEU_2015_9B

CAMEO_DEU_2015_9C

CAMEO_DEU_2015_9D

CAMEO_DEU_2015_9E

PRAEGENDE_JUGENDJAHRE_DM

PRAEGENDE_JUGENDJAHRE_D

CAMEO_INTL_2015_W

CAMEO_INTL_2015_L

191649

4.0

1.0

5

1

5

1

1

2

2.0

2.0

...

0

0

0

0

0

0

2.0

2.0

2.0

4.0

191650

3.0

2.0

2

1

5

1

2

5

1.0

2.0

...

0

0

0

0

0

0

1.0

4.0

2.0

4.0

191651

2.0

1.0

5

1

5

1

1

2

1.0

2.0

...

0

0

0

0

0

0

1.0

2.0

3.0

3.0

3 rows × 192 columns

In [119]:

features_customers.info()

#imputed_customers = imputer.transform(features_customers)


#standardized_customers = scaler.transform(imputed_customers)

#pca_customers = pca.transform(standardized_customers)

#kmeans_customers = kmeans.predict(pca_customers)



customers_clean_imputed = pd.DataFrame(imputer.fit_transform(features_customers))

customers_clean_imputed.info()

# Apply scaler
CustomerScaled = scaler.transform(customers_clean_imputed)
CustomerScaled = pd.DataFrame(CustomerScaled, columns=list(customers_clean_imputed))
# PCA transformation
CustomerPca = pca.transform(CustomerScaled)

# Predict using Kmeans model_12
CustomerskMeans = model_general.predict(CustomerPca)


print("Run time: %s min" % np.round(((time.time() - startTime)/60/60),2))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 141725 entries, 0 to 191651
Columns: 192 entries, ALTERSKATEGORIE_GROB to CAMEO_INTL_2015_L
dtypes: float64(44), int64(20), uint8(128)
memory usage: 87.6 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141725 entries, 0 to 141724
Columns: 192 entries, 0 to 191
dtypes: float64(192)
memory usage: 207.6 MB
Run time: 2.45 min

Step 3.3: Compare Customer Data to Demographics Data

At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.

Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.

Take a look at the following points in this step:

  • Compute the proportion of data points in each cluster for the general population and the customer data. Visualizations will be useful here: both for the individual dataset proportions, but also to visualize the ratios in cluster representation between groups. Seaborn's countplot() or barplot() function could be handy.

    • Recall the analysis you performed in step 1.1.3 of the project, where you separated out certain data points from the dataset if they had more than a specified threshold of missing values. If you found that this group was qualitatively different from the main bulk of the data, you should treat this as an additional data cluster in this analysis. Make sure that you account for the number of data points in this subset, for both the general population and customer datasets, when making your computations!

  • Which cluster or clusters are overrepresented in the customer dataset compared to the general population? Select at least one such cluster and infer what kind of people might be represented by that cluster. Use the principal component interpretations from step 2.3 or look at additional components to help you make this inference. Alternatively, you can use the .inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.

  • Perform a similar investigation for the underrepresented clusters. Which cluster or clusters are underrepresented in the customer dataset compared to the general population, and what kinds of people are typified by these clusters?

In [123]:

# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.

figure, axs1 = plt.subplots(nrows=1, ncols=2, figsize = (16,6))
figure.subplots_adjust(hspace = 1, wspace=0.3)

sns.countplot(CustomerskMeans, ax=axs1[0] ,palette="inferno")
axs1[0].set_title('Customer Cluster')
sns.countplot(Predcluster, ax=axs1[1], palette="inferno")
axs1[1].set_title('Population Cluster')

Out[123]:

Text(0.5,1,'Population Cluster')

# What kinds of people are part of a cluster that is overrepresented in the # customer data compared to the general population? pd.set_option('display.max_rows'192) centroid6=scaler.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_[6])).round(1) PopularCluster=pd.Series(data=centroid6, index= features_customers.columns) PopularCluster Discussion 3.3: Compare Customer Data to Demographics Data

(Double-click this cell and replace this text with your own text, reporting findings and conclusions from the clustering analysis. Can we describe segments of the population that are relatively popular with the mail-order company, or relatively unpopular with the company?)

Segments of the population that are relatively popular with the mail-order company(centroid 6)

When I compare the segments of the population, independent workers who are the people of the financial mid sector and live in multi-generation family holder-households are more popular with the company rather than young and mobile people / urban parents.

Forexample;

Popular -> Estimated age based on given name(46 - 60 years old), Financial typology(average), Health typology (sanitary affine), Personality typology (average affinity)

UnPopular -> Estimated age based on given name (30 - 45 years old), Financial typology (low), Health typology (jaunty hedonists), Personality typology(very high affinity)

Subscribe to our newsletter • Don’t miss out!

bottom of page