Data Prepare

Before feeding data to the training process, we need to preprocess the data to make it “a good teacher”. The works includes:

collect data
transform data in a required form
resolve challenges of data

Suppose we’re going to predict the median house value and the census data is provided for analysis.

Quick Overview

We can use pandas for a quick overview.

import os
import pandas as pd

HOUSING_PATH = os.path.join("datasets", "housing")

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

In [2]: housing.head()
Out[2]:
   longitude  latitude  housing_median_age  total_rooms  ...  households  median_income  median_house_value  ocean_proximity
0    -122.23     37.88                41.0        880.0  ...       126.0         8.3252            452600.0         NEAR BAY
1    -122.22     37.86                21.0       7099.0  ...      1138.0         8.3014            358500.0         NEAR BAY
2    -122.24     37.85                52.0       1467.0  ...       177.0         7.2574            352100.0         NEAR BAY
3    -122.25     37.85                52.0       1274.0  ...       219.0         5.6431            341300.0         NEAR BAY
4    -122.25     37.85                52.0       1627.0  ...       259.0         3.8462            342200.0         NEAR BAY

[5 rows x 10 columns]

Split data

We need to exclude test data from the data set before further exploration. Otherwise, our brain will detect patterns from all data (including test data) and lead to select a particular kind of ML model. In a word, it is highly prone to overfitting. The page split data set tells related topics.

Suppose we split it and get the training data set and test data set: strat_train_set and strat_test_set.

We also need to seperate the labels from training data.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

Exploration

We’ll discuss the following topics:

feature engineering
feature scaling
poor data quality
representative data (discussed in split data set)
text attributes

All the steps can be packed in one pipeline and run automatically.

correlations

We can check the relationship between the goal “median house value” and each feature. And then decide whether to remove or even combine some features. One of the method is to make use of pandas.corr().

# make a copy to investigate correlations
housing = strat_train_set.copy()

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[16]:
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

The “median_income” feature has strong correlation while “total_rooms” and “total_bedrooms” have weak correlation. We can try some other feature for analysis.

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

Out[17]:
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

The new features have strong correlation.

check poor quality data

Poor data lead to poor model. We need to handle it before training the model. One of the poor quality is data missing.

# make a copy for investigation
housing = strat_train_set.copy()

housing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 10 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16354 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
median_house_value    16512 non-null float64
ocean_proximity       16512 non-null object
dtypes: float64(9), object(1)
memory usage: 1.4+ MB

There’re missing data in the total_bedrooms column. We can remove rows missing the data, remove this feature or give it some value (e.g., median value). We’ll make use of the scikit-learn tool SimpleImputer to give each missing entry a median value.

# Remove the text attribute because median can
# only be calculated on numerical attributes:
housing_num = housing.drop("ocean_proximity", axis=1)

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

# median value will be saved in imputer.statistics_
imputer.fit(housing_num)

# transfrom the dataframe to a numpy array X
# that filled with median value 
X = imputer.transform(housing_num)

# get the DataFrame data type
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

housing_tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 9 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16512 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
median_house_value    16512 non-null float64
dtypes: float64(9)
memory usage: 1.3 MB

The “total_bedrooms” column does not miss values now.

handle text and categorical attributes

Most ML algorithms prefer to work with numbers, so we need to convert text attribute (categories) to numerical attributes.

Even a ML algorithm can handle text attributes, it may detect text patterns by chance, which is unexpected.

# make a copy for investigation
housing = strat_train_set.copy()

housing["ocean_proximity"].value_counts()
Out[42]:
<1H OCEAN     7276
INLAND        5263
NEAR OCEAN    2124
NEAR BAY      1847
ISLAND           2
Name: ocean_proximity, dtype: int64

There’re five catogaries. If we just give five numbers [0, 1, 2, 3, 4] accordingly, there will be new issue: ML algorithms will assume two nearby values are more similar than two distant values. It is not case for our situation.

To fix the issue, we can add a binary attribute for each catogory. It is called one-hot encoding: one and only one of the attribute is 1 (hot), while other attributes are 0 (cold).

housing_cat = housing[["ocean_proximity"]]

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Scaling

The feature scaling discussion will be added in the future.