sklearn.preprocessing.PolynomialFeatures のカラム名を取得する〜Boston Housingを例にして〜

こんにちは、company-rapperです!

前回のポスト

company-rapper.hatenablog.jp

で、sklearn.preprocessing.PolynomialFeatures()の仕様を調査しました。

今回は、その応用編として、PolynomialFeatures()で多項式化させたデータセットを回帰分析して、回帰係数がどのカラム名に対応しているかを出力してみたいと思います。

コード(Jupyter notebook形式)はGithubにアップしています。

github.com

今回は、例としてBoston Housingのデータセットを使用します。これは1970年の国勢調査で得られたボストンの住宅価格に関するデータです。scikit-learnのサンプルデータセットに収録されています。

プログラムの流れとしては

ライブラリインポート
データインポート
今回はBoston Housingを使用
グリッドサーチ
グリッドサーチ内では
StandardScaler()で標準化する
PolynomialFeatures()で多項式カラムを作る
ElasticNet()で回帰させる
を行なっています。
ベストモデルで係数出力(係数が大きい順に出力)
この例では、すなわち、ElasticNetで得られた回帰係数がどの特徴量に対応しているかをランキング形式で出力します。

です

まずはライブラリをインポートします。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_boston

次に、データセット *1をインポートし、学習データとテストデータに分けます

boston = load_boston()
# sklearn.utils.BunchからDataFrame形式へ変換
df_prices = pd.DataFrame(boston.target, columns=["MEDV"])
df_features = pd.DataFrame(boston.data, columns=boston.feature_names)

df_data = pd.concat([df_prices,df_features],axis=1)

# 価格が50以上の場合は50になってしまっているようなので、データを削除
# (「Boston Housing Prices - Evaluation & Validation」https://www.kaggle.com/samratp/boston-housing-prices-evaluation-validation)
df_data = df_data[df_data["MEDV"]<50]

df_features_ = df_data.drop(["MEDV"],axis=1)
df_prices = df_data["MEDV"]
X_train, X_test, y_train, y_test = train_test_split(df_features_, df_prices, test_size = 0.2, random_state=5)

次に、グリッドサーチ*2でハイパーパラメータを探索します

alpha_params = np.logspace(-3, 3, 7)
l1_ratio_params = np.arange(0.1, 1.0, 0.1)
 
steps = [
    ('ss', StandardScaler()),
    ('pf', PolynomialFeatures()),
    ('en', ElasticNet())
]
pipeline = Pipeline(steps=steps)
    
paramters = {
    'pf__degree': [1,2,3],
    'en__alpha': alpha_params,
    'en__l1_ratio': l1_ratio_params,
    'en__fit_intercept': [False]
}
 
model_CV = GridSearchCV(
    estimator = pipeline,  
    param_grid = paramters,
    cv = 5,
    n_jobs = -1, 
)
 
model_CV.fit(X_train, y_train)

best_param = model_CV.best_params_
best_estimator = model_CV.best_estimator_
 
pred_train = best_estimator.predict(X_train)
r2score_train = r2_score(y_train, pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, pred_train))

pred_test = best_estimator.predict(X_test)
r2score_test = r2_score(y_test, pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, pred_test))

print("結果")
print('RMSE (train) : {:.3f}'.format(rmse_train))
print('RMSE (test) : {:.3f}'.format(rmse_test))
print('R2 score (train): %.3f' % r2score_train)
print('R2 score (test) : %.3f' % r2score_test)
print('clf.best_params_', best_param)

出力は下記のようになります

RMSE (train) : 2.585
RMSE (test) : 2.798
R2 score (train): 0.889
R2 score (test) : 0.885
clf.best_params_ {'en__alpha': 0.1, 'en__fit_intercept': False, 'en__l1_ratio': 0.9, 'pf__degree': 2}

この出力から、ベストなモデルとなるハイパーパラメータはElasticNetのα=0.1, l1_ratio=0.9, PolynomialFeaturesの次元が2の時と得られました。R2 scoreで見ると、trainが0.889, testが0.885と過学習している心配はなさそうです。(※ElasticNetのfit_intercept=Falseとしているのは、PolynomialFeaturesですでにバイアス項を生成させているためです。)

さて、ここからが本番です。

ElasticNetは回帰型の分析ですから、回帰式の係数が分析の結果として得られています。この例では、

best_estimator.named_steps["en"].coef_

でその係数を得られますが、これはnumpy.ndarray形式であり、出力すると

array([ 1.75804749e+01, -2.04597703e+00, -4.86938426e-01, -9.23279354e-01, -0.00000000e+00, -4.54373705e-01, 2.50403603e+00, -7.75038911e-01, -7.67195961e-01, -0.00000000e+00, -9.50376753e-01, -5.91575503e-01, 5.94885088e-01, -2.48676454e+00, 1.18679136e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 1.38849955e-02, -0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -1.02503909e-01, -0.00000000e+00, 1.54255789e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.10347954e-01, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.46273726e-01, -0.00000000e+00, -0.00000000e+00, 5.59900185e-01, -0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00, 1.70831778e-01, -2.04649122e-01, -4.64448865e-01, -0.00000000e+00, 4.09789119e-01, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -4.38022921e-01, -0.00000000e+00, 4.22304702e-01, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 5.18078151e-01, -1.61121546e-01, 0.00000000e+00, -4.24372747e-02, -5.16195159e-01, -4.26033726e-01, 2.12103848e-01, -5.75149604e-01, 3.74467667e-01, 1.51477181e-01, -0.00000000e+00, -0.00000000e+00, -2.59401414e-03, -6.81035927e-01, -1.35884843e-01, 1.20519287e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 9.19710887e-01, 0.00000000e+00, -0.00000000e+00, -0.00000000e+00, 4.32392321e-01, 5.95061672e-01, -0.00000000e+00, -7.34775902e-01, 3.92426845e-01, 0.00000000e+00, -0.00000000e+00, -1.29572606e-01, -3.02331023e-02, 1.14892681e-01])

このように、係数の値しか入っていません。そこで、前回sklearn.preprocessing.PolynomialFeaturesの仕様を調査したときに作成した関数を手直しして、この配列に対応する特徴量名を付けてみましょう

# 1番目からn番目までの素数をnp.arrayで返す関数
import sympy
def primes(n=0):
    ret_primes = list()
    for i in np.arange(n):
        ret_primes.append(sympy.prime(i+1))
    return ret_primes

# コラム名リストからコラムに対してユニークな素数が割り付けられたデータフレームを返す関数
def generate_df_prime_from_columns(columns_original = np.array(["a","b","c"])):
    data_original = np.array(primes(len(columns_original)))
    return pd.DataFrame(data=data_original[np.newaxis,:],columns=columns_original,index=["original"])

# PolynomialFeatures()の出力に対応したカラムを返す関数
def get_columns_for_PolynomialFeatures(poly=PolynomialFeatures(2),columns_from=["a","b","c"],power=False):
    df_from = generate_df_prime_from_columns(columns_from)
    columns_from = df_from.columns
    data_from = df_from.values
    data_poly = poly.fit_transform(df_from)
    # columnsをもう一度作り直す
    columns_poly = list()
    for i, data_poly_0 in enumerate(data_poly[0]):
        if (data_poly_0 == 1):
            columns_poly.append("bias")
        else:
            prime_dict=sympy.factorint(data_poly_0)
            keys = list(prime_dict.keys())
            columns_str = ""
            if power:
                # 累乗で書ける部分は累乗で書く(例:a^2)
                for j, key in enumerate(keys):
                    columns_str += columns_from[list(data_from[0]).index(key)]
                    if prime_dict[key] > 1:
                        columns_str += "^" + str(prime_dict[key])
                    if (j < len(keys)-1):
                        columns_str += "×"
            else:
                # 単純に×で項目をつなげていく(例:a×a×b)
                for j, key in enumerate(keys):
                    for k in np.arange(prime_dict[key]):
                        columns_str += columns_from[list(data_from[0]).index(key)]
                        if (j < len(keys)-1) | (k < prime_dict[key]-1):
                            columns_str += "×"
            columns_poly.append(columns_str)
    return columns_poly

前回はinvestigate_PolynomialFeatures という関数でしたが、今回は少し手直ししてget_columns_for_PolynomialFeaturesという名前にし、特徴量の名前(list形式)を返す関数にしています。

さて、get_columns_for_PolynomialFeaturesを使用して、係数の大きな順番にランキング形式で表示してみましょう。

# 係数を出力
best_estimator.named_steps["en"].coef_

columns_for_coef_ = get_columns_for_PolynomialFeatures(poly=PolynomialFeatures(best_estimator.named_steps["pf"].degree),
                                                       columns_from=df_features_.columns)
df_coef = pd.DataFrame(data=best_estimator.named_steps["en"].coef_,index=columns_for_coef_,columns=["coef"])
df_coef["abs coef"] = abs(df_coef["coef"])

# 係数の絶対値が0.1以上のものを絶対値の降順で表示
display(df_coef[df_coef["abs coef"]>0.1].sort_values('abs coef', ascending=False))

	coef	abs coef
bias	17.580475	17.580475
RM	2.504036	2.504036
LSTAT	-2.486765	2.486765
CRIM	-2.045977	2.045977
TAX	-0.950377	0.950377
INDUS	-0.923279	0.923279
RAD×TAX	0.919711	0.919711
AGE	-0.775039	0.775039
DIS	-0.767196	0.767196
TAX×LSTAT	-0.734776	0.734776
AGE×B	-0.681036	0.681036
TAX×PTRATIO	0.595062	0.595062
B	0.594885	0.594885
PTRATIO	-0.591576	0.591576
RM×LSTAT	-0.575150	0.575150
INDUS×INDUS	0.559900	0.559900
RM×RM	0.518078	0.518078
RM×TAX	-0.516195	0.516195
ZN	-0.486938	0.486938
CHAS×RM	-0.464449	0.464449
NOX	-0.454374	0.454374
ZN×PTRATIO	0.446274	0.446274
NOX×RM	-0.438023	0.438023
TAX×TAX	0.432392	0.432392
RM×PTRATIO	-0.426034	0.426034
NOX×DIS	0.422305	0.422305
CHAS×DIS	0.409789	0.409789
PTRATIO×PTRATIO	0.392427	0.392427
AGE×AGE	0.374468	0.374468
RM×B	0.212104	0.212104
CHAS×NOX	-0.204649	0.204649
CHAS×CHAS	0.170832	0.170832
RM×AGE	-0.161122	0.161122
ZN×ZN	0.154256	0.154256
AGE×DIS	0.151477	0.151477
AGE×LSTAT	-0.135885	0.135885
B×B	-0.129573	0.129573
DIS×DIS	0.120519	0.120519
CRIM×CRIM	0.118679	0.118679
LSTAT×LSTAT	0.114893	0.114893
ZN×RM	0.110348	0.110348
CRIM×B	-0.102504	0.102504

このように、どの特徴量の係数が大きいかが分かるようになりました。Elasticnetで回帰する前にStandardScalerで標準化しているので、基本的には係数が大きい特徴量が住宅価格を決めるのに重要となっていると考えられます。

さて、今回はBoston Housingデータセットを例に、sklearn.preprocessing.PolynomialFeaturesで生成した特徴量の係数名を得てみました。何かのお役に立てば幸いです。

*1:Boston Housingデータセットについては下記サイトを参考にしました。

「Boston Housing Prices - Evaluation & Validation」https://www.kaggle.com/samratp/boston-housing-prices-evaluation-validation
「Linear Regression on Boston Housing Dataset」https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
「機械学習とBoston Housing Data（3）」
https://kiidax.wordpress.com/2016/09/14/機械学習とboston-housing-data（3）/

*2:グリッドサーチはこちらのサイトを参考にいたしました。

「GridSearchCV(scikit-learn)によるチューニング」http://starpentagon.net/analytics/scikit_learn_grid_search_cv/

company-rapper’s blog

ブログ

sklearn.preprocessing.PolynomialFeatures のカラム名を取得する〜Boston Housingを例にして〜