Add Anaconda new Python environment & add to jupyter notebook kernel

Create a virtual environment for your project


Activate your virtual environment

list conda environment

 

Install additional Python packages to a virtual environment


Add the new environment to jupyter notebook

list jupyter notebook’s kernels info (config) , list kernel environment

Change name/ properties of the environment:

go to the path under “jupyter kernelspec list”, edit the config file directly

delete environment


 

 

 

bokeh 5th: project interactive data visualization web app

 

How to make useful and fun interactive data visualization web apps and how to deploy them online for public access? 

Bokeh Plot

Click App runs on Here, will be interactive and animated.

Package a Python program to a single Executable file with Pyinstaller

The advantage of packaging a program is that the user do NOT need any programming environment to use the packaged software.

Packaging a program is very common.For example, when you install most of the software, the ‘setup’ installer will install a packaged software into your computer as a folder and add a shortcut of the executable to your desktop.

However, packaging the program into a single file is much more difficult, I’ll explain why later, let us see how to do it for a Python project.

First things first, make sure you have pyinstaller. I’ve used other libraries like py2exe, but pyinstaller is the best in terms of its compatibility, stability, and user-friendly, the list goes on and on.

in the command window

-> pyinstaller  file_tobe_pack.py  -o image.ico  -F  –upx-dir upxfolder

-c  (default)  open a console window

-w  will NOT open a console window

The biggest problem when packaging, especially to a single file:

can not find some specific library dependencies

these are tons of solutions online, but much are so complex and not direct

these is a very simple solution:

Locate the library you import in your program (the one failed packaging) then move the source code to your current directory with the program you wrote, the CHANGE the name of the source code file and change the library name you import.

The packaging software will then search dependency files from the files you made.

Here’s an example:

 

 

Travelers Auto Insurance Claims Prediction for a case competition

A lot like Kaggle projects I experienced. Train and test data set are given, scoring using Gini index.

Here is some sample code I wrote in Python.

Technique used:

Python Scikit-learn, Xgboost, imbalanced-learn for SMOTE over-sampling, etc.

Compared multiple algorithms before chose the best and tuning parameters.

  • LinearRegression
  • Ridge Regression
  • LASSO Linear Regression
  • Elastic Net Regression
  • KNN
  • DecisionTree
  • SVM
  • Bagged Decision Trees
  • Random Forest
  • Extra Trees (Bagging)
  • AdaBoost (Boosting)
  • Stochastic Gradient Boosting (Boosting)
  • XGBoost Regressor
  • XGBoost XGBClassifier
travelers

environment

numpy, pandas, sklearn, xgboost, imbalanced-learn

under Linux:

  • pip install xgboost
  • pip install -U imbalanced-learn
In [43]:
import xgboost as xgb
import seaborn
# from gini import Gini
In [2]:
import pandas as pd
import numpy as np
from time import time
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
In [3]:
# load data
train1=pd.read_csv('Kangaroo_train.csv')
train2=pd.read_csv('Kangaroo_valid.csv')
print (train1.shape,train2.shape)
hold=pd.read_csv('Kangaroo_hold.csv')
print (hold.shape)
combine=pd.concat([train1,train2])
train=combine.copy()
((22610, 11), (22629, 11))
(22617, 10)

factorize categorical data

In [4]:
cat_vars = train.select_dtypes(['object']).columns

for var in cat_vars:
    train[var] = pd.factorize(train[var])[0]
    hold[var] = pd.factorize(hold[var])[0]
In [4]:
train.head(1)
Out[4]:
id claimcst0 veh_value exposure veh_body veh_age gender area agecat clm numclaims
0 1 0.0 6.43 0.241898 0 1 0 0 3 0 0

data clusters explore

In [5]:
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap
from sklearn.cluster import KMeans
In [6]:
pca=PCA(2)
after=pca.fit_transform(train)
print(after.shape)
plt.scatter(after[:,0],after[:,1],alpha=.3)
(45239, 2)
Out[6]:
<matplotlib.collections.PathCollection at 0x7f9d1893e048>
In [7]:
K=KMeans(2)
gp=K.fit_transform(after)
print (gp.shape)
plt.scatter(gp[:,0],gp[:,1],alpha=.2,color='k')
(45239, 2)
Out[7]:
<matplotlib.collections.PathCollection at 0x7f9d18614e48>

split train data to train, validate

In [6]:
# preprocessing
y_train =train.pop('clm')
# y_claim_train=train.pop('claimcst0')
x_train = train.drop(['id','numclaims'], axis = 1)

y_hold = hold.pop('clm')
hold = hold.drop(['id','numclaims'], axis = 1)
In [7]:
from sklearn.cross_validation import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(
    x_train, y_train, test_size = 0.25, random_state = 345,stratify=y_train)
In [8]:
x_train.shape, x_valid.shape, y_train.shape, y_valid.shape, y_train.sum(),y_valid.sum()
Out[8]:
((33929, 8), (11310, 8), (33929,), (11310,), 2313, 771)

compare predicting clm0 rather than clm

In [9]:
y_train_clm=y_train.copy()
y_valid_clm=y_valid.copy()
In [10]:
y_train_clm0=x_train.pop('claimcst0')
y_valid_clm0=x_valid.pop('claimcst0')
In [11]:
x_train.head(1)
Out[11]:
veh_value exposure veh_body veh_age gender area agecat
17594 3.26 0.819075 3 2 0 3 5

comment below if to use clm0

In [12]:
# y_train_clm=y_train.copy()
# y_valid_clm=y_valid.copy()
# y_train=x_train.pop('claimcst0')
# y_valid=x_valid.pop('claimcst0')
In [13]:
x_train.head(1)
Out[13]:
veh_value exposure veh_body veh_age gender area agecat
17594 3.26 0.819075 3 2 0 3 5
In [14]:
fig=plt.figure()

fig.add_subplot(121)
ax=y_train_clm.plot.hist(bins=2,alpha=.5)
ax.set_title('training set clm')
fig.add_subplot(122)
ax=y_valid_clm.plot.hist(bins=2, alpha=.4,color='r')
ax.set_title('validate set clm')
fig.subplots_adjust(wspace=.4)

balance imbalanced training data set

In [15]:
from imblearn.over_sampling import SMOTE, RandomOverSampler, ADASYN
from imblearn.under_sampling import NearMiss
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.ensemble import BalanceCascade


sm = SMOTE(kind='regular')
# sm = SMOTE(kind='svm')


# sm= RandomOverSampler()
x_train,y_train=sm.fit_sample(x_train,y_train)
x_train=pd.DataFrame(x_train)
y_train=pd.Series(y_train)
print x_train.shape, y_train.shape, y_train.sum(), y_valid.sum()
(63232, 7) (63232,) 31616 771
In [16]:
fig=plt.figure()

fig.add_subplot(121)
ax=y_train.plot.hist(bins=2,alpha=.5)
ax.set_title('training set clm')
fig.add_subplot(122)
ax=y_valid.plot.hist(bins=2, alpha=.4,color='r')
ax.set_title('validate set clm')
fig.subplots_adjust(wspace=.4)
In [23]:
x_train.head(1)
Out[23]:
0 1 2 3 4 5 6
0 3.26 0.819075 3.0 2.0 0.0 3.0 5.0
In [24]:
try:
    x_train.columns=train.drop(['id','numclaims'], axis = 1).columns
except:
    x_train.columns=train.drop(['id','numclaims','claimcst0'], axis = 1).columns
In [ ]:
 
In [25]:
# print (x_train['claimcst0']>0).sum()==(y_train).sum()
In [26]:
# y_train=x_train.pop('claimcst0')
In [27]:
# y_valid=x_valid.pop('claimcst0')
In [28]:
x_train.head(1)
Out[28]:
veh_value exposure veh_body veh_age gender area agecat
0 3.26 0.819075 3.0 2.0 0.0 3.0 5.0
In [29]:
x_valid.head(1)
Out[29]:
veh_value exposure veh_body veh_age gender area agecat
15047 0.79 0.775908 2 4 1 2 5
In [293]:
x_valid.dtypes
Out[293]:
veh_value    float64
exposure     float64
veh_body       int64
veh_age        int64
gender         int64
area           int64
agecat         int64
dtype: object

define nomalized gini

In [294]:
# calculate gini
def Gini(y_true, y_pred):
    # check and get number of samples
    assert y_true.shape == y_pred.shape
    n_samples = y_true.shape[0]
    
    # sort rows on prediction column 
    # (from largest to smallest)
    arr = np.array([y_true, y_pred]).transpose()
    true_order = arr[arr[:,0].argsort()][::-1,0]
    pred_order = arr[arr[:,1].argsort()][::-1,0]
    
    # get Lorenz curves
    L_true = np.cumsum(true_order) / np.sum(true_order)
    L_pred = np.cumsum(pred_order) / np.sum(pred_order)
    L_ones = np.linspace(1/n_samples, 1, n_samples)
    
    # get Gini coefficients (area between curves)
    G_true = np.sum(L_ones - L_true)
    G_pred = np.sum(L_ones - L_pred)
    
    # normalize to true Gini coefficient
    return G_pred/G_true
In [ ]:
 
In [39]:
# save to csv for submit & R gini calculation
def save():
    aa=pd.DataFrame(y_valid_clm0)
    pp=pd.Series(pred,index=aa.index)
    R_test=pd.concat([aa,pp],axis=1)
    R_test['split']='val'
    R_test.columns=['act','pred','split']
    R_test.to_csv('other_methods.csv')

since the dataset is relatively small, optionally transforming categorical features to dummy forms in order to remove features efficientlly

confusion_matrix

In [66]:
from sklearn.metrics import confusion_matrix

test a list of algorithms

general without tuning parameters

  • LinearRegression
  • Ridge Regression
  • LASSO Linear Regression
  • Elastic Net Regression
  • KNN
  • DecisionTree
  • SVM
  • Bagged Decision Trees
  • Random Forest
  • Extra Trees (Bagging)
  • AdaBoost (Boosting)
  • Stochastic Gradient Boosting (Boosting)
  • XGBoost Regressor
  • XGBoost XGBClassifier

compare predicting clm0 rather than clm

In [ ]:
 

LinearRegression

In [156]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model
Out[156]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
In [157]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :0.0053 s
gini :0.211997 

on clm 0.200869

In [158]:
judge=pd.Series(pred).apply(lambda x:1 if x>pd.Series(pred).mean() else 0)
In [160]:
confusion_matrix(y_valid_clm,judge)
Out[160]:
array([[5452, 5087],
       [ 250,  521]])

Ridge Regression

In [161]:
from sklearn.linear_model import Ridge
model=Ridge(normalize=True)
model
Out[161]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)
In [162]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :0.0046 s
gini :0.213310 

on clm 0.201315

LASSO Linear Regression

In [167]:
from sklearn.linear_model import Lasso
model=Lasso()
model
Out[167]:
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
In [168]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :0.0154 s
gini :0.005823 

Elastic Net Regression

In [169]:
from sklearn.linear_model import ElasticNet
model = ElasticNet()
model
Out[169]:
ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
In [170]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :0.0069 s
gini :0.005823 

KNN

In [73]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(800)
model
Out[73]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=800, p=2,
          weights='uniform')
In [74]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :3.5301 s
gini :0.056210 
In [81]:
pd.Series(pred).describe()
Out[81]:
count    11310.000000
mean         0.495275
std          0.071319
min          0.271250
25%          0.447500
50%          0.497500
75%          0.545000
max          0.737500
dtype: float64

Decision Tree

In [171]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(max_depth=5,random_state=7,
                              min_samples_leaf=1,max_leaf_nodes=None,
                             min_samples_split=5,
                              min_weight_fraction_leaf=.1)
model
Out[171]:
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=5,
           min_weight_fraction_leaf=0.1, presort=False, random_state=7,
           splitter='best')
In [172]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :0.0879 s
gini :0.251430 
In [173]:
train_set=model.predict(x_train)
train_set_gini = Gini(y_train, train_set)
train_set_gini
Out[173]:
0.33359751419411399

SVM

In [169]:
from sklearn.svm import SVR
model = SVR(cache_size=20000)
model
Out[169]:
SVR(C=1.0, cache_size=20000, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [63]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(x_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :108.2684 s
gini :0.058458 

Bagged Decision Trees

In [304]:
from sklearn.ensemble import BaggingRegressor
model = BaggingRegressor(n_estimators=300,verbose=1,n_jobs=-1,
                        max_features=.9, max_samples=.9,)
model
Out[304]:
BaggingRegressor(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=0.9, max_samples=0.9,
         n_estimators=300, n_jobs=-1, oob_score=False, random_state=None,
         verbose=1, warm_start=False)
In [305]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    5.3s remaining:   15.8s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    5.6s finished
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    0.2s remaining:    0.6s
used :6.3984 s
gini :0.139489 
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    0.4s finished

Random Forest (Bagging)

In [306]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=600,max_depth=3,n_jobs=-1)
model
Out[306]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=3,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=600, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
In [307]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :4.5205 s
gini :0.211273 
In [308]:
pred
Out[308]:
array([ 0.3596833 ,  0.34252775,  0.12714231, ...,  0.16006586,
        0.31602501,  0.12386025])
In [223]:
save()

on clm 0.205095

Extra Trees (Bagging)

In [309]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor(n_jobs=-1,n_estimators=1000,max_depth=3,
                           bootstrap=True,)
model
Out[309]:
ExtraTreesRegressor(bootstrap=True, criterion='mse', max_depth=3,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=1000, n_jobs=-1, oob_score=False, random_state=None,
          verbose=0, warm_start=False)
In [310]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :3.4390 s
gini :0.190100 
In [311]:
save()

on clm 0.197320

AdaBoost (Boosting)

In [312]:
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor()
model
Out[312]:
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None)
In [313]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :0.7475 s
gini :0.201253 

on clm 0.204600

Stochastic Gradient Boosting (Boosting)

In [314]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=300,max_depth=3,
                                 subsample=.9)
model
Out[314]:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=300,
             presort='auto', random_state=None, subsample=0.9, verbose=0,
             warm_start=False)
In [315]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :4.8759 s
gini :0.170067 

on clm 0.221349 (train on clm and cal gini with clm0)

In [316]:
train_set=model.predict(x_train)
train_set_gini = Gini(y_train, train_set)
train_set_gini
Out[316]:
0.8828912822708892

on clm 0.41400336589109327

XGBoost sklearn api

Regressor

In [317]:
from xgboost import XGBRegressor
model = XGBRegressor(learning_rate=0.01,max_depth=3,)
model
Out[317]:
XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.01, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
In [318]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :0.5483 s
gini :0.214805 

Classifier

In [319]:
from xgboost import XGBClassifier

model = XGBClassifier(learning_rate=0.0087,objective='binary:logistic',
                     n_estimators=300,reg_lambda=3,
                     subsample=.6,colsample_bylevel=.6,
                   )
model
Out[319]:
XGBClassifier(base_score=0.5, colsample_bylevel=0.6, colsample_bytree=1,
       gamma=0, learning_rate=0.0087, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=300, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=3,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.6)
In [320]:
start=time()
model.fit(x_train,y_train)
pred=model.predict(x_valid)
result = Gini(y_valid_clm0, pred)
print ('used :%.4f s'%(time()-start))
print ('gini :%.6f '%result)
used :1.5666 s
gini :0.009133 
In [ ]:
 
In [ ]:
 
In [321]:
x_train.head(1)
Out[321]:
veh_value exposure veh_body veh_age gender area agecat
0 3.26 0.819075 3.0 2.0 0.0 3.0 5.0
In [322]:
x_valid.head(1)
Out[322]:
veh_value exposure veh_body veh_age gender area agecat
15047 0.79 0.775908 2 4 1 2 5
In [ ]:
 

xgboost

linear

In [32]:
# apply xgboost
d_train = xgb.DMatrix(x_train, y_train)
d_valid = xgb.DMatrix(x_valid, y_valid_clm0)
d_test = xgb.DMatrix(x_valid)

# d_hold=xgb.DMatrix(x_hold)
def use_gini(a, b):
    y = b.get_label()
    return 'gini', Gini(y, a)
    
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
In [48]:
xgb_params = {
#     'base_score': 0.5,
 'colsample_bylevel': 0.6,
 'colsample_bytree': 0.8,
#  'gamma': 0,
 'learning_rate': 0.01,
#  'max_delta_step': 1,
 'max_depth': 1,
 'min_child_weight': 9,
 'missing': None,

#  'objective': 'reg:linear',

 'objective': 'binary:logistic',
 'reg_alpha': 9,
 'reg_lambda': 9,
 'scale_pos_weight': 1,
 'seed': 27,
 'silent': 1,
 'subsample': 0.9}
In [82]:
start=time()

clf = xgb.train(xgb_params, d_train, 502, watchlist, early_stopping_rounds=1111, 
                verbose_eval=50, 
                feval=use_gini, 
                maximize=True)


'!!! used %.2f s'%(time()-start)
[0]	train-gini:-0.186319	valid-gini:0.114123
Multiple eval metrics have been passed: 'valid-gini' will be used for early stopping.

Will train until valid-gini hasn't improved in 1111 rounds.
[50]	train-gini:0.325454	valid-gini:0.191548
[100]	train-gini:0.35463	valid-gini:0.201859
[150]	train-gini:0.368934	valid-gini:0.215929
[200]	train-gini:0.373168	valid-gini:0.231201
[250]	train-gini:0.374377	valid-gini:0.23014
[300]	train-gini:0.379856	valid-gini:0.234736
[350]	train-gini:0.386277	valid-gini:0.234685
[400]	train-gini:0.391889	valid-gini:0.233804
[450]	train-gini:0.393826	valid-gini:0.231898
[500]	train-gini:0.401135	valid-gini:0.239523
Out[82]:
'!!! used 5.60 s'
In [83]:
clf.best_iteration,clf.best_ntree_limit,clf.best_score
Out[83]:
(501, 502, 0.240186)
In [84]:
pred=clf.predict(d_test)
In [85]:
save()

feature importance

In [96]:
feature_score=pd.Series(clf.get_fscore())
feature_score.sort()
feature_score.plot.barh(color='r',alpha=.3)
Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efc04a8f5d0>
In [ ]:
 
In [ ]:
 

logistic

In [10]:
xgb_params3 = {
#     'base_score': 0.5,
 'colsample_bylevel': 0.6,
#  'colsample_bytree': 0.8,
#  'gamma': 0,
 'learning_rate': 0.0086,
 'max_depth': 1,
 'min_child_weight': .5,
#  'missing': None,

#  'objective': 'reg:linear',

 'objective': 'binary:logistic',
 'reg_alpha': 0.5,
 'reg_lambda': 3,
 'scale_pos_weight': .5,
 'seed': 27,
 'silent': 1,
 'subsample': 0.54,
'eval_metric':'auc'}
In [98]:
start=time()

clf = xgb.train(xgb_params3, d_train, 1597, watchlist, early_stopping_rounds=666, 
                verbose_eval=50,
                feval=use_gini, 
                maximize=True
               )


'!!! used %.2f s'%(time()-start)
[0]	train-gini:-0.43203	valid-gini:0.077052
Multiple eval metrics have been passed: 'valid-gini' will be used for early stopping.

Will train until valid-gini hasn't improved in 666 rounds.
[50]	train-gini:0.309446	valid-gini:0.192613
[100]	train-gini:0.351892	valid-gini:0.206371
[150]	train-gini:0.368609	valid-gini:0.215305
[200]	train-gini:0.365294	valid-gini:0.226315
[250]	train-gini:0.370303	valid-gini:0.229143
[300]	train-gini:0.36993	valid-gini:0.233933
[350]	train-gini:0.377201	valid-gini:0.23231
[400]	train-gini:0.381864	valid-gini:0.229641
[450]	train-gini:0.385241	valid-gini:0.230075
[500]	train-gini:0.391504	valid-gini:0.235022
[550]	train-gini:0.39584	valid-gini:0.23102
[600]	train-gini:0.399529	valid-gini:0.235704
[650]	train-gini:0.400685	valid-gini:0.23321
[700]	train-gini:0.402738	valid-gini:0.233753
[750]	train-gini:0.404563	valid-gini:0.233582
[800]	train-gini:0.405778	valid-gini:0.233925
[850]	train-gini:0.407957	valid-gini:0.236965
[900]	train-gini:0.410187	valid-gini:0.236511
[950]	train-gini:0.412109	valid-gini:0.237146
[1000]	train-gini:0.414372	valid-gini:0.237582
[1050]	train-gini:0.416637	valid-gini:0.238728
[1100]	train-gini:0.418962	valid-gini:0.239321
[1150]	train-gini:0.421577	valid-gini:0.239511
[1200]	train-gini:0.423235	valid-gini:0.239334
[1250]	train-gini:0.425139	valid-gini:0.23904
[1300]	train-gini:0.426691	valid-gini:0.238922
[1350]	train-gini:0.428596	valid-gini:0.238845
[1400]	train-gini:0.430435	valid-gini:0.239261
[1450]	train-gini:0.431998	valid-gini:0.239357
[1500]	train-gini:0.433579	valid-gini:0.239216
[1550]	train-gini:0.434974	valid-gini:0.239915
Out[98]:
'!!! used 19.15 s'
In [100]:
clf.best_iteration,clf.best_ntree_limit,clf.best_score
Out[100]:
(1596, 1597, 0.240364)
In [101]:
pred=clf.predict(d_test)
In [102]:
save()
In [ ]:
 
In [103]:
feature_score=pd.Series(clf.get_fscore())
feature_score.sort()
feature_score.plot.barh(color='g',alpha=.4)
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efc04a14f10>

next step is to use 100% train+valid

feature selection might work

feature convert to dummies might help feature selection

In [10]:
def make_dummies(data):
    for i in ['veh_body', 'veh_age', 'gender', 'area', 'agecat']:
        data[i]=data[i].astype('int')
        data=pd.concat([data,pd.get_dummies(data[i])],axis=1)

    a=['veh_body_'+str(i) for i in range(13) ]
    b=['veh_age_'+str(i) for i in range(1,5) ]
    c=['gender_'+str(i) for i in range(2) ]
    d=['area_'+str(i) for i in range(6) ]
    e=['agecat_'+str(i) for i in range(1,7) ]
    data.pop('veh_body')
    data.pop('veh_age')
    data.pop('gender')
    data.pop('area')
    data.pop('agecat')
    data.columns=['veh_value','exposure']+a+b+c+d+e
    return data
In [11]:
x_train=make_dummies(x_train)
# x_valid=make_dummies(x_valid)
In [12]:
x_train.shape
Out[12]:
(45239, 33)
In [13]:
x_train.head(1)
Out[13]:
veh_value exposure veh_body_0 veh_body_1 veh_body_2 veh_body_3 veh_body_4 veh_body_5 veh_body_6 veh_body_7 ... area_2 area_3 area_4 area_5 agecat_1 agecat_2 agecat_3 agecat_4 agecat_5 agecat_6
0 6.43 0.241898 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1 rows × 33 columns

use all data to train the model

In [ ]:
 
In [ ]:
 
In [14]:
# load data
train1=pd.read_csv('Kangaroo_train.csv')
train2=pd.read_csv('Kangaroo_valid.csv')
print (train1.shape,train2.shape)
hold=pd.read_csv('Kangaroo_hold.csv')
print (hold.shape)
combine=pd.concat([train1,train2])
train=combine.copy()

cat_vars = train.select_dtypes(['object']).columns

for var in cat_vars:
    train[var] = pd.factorize(train[var])[0]
    hold[var] = pd.factorize(hold[var])[0]

# preprocessing
y_train =train.pop('clm')
# y_claim_train=train.pop('claimcst0')
x_train = train.drop(['id','numclaims'], axis = 1)

y_hold = hold.pop('clm')
hold = hold.drop(['id','numclaims'], axis = 1)

y_train_clm0=x_train.pop('claimcst0')
((22610, 11), (22629, 11))
(22617, 10)
In [15]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(kind='regular')

x_train,y_train=sm.fit_sample(x_train,y_train)
x_train=pd.DataFrame(x_train)
y_train=pd.Series(y_train)
print x_train.shape, y_train.shape, y_train.sum()
x_train.columns=train.drop(['id','numclaims','claimcst0'], axis = 1).columns
(84310, 7) (84310,) 42155
In [16]:
x_train.head(1)
Out[16]:
veh_value exposure veh_body veh_age gender area agecat
0 6.43 0.241898 0.0 1.0 0.0 0.0 3.0

make dummies

In [17]:
x_train=make_dummies(x_train)
hold=make_dummies(hold)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [18]:
# apply xgboost
d_train = xgb.DMatrix(x_train, y_train)
d_valid = xgb.DMatrix(x_train, y_train)
d_hold = xgb.DMatrix(hold)

# d_hold=xgb.DMatrix(x_hold)
def use_gini(a, b):
    y = b.get_label()
    return 'gini', Gini(y, a)
    
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
In [19]:
xgb_params = {
#     'base_score': 0.5,
 'colsample_bylevel': 0.6,
 'colsample_bytree': 0.8,
#  'gamma': 0,
 'learning_rate': 0.01,
#  'max_delta_step': 1,
 'max_depth': 1,
 'min_child_weight': 9,
 'missing': None,

#  'objective': 'reg:linear',

 'objective': 'binary:logistic',
 'reg_alpha': 9,
 'reg_lambda': 9,
 'scale_pos_weight': 1,
 'seed': 27,
 'silent': 1,
 'subsample': 0.9}
In [20]:
# linear

start=time()

clf = xgb.train(xgb_params, d_train, int(502/0.75), watchlist, early_stopping_rounds=1111, 
                verbose_eval=50, 
                feval=use_gini, 
                maximize=True)


'!!! used %.2f s'%(time()-start)
[0]	train-gini:-0.127339	valid-gini:-0.127339
Multiple eval metrics have been passed: 'valid-gini' will be used for early stopping.

Will train until valid-gini hasn't improved in 1111 rounds.
[50]	train-gini:0.223111	valid-gini:0.223111
[100]	train-gini:0.273884	valid-gini:0.273884
[150]	train-gini:0.353242	valid-gini:0.353242
[200]	train-gini:0.383093	valid-gini:0.383093
[250]	train-gini:0.377712	valid-gini:0.377712
[300]	train-gini:0.378535	valid-gini:0.378535
[350]	train-gini:0.379602	valid-gini:0.379602
[400]	train-gini:0.381008	valid-gini:0.381008
[450]	train-gini:0.38618	valid-gini:0.38618
[500]	train-gini:0.391242	valid-gini:0.391242
[550]	train-gini:0.393174	valid-gini:0.393174
[600]	train-gini:0.394663	valid-gini:0.394663
[650]	train-gini:0.396026	valid-gini:0.396026
Out[20]:
'!!! used 15.14 s'
In [21]:
clf.best_iteration,clf.best_ntree_limit,clf.best_score
Out[21]:
(665, 666, 0.396638)
In [22]:
pred=clf.predict(d_hold)
In [17]:
save2()## final prediction
In [21]:
feature_score=pd.Series(clf.get_fscore())
feature_score.sort()
feature_score.plot.barh(color='m',alpha=.4)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf62e58890>

optional drop features

In [25]:
from sklearn.metrics import confusion_matrix
judge=pd.Series(pred).apply(lambda x:1 if x>0.522029 else 0)
In [26]:
confusion_matrix(y_hold,judge)
Out[26]:
array([[10801, 10276],
       [  440,  1100]])
In [27]:
pd.Series(pred).describe()
Out[27]:
count    22617.000000
mean         0.456469
std          0.157445
min          0.098007
25%          0.324611
50%          0.525218
75%          0.600220
max          0.645061
dtype: float64
In [28]:
(0.595391-0.522029)/5*8+0.522029
Out[28]:
0.6394082
In [29]:
(pd.Series(pred)>0.62).sum()
Out[29]:
758
In [30]:
y_hold.sum()
Out[30]:
1540
In [31]:
1540/22617.
Out[31]:
0.06809037449705974
In [32]:
xgb_params4 = {
#     'base_score': 0.5,
 'colsample_bylevel': 0.6,
#  'colsample_bytree': 0.8,
#  'gamma': 0,
 'learning_rate': 0.0086,
 'max_depth': 1,
 'min_child_weight': .5,
#  'missing': None,

#  'objective': 'reg:linear',

 'objective': 'binary:logistic',
 'reg_alpha': 0.5,
 'reg_lambda': 3,
 'scale_pos_weight': .5,
 'seed': 27,
 'silent': 1,
 'subsample': 0.54,
'eval_metric':'auc'}
In [34]:
start=time()

clf = xgb.train(xgb_params4, d_train, 500, watchlist, early_stopping_rounds=666, 
                verbose_eval=50,
                feval=use_gini, 
                maximize=True
               )


'!!! used %.2f s'%(time()-start)
[0]	train-gini:-0.127759	valid-gini:-0.127759
Multiple eval metrics have been passed: 'valid-gini' will be used for early stopping.

Will train until valid-gini hasn't improved in 666 rounds.
[50]	train-gini:0.252112	valid-gini:0.252112
[100]	train-gini:0.307819	valid-gini:0.307819
[150]	train-gini:0.368028	valid-gini:0.368028
[200]	train-gini:0.371885	valid-gini:0.371885
[250]	train-gini:0.371882	valid-gini:0.371882
[300]	train-gini:0.378054	valid-gini:0.378054
[350]	train-gini:0.375776	valid-gini:0.375776
[400]	train-gini:0.380253	valid-gini:0.380253
[450]	train-gini:0.386351	valid-gini:0.386351
Out[34]:
'!!! used 11.42 s'
In [35]:
clf.best_iteration,clf.best_ntree_limit,clf.best_score
Out[35]:
(494, 495, 0.391914)
In [37]:
pred=clf.predict(d_hold)
In [38]:
result=pd.Series(pred)
In [40]:
judge=pd.Series(pred).apply(lambda x:1 if x>0.494940 else 0)
In [41]:
confusion_matrix(y_hold,judge)
Out[41]:
array([[21077,     0],
       [ 1540,     0]])
In [42]:
result.describe()
Out[42]:
count    22617.000000
mean         0.315732
std          0.116792
min          0.081827
25%          0.199839
50%          0.369745
75%          0.422930
max          0.453982
dtype: float64

A Image Crawler I wrote Added GUI (graphical user interface) & packaged into a single .exe file

in 2014, I’ ve written a Python image crawler for a specific website Cure WorldCosplay,  a website that attracts cosplayers  all over the world post their own pictures . Which has about 10k active members and up to 10 million pictures posted.  

The pros is the program is packaged into a single executable file, no programming environment needed. But some virus detection software could report unsafe file.

 Here the program is!     

Click names below for download

With interface

Without interface


Theoretically, if you have enough disk space, you can download all the pics of that website (about 9800 Gigabyte), the only limit is your bandwidth. I have deployed 36 crawlers on a Linux server at the same time, they download pics 24/7 at the maximum internet bandwidth.

 

How to use it: