西安科技大学数学建模，Bootstrap方法在创新实践中的应用

1. 介绍

模型融合学习----结合多个不同的模型结果，来获得更好的泛化性能。目前，具有多层次架构的深度学习模型，相较于只有浅层结构的模型来说，具有更好的性能表现。深度模型融合学习，就是结合深度学习模型的优点和模型融合的优点，从而实现最终的模型具有更好的泛化性能。融合模型被大致分类为：bagging、boosting和stacking。

本文主要围绕上述三类模型融合方法进行理论概述，并且给出具体的参考论文。与此同时，本文还会给出一个用python实现的模型融合方法示例。

2. Bagging策略

2.1理论介绍

也称为聚合，是生成模型融合算法模型的标准技术之一，用于提高集成分类器的性能。的主要思想是生成一系列与原始数据具有相同大小和分布的*观察。给定一系列观察结果，生成一个集合预测器，它比在原始数据上生成的单个预测器更好。在原始模型中增加了两个步骤：首先，生成样本并将每组样本传递给基础模型，其次，组合多个预测器的预测的策略。每组样品可以在更换或不更换的情况下生成。组合基本预测器的输出可能会有所不同，因为大多数投票用于分类问题，而平均策略用于生成集成输出的回归问题。

2.2相关论文

Smoothing Effects of Bagging: Von Mises Expansions of Bagged Statistical Functionals
Analyzing bagging
Support vector machine ensemble with bagging
Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval
A case study on bagging, boosting and basic ensembles of neural networks for ocr
Response models based on bagging neural networks
Pricing and hedging derivative securities with neural networks: Bayesian regularization, early stopping, and bagging
Improved short-term load forecasting using bagged neural networks
Bagging survival trees
On building ensembles of stacked denoising autoencoding classifiers and their further improvement
Roughly balanced bagging for imbalanced data
Neighbourhood sampling in bagging for imbalanced data
Online bagging and boosting

3. Boosting策略

3.1 理论介绍

技术用于集成模型，用于将弱学习模型转换为具有更好泛化能力的学习模型。与单个弱学习器相比，诸如分类问题中的多数投票或回归问题中弱学习器的线性组合等技术可以产生更好的预测。像和这样的提升方法已被用于不同的领域。使用贪心技术来最小化由错误分类损失上限的凸代理函数，在每次迭代时，当前模型通过适当加权的预测器进行扩充。学习有效的集成分类器，因为它在学习的每个阶段利用错误分类的样本。最小化了指数损失函数，而梯度提升将这个框架推广到任意差分损失函数。，也称为前向阶段加法建模，最初是为了提高分类树的性能而提出的。考虑到深度学习模型在许多领域/应用程序中的应用性能，它最近已被纳入深度学习模型。

3.2 相关论文

Boosted deep belief network (DBN) as base classifiers for facial expression recognition
Decision trees as base classifiers for binary class classification problems
Decision trees as base classifiers for multiclass classification problems
Boosting based CNN with incremental approach for facial action unit recognition
Boosted CNN
Deep boosting for image denoising with dense connections
Deep boosting for image restoration and image denoising
Ensemble of CNN and boosted forest for edge detection, object proposal generation, pedestrian and face detection
CNN Boosting applied to bacterila cell images and crowd counting
Boosted deep independent embedding model for online scenarios
Hierarchical boosted deep metric learning with hierarchical label embedding
Transfer learning based deep incremental boosting
Snapshot boosting

4. Stacking策略

4.1 理论介绍

集成可以通过以某种方式组合多个基本模型的输出或使用某种方法来选择“最佳”基本模型来完成。 Stacking 是一种集成技术，其中元学习模型用于集成基础模型的输出。如果最终决策部分是线性模型，则放样通常称为“模型混合”或简称为“混合”。堆叠或堆叠回归的概念最初由给出。在这种技术中，数据集被随机分成 j个相等的部分。对于第 j 折交叉验证，一组用于测试，其余用于训练。通过这些训练测试对子集，我们获得了不同学习模型的预测，这些模型用作元数据来构建元模型。元模型做出最终预测，也称为赢家通吃策略。

4.2 相关论文

Combining Estimates in Regression and Classification
Deep convex net: A scalable architecture for speech pattern classification
stacking and learning for building deep architectures
Use of kernel deep convex networks and end-to-end learning for spoken language understanding
Random features for Kernel Deep Convex Network
A framework for parameter estimation and model selection in kernel deep stacking networks
A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition
Tensor deep stacking networks
Sparse deep stacking network for image classification
Sparse deep stacking network for fault diagnosis of motor
Visual representation and classification by learning group sparse deep stacking network
Grasp for stacking via deep reinforcement learning
Particle swarm optimisation for evolving deep neural networks for image classification by evolving and stacking transferable blocks
Deep stacked hierarchical multi-patch network for image deblurring

5. Python模型融合示例

5.1 加载相关库文件

首先我们需要加载库，但我们的选择仅限于完善的库，如、numpy和。

import numpy as np import pandas as pd from sklearn.tree import DecisionTreeRegressor

5.2 加载数据和硬编码标签

接下来，在这个例子中，我们有一组目标和输入数据帧。此外，我们分别在变量 input_tags 和 target_tags 下明确包含输入和目标列的名称。

df = pd.read_csv( "/kaggle/input/sound-the-alarm2/public.csv" ) #inputs df_ = pd.read_csv( "/kaggle/input/sound-the-alarm2/public.targets.csv" ) #targets

5.3 预处理划分训练集和测试集

预处理功能将非结构化数据聚合为时间段和标签。

def rolling(df, window , step): count = 0 df_length = len(df)def rolling(df, window , step): count = 0 df_length = len(df) while count (df_length - window ): yield count, df[count: window +count] count += stepdef preprocess(alarms, labels=None): alarms = alarms[alarms.major_down_time==False].drop(columns=[ major_down_time ]) #count frequencies t = alarms.groupby([ day , ag ]).count().rename(columns={ date : freq }).reset_index() # add an empty row for all columns so we always get the same shape output t = t.append(pd.DataFrame({ day :[pd.to_datetime(alarms.date.values[ 0 ]).date()]*len(input_tags), freq :[ 0 ]*len(input_tags), ag :input_tags})) X = pd.pivot_table(t, values= freq , columns= ag , index= day , aggfunc=np.sum).reset_index() # ensure the columns are in the same order X=X[[ day ] + input_tags] x=dict() # For model input we will take 30 days of history for every row IE 3 dimensions # (sample_day, history date, column) # then flatten to 2 dimensions using date # (sample day, datecolumn) for offset, window in rolling(X, 30 , 1 ): # prepare the X input d = window .tail( 1 ).day.values[ 0 ] if d in labels[labels. window == 7 day ].date.values : # make sure we have a label for the date x[d]= window .drop(columns=[ day ]).fillna( 0 ) inputs = [x[y] for y in x] inp = np.array(inputs) X = inp.reshape((inp.shape[ 0 ],inp.shape[ 1 ]*inp.shape[ 2 ])) # flatten to one row per day target1 = labels[(labels. window == 7 day ) & labels.date.isin(x)].fillna( 0 ) target2 = labels[(labels. window == 8-14 day ) & labels.date.isin(x)].fillna( 0 ) target3 = labels[(labels. window == 15-21 day ) & labels.date.isin(x)].fillna( 0 ) return X, x.keys(), target1[target_tags], target2[target_tags], target3[target_tags]

在本文中，数据显着依赖于时间维度，因此将数据随机拆分为测试和训练并不是最佳做法。因此，我们将 80% 的靠前个观察结果放入训练中，将 20% 的最新观察结果放入测试数据拆分中。

X,dates,y1,y2,y3 = preprocess(inputs,targets)X_train, X_test = X[: int (X.shape[ 0 ]* 0.8 )], X[ int (X.shape[ 0 ]* 0.8 ):]dates_train, dates_test = list (dates)[: int (len(dates)* 0.8 )], list (dates)[ int (len(dates)* 0.8 ):]

5.4 融合模型

为了预测三列的标签，我们创建了一个适合单独决策树模型并结合各个预测的集成类。

class EnsembleModel : def __init__ ( self ) : self .models = dict() self .models[ model1 ] = DecisionTreeRegressor(random_state= 1 ) self .models[ model2 ] = DecisionTreeRegressor(random_state= 1 ) self .models[ model3 ] = DecisionTreeRegressor(random_state= 1 ) def fit1 ( self , X, y) : self .models[ model1 ].fit(X, y) def fit2 ( self , X, y) : self .models[ model2 ].fit(X, y) def fit3 ( self , X, y) : self .models[ model3 ].fit(X, y) def _predict ( self ,model_name, inp_X, dates) : preds = self .models[model_name].predict(inp_X) preds = pd.DataFrame(dict(zip(target_tags,preds.T))) preds[ date ]=dates return preds def _predict_all ( self ,inp_X, dates) : p1 = self ._predict( model1 ,inp_X,dates) p2 = self ._predict( model2 ,inp_X,dates) p3 = self ._predict( model3 ,inp_X,dates) p1[ window ]=[ 7 day ]*len(p1) p2[ window ]=[ 8-14 day ]*len(p2) p3[ window ]=[ 15-21 day ]*len(p3) return pd.concat([p1,p2,p3]).reset_index(drop=True) def predict ( self , inp_X, dates) : return self ._predict_all(inp_X, dates)

5.5 预测计算

在本节中，我们初始化集成模型并拟合每个单独的模型来估计预测。

models = EnsembleModel()models.fit1(X_train, y1[:167])models.fit2(X_train, y2[:167])models.fit3(X_train, y3[:167])pred = models.predict(X_test, dates_test)

此外，我们还构建了一个评分函数来惩罚不准确的预测。

def scoring( gt , pred): gt [ date ]=pd.to_datetime(gt.date) pred[ date ]=pd.to_datetime(pred.date) gt = gt.set_index([ window , date ]) pred = pred.set_index([ window , date ]) m = gt.join(pred, how= inner , rsuffix= _pred ) p_cols = [c+ "_pred" for c in target_tags] gt = m [target_tags].values pred = m [p_cols].values correct = np.bitwise_and( gt 0 ,pred 0 ).sum() incorrect = np.bitwise_and( gt == 0 ,pred 0 ).sum() return correct - (incorrect/ 2 )scoring(targets, pred)

5.6 总结

在这个例子中，我们演示了如何将机器学习功能组合到一个流程中。我们使用了清理和聚合数据的预处理功能。此外，我们构建了一个集成模型，该模型适合三个目标的单个模型并预测所有标签。最后，我们在自定义评分功能的帮助下对预测进行评分。需要改进的领域是添加更多特征和改进预测模型。此外，我们可以使用和交叉验证器来估计必要的特征和超参数。