python实现K折交叉验证出现的问题以及KFold和StratifiedKFold的区别是什么

发布时间：2021-12-04 09:09:42 来源：亿速云阅读：207 作者：柒染栏目：云计算

本篇文章给大家分享的是有关python实现K折交叉验证出现的问题以及KFold和StratifiedKFold的区别是什么，小编觉得挺实用的，因此分享给大家学习，希望大家阅读完这篇文章后可以有所收获，话不多说，跟着小编一起来看看吧。

训练集和测试集的划分方法很大程度上影响最终的模型与参数的值。一般情况将K折交叉验证用于模型调优，找到使得模型泛化性能最优的超参值，同时可以测试当前模型算法的性能。
python实现K折交叉验证出现的问题以及KFold和StratifiedKFold的区别是什么
k值大时，在每次迭代过程中将会有更多的数据用于模型训练，能够得到最小偏差，同时算法时间延长。
k值小时，降低模型在不同的数据块上进行重复拟合的性能评估的计算成本，在平均性能的基础上获得模型的准确评估。

二折实现代码

通常用以下模块实现

from sklearn.model_selection import KFold,StratifiedKFold

StratifiedKFold参数说明：

class sklearn.model_selection.StratifiedKFold(n_splits=’warn’, shuffle=False, random_state=None)n_splits：表示几折（折叠的数量）shuffle== True:选择是否在分割成批次之前对数据的每个分层进行打乱。   供5次2折使用，这样每次的数据是进行打乱的，否则，每次取得的数据是相同的random_state:控制随机状态，随机数生成器使用的种子

两注意点：
1.kf.split(x)返回的是数据集的索引，需要x[train_index]才能提取数据
2.shuffle=True时，shuffle（洗牌的意思），每次run代码是，随机取得的索引是不同的。反之，所以不变。

import numpy as npfrom sklearn.model_selection import KFold,StratifiedKFold
x = np.array([[1, 1], [2, 2], [3, 3], [4, 4],[5,5],[6,6]])kf = KFold(n_splits=2,shuffle=True)for train_index, test_index in kf.split(x):print('train_index:', train_index)print("train_data:",x[train_index])print('test_index', test_index)print("--------二折时，测试集变成了训练集分割线--------")train_index: [1 2 3]train_data: [[2 2]
 [3 3]
 [4 4]]test_index [0 4 5]--------二折时，测试集变成了训练集分割线--------train_index: [0 4 5]train_data: [[1 1]
 [5 5]
 [6 6]]test_index [1 2 3]--------二折时，测试集变成了训练集分割线--------

KFold和StratifiedKFold的区别

Stratified是分层采样的意思，确保训练集，测试集中各类别样本的比例与原始数据集中相同。
下面这个例子6个数据对应6个标签，我们分成三折，则每次训练时，4个数据为train，2个数据为test。

StratifiedKFold能保证样本的比例与原始数据集中相同，即不会出现train_index=[0,1,2,3] train_label=[1,1,1,0]
test_index=[4,5] test_label=[0,0]-----数据分布偏颇现象

import numpy as npfrom sklearn.model_selection import KFold,StratifiedKFold

x = np.array([[1, 1], [2, 2], [3, 3], [4, 4],[5,5],[6,6]])y=np.array([1,1,1,0,0,0])kf = StratifiedKFold(n_splits=3,shuffle=True)for train_index, test_index in kf.split(x,y):print('train_index:', train_index)print('test_index', test_index)print("--------二折时，测试集成了训练集分割线--------")train_index: [0 1 4 5]test_index [2 3]--------二折时，测试集成了训练集分割线--------train_index: [0 2 3 5]test_index [1 4]--------二折时，测试集成了训练集分割线--------train_index: [1 2 3 4]test_index [0 5]--------二折时，测试集成了训练集分割线--------

random_state（随机状态）

为什么需要用到这样一个参数random_state（随机状态）？

1、在构建模型时：
forest = RandomForestClassifier(n_estimators=100, random_state=0)forest.fit(X_train, y_train)2、在生成数据集时：
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)3、在拆分数据集为训练集、测试集时：
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)

如果不设置random_state的话，则每次构建的模型是不同的。
每次生成的数据集是不同的，每次拆分出的训练集、测试集是不同的，所以根据需求而定。

以上就是python实现K折交叉验证出现的问题以及KFold和StratifiedKFold的区别是什么，小编相信有部分知识点可能是我们日常工作会见到或用到的。希望你能通过这篇文章学到更多知识。更多详情敬请关注亿速云行业资讯频道。

向AI问一下细节

python实现K折交叉验证出现的问题以及KFold和StratifiedKFold的区别是什么

二折实现代码

KFold和StratifiedKFold的区别

random_state（随机状态）

猜你喜欢

最新资讯

相关推荐

相关标签