根据数据的特点,选择合适的模型 - 充分理解数据的特性 - 充分理解各个/各类模型算法的优点与不足 - 确定比较合适的m种方案,实验/验证,再从中选择出1种最佳方案 数据特点,规模,量级 模型复杂度,计算量,计算代价,性能,效率,准确率
波士顿房价封装 from tpf.datasets import load_boston X_train, y_train, X_test, y_test = load_boston(split=True,test_size=0.15, reload=False) X_train.shape,y_train.shape, X_test.shape, y_test.shape ((430, 13), (430,), (76, 13), (76,)) 重要性低于0.01特征丢弃 from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) feature=dtr.feature_importances_ import numpy as np a=np.argsort(feature)[::-1] X_train = X_train[a][:6] y_train = y_train[a][:6] X_test = X_test[a][:6] y_test = y_test[a][:6] 原官方方法 from sklearn.datasets import load_boston # 加载数据集,没有算法没有数据集测试,算法就不值钱 X,y = load_boston(return_X_y=True) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) |
from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) y_pred_dtr = dtr.predict(X_test) ((y_pred_dtr - y_test)**2).mean() 19.292105263157893 并不是每次训练结果都一样,会有变化 from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) y_pred_dtr = dtr.predict(X_test) ((y_pred_dtr - y_test)**2).mean() 17.45394736842105 将重要性低于0.01的列舍弃,重新训练 feature=dtr.feature_importances_ import numpy as np a=np.argsort(feature)[::-1] a array([ 5, 12, 7, 0, 4, 6, 10, 9, 11, 2, 1, 8, 3]) X_train = X_train[a][:6] y_train = y_train[a][:6] X_test = X_test[a][:6] y_test = y_test[a][:6] dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) y_pred_dtr = dtr.predict(X_test) ((y_pred_dtr - y_test)**2).mean() 14.103333333333325 使用训练器 from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor() from tpf import MlTrain MlTrain.train(X_train, y_train, X_test, y_test, model,save_path=save_path,epoch=1000) 2.994999999999996 from tpf import pkl_load,pkl_save model,loss=pkl_load(file_path=save_path) y_pred_dtr = model.predict(X_test) ((y_pred_dtr - y_test)**2).mean() 2.994999999999996 |
from sklearn.neighbors import KNeighborsClassifier ,KNeighborsRegressor knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train,y_train) y_pred_knn = knn_reg.predict(X_test) ((y_pred_knn-y_test)**2).mean() 21.96151578947368 使用决策树选择后的特征,对于KNN算法无提升 21.747733333333333 |
from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor() rfr.fit(X_train,y_train) y_pred_frf = rfr.predict(X_test) ((y_pred_frf - y_test)**2).mean() 12.495555855263165 使用决策树选择后的特征,对于随机森林算法有明显提升 from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) feature=dtr.feature_importances_ import numpy as np a=np.argsort(feature)[::-1] X_train = X_train[a][:6] y_train = y_train[a][:6] X_test = X_test[a][:6] y_test = y_test[a][:6] from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor() rfr.fit(X_train,y_train) y_pred_frf = rfr.predict(X_test) ((y_pred_frf - y_test)**2).mean() 7.368401833333395 使用训练器 from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() from tpf import MlTrain MlTrain.train(X_train, y_train, X_test, y_test, model, save_path="/media/xt/tpf/tpf/aitpf/source/models/fangjia_RandomForestRegressor.pkl", epoch=3000, loss_break=0.1) loss_start: 5.198032166666702 5.0362105000000055 |
使用训练器 from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) feature=dtr.feature_importances_ import numpy as np a=np.argsort(feature)[::-1] X_train = X_train[a][:6] y_train = y_train[a][:6] X_test = X_test[a][:6] y_test = y_test[a][:6] from sklearn.svm import SVC,SVR model = SVR() from tpf import MlTrain MlTrain.train(X_train, y_train, X_test, y_test, model, save_path="/media/xt/tpf/tpf/aitpf/source/models/fangjia_SVR.pkl", epoch=3000, loss_break=0.1) loss_start: 9.564113280675924 SVM特点 SVM的速度真是快,3000轮,几秒,也就五秒左右就执行完了, - 让我误以为代码出问题了, - 3000轮怎么一下子就过去了,之前的算法都是等上一会的, - 排查了好几遍代码问题 SVM极其稳定 - 每次执行结果都一样 - 3000次的执行结果都是9.564113280675924 - 不像其他算法可能会有一些浮动 - 这意味着SVM算法用不着训练器,因为预测数据一致,结果基本不变 数学难度极高 - SVM背后的数学理论可能是常见机器学习算法中最复杂,难度最高的 SVM只是极其稳定,但并不意味着相同数据集结果100%不变 from tpf import MlTrain MlTrain.train(X_train, y_train, X_test, y_test, model, save_path="fangjia_SVR.pkl", epoch=300000, loss_break=0.1) loss_start: 6.955530122835717 还有一个5点多的,由于代码缺陷没保存上, 原来的代码以第一次训练为基准,后面的训练要与第一次训练结果对比, 更好才保存, 没想到SVM第一次就最好,后面一直不变,结果出现5点多时就没保存上 这就像游戏中开局即over,首次出手就结束了局面,出场就是终场... 这个不同的结果,是我隔天再试才出现的... |