ML
scikit-learn¶
- sklearn基础函数
fit
训练(拟合)模型predict
用训练好的模型对新数据进行预测transform
对数据做特征变换(主要用于预处理/降维/特征选择)fit_transform
相当于fit + transform的结合,一次完成拟合和变换score
模型评估,默认情况下,accuracy 用于分类器,R^2 用于回归器datasets.load
加载自带数据集train_test_split
训练集、测试集划分sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
- accuracy_score、confusion_matrix 等评估指标及
classification_report(y_test, y_pred)
分类报告
- 数据集
自带:Iris, diabetes, breast cancer ...
csv
import pandas as pd from sklearn.model_selection import train_test_split # 读取 CSV 文件 df = pd.read_csv("data.csv") # 假设最后一列是标签 X = df.iloc[:, :-1].values # 特征 y = df.iloc[:, -1].values # 标签 # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
excel
import pandas as pd df = pd.read_excel("data.xlsx") X = df.drop("label", axis=1).values y = df["label"].values
图片
from sklearn.datasets import load_files data = load_files("dataset/", load_content=False) X, y = data["filenames"], data["target"] print(X[:5]) # 图片路径 print(y[:5]) # 标签
监督学习¶
LinearRegression
from sklearn.linear_model import LinearRegression
LogisticRegression
from sklearn.linear_model import LogisticRegression
DecisionTree
from sklearn.tree import DecisionTreeClassifier
RandomForest
from sklearn.ensemble import RandomForestClassifier
- SVC (核函数、参数 C 与 gamma)
from sklearn.svm import SVC
kernel:核函数类型(linear 线性核、rbf 高斯核等) C:惩罚系数,C 越大,对误分类容忍度越低 gamma:核函数系数(只对 rbf/poly 有效)
- MLPClassifier 多层感知机(隐藏层、迭代次数)
class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100,), activation='relu', *,
solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001,
power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False,
momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9,
beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000)
GaussianNB
from sklearn.naive_bayes import GaussianNB
MultinomialNB
from sklearn.naive_bayes import MultinomialNB
非监督学习¶
- KMeans
from sklearn.cluster import KMeans
- DBSCAN
from sklearn.cluster import DBSCAN
- 层次聚类
from sklearn.cluster import AgglomerativeClustering
- 降维
模型优化与进阶¶
- 标准化:按照特征列进行均值为0,标准差为1的标准化处理
from sklearn.preprocessing import StandardScaler
- 归一化:将数据的取值范围缩放到一个固定区间
from sklearn.preprocessing import MinMaxScaler
- 缺失值处理:填充、删除或插值等方法
from sklearn.impute import SimpleImputer
- 从高维特征集中选择一个子集,保留对预测目标最有用的特征,剔除冗余或无关特征。 2.1 过滤法(Filter Method):基于统计指标(如方差、相关系数)评估特征,不依赖具体模型。 2.2 包裹法(Wrapper Method):使用特定模型(如递归特征消除RFE)评估特征子集的性能。 2.3 嵌入法(Embedded Method):在模型训练过程中进行特征选择(如Lasso回归的L1正则化)。
- 网格搜索
- 随机搜索
综合实践¶
- 分类任务:MNIST 手写数字识别
- 回归任务:房价预测 (Boston / California housing)
- 开源数据集完整流程
In [1]:
# 多维线性回归
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
diabetes = load_diabetes()
print(diabetes.feature_names)
# ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
X = diabetes.data # 10个特征
y = diabetes.target # 目标变量
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("回归系数 (coef):", model.coef_)
print("截距 (intercept):", model.intercept_)
print("R^2:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.scatter(y_test, y_pred, color="blue", edgecolors="k", alpha=0.7)
plt.plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()],
"r--", lw=2, label="理想预测")
plt.xlabel("真实值")
plt.ylabel("预测值")
plt.title("多维线性回归 - 糖尿病数据集")
plt.legend()
plt.show()
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 回归系数 (coef): [ 37.90402135 -241.96436231 542.42875852 347.70384391 -931.48884588 518.06227698 163.41998299 275.31790158 736.1988589 48.67065743] 截距 (intercept): 151.34560453985995 R^2: 0.4526027629719197 MSE: 2900.19362849348
In [2]:
# 逻辑回归-对数几率回归
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
iris = datasets.load_iris()
print(iris["DESCR"])
X = iris.data[:, :2] # 取前两个特征:sepal length, sepal width
y = (iris.target == 0).astype(int) # 是否为 setosa (0 或 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("准确率 Accuracy:", accuracy_score(y_test, y_pred))
print("混淆矩阵 Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("分类报告 Classification Report:\n", classification_report(y_test, y_pred))
# **绘制决策边界
import numpy as np
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k", cmap=plt.cm.Paired)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Logistic Regression on Iris (二分类)")
plt.show()
.. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%[email protected]) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. dropdown:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ... 准确率 Accuracy: 1.0 混淆矩阵 Confusion Matrix: [[26 0] [ 0 19]] 分类报告 Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 26 1 1.00 1.00 1.00 19 accuracy 1.00 45 macro avg 1.00 1.00 1.00 45 weighted avg 1.00 1.00 1.00 45
In [3]:
# 决策树
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3, random_state=42)
# clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
plt.figure(figsize=(12, 6))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("决策树 - 鸢尾花分类")
plt.show()
Accuracy: 0.9777777777777777 Confusion Matrix: [[19 0 0] [ 0 12 1] [ 0 0 13]] Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 19 1 1.00 0.92 0.96 13 2 0.93 1.00 0.96 13 accuracy 0.98 45 macro avg 0.98 0.97 0.97 45 weighted avg 0.98 0.98 0.98 45
In [4]:
# 随机森林
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
rf = RandomForestClassifier(
n_estimators=100, # 森林里树的数量
criterion="gini", # 每棵树的划分标准
max_depth=None, # 树的最大深度,默认不限
random_state=42
)
# rf = RandomForestClassifier(
# n_estimators=100,
# criterion="entropy",
# max_depth=None,
# random_state=42
# )
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
import numpy as np
feature_importances = rf.feature_importances_
features = iris.feature_names
plt.barh(np.arange(len(features)), feature_importances, align="center")
plt.yticks(np.arange(len(features)), features)
plt.xlabel("Feature Importance")
plt.title("随机森林 - 特征重要性")
plt.show()
Accuracy: 1.0 Confusion Matrix: [[19 0 0] [ 0 13 0] [ 0 0 13]] Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 19 1 1.00 1.00 1.00 13 2 1.00 1.00 1.00 13 accuracy 1.00 45 macro avg 1.00 1.00 1.00 45 weighted avg 1.00 1.00 1.00 45
In [5]:
# 支持向量机
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = SVC(kernel="linear", random_state=42)
# model = SVC(kernel='rbf', C=1.0, gamma='scale') # 高斯核
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# 绘制决策边界
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("SVM on Iris Dataset (前两特征)")
plt.show()
Accuracy: 0.9 Confusion Matrix: [[10 0 0] [ 0 7 2] [ 0 1 10]] Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 10 1 0.88 0.78 0.82 9 2 0.83 0.91 0.87 11 accuracy 0.90 30 macro avg 0.90 0.90 0.90 30 weighted avg 0.90 0.90 0.90 30
In [6]:
# 多层感知机-神经网络
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score
digits = load_digits()
X = digits.data
y = digits.target
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
mlp = MLPClassifier(hidden_layer_sizes=(64, 32, 16),
activation='relu',
solver='adam', # 优化器Adam
max_iter=200,
random_state=42)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print("测试集准确率:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
(1797, 64) (1797,) 测试集准确率: 0.9694444444444444 precision recall f1-score support 0 1.00 0.97 0.98 33 1 0.96 0.96 0.96 28 2 0.94 1.00 0.97 33 3 0.97 0.97 0.97 34 4 1.00 1.00 1.00 46 5 0.94 0.98 0.96 47 6 0.94 0.97 0.96 35 7 1.00 0.97 0.99 34 8 0.96 0.90 0.93 30 9 0.97 0.95 0.96 40 accuracy 0.97 360 macro avg 0.97 0.97 0.97 360 weighted avg 0.97 0.97 0.97 360
In [7]:
# 高斯朴素贝叶斯
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("GaussianNB 测试集准确率:", accuracy_score(y_test, y_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, y_pred))
print("分类报告:\n", classification_report(y_test, y_pred))
GaussianNB 测试集准确率: 0.9777777777777777 混淆矩阵: [[19 0 0] [ 0 12 1] [ 0 0 13]] 分类报告: precision recall f1-score support 0 1.00 1.00 1.00 19 1 1.00 0.92 0.96 13 2 0.93 1.00 0.96 13 accuracy 0.98 45 macro avg 0.98 0.97 0.97 45 weighted avg 0.98 0.98 0.98 45
In [8]:
# 多项式朴素贝叶斯
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
iris = load_iris()
X = iris.data
y = iris.target
# MultinomialNB 要求非负特征,特征要归一化到非负值
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)
print("MultinomialNB 测试集准确率:", accuracy_score(y_test, y_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, y_pred))
print("分类报告:\n", classification_report(y_test, y_pred))
MultinomialNB 测试集准确率: 0.9333333333333333 混淆矩阵: [[19 0 0] [ 0 11 2] [ 0 1 12]] 分类报告: precision recall f1-score support 0 1.00 1.00 1.00 19 1 0.92 0.85 0.88 13 2 0.86 0.92 0.89 13 accuracy 0.93 45 macro avg 0.92 0.92 0.92 45 weighted avg 0.93 0.93 0.93 45
In [9]:
# k-means
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2] # 花萼长、宽
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=50)
# 聚类中心
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5, marker="X")
plt.title("K-means Clustering on Iris (first 2 features)")
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()
In [10]:
# DBSCAN
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2]
dbscan = DBSCAN(eps=0.5, min_samples=5) # eps=半径阈值, min_samples=最小点数
y_dbscan = dbscan.fit_predict(X)
# KMeans 是“中心点”聚类算法,可以对新数据点进行预测,所以有 predict()。
# DBSCAN 是基于密度的聚类,没有模型中心点,它只能标记训练数据的簇,不能直接预测新数据,所以没有 predict() 方法。
plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='viridis', s=50)
plt.title("DBSCAN Clustering on Iris (first 2 features)")
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()
In [11]:
# 层次聚类
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris
import scipy.cluster.hierarchy as sch
iris = load_iris()
X = iris.data[:, :2] # 使用前两个特征
# linkage参数指定链接方法:'ward'、'complete'、'average'或'single'
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
y_hierarchical = hierarchical.fit_predict(X)
plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=y_hierarchical, cmap='viridis', s=50)
plt.title("Hierarchical Clustering on Iris (first 2 features)")
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
# 绘制树状图
plt.subplot(122)
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Data points')
plt.ylabel('Euclidean distances')
plt.tight_layout()
plt.show()
In [12]:
# PCA降维
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
# PCA 降到二维
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(6, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', s=50)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("PCA on Iris Dataset")
plt.show()
C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from font(s) SimHei. fig.canvas.print_figure(bytes_io, **kw)
In [13]:
# t-SNE降维
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
tsne = TSNE(n_components=2, random_state=42, perplexity=30, learning_rate=200)
X_tsne = tsne.fit_transform(X)
plt.figure(figsize=(6, 5))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', s=50)
plt.xlabel("t-SNE Dim 1")
plt.ylabel("t-SNE Dim 2")
plt.title("t-SNE on Iris Dataset")
plt.show()
C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from font(s) SimHei. fig.canvas.print_figure(bytes_io, **kw)
In [14]:
# 标准化、归一化、缺失值处理
# 标准化 Standardization
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 归一化 Normalization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)
# 缺失值处理
from sklearn.impute import SimpleImputer
import numpy as np
X_train_missing = X_train.copy()
X_train_missing[0, 0] = np.nan # 模拟缺失值
imputer = SimpleImputer(strategy="mean") # 还可选 "median", "most_frequent", "constant"
X_train_imputed = imputer.fit_transform(X_train_missing)
In [11]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import VarianceThreshold, SelectKBest, RFE
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.preprocessing import StandardScaler
from scipy.stats import pearsonr # ✅ 修正这里
# 加载乳腺癌数据集(569样本,30特征,预测良性/恶性肿瘤)
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# 1. 过滤法:方差选择
selector_var = VarianceThreshold(threshold=0.8)
X_var = selector_var.fit_transform(X)
selected_features_var = feature_names[selector_var.get_support()]
print("方差选择保留的特征:", selected_features_var)
# 2. 过滤法:相关系数选择
def pearsonr_score(X, y):
# SelectKBest需要返回 (scores, p-values)
scores, p_values = [], []
for i in range(X.shape[1]):
r, p = pearsonr(X[:, i], y)
scores.append(abs(r)) # 用相关系数绝对值作为分数
p_values.append(p)
return np.array(scores), np.array(p_values)
selector_corr = SelectKBest(score_func=pearsonr_score, k=5)
X_corr = selector_corr.fit_transform(X, y)
selected_features_corr = feature_names[selector_corr.get_support()]
print("相关系数选择保留的特征:", selected_features_corr)
# 3. 包裹法:递归特征消除(RFE)
model = LogisticRegression(max_iter=1000)
selector_rfe = RFE(model, n_features_to_select=5)
X_rfe = selector_rfe.fit_transform(X, y)
selected_features_rfe = feature_names[selector_rfe.get_support()]
print("RFE选择保留的特征:", selected_features_rfe)
# 4. 嵌入法:Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lasso = Lasso(alpha=0.01)
lasso.fit(X_scaled, y)
selected_features_lasso = feature_names[lasso.coef_ != 0]
print("Lasso选择保留的特征:", selected_features_lasso)
方差选择保留的特征: ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'perimeter error' 'area error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area'] 相关系数选择保留的特征: ['mean perimeter' 'mean concave points' 'worst radius' 'worst perimeter' 'worst concave points']
C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1): STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT Increase the number of iterations to improve the convergence (max_iter=1000). You might also want to scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1): STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT Increase the number of iterations to improve the convergence (max_iter=1000). You might also want to scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1): STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT Increase the number of iterations to improve the convergence (max_iter=1000). You might also want to scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1): STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT Increase the number of iterations to improve the convergence (max_iter=1000). You might also want to scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
RFE选择保留的特征: ['mean radius' 'texture error' 'worst radius' 'worst compactness' 'worst concavity'] Lasso选择保留的特征: ['mean texture' 'mean concave points' 'mean fractal dimension' 'radius error' 'smoothness error' 'concavity error' 'worst radius' 'worst texture' 'worst smoothness' 'worst concavity' 'worst concave points' 'worst symmetry']