sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None,
shuffle=True, stratify=None)

# 多维线性回归
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

diabetes = load_diabetes()

print(diabetes.feature_names)
# ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

X = diabetes.data  # 10个特征
y = diabetes.target  # 目标变量

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("回归系数 (coef):", model.coef_)
print("截距 (intercept):", model.intercept_)
print("R^2:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.scatter(y_test, y_pred, color="blue", edgecolors="k", alpha=0.7)
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         "r--", lw=2, label="理想预测")
plt.xlabel("真实值")
plt.ylabel("预测值")
plt.title("多维线性回归 - 糖尿病数据集")
plt.legend()
plt.show()

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
回归系数 (coef): [  37.90402135 -241.96436231  542.42875852  347.70384391 -931.48884588
  518.06227698  163.41998299  275.31790158  736.1988589    48.67065743]
截距 (intercept): 151.34560453985995
R^2: 0.4526027629719197
MSE: 2900.19362849348

# 逻辑回归-对数几率回归
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

iris = datasets.load_iris()
print(iris["DESCR"])
X = iris.data[:, :2]  # 取前两个特征：sepal length, sepal width
y = (iris.target == 0).astype(int)  # 是否为 setosa (0 或 1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("准确率 Accuracy:", accuracy_score(y_test, y_pred))
print("混淆矩阵 Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("分类报告 Classification Report:\n", classification_report(y_test, y_pred))

# **绘制决策边界
import numpy as np

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k", cmap=plt.cm.Paired)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Logistic Regression on Iris (二分类)")
plt.show()

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
                Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. dropdown:: References

  - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
    Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
    Mathematical Statistics" (John Wiley, NY, 1950).
  - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
  - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
    Structure and Classification Rule for Recognition in Partially Exposed
    Environments".  IEEE Transactions on Pattern Analysis and Machine
    Intelligence, Vol. PAMI-2, No. 1, 67-71.
  - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
    on Information Theory, May 1972, 431-433.
  - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
    conceptual clustering system finds 3 classes in the data.
  - Many, many more ...

准确率 Accuracy: 1.0
混淆矩阵 Confusion Matrix:
 [[26  0]
 [ 0 19]]
分类报告 Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        26
           1       1.00      1.00      1.00        19

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

# 决策树
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

clf = DecisionTreeClassifier(criterion="entropy", max_depth=3, random_state=42)
# clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

plt.figure(figsize=(12, 6))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("决策树 - 鸢尾花分类")
plt.show()

Accuracy: 0.9777777777777777
Confusion Matrix:
 [[19  0  0]
 [ 0 12  1]
 [ 0  0 13]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45

# 随机森林
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

rf = RandomForestClassifier(
    n_estimators=100,  # 森林里树的数量
    criterion="gini",  # 每棵树的划分标准
    max_depth=None,  # 树的最大深度，默认不限
    random_state=42
)

# rf = RandomForestClassifier(
#     n_estimators=100,
#     criterion="entropy",
#     max_depth=None,
#     random_state=42
# )

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

import numpy as np

feature_importances = rf.feature_importances_
features = iris.feature_names

plt.barh(np.arange(len(features)), feature_importances, align="center")
plt.yticks(np.arange(len(features)), features)
plt.xlabel("Feature Importance")
plt.title("随机森林 - 特征重要性")
plt.show()

Accuracy: 1.0
Confusion Matrix:
 [[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

# 支持向量机
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = SVC(kernel="linear", random_state=42)
# model = SVC(kernel='rbf', C=1.0, gamma='scale')  # 高斯核
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# 绘制决策边界
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("SVM on Iris Dataset (前两特征)")
plt.show()

Accuracy: 0.9
Confusion Matrix:
 [[10  0  0]
 [ 0  7  2]
 [ 0  1 10]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.88      0.78      0.82         9
           2       0.83      0.91      0.87        11

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.90        30
weighted avg       0.90      0.90      0.90        30

# 多层感知机-神经网络

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score

digits = load_digits()

X = digits.data
y = digits.target

print(X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


mlp = MLPClassifier(hidden_layer_sizes=(64, 32, 16),
                    activation='relu',
                    solver='adam',  # 优化器Adam
                    max_iter=200,
                    random_state=42)

mlp.fit(X_train, y_train)

y_pred = mlp.predict(X_test)

print("测试集准确率:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

(1797, 64) (1797,)
测试集准确率: 0.9694444444444444
              precision    recall  f1-score   support

           0       1.00      0.97      0.98        33
           1       0.96      0.96      0.96        28
           2       0.94      1.00      0.97        33
           3       0.97      0.97      0.97        34
           4       1.00      1.00      1.00        46
           5       0.94      0.98      0.96        47
           6       0.94      0.97      0.96        35
           7       1.00      0.97      0.99        34
           8       0.96      0.90      0.93        30
           9       0.97      0.95      0.96        40

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360

# 高斯朴素贝叶斯
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print("GaussianNB 测试集准确率:", accuracy_score(y_test, y_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, y_pred))
print("分类报告:\n", classification_report(y_test, y_pred))

GaussianNB 测试集准确率: 0.9777777777777777
混淆矩阵:
 [[19  0  0]
 [ 0 12  1]
 [ 0  0 13]]
分类报告:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45

# 多项式朴素贝叶斯
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

iris = load_iris()
X = iris.data
y = iris.target


# MultinomialNB 要求非负特征,特征要归一化到非负值
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

mnb = MultinomialNB()
mnb.fit(X_train, y_train)

y_pred = mnb.predict(X_test)

print("MultinomialNB 测试集准确率:", accuracy_score(y_test, y_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, y_pred))
print("分类报告:\n", classification_report(y_test, y_pred))

MultinomialNB 测试集准确率: 0.9333333333333333
混淆矩阵:
 [[19  0  0]
 [ 0 11  2]
 [ 0  1 12]]
分类报告:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.92      0.85      0.88        13
           2       0.86      0.92      0.89        13

    accuracy                           0.93        45
   macro avg       0.92      0.92      0.92        45
weighted avg       0.93      0.93      0.93        45

# k-means
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, :2]   # 花萼长、宽


kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=50)

# 聚类中心
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5, marker="X")

plt.title("K-means Clustering on Iris (first 2 features)")
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()

# DBSCAN
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris


iris = load_iris()
X = iris.data[:, :2]


dbscan = DBSCAN(eps=0.5, min_samples=5)  # eps=半径阈值, min_samples=最小点数
y_dbscan = dbscan.fit_predict(X)

# KMeans 是“中心点”聚类算法，可以对新数据点进行预测，所以有 predict()。
# DBSCAN 是基于密度的聚类，没有模型中心点，它只能标记训练数据的簇，不能直接预测新数据，所以没有 predict() 方法。


plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='viridis', s=50)

plt.title("DBSCAN Clustering on Iris (first 2 features)")
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()

# 层次聚类
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris
import scipy.cluster.hierarchy as sch

iris = load_iris()
X = iris.data[:, :2]  # 使用前两个特征


# linkage参数指定链接方法：'ward'、'complete'、'average'或'single'
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
y_hierarchical = hierarchical.fit_predict(X)

plt.figure(figsize=(12, 5))

plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=y_hierarchical, cmap='viridis', s=50)
plt.title("Hierarchical Clustering on Iris (first 2 features)")
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")

# 绘制树状图
plt.subplot(122)
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Data points')
plt.ylabel('Euclidean distances')

plt.tight_layout()
plt.show()

# PCA降维
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris


iris = load_iris()
X = iris.data
y = iris.target

# PCA 降到二维
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(6, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', s=50)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("PCA on Iris Dataset")
plt.show()

C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from font(s) SimHei.
  fig.canvas.print_figure(bytes_io, **kw)

# t-SNE降维
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

tsne = TSNE(n_components=2, random_state=42, perplexity=30, learning_rate=200)
X_tsne = tsne.fit_transform(X)

plt.figure(figsize=(6, 5))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', s=50)
plt.xlabel("t-SNE Dim 1")
plt.ylabel("t-SNE Dim 2")
plt.title("t-SNE on Iris Dataset")
plt.show()

C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from font(s) SimHei.
  fig.canvas.print_figure(bytes_io, **kw)

# 标准化、归一化、缺失值处理

# 标准化 Standardization
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



# 归一化 Normalization
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)


# 缺失值处理
from sklearn.impute import SimpleImputer
import numpy as np

X_train_missing = X_train.copy()
X_train_missing[0, 0] = np.nan  # 模拟缺失值

imputer = SimpleImputer(strategy="mean")  # 还可选 "median", "most_frequent", "constant"
X_train_imputed = imputer.fit_transform(X_train_missing)

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import VarianceThreshold, SelectKBest, RFE
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.preprocessing import StandardScaler
from scipy.stats import pearsonr  # ✅ 修正这里

# 加载乳腺癌数据集（569样本，30特征，预测良性/恶性肿瘤）
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# 1. 过滤法：方差选择
selector_var = VarianceThreshold(threshold=0.8)
X_var = selector_var.fit_transform(X)
selected_features_var = feature_names[selector_var.get_support()]
print("方差选择保留的特征:", selected_features_var)

# 2. 过滤法：相关系数选择
def pearsonr_score(X, y):
    # SelectKBest需要返回 (scores, p-values)
    scores, p_values = [], []
    for i in range(X.shape[1]):
        r, p = pearsonr(X[:, i], y)
        scores.append(abs(r))  # 用相关系数绝对值作为分数
        p_values.append(p)
    return np.array(scores), np.array(p_values)

selector_corr = SelectKBest(score_func=pearsonr_score, k=5)
X_corr = selector_corr.fit_transform(X, y)
selected_features_corr = feature_names[selector_corr.get_support()]
print("相关系数选择保留的特征:", selected_features_corr)

# 3. 包裹法：递归特征消除（RFE）
model = LogisticRegression(max_iter=1000)
selector_rfe = RFE(model, n_features_to_select=5)
X_rfe = selector_rfe.fit_transform(X, y)
selected_features_rfe = feature_names[selector_rfe.get_support()]
print("RFE选择保留的特征:", selected_features_rfe)

# 4. 嵌入法：Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lasso = Lasso(alpha=0.01)
lasso.fit(X_scaled, y)
selected_features_lasso = feature_names[lasso.coef_ != 0]
print("Lasso选择保留的特征:", selected_features_lasso)

方差选择保留的特征: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'perimeter error' 'area error' 'worst radius' 'worst texture'
 'worst perimeter' 'worst area']
相关系数选择保留的特征: ['mean perimeter' 'mean concave points' 'worst radius' 'worst perimeter'
 'worst concave points']

C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\yz\AppData\Roaming\Python\Python312\site-packages\sklearn\linear_model\_logistic.py:470: ConvergenceWarning: lbfgs failed to converge after 1000 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

RFE选择保留的特征: ['mean radius' 'texture error' 'worst radius' 'worst compactness'
 'worst concavity']
Lasso选择保留的特征: ['mean texture' 'mean concave points' 'mean fractal dimension'
 'radius error' 'smoothness error' 'concavity error' 'worst radius'
 'worst texture' 'worst smoothness' 'worst concavity'
 'worst concave points' 'worst symmetry']

ML

scikit-learn¶

监督学习¶

非监督学习¶

模型优化与进阶¶

综合实践¶