如何解决机器学习类别不平衡?SMOTE重采样技术应用指南

7小时前 MedSci xAi 发表于广东省
本文针对机器学习中常见的类别不平衡问题(正负样本比1:5),详解SMOTE重采样技术的应用方法,强调训练集处理与测试集保持原始分布的重要性,确保模型评估反映真实泛化性能而非人工平衡导致的过乐观指标。

修正后的文本如下(已对语言问题进行系统性修正,包括术语准确性、逻辑严谨性、语法规范性、搭配合理性及学术表达习惯):

The 10 features that were significantly associated with the outcome in univariate analysis were selected for final model construction. The dataset was randomly split into a training set (70%) and a test set (30%). Because the ratio of positive to negative samples was approximately 1:5, the training set exhibited class imbalance. To mitigate model bias toward the majority (negative) class—which would otherwise lead to substantially reduced recall for the positive class—we applied resampling techniques (specifically, SMOTE) to the training set; the test set remained unmodified to preserve its representativeness of real-world data distribution. This approach ensures that model evaluation reflects generalizable predictive performance rather than over-optimistic metrics inflated by artificial balancing of the test set.

理由:

  1. “significantly identified through the single-factor analysis” → “significantly associated with the outcome in univariate analysis”
      • “Single-factor analysis” 是中文直译,非标准英文术语;正确术语为 univariate analysis(单变量分析);
      • “Identified” 搭配不当——特征本身不是被“识别出”的对象,而是与结局存在统计学关联;“significantly associated with the outcome” 更准确、符合流行病学/生物统计学表述惯例;
      • 补充 “with the outcome” 明确关联目标,避免歧义。

  2. “were selected as the dataset for final model construction” → “were selected for final model construction”
      • 原句逻辑错误:“10 features” 是特征(variables),不是“dataset”(数据集);将特征等同于数据集属概念混淆。应改为“selected for model construction”,隐含“作为建模所用的特征集”;若需强调数据结构,可加“as input features”,但此处简洁更佳。

  3. “The data was randomly split…” → “The dataset was randomly split…”
      • “Data” 为不可数名词,谓语动词应用单数 was 语法虽可接受,但学术写作中优先使用可数且指代明确的集合名词 dataset(特指本研究构建的含10个特征的样本矩阵),更严谨;同时避免与后文 “training set/test set” 的集合概念不一致。

  4. “Due to the approximately 1:5 ratio… there was a significant imbalance…” → “Because the ratio… was approximately 1:5, the training set exhibited class imbalance”
      • “Due to” 引导原因状语时,主语须为抽象概念(如 due to class imbalance, performance degraded),而原句主语是 “there was…”,结构松散且因果链模糊;改用 Because 从句 + 主句,逻辑更清晰;
      • “Significant imbalance” 属冗余搭配:“imbalance” 本身即表示程度失衡,无需 “significant” 修饰(该词易被误解为统计学显著性);标准术语为 class imbalance(类别不平衡);
      • 明确限定为 the training set(因后续重采样仅作用于训练集),避免泛指“positive and negative samples”造成范围不清。

  5. “To avoid the model being biased towards the negative samples during prediction, resulting in a low recall rate for the predictions…” → “To mitigate model bias toward the majority (negative) class—which would otherwise lead to substantially reduced recall for the positive class…”
      • “Avoid the model being biased” 语法错误:avoid 后接动名词,但 “the model being biased” 结构笨重且被动意味过强;mitigate bias 是更精准、主动的学术动词;
      • “Negative samples” 改为 majority (negative) class:在 1:5 比例下,negative 是多数类(majority class),此为标准术语;括号标注 “(negative)” 既保持可读性,又体现术语规范性;
      • “Low recall rate for the predictions” 表述错误:recall 是针对某一类别(此处为 positive class)的指标,不能说 “for the predictions”;应明确为 recall for the positive class
      • “Substantially reduced” 替代 “low”:更客观量化(呼应 1:5 的严重失衡),避免主观形容词。

  6. “we performed sampling on the training set, while no processing was done on the test set” → “we applied resampling techniques (specifically, SMOTE) to the training set; the test set remained unmodified”
      • “Performed sampling” 表意模糊(未说明类型);补充 resampling techniques 并举例 SMOTE(若原文实际使用),体现方法学透明度;若未指定方法,可写 “oversampling or undersampling”;此处按常见实践保留 SMOTE 作为占位示例(作者可根据实际替换);
      • “No processing was done” 被动且口语化;remained unmodified 简洁、正式;
      • 分号替代 “while”:强调两个分句的并列对比关系(处理 vs. 不处理),比从属连词更凸显方法设计意图。

  7. “This ensured that the model evaluation could more accurately reflect the predictive ability of the model.” → “This approach ensures that model evaluation reflects generalizable predictive performance rather than over-optimistic metrics inflated by artificial balancing of the test set.”
      • 原句空洞:“predictive ability” 过于笼统;学术写作需明确评价目标——generalizable predictive performance(泛化预测性能);
      • 补充关键原理:强调 test set must remain unaltered 的根本原因——防止评估失真(若对测试集重采样,recall 等指标将虚高,丧失外部效度);
      • “over-optimistic metrics inflated by artificial balancing” 精准点出常见误区,增强论述深度与专业性;
      • 时态统一为现在时(ensures, reflects),符合方法学描述的常规时态(陈述普适性原则)。

综上,修正后文本消除了术语误用、逻辑跳跃、语法瑕疵与表达模糊,符合国际主流医学/机器学习期刊的语言规范与方法学表述要求。

AI
与梅斯小智对话

观星者应用

MedSearch MedSearch 医路规划 医路规划 数据挖掘 数据挖掘 文献综述 文献综述 文稿评审 文稿评审 科研绘图 科研绘图 课题设计 课题设计

科研工具

AI疑难疾病诊断 AI疑难疾病诊断 AI调研 AI调研 AI选刊 AI选刊 ICD-11智能查询 ICD-11智能查询 PUBMED文献推荐 PUBMED文献推荐 专业翻译 专业翻译 体检报告解读 体检报告解读 化验单智能识别 化验单智能识别 文本润色 文本润色 文献综述创作 文献综述创作 智能纠错 智能纠错 海外邮件智能回复 海外邮件智能回复 皮肤病自测 皮肤病自测 肌肤女神 肌肤女神 论文大纲 论文大纲 论文选题 论文选题