To further screen for core genes related to CKD, we utilized the random forest algorithm to rank the DEGs in Cohort 1 by importance. As shown in Figure 3A, the top 10 most important genes associated with CKD onset are listed. In Figure 3B, the top 10 most important genes associated with CKD progression in Cohort 2 are listed. In Figure 3C, we intersected the positive gene set with the top 10 genes from the random forest analysis of Cohort 1, identifying CCL2 as a core positively correlated gene. In Figure 3D, we identified two core negatively correlated genes, SUCLG1 and ACADM. We composed a Minimal gene set (CCL2, SUCLG1, ACADM). Based on literature that divided genes into modules using DEGs (PMID: 39402203), we attempted to reconstruct WGCNA analysis based on DEGs in both Cohort 1 and Cohort 2 of GSE137570. As shown in Supplementary Figures 3A and 3B, the DEGs in Cohort 1 were divided into 4 gene modules, with module blue (r=0.759, P<0.0001) being the most significantly positively correlated with CKD onset, and module brown (r=-0.815, P<0.0001) being the most significantly negatively correlated. In Cohort 2, the most significantly positively correlated module with CKD progression was module turquoise (r=0.75, P=0.00053), and the most significantly negatively correlated module was module grey (r=-0.869, P<0.0001). Compared to Figures 2B and 2D, we found that the correlation of gene modules constructed by DEGs in WGCNA analysis was higher, with fewer modules. Next, we intersected the most significant module genes with the top 10 important genes screened by random forest, as shown in Supplementary Figures 3C and 3D. The core genes most significantly positively correlated with CKD onset were CCL2 and MMP7, while the core genes most significantly negatively correlated were GGT6, PCK2, SFXN2, SLC34A3, ALPL, GLTPD2, ACADM, and SUCLG1 (8 genes). The core genes most significantly positively correlated with CKD progression were CCL2, CLDN1, SLC34A2, OSMR, and C1RL (5 genes), but no core genes were identified as most significantly negatively correlated with CKD development. Considering that a gene set composed of around 10 core genes has better clinical translational feasibility, we formed a Maximal gene set (CCL2, MMP7, GGT6, PCK2, SFXN2, SLC34A3, ALPL, GLTPD2, ACADM, SUCLG1) and a Medium gene set (CCL2, GGT6, PCK2, SFXN2, SLC34A3, ALPL, GLTPD2, ACADM, SUCLG1) for subsequent studies, which includes CCL2 and the core genes negatively correlated with CKD. By integrating WGCNA analysis and random forest analysis of two gene expression profiles (whole-genome and DEGs), we screened out three different sizes of gene sets related to CKD onset and progression.
We scored the above three gene sets using different gene set enrichment algorithms, including GSVA, ssGSEA, z-score, and Plage, and further evaluated the classification accuracy of these scores using ROC curves for binary diagnosis. For internal validation in Cohort 1 of GSE137570, as shown in Figure 3E, except for Minimal GSVA (AUC: 0.778) and Maximal ssGSEA (AUC: 0.794) which had moderate diagnostic performance (AUC < 0.8), all other scoring methods had excellent diagnostic performance (AUC > 0.8). As shown in Figure 3F, internal validation in Cohort 2 indicated that, except for Medium ssGSEA (AUC: 0.764), Maximal ssGSEA (AUC: 0.764), Minimal z-score (AUC: 0.708), Medium z-score (AUC: 0.792), and Maximal z-score (AUC: 0.792) which had moderate performance, all other scoring methods had excellent diagnostic performance. We then used GSE66494 (n=61, whole-kidney gene chip dataset) and GSE180394 (n=59, renal tubule chip dataset) as external validation datasets to evaluate the diagnostic performance of different scoring methods and gene set combinations. As shown in Figure 3G, in the diagnostic validation of GSE66494, only Minimal GSVA (AUC: 0.776), Medium GSVA (AUC: 0.734), Medium Plage (AUC: 0.703), and Maximal Plage (AUC: 0.767) had moderate diagnostic performance. Validation of the other dataset, GSE180394, as shown in Figure 3H, indicated that Minimal ssGSEA (AUC: 0.718), Medium Plage (AUC: 0.716), and Maximal Plage (AUC: 0.760) had moderate diagnostic performance for CKD, while Minimal Plage (AUC: 0.884) had excellent diagnostic performance. In summary, based on the evaluation of the diagnostic performance of different gene sets and scoring methods, we found that it is challenging to standardize a single gene set and a single scoring method in this study, and further integration with prognostic diagnostic models is required for performance assessment.