Predicción y análisis de la deserción estudiantil en la Facultad de Ingeniería de la Universidad El Bosque mediante modelos avanzados de Machine Learning: Identificación de variables más influyentes, caracterización de graduados y no graduados, y detección temprana
| dc.contributor.advisor | Rodriguez Arango, Emiliano | |
| dc.contributor.author | Fajardo Gil, Olga Patricia | |
| dc.contributor.author | Useche Becerra, Diego Alejandro | |
| dc.contributor.author | Hernández Novoa, Tatiana Milena | |
| dc.date.accessioned | 2025-07-11T21:08:55Z | |
| dc.date.available | 2025-07-11T21:08:55Z | |
| dc.date.issued | 2025-06 | |
| dc.description.abstract | Este estudio presenta un enfoque mixto para predecir y analizar la deserción estudiantil en la Facultad de Ingeniería de la Universidad El Bosque, integrando técnicas supervisadas y no supervisadas de aprendizaje automático. Se utilizaron datos de 6 054 estudiantes (2008–2025) con variables sociodemográficas (género, estrato socioeconómico, jornada, tipo de estudiante y estado civil) y académicas (edad y semestres aprobados). El preprocesamiento incluyó imputación de valores faltantes mediante KNN, codificación ordinal y selección de atributos. Se entrenaron y compararon modelos supervisados (árboles de decisión, Random Forest, SVM, XGBoost y LightGBM) con validación cruzada estratificada. LightGBM obtuvo el mejor desempeño (F1-macro = 0,895; AUC = 0,95), destacándose por su bajo número de falsos negativos, ideal para alertas tempranas. La interpretabilidad del modelo se abordó con SHAP, que identificó como predictores clave los semestres aprobados, estrato, jornada, tipo de estudiante y edad. Complementariamente, se aplicó K-Modes (k = 7) para identificar siete perfiles estudiantiles con trayectorias y condiciones heterogéneas. El Análisis de Correspondencias Múltiples reforzó esta segmentación, visualizando asociaciones significativas entre categorías. Los perfiles resultantes incluyen grupos consolidados, como mujeres avanzadas con alto desempeño, y otros en riesgo, como jóvenes de estrato bajo o adultos en jornada nocturna. En conjunto, este marco metodológico permite anticipar riesgos de deserción, comprender la diversidad estudiantil y orientar decisiones institucionales. Su aplicación puede fortalecer políticas de retención, optimizar recursos y promover una educación más equitativa y basada en datos. | |
| dc.description.abstractenglish | This study presents a mixed approach to predicting and analyzing student dropout in the School of Engineering at Universidad El Bosque, integrating supervised and unsupervised machine learning techniques. The dataset comprises 6 054 students (2008–2025) with sociodemographic variables (gender, socioeconomic status, study schedule, student type, and marital status) and academic variables (age and approved semesters). Data preprocessing included missing value imputation using KNN, ordinal encoding, and feature selection. Supervised models (decision trees, Random Forest, SVM, XGBoost, and LightGBM) were trained and evaluated using stratified cross-validation. LightGBM achieved the best performance (F1-macro = 0.895; AUC = 0.95), with a notably low false-negative rate, making it suitable for early dropout alert systems. SHAP was used to interpret the model, identifying the most influential predictors as the number of approved semesters, socioeconomic status, study schedule, student type, and age. In parallel, K-Modes clustering (k = 7) was applied to uncover seven distinct student profiles with heterogeneous academic and demographic trajectories. Multiple Correspondence Analysis (MCA) further supported this segmentation by revealing significant associations among categorical variables. The resulting profiles range from consolidated trajectories such as high-performing advanced female students to at risk groups like low-income freshmen or adult night-shift students. Overall, this methodological framework enables accurate dropout prediction, nuanced student profiling, and data-driven decision-making. Its application can enhance institutional retention policies, improve resource allocation, and promote a more equitable and evidence-based approach to higher education management. | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12495/14932 | |
| dc.language.iso | es | |
| dc.relation.references | Álvarez-García, M., Arenas-Parra, M., & Ibar-Alonso, R. (2024). Uncovering student profiles: An explainable cluster analysis approach to PISA 2022. Computers & Education, 223 (100), 105166. https://doi.org/10.1016/j.compedu.2024.105166 | |
| dc.relation.references | Anahua, T., Yana, V., & ReynosoD, A. (2025). Machine Learning Algorithms for Predicting Student Dropout in Engineering Programs. Information Management and Big Data: 11th Annual International Conference, SIMBig 2024, Ilo, Peru, November 20–22, 2024, Proceedings, 2496, 68. | |
| dc.relation.references | Barros, T. M., Silva, I., & Guedes, L. A. (2019). Determination of dropout student profile based on correspondence analysis technique. IEEE Latin America Transactions, 17 (9), 1517-1523. https://doi.org/10.1109/TLA.2019.8931146 | |
| dc.relation.references | Boström, H. (2020). MissingPy: Imputation and missing data analysis in Python. Boteju, G., Tang, L., & Brown, M. S. (2024). Support Vector Machine for Predicting Student Dropout Under Different Normalization Methods. 2024 IEEE International Conference on Big Data (BigData), 8633-8636. | |
| dc.relation.references | Chen, L. (2024). Model-driven early alert systems for student retention. Educational Data Science, 2 (1), 1-15. | |
| dc.relation.references | Choque-Soto, V. M., Sosa-Jauregui, V. D., & Ibarra, W. (2025). Characterization of the dropout student profile using data mining techniques. Revista de Gestão Social e Ambiental. https://www.researchgate.net/publication/389099375 | |
| dc.relation.references | Dake, D. K., Bada, G. K., & Techie-Menson, H. (2023). Using Machine Learning to Cluster and Predict the Learning Pattern of University Students. Telematique, 22 (1), 77-84. https://www.researchgate.net/publication/367190613 | |
| dc.relation.references | Dake, D. K., Bada, G. K., & Techie-Menson, H. (2024). Model interpretability on private-safe oriented student dropout prediction. PLOS ONE, 19 (5), e0317726. https://doi.org/10.1371/ journal.pone.0317726 | |
| dc.relation.references | Darwis, M., Hasibuan, L. H., & Firmansyah, M. (2025). An analysis of students’ academic performance using K-means clustering algorithm. JISA (Jurnal Informatika dan Sains). https://www. researchgate.net/publication/359547536 | |
| dc.relation.references | Estrada, J. C., & Rodríguez, A. (2022). Using SHAP Values to Explain Dropout Prediction Models in Higher Education [Tesis doctoral, Tecnológico de Monterrey]. https://repositorio.tec.mx/bitstreams/f2e9ebda- ed86- 4fd3- a9494ee8b9ba26f2/download | |
| dc.relation.references | Flores-Báez, M. A. (2023). Dificultades en la asignatura de Formación Cívica y Ética como causa de deserción escolar en secundaria. Revista Latinoamericana de Educación, 16 (2), 35-52. https://alumnieditora.com/index.php/ojs/article/view/175 | |
| dc.relation.references | GeeksforGeeks. (2025). K-mode clustering in Python. https ://www. geeksforgeeks . org/k- modeclustering-in-python/ | |
| dc.relation.references | Hanji, S. (2023). Predicting Student Dropout Using Machine Learning. | |
| dc.relation.references | Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., & Cournapeau D. (2020). Array programming with NumPy. Nature, 585 (7825), 357-362. https://doi.org/10.1038/s41586-020-2649-2 | |
| dc.relation.references | Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9 (3), 90-95. https://doi.org/10.1109/MCSE.2007.55 | |
| dc.relation.references | Instituto Nacional de Estadística y Geografía (INEGI). (2023). Estadísticas a propósito del día internacional de la educación [Consultado el 23 de mayo de 2025]. https://www.inegi.org.mx | |
| dc.relation.references | Jin, W., Li, X., & Zhang, H. (2024). From Data to Decision: Machine Learning and Explainable AI in Student Dropout Prediction. Journal of Education and Learning. https://ibimapublishing.com/articles/JELHE/2024/246301/ | |
| dc.relation.references | Kabir, R. (2024). Student Dropout Prediction Through Machine Learning Optimization. Scientific Reports, 14, 10032. https://www.nature.com/articles/s41598-025-93918-1 | |
| dc.relation.references | Kaur, H., & Sharma, V. (2023). A Study on Dropout Prediction for University Students Using Machine Learning. | |
| dc.relation.references | Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30. | |
| dc.relation.references | Khan, R. A. (2023). Supervised Machine Learning Algorithms for Predicting Student Dropout and Academic Success: A Comparative Study. International Journal of Data Science and Analytics, 7 (1), 45-59. https://link.springer.com/article/10.1007/s44163-023-00079-z | |
| dc.relation.references | Liu, Y.,Wang, H., & Xu, Y. (2023). Who Will Dropout from University? Academic Risk Prediction Based on Interpretable Machine Learning. | |
| dc.relation.references | Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 4768-4777. | |
| dc.relation.references | Malik, S., & Tüfekci, A. (2023). An Explainable Machine Learning Approach to Predicting and Understanding Dropouts in MOOCs. Educational Data Mining Conference. https://open.metu.edu.tr/bitstream/handle/11511/102701/10.24106-kefdergi.1246458-2933723.pdf | |
| dc.relation.references | McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 51-56. https://doi.org/10.25080/Majora-92bf1922-00a | |
| dc.relation.references | Melchor, F., Conejero, J. M., Fernández-García, A. J., Sánchez-Figueroa, F., & Rodríguez Echeverría, R. (2025). An empirical evaluation of clustering processes for early detection of university dropout. https://doi.org/10.21203/rs.3.rs-6146415/v1 | |
| dc.relation.references | Ministerio de Educación del Perú. (2021). Análisis de la deserción escolar en el nivel primario durante la pandemia por COVID-19 [Consultado el 23 de mayo de 2025]. https://www.inei.gob.pe/media/MenuRecursivo/investigaciones/desercion-escolar.pdf | |
| dc.relation.references | Ministerio de Educación Nacional de Colombia. (2022). Indicadores de eficiencia interna del sistema educativo [Consultado el 23 de mayo de 2025]. https://www.mineducacion.gov.co | |
| dc.relation.references | Nguyen, T. (2020). Predicting student dropout using machine learning: A review. IEEE Access, 8,20036-20054. | |
| dc.relation.references | Nguyen, T. M., Doan, Q. H., & Pham, V. T. (2023). A comparative study of clustering algorithms for categorical data: Evaluation and applications. Expert Systems with Applications, 213, 118957. https://doi.org/10.1016/j.eswa.2022.118957 | |
| dc.relation.references | Olugbara, C. T., Letseka, M., & Olugbara, O. O. (2021). Multiple correspondence analysis of factors influencing student acceptance of massive open online courses. Sustainability, 13 (23), 13451. https : / / doi . org / 10 . 3390 /su132313451 | |
| dc.relation.references | Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Blondel, M. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830. | |
| dc.relation.references | Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1, 81-106. ResearchGate. (2025). An empirical evaluation of clustering processes for early detection of university dropout. https://www.researchgate.net/publication/389606807 | |
| dc.relation.references | Romero, L., & Ventura, J. (2020). Factores asociados a la deserción en ingeniería. Revista Educación y Desarrollo, 18, 45-60. | |
| dc.relation.references | Sanchez, D., Martínez, G., & Torres, M. (2023). College Dropout Factors: An Analysis with LightGBM and Shapley’s Cooperative Game Theory. | |
| dc.relation.references | Tinto, V. (1993). Leaving College: Rethinking the Causes and Cures of Student Attrition. University of Chicago Press. | |
| dc.relation.references | Torres Acero, N. (2022). Modelos para la predicción de deserción universitaria de estudiantes de Psicología de la Universidad El Bosque [Trabajo de grado]. Universidad El Bosque. | |
| dc.relation.references | UNESCO. (2023). Why are boys at risk of dropping out of school? [Consultado el 23 de mayo de 2025]. https://www.unesco.org/es/genderequality/education/boys | |
| dc.relation.references | Vásquez, J. (2016). Modelo predictivo para estimar la deserción de estudiantes en una institución de educación superior [Tesis de magíster]. Universidad de Chile, Facultad de Economía y Negocios. | |
| dc.relation.references | Villar, A., & de Andrade, C. R. V. (2024). Supervised machine learning algorithms for predicting student dropout and academic success: a comparative study. Discover Artificial Intelligence, 4 (1), 2. | |
| dc.relation.references | Waskom, M. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6 (60), 3021. https://doi.org/10.21105/joss.03021 | |
| dc.rights | Attribution-NoDerivatives 4.0 International | en |
| dc.rights.uri | http://creativecommons.org/licenses/by-nd/4.0/ | |
| dc.subject | Deserción estudiantil | |
| dc.subject | Machine Learning | |
| dc.subject | LightGBM | |
| dc.subject | SHAP | |
| dc.subject | Agrupamiento K-Modes | |
| dc.subject.keywords | Academic Dropout | |
| dc.subject.keywords | Machine Learning | |
| dc.subject.keywords | LightGBM | |
| dc.subject.keywords | SHAP | |
| dc.subject.keywords | K-Modes clustering | |
| dc.title | Predicción y análisis de la deserción estudiantil en la Facultad de Ingeniería de la Universidad El Bosque mediante modelos avanzados de Machine Learning: Identificación de variables más influyentes, caracterización de graduados y no graduados, y detección temprana | |
| dc.title.translated | Prediction and Analysis of Student Dropout Rate at the Faculty of Engineering, Universidad El Bosque Using Advanced Machine Learning Models: Identification of the Most Influential Variables, Characterization of Graduates and Non-Graduates, and Early Detection. |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- Trabajo de grado.pdf
- Tamaño:
- 2.24 MB
- Formato:
- Adobe Portable Document Format
Bloque de licencias
1 - 3 de 3
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 1.95 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción:
Cargando...
- Nombre:
- Carta de autorizacion.pdf
- Tamaño:
- 583.9 KB
- Formato:
- Adobe Portable Document Format
- Descripción:
Cargando...
- Nombre:
- Anexo 1 Acta de aprobacion.pdf
- Tamaño:
- 381.8 KB
- Formato:
- Adobe Portable Document Format
- Descripción:
