Language models for generating programming questions with varying difficulty levels



Parole chiave:

Large Langue Models, ChatGPT, Question Generation, Adaptation, Gamification, Python, Difficulty, Pedagogy


Introduction: This study explores the potential of Large Language Models (LLMs), specifically ChatGPT-4, in generating Python programming questions with varying degrees of difficulty. This ability could significantly enhance adaptive educational applications. Methodology: Experiments were conducted with ChatGPT-4 and participants to evaluate its ability to generate questions on various topics and difficulty levels in programming. Results: The results reveal a moderate positive correlation between the difficulty ratings assigned by ChatGPT-4 and the perceived difficulty ratings given by participants. ChatGPT-4 proves to be effective in generating questions that cover a wide range of difficulty levels.Discussion: The study highlights ChatGPT-4’s potential for use in adaptive educational applications that accommodate different learning competencies and needs. Conclusions: This study presents a prototype of a gamified educational application for teaching Python, which uses ChatGPT to automatically generate questions of varying difficulty levels. Future studies should conduct more exhaustive experiments, explore other programming languages, and address more complex programming concepts.


I dati di download non sono ancora disponibili.

Biografie autore

Christian Lopez, Lafayette College & Universidad Nacional Pedro Henríquez Ureña (UNPHU)

He is an Assistant Professor of Computer Science with an affiliation in Mechanical Engineering at Lafayette College. His research interests are in the design and optimization of intelligent decision support systems and persuasive technologies to augment human proficiencies. What this means is, he works on designing and creating systems to help make better decisions and help improve task performance by integrating technologies and methods from science and engineering, such as Machine Learning and Virtual Reality. In some cases, these systems need to be able to motivate individuals as well; hence, the use of persuasive technologies like gamification.

Miles Morrison, Lafayette College

Miles Morrison is pursuing an undergraduate degree in Integrative Engineering with a Robotics Focus at Lafayette College in Easton, PA, and is expected to graduate in 2026. He intends to pursue a graduate degree after obtaining his bachelor’s from Lafayette College to further his expertise. This is his first official contribution to research work and will likely contribute to more in the future. His research and professional interests include applications of artificial intelligence, robotics ; digital automation, and systems optimization.

Matthew Deacon, Lafayette College

Matthew Deacon is pursuing an undergraduate degree in Mechanical Engineering with a minor in Economics at Lafayette College in Easton, PA, and is expected to graduate in 2026. He intends to pursue an MBA after obtaining his bachelor’s degree. In the summer of 2021, Matthew completed a paper on Stroke data for Prof. Guillermo Goldsztein from Georgia Tech as part of the Data Science and Machine Learning Course for Horizon Inspires Academic. He also completed an online course called “Programming for Everybody - Getting started with Python” through the University of Michigan. Matthew’s professional interests include the use of engineering to innovate and create new products, applications or technologies

Riferimenti bibliografici

Aguinis, H., Villamor, I., & Ramani, R. S. (2021). MTurk Research: Review and Recommendations. Journal of Management, 47(4), 823–837. SAGE Publications Inc. DOI:

Ahmad, A., Zeshan, F., Khan, M. S., Marriam, R., Ali, A., & Samreen, A. (2020). The Impact of Gamification on Learning Outcomes of Computer Science Majors. ACM Transactions on Computing Education, 20(2). DOI:

Albán Bedoya, I., & Ocaña-Garzón, M. (2022). Educational Programming as a Strategy for the Development of Logical-Mathematical Thinking. Lecture Notes in Networks and Systems, 405 LNNS, 309–323. DOI:

Amatriain, X. (2024). Prompt Design and Engineering: Introduction and Advanced Methods. 1–26.

Amazon. (2018). Amazon Mechanical Turk.

API Reference - OpenAI API. Retrieved December 10, 2023, from

Baudisch, P., Beaudouin-Lafon, M., Mackay, W., Association for Computing Machinery, SIGCHI (Group: U.S.), & ACM Digital Library. (2013). CHI2013 Changing perspectives : extended abstracts : the 31st Annual CHI Conference on Human Factors in Computing Systems : 27 April - 2 May, 2013, Paris, France.

Bennani, S., Maalel, A., & Ben Ghezala, H. (2022). Adaptive gamification in E-learning: A literature review and future challenges. Computer Applications in Engineering Education, 30 (2), 628–642. DOI:

Biancini, G., Ferrato, A., & Limongelli, C. (2024). Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights. Adjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization, 584–590. DOI:

Busheska, A., & Lopez, C. (2022). Exploring the perceived complexity of 3d shapes: towards a spatial visualization VR application. Proceedings of the IDETC-CIE 2022, 1–9. DOI:

Caruccio, L., Cirillo, S., Polese, G., Solimando, G., Sundaramurthy, S., & Tortora, G. (2024). Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach. Intelligent Systems with Applications, 21. DOI:

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2024). A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol., 15(3). DOI:

Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2023). Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., … Zaremba, W. (2021). Evaluating Large Language Models Trained on Code.

Chen, O., Paas, F., & Sweller, J. (2023). A Cognitive Load Theory Approach to Defining and Measuring Task Complexity Through Element Interactivity. Educational Psychology Review, 35 (2). DOI:

Davis, J., Van Bulck, L., Durieux, B., & Lindvall, C. (2023). The temperature feature of ChatGPT: Modifying creativity for clinical research (Preprint). JMIR Human Factors, 11. DOI:

Deterding, S., Dixon, D., Khaled, R., & Nacke, L. (2011). From game design elements to gamefulness: Defining “gamification.” Proceedings of the 15th International Academic MindTrek Conference: Envisioning Future Media Environments, MindTrek 2011, 9–15. DOI:

Doughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., & Sakr, M. (2024). A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education. ACM International Conference Proceeding Series, 114–123. DOI:

Ekin, S. (2023). Prompt Engineering For ChatGPT: A Quick Guide To Techniques, Tips, And Best Practices. DOI:

Flegal, K. E., Ragland, J. D., & Ranganath, C. (2019). Adaptive task difficulty influences neural plasticity and transfer of training. NeuroImage, 188, 111–121. DOI:

Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T., Lazaridou, A., … Vinyals, O. (2023). Gemini: A Family of Highly Capable Multimodal Models.

Gomes, A., Ke, W., Lam, C. T., Teixeira, A., Correia, F., Marcelino, M., & Mendes, A. (2019). Understanding loops: a visual methodology. 2019 IEEE International Conference on Engineering, Technology and Education (TALE), 1–7. DOI:

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., & Wang, H. (2023). Large Language Models for Software Engineering: A Systematic Literature Review.

Huotari, K., & Hamari, J. (2017). A definition for gamification: anchoring gamification in the service marketing literature. Electronic Markets, 27(1), 21–31. DOI:

Ihantola, P., & Petersen, A. (2019). Code Complexity in Introductory Programming Courses. Proceedings of the 52nd Hawaii International Conference on System Sciences, 1–9. DOI:

Jones, K., Harland, J., Reid, J., & Bartlett, R. (2009). Relationship between examination questions and bloom’s taxonomy. Proceedings - Frontiers in Education Conference, 1–6. DOI:

Lee, U., Jung, H., Jeon, Y., Sohn, Y., Hwang, W., Moon, J., & Kim, H. (2023). Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education. Education and Information Technologies, 1–33. DOI:

Lei, H., Cui, Y., & Zhou, W. (2018). Relationships between student engagement and academic achievement: A meta-analysis. Social Behavior and Personality, 46(3), 517–528. DOI:

Liu, J., Xia, C. S., Wang, Y., & Zhang, L. (2023). Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.

Lu, L., Neale, N., Line, N. D., & Bonn, M. (2022). Improving Data Quality Using Amazon Mechanical Turk Through Platform Setup. Cornell Hospitality Quarterly, 63(2), 231–246. DOI:

Mcshane, L., & Lopez, C. (2023). Perceived complexity of 3d shapes for spatial visualization tasks: humans vs generative models. Proceedings of the ASME IDETC-CIE 2023, 1–10. DOI:

Oliveira, W., Hamari, J., Shi, L., Toda, A. M., Rodrigues, L., Palomino, P. T., & Isotani, S. (2023). Tailored gamification in education: A literature review and future agenda. Education and Information Technologies, 28(1), 373–406. DOI:

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2023). GPT-4 Technical Report.

Ortolan, P. (2023). Optimizing Prompt Engineering for Improved Generative AI Content. [Trabajo de fin de grado, Universidad Pontificia Comillas].

Saleem, A. N., Noori, N. M., & Ozdamli, F. (2022). Gamification Applications in E-learning: A Literature Review. Technology, Knowledge and Learning, 27(1), 139–159. DOI:

Sarsa, S., Denny, P., Hellas, A., & Leinonen, J. (2022). Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. ICER 2022 - Proceedings of the 2022 ACM Conference on International Computing Education Research, 1, 27–43. DOI:

Scherer, R., Siddiq, F., & Sánchez-Scherer, B. (2021). Some Evidence on the Cognitive Benefits of Learning to Code. Frontiers in Psychology, 12. DOI:

Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G., & Arawjo, I. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. DOI:

Shieh J. (2023). Best practices for prompt engineering with the OpenAI API | OpenAI Help Center. OpenAI.

Shin, E., & Ramanathan, M. (2023). Evaluation of prompt engineering strategies for pharmacokinetic data analysis with the ChatGPT large language model. Journal of Pharmacokinetics and Pharmacodynamics, 51. DOI:

Sinclair, J., Butler, M., Morgan, M., & Kalvala, S. (2015). Student Engagement in computer science. Annual Conference on Innovation and Technology in Computer Science Education, ITiCSE, 2015-June, 242–247. DOI:

Sweller, J. (1988). Cognitive Load During Problem Solving: Effects on Learning. Cognitive Science, 12(2), 257–285. DOI:

Velasquez-Hainao, J. D., Franco-Cardona, C. J., & Cadavid-Higuita, L. (2023). Prompt Engineering: a methodology for optimizing interactions with AI-Language Models in the field of engineering. DYNA, 1–9. DOI:

Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P. S., & Wen, Q. (2024). Large Language Models for Education: A Survey and Outlook.

Yazidi, A., Abolpour Mofrad, A., Goodwin, M., Hammer, H. L., & Arntzen, E. (2020).

Balanced difficulty task finder: an adaptive recommendation method for learning tasks based on the concept of state of flow. Cognitive Neurodynamics, 14(5), 675–687. DOI:

Zhan, Z., He, L., Tong, Y., Liang, X., Guo, S., & Lan, X. (2022). The effectiveness of gamification in programming education: Evidence from a meta-analysis. In Computers and Education: Artificial Intelligence (Vol. 3). Elsevier B.V. DOI:

Zhang, R., Guo, J., Chen, L., Fan, Y., & Cheng, X. (2022). A Review on Question Generation from Natural Language Text. ACM Transactions on Information Systems, 40(1). DOI:

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2022). Large Language Models Are Human-Level Prompt Engineers.

Zu, T., Munsell, J., & Rebello, N. S. (2021). Subjective Measure of Cognitive Load Depends on Participants’ Content Knowledge Level. Frontiers in Education, 6, 647097. DOI:




Come citare

Lopez, C., Morrison, M., & Deacon, M. (2024). Language models for generating programming questions with varying difficulty levels. European Public & Social Innovation Review, 9, 1–19.




Dati di finanziamento