Predictive power of semantic information in the use context of multiword terms for their structural disambiguation

Juan Rojas-García

doi:10.31637/epsir-2025-2070

Autores

Juan Rojas-García University of Granada https://orcid.org/0000-0002-7611-1386

DOI:

https://doi.org/10.31637/epsir-2025-2070

Palavras-chave:

multiword-term bracketing, bracketing prediction, random forest model, decision-tree model, verb lexical domain, semantic role, semantic category, semantic relation

Resumo

Introduction: The structural disambiguation of English multiword terms (MWT) of three or more constituents (e.g., coastal sediment transport), often known as bracketing, involves the grouping of the dependent components so that the MWT is reduced to its basic form of modifier+head, as in coastal [sediment transport], which is a right-bracketed ternary compound. This work presents a study that explored whether the bracketing of a ternary compound, when used as an argument in a sentence, can be predicted from the semantic information encoded in that sentence. Methodology: A set of 1.694 sentences were analyzed semantically and annotated with the lexical domain of the verbs, the semantic role and category of the arguments, and the semantic relation between the arguments. These semantic variables were then analyzed statistically to determine whether they are able to predict the bracketing of a ternary compound. Results: A random forest model, with the lexical domain of the verb, and the semantic role and category of the MWT, was able to predict the bracketing of the ternary compounds used as arguments in a sample of 380 MWTs (100% F₁‑score). A decision tree, with solely the semantic relation of the MWT to another argument in the same sentence, was also able to predict the bracketing of the ternary compounds in the sample (94,12% F₁‑score). Discussion: Only a subset of three variables was necessary for bracketing prediction with an error free performance, whereas previous research employed a minimum of 12 variables. Conclusion: The semantic information in a sentence contributed substantially to compound parsing. This suggests a novel research direction in the integration of semantic variables into syntactic parsers and machine-translation applications.

Downloads

Não há dados estatísticos.

Biografia Autor

Juan Rojas-García, University of Granada

Juan Rojas-García is Junior Lecturer at the University of Granada, holds a PhD in Translation and Interpreting from the University of Granada, and a Master's degree in Teaching Spanish as a Foreign Language, a Master's degree in Teaching Spanish Language and Literature in Secondary Education, and a Master's degree in Data Science. In addition, he holds a degree in Telecommunication Systems Engineering from the University of Malaga. He is a member of the research group LexiCon (University of Granada). He completed a doctoral thesis in the field of Terminology, for which he received a research grant (FPU) from the Spanish Ministry of Economy and Competitiveness. His research areas are terminology, representation of named entities in terminological knowledge bases, text mining, and employability of Translation and Interpreting students. He has presented and published research papers on these areas at international conferences and in journals on applied linguistics, natural language processing, and translatology.

Referências

Agirre, E., Baldwin, T., & Martinez, D. (2008). Improving parsing and PP attachment performance with sense information. In Proceedings of the 46th Annual Meeting of the ACL (pp. 317 325). ACL. https://cutt.ly/uekWpF6t.

Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press. DOI: https://doi.org/10.1017/9781316410899

Faber, P. (2015). Frames as a framework for Terminology. In H. Kockaert, & F. Steurs (Eds.), A Handbook of Terminology (pp. 14-33). John Benjamins. https://doi.org/10.1075/hot.1.fra1. DOI: https://doi.org/10.1075/hot.1.02fra1

Faber, P., & Mairal, R. (1999). Constructing a Lexicon of English Verbs. Mouton de Gruyter. DOI: https://doi.org/10.1515/9783110800623

Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., & Herrera, F. (2018). Learning from Imbalanced Data Sets. Springer. DOI: https://doi.org/10.1007/978-3-319-98074-4

Garg, K.D., Shekhar, S., Kumar, A., Goyal, V., Sharma, B., Chengoden, R., & Srivastava, G. (2022). Framework for Handling Rare Word Problems in Neural Machine Translation System Using Multi-Word Expressions. Applied Sciences, 12, 11038. https://cutt.ly/Jekhbfpg. DOI: https://doi.org/10.3390/app122111038

Girju, R., Moldovan, D.I., Tatu, M., & Antohe, D. (2005). On the semantics of noun compounds. Computer Speech and Language, 19(4), 479 496. https://doi.org/10.1016/j.csl.2005.02.006. DOI: https://doi.org/10.1016/j.csl.2005.02.006

Gregorutti, B., Michel, B., & Saint-Pierre, P. (2017). Correlation and variable importance in random forests. Statistics and Computing, 27, 659–678. https://cutt.ly/FYFYCbF. DOI: https://doi.org/10.1007/s11222-016-9646-1

Hindle, D., & Rooth, M. (1993). Structural ambiguity and lexical relations. Computational Linguistics, 19(1), 103-120. https://aclanthology.org/J93-1005.pdf.

Klie, J.C., Bugert, M., Boullosa, B., Eckart de Castilho, R., & Gurevych, I. (2018). The INCEpTION platform: Machine assisted and knowledge oriented interactive annotation. In Proceedings of the 27th International Conference COLING 2018 (pp. 5 9). ACL. https://cutt.ly/UekjrFRT.

Krippendorff, K. (2012). Content Analysis: An Introduction to its Methodology. Sage.

Kroeger, P.R. (2005). Analyzing Grammar: An Introduction. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511801679

Kuhn, M. (2021). caret: Classification and Regression Training. R package. https://cutt.ly/aekhnQuG.

Kuhn, M., & Johnson, K. (2016). Applied Predictive Modeling. Springer.

Lauer, M. (1994). Conceptual Association for Compound Noun Analysis. In Proceedings of the 32nd Annual Meeting of the ACL (pp. 337 339). CoRR. https://cutt.ly/CekhxzZ5. DOI: https://doi.org/10.3115/981732.981785

Lauer, M. (1995). Designing Statistical Language Learners: Experiments on Noun Compounds. (Ph.D. Thesis). Macquarie University, Sidney. https://arxiv.org/abs/cmp-lg/9609008.

Lazaridou, A., Vecchi, E.M., & Baroni, M. (2013). Fish transporters and miracle homes: How compositional distributional semantics can help NP parsing. In Proceedings of the 2013 Conference on EMNLP (pp. 1908 1913). ACL. https://aclanthology.org/D13-1196. DOI: https://doi.org/10.18653/v1/D13-1196

León-Araúz, P., Cabezas García, M., & Faber, P. (2021). Multiword term bracketing and representation in terminological knowledge bases. In Proceedings of the eLex 2021 Conference (pp. 139 163). Lexical Computing CZ. https://cutt.ly/oRbV3LW.

León Araúz, P., Reimerink, A., & Faber, P. (2019). EcoLexicon and by-products: Integrating and reusing terminological resources. Terminology, 25(2), 222 258. https://cutt.ly/BekhlSbR. DOI: https://doi.org/10.1075/term.00037.leo

Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. MIT Press.

Pitler, E., Bergsma, S., Lin, D., & Church, K.W. (2010). Using web scale n grams to improve base NP parsing performance. In Proceedings of the 23rd International Conference COLING 2010 (pp. 886 894). ACL. https://cutt.ly/PekhlpjN.

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.

Rojas-Garcia, J. (2022a). Semantic Representation of Context for Description of Named Rivers in a Terminological Knowledge Base. Frontiers in Psychology, 13, 847024. https://cutt.ly/0ekjeN3q. DOI: https://doi.org/10.3389/fpsyg.2022.847024

Rojas-Garcia, J. (2022b). Semantic Relations Predict the Bracketing of Three-Component Multiword Terms. Procesamiento del Lenguaje Natural, 69, 141-152. https://shorturl.at/bdEG6.

San Martín, A., Cabezas García, M., Buendía Castro, M., Sánchez Cárdenas, B., León Araúz, P., Reimerink, A., & Faber, P. (2020). Presente y futuro de la base de conocimiento terminológica EcoLexicon. Onomázein, 49, 174 202. https://doi.org/10.7764/onomazein.49.09. DOI: https://doi.org/10.7764/onomazein.49.09

Thompson, P., Iqbal, S.A., McNaught, J., & Ananiadou, S. (2009). Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10, 349. https://doi.org/10.1186/1471-2105-10-349. DOI: https://doi.org/10.1186/1471-2105-10-349

Vadas, D., & Curran, J.R. (2008). Parsing noun phrase structure with CCG. In Proceedings of the 46th Annual Meeting of the ACL (pp. 335 343). ACL. https://aclanthology.org/P08-1039.pdf.