Predictive power of semantic information in the use context of multiword terms for their structural disambiguation
DOI:
https://doi.org/10.31637/epsir-2025-2070Palabras clave:
multiword-term bracketing, bracketing prediction, random forest model, decision-tree model, verb lexical domain, semantic role, semantic category, semantic relationResumen
Introduction: The structural disambiguation of English multiword terms (MWT) of three or more constituents (e.g., coastal sediment transport), often known as bracketing, involves the grouping of the dependent components so that the MWT is reduced to its basic form of modifier+head, as in coastal [sediment transport], which is a right-bracketed ternary compound. This work presents a study that explored whether the bracketing of a ternary compound, when used as an argument in a sentence, can be predicted from the semantic information encoded in that sentence. Methodology: A set of 1.694 sentences were analyzed semantically and annotated with the lexical domain of the verbs, the semantic role and category of the arguments, and the semantic relation between the arguments. These semantic variables were then analyzed statistically to determine whether they are able to predict the bracketing of a ternary compound. Results: A random forest model, with the lexical domain of the verb, and the semantic role and category of the MWT, was able to predict the bracketing of the ternary compounds used as arguments in a sample of 380 MWTs (100% F1‑score). A decision tree, with solely the semantic relation of the MWT to another argument in the same sentence, was also able to predict the bracketing of the ternary compounds in the sample (94,12% F1‑score). Discussion: Only a subset of three variables was necessary for bracketing prediction with an error free performance, whereas previous research employed a minimum of 12 variables. Conclusion: The semantic information in a sentence contributed substantially to compound parsing. This suggests a novel research direction in the integration of semantic variables into syntactic parsers and machine-translation applications.
Descargas
Citas
Agirre, E., Baldwin, T., & Martinez, D. (2008). Improving parsing and PP attachment performance with sense information. In Proceedings of the 46th Annual Meeting of the ACL (pp. 317 325). ACL. https://cutt.ly/uekWpF6t.
Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press. DOI: https://doi.org/10.1017/9781316410899
Faber, P. (2015). Frames as a framework for Terminology. In H. Kockaert, & F. Steurs (Eds.), A Handbook of Terminology (pp. 14-33). John Benjamins. https://doi.org/10.1075/hot.1.fra1. DOI: https://doi.org/10.1075/hot.1.02fra1
Faber, P., & Mairal, R. (1999). Constructing a Lexicon of English Verbs. Mouton de Gruyter. DOI: https://doi.org/10.1515/9783110800623
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., & Herrera, F. (2018). Learning from Imbalanced Data Sets. Springer. DOI: https://doi.org/10.1007/978-3-319-98074-4
Garg, K.D., Shekhar, S., Kumar, A., Goyal, V., Sharma, B., Chengoden, R., & Srivastava, G. (2022). Framework for Handling Rare Word Problems in Neural Machine Translation System Using Multi-Word Expressions. Applied Sciences, 12, 11038. https://cutt.ly/Jekhbfpg. DOI: https://doi.org/10.3390/app122111038
Girju, R., Moldovan, D.I., Tatu, M., & Antohe, D. (2005). On the semantics of noun compounds. Computer Speech and Language, 19(4), 479 496. https://doi.org/10.1016/j.csl.2005.02.006. DOI: https://doi.org/10.1016/j.csl.2005.02.006
Gregorutti, B., Michel, B., & Saint-Pierre, P. (2017). Correlation and variable importance in random forests. Statistics and Computing, 27, 659–678. https://cutt.ly/FYFYCbF. DOI: https://doi.org/10.1007/s11222-016-9646-1
Hindle, D., & Rooth, M. (1993). Structural ambiguity and lexical relations. Computational Linguistics, 19(1), 103-120. https://aclanthology.org/J93-1005.pdf.
Klie, J.C., Bugert, M., Boullosa, B., Eckart de Castilho, R., & Gurevych, I. (2018). The INCEpTION platform: Machine assisted and knowledge oriented interactive annotation. In Proceedings of the 27th International Conference COLING 2018 (pp. 5 9). ACL. https://cutt.ly/UekjrFRT.
Krippendorff, K. (2012). Content Analysis: An Introduction to its Methodology. Sage.
Kroeger, P.R. (2005). Analyzing Grammar: An Introduction. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511801679
Kuhn, M. (2021). caret: Classification and Regression Training. R package. https://cutt.ly/aekhnQuG.
Kuhn, M., & Johnson, K. (2016). Applied Predictive Modeling. Springer.
Lauer, M. (1994). Conceptual Association for Compound Noun Analysis. In Proceedings of the 32nd Annual Meeting of the ACL (pp. 337 339). CoRR. https://cutt.ly/CekhxzZ5. DOI: https://doi.org/10.3115/981732.981785
Lauer, M. (1995). Designing Statistical Language Learners: Experiments on Noun Compounds. (Ph.D. Thesis). Macquarie University, Sidney. https://arxiv.org/abs/cmp-lg/9609008.
Lazaridou, A., Vecchi, E.M., & Baroni, M. (2013). Fish transporters and miracle homes: How compositional distributional semantics can help NP parsing. In Proceedings of the 2013 Conference on EMNLP (pp. 1908 1913). ACL. https://aclanthology.org/D13-1196. DOI: https://doi.org/10.18653/v1/D13-1196
León-Araúz, P., Cabezas García, M., & Faber, P. (2021). Multiword term bracketing and representation in terminological knowledge bases. In Proceedings of the eLex 2021 Conference (pp. 139 163). Lexical Computing CZ. https://cutt.ly/oRbV3LW.
León Araúz, P., Reimerink, A., & Faber, P. (2019). EcoLexicon and by-products: Integrating and reusing terminological resources. Terminology, 25(2), 222 258. https://cutt.ly/BekhlSbR. DOI: https://doi.org/10.1075/term.00037.leo
Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. MIT Press.
Pitler, E., Bergsma, S., Lin, D., & Church, K.W. (2010). Using web scale n grams to improve base NP parsing performance. In Proceedings of the 23rd International Conference COLING 2010 (pp. 886 894). ACL. https://cutt.ly/PekhlpjN.
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.
Rojas-Garcia, J. (2022a). Semantic Representation of Context for Description of Named Rivers in a Terminological Knowledge Base. Frontiers in Psychology, 13, 847024. https://cutt.ly/0ekjeN3q. DOI: https://doi.org/10.3389/fpsyg.2022.847024
Rojas-Garcia, J. (2022b). Semantic Relations Predict the Bracketing of Three-Component Multiword Terms. Procesamiento del Lenguaje Natural, 69, 141-152. https://shorturl.at/bdEG6.
San Martín, A., Cabezas García, M., Buendía Castro, M., Sánchez Cárdenas, B., León Araúz, P., Reimerink, A., & Faber, P. (2020). Presente y futuro de la base de conocimiento terminológica EcoLexicon. Onomázein, 49, 174 202. https://doi.org/10.7764/onomazein.49.09. DOI: https://doi.org/10.7764/onomazein.49.09
Thompson, P., Iqbal, S.A., McNaught, J., & Ananiadou, S. (2009). Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10, 349. https://doi.org/10.1186/1471-2105-10-349. DOI: https://doi.org/10.1186/1471-2105-10-349
Vadas, D., & Curran, J.R. (2008). Parsing noun phrase structure with CCG. In Proceedings of the 46th Annual Meeting of the ACL (pp. 335 343). ACL. https://aclanthology.org/P08-1039.pdf.
Publicado
Cómo citar
Número
Sección
Licencia
Derechos de autor 2025 Juan Rojas-García

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-SinDerivadas 4.0.
Authors who publish with this journal agree to the following terms:- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Non Commercial, No Derivatives Attribution 4.0. International (CC BY-NC-ND 4.0.), that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Datos de los fondos
-
Ministerio de Ciencia e Innovación
Números de la subvención PID2020-118369GB-I00;"Transversal Integration of Culture in a Terminological Knowledge Base on Environment" (TRANSCULTURE);TRANSCULTURE