Creating and sharing knowledge for telecommunications

Beyond Characters: Subword-level Morpheme Segmentation

Peters, B. ; Martins, A.

Beyond Characters: Subword-level Morpheme Segmentation, Proc Workshop on Computational Research in Phonetics, Phonology, and Morphology SIGMORPHON, Seattle, United States, Vol. , pp. - , July, 2022.

Digital Object Identifier:

Download Full text PDF ( 245 KBs)

This paper presents DeepSPIN's submissions to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation.
We make three submissions, all to the word-level subtask.
First, we show that entmax-based sparse sequence-to-sequence models deliver large improvements over conventional softmax-based models, echoing results from other tasks.
Then, we challenge the assumption that models for morphological tasks should be trained at the character level by building a transformer that generates morphemes as sequences of unigram language model-induced subwords.
This subword transformer outperforms all of our character-level models and wins the word-level subtask.
Although we do not submit an official submission to the sentence-level subtask, we show that this subword-based approach is highly effective there as well.