Improving Legal Rhetorical Role Labeling Through Additional Data and Efficient Exploitation of Transformer Models
sentence encoding, BERT, judgment, sentence classification
Legal AI, the application of Artificial Intelligence (AI) in the legal domain, is a research field that comprises several dimensions and tasks of interest. As in other targeted application domains, one of the desired benefits is task automation, which increases the productivity of legal professionals and makes law more accessible to the general public. Text is an important data source in the legal domain, therefore Legal AI has a great interest in the Natural Language Processing advances. This thesis concerns the automation of the Legal Rhetorical Role Labeling (RRL), a task that assigns semantic functions to sentences in legal documents. Legal RRL is a relevant task because it finds information that is useful both by itself and for downstream tasks such as legal summarization and case law retrieval. There are several factors that make legal RRL a non-trivial task, even for humans: the heterogeneity of document sources, the lack of standards, the domain expertise required, and the subjectivity inherent in the task. These complicating factors and the large volume of legal documents justify the automation of the task. Such automation can be implemented as a sentence classification task, i.e. sentences are fed to a machine learning model that assigns a label or class to each sentence. Developing such models on the basis of Pre-trained Transformer Language Models (PTLMs) is an obvious choice, since PTLMs are the current state of the art for many NLP tasks, including text classification. Nevertheless, in this thesis we highlight two main problems with works that exploit PTLMs to tackle the Legal RRL task. The first one is the lack of works that address how to better deal with the idiosyncrasies of legal texts and the typically small size and imbalance of Legal RRL datasets. Almost all related works simply employ the regular fine-tuning strategy to train models.
The second problem is the poor utilization of the intrinsic ability of PTLMs to exploit context, which hampers the performance of the models. This thesis aims to advance the current state of the art on the Legal RRL task by presenting three approaches devised to overcome such problems. The first approach relies on a data augmentation technique to generate synthetic sentence embeddings, thus increasing the amount of training data. The second approach makes use of positional data by combining sentence embeddings and positional embeddings to enrich the training data. The third approach, called Dynamically-filled Contextualized Sentence Chunks (DFCSC), specifies a way to produce efficient sentence embeddings by better exploiting the encoding capabilities of PTLMs. The studies in this thesis show that the first two approaches have a limited impact on the performance of the models. Conversely, models based on the DFCSC approach achieve remarkable results and are the best performers in the respective studies. Our conclusion is that the DFCSC approach is a valuable contribution to the state of the art of the Legal RRL task.