Research Center for Digital Sustainability

Dataset Construction for Legal Multilingual Language Models

Dataset Construction for Legal Multilingual Language Models 

This project is available as a Seminar or Bachelor's project. This project is also available as a group project.


Swiss court decisions are anonymized to protect the privacy of the involved people (parties, victims, etc.). Previous research [1] has shown that in certain cases, it is possible to re-identify companies involved in court decisions by linking the rulings with external data. Our project tries to go a step further by building an automated system for re-identifying involved people from court rulings. This system can then be used as a test for the anonymization practice of Swiss courts. For more information regarding the overarching research project, please go here.

Since legal text is often rather long, conventional transformer-based models are not optimal in this area. Various efficient transformer variants have been proposed [2]. However, none of these models has been pretrained in languages other than English so far. Additionally, models pretrained on legal data are rather rare and do not exist for the Swiss national languages so far. To pretrain an efficient multilingual transformer-based model capable of handling long text input on legal data, many corpora need to be collected. 

List of possible data sources:

Research Questions

So far, there does not exist a readily available collection of legal datasets usable for pretraining large language models.

RQ1: Is it possible to compile a readily available collection of legal datasets usable for pretraining large language models? 


  1. Identify promising data sources
  2. Analyze the HTML and scrape the documents from the websites (using libraries such as Scrapy or BeautifulSoup)
  3. Extract the text from the documents
  4. Evaluate the results (in the sense of a small descriptive analysis of the content)


⬤⬤⬤⬤◯ Programming

⬤⬤◯◯◯ Experimentation

⬤◯◯◯◯ Literature


Good programming skills (preferably in Python)


Joel Niklaus


[1] Vokinger, K.N., Mühlematter, U.J., 2019. Re-Identifikation von Gerichtsurteilen durch «Linkage» von Daten(banken). Jusletter 27.
[2] Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler. Efficient Transformers: A Survey