Fine-Tuning Embedding Models for Sustainability Analysis in Public Procurement

This project is available as a Master's Thesis.

Introduction

Public procurement is a key driver for advancing sustainability goals. By embedding sustainability criteria into procurement processes, governments and organizations can influence markets to favor environmentally friendly, socially responsible, and economically viable solutions. However, assessing the importance of sustainability in calls for tenders is a complex task. This project focuses on analyzing thousands of public procurement documents to determine the emphasis placed on sustainability-related aspects within the evaluation criteria for selecting the best offer.

To achieve this, the project will leverage sentence embeddings to represent textual data from procurement documents and sustainability guidelines. These embeddings will then be compared to measure semantic similarity. While generic pre-trained sentence transformers provide a starting point, we hypothesize that fine-tuning these models on domain-specific data will yield significantly better results. Additionally, a fine-tuned cross-encoder model will be developed for re-ranking candidate matches.

Creating a fine-tuning dataset poses a significant challenge, as labeled procurement data is scarce. This project will explore innovative approaches to generate synthetic training data using generative AI and domain expertise. The resulting models will contribute to automated sustainability analysis in procurement and pave the way for further research in this area.

Research Questions

How well do generic pre-trained sentence transformers perform in identifying sustainability-related aspects in public procurement documents?
To what extent can fine-tuning improve the performance of sentence embedding models and cross-encoders in this domain?
What methods can be used to create high-quality fine-tuning datasets from unlabeled procurement data?

Steps

Literature Review
- Study existing approaches to sentence embeddings, cross-encoder models, and fine-tuning techniques.
- Review methods for synthetic data generation and domain adaptation.
Data Preparation
- Collect and preprocess public procurement documents.
- Develop a methodology for splitting documents into sentences and embedding them.
Synthetic Data Generation
- Use generative AI to create synthetic examples of sustainability-related and unrelated sentences.
- Develop rules or prompts to guide data generation based on domain knowledge.
Model Fine-Tuning
- Fine-tune a sentence embedding model on the generated dataset to optimize similarity matching.
- Fine-tune a cross-encoder model for re-ranking candidate matches to improve accuracy.
Evaluation
- Compare the performance of the fine-tuned models against generic pre-trained models using metrics such as semantic similarity scores and ranking quality.
- Conduct case studies on specific procurement documents to validate real-world applicability.
Analysis and Reporting
- Analyze the strengths and limitations of the fine-tuned models.
- Document the methodology, results, and potential areas for improvement.

Activities

Programming: 4/5
Experimentation: 5/5
Literature Review: 3/5

Prerequisites

Strong programming skills (preferably in Python).
Experience with deep learning frameworks (e.g., transformers, PyTorch).
Familiarity with natural language processing (NLP) techniques and tools.
Interest in sustainability and public procurement processes.

Contact

Luca Rolshoven

Digital Sustainability Group