project · 2023 · Research / Applied NLP

Knowledge Extraction from Investigative Documents

Key-information extraction from multilingual investigative documents (names, organisations, addresses, dates, and amounts), benchmarking NER approaches and fine-tuning BERT and spaCy.

BERTspaCyTensorFlowNLTKFastAPIPostgreSQL

This project tackled key-information extraction from investigative documents: pulling structured fields like person names, organisation names, addresses, dates, and monetary amounts out of messy, real-world text.

Finding the method that holds up across languages

Investigative documents come in many languages, and a model that works on one often falls apart on another. I worked through several named-entity recognition approaches to find an efficient method that generalised, measured against consistent performance metrics, with fine-tuned BERT and spaCy at the core.

Stack

TensorFlow and Transformers for the deep-learning pieces, spaCy and NLTK for the linguistic plumbing, and a FastAPI service over PostgreSQL to make the extraction usable downstream.

← all work