project · 2023 · Research / Applied NLP
Knowledge Extraction from Investigative Documents
Key-information extraction from multilingual investigative documents (names, organisations, addresses, dates, and amounts), benchmarking NER approaches and fine-tuning BERT and spaCy.
This project tackled key-information extraction from investigative documents: pulling structured fields like person names, organisation names, addresses, dates, and monetary amounts out of messy, real-world text.
Finding the method that holds up across languages
Investigative documents come in many languages, and a model that works on one often falls apart on another. I worked through several named-entity recognition approaches to find an efficient method that generalised, measured against consistent performance metrics, with fine-tuned BERT and spaCy at the core.
Stack
TensorFlow and Transformers for the deep-learning pieces, spaCy and NLTK for the linguistic plumbing, and a FastAPI service over PostgreSQL to make the extraction usable downstream.