writing · 2023-06-15

Reading documents in a clean room

Doing OCR and entity extraction on health records inside a Trusted Research Environment, where the data can't leave and most of your usual tools can't come in.

At DFKI I worked on extracting structured information from sensitive documents, the kind of health records you are never allowed to copy onto your own machine. The work happened inside a Trusted Research Environment: the data stays put, the analysis comes to it, and nothing leaves except aggregate results someone has checked.

That constraint reshapes the whole problem.

The pipeline is ordinary, the setting is not

The computer-vision part is familiar. Run OCR (I leaned on PaddleOCR, EasyOCR, and PyTesseract depending on the document), then named-entity recognition to pull out the fields that matter. What changes is everything around it. You cannot paste a tricky page into a hosted service to see what it says. You cannot pull a fresh model over the internet on a whim. You debug with the data you are allowed to see, which is often none of it directly.

Build for the audit, not just the result

In a normal project the output is the deliverable. Here the deliverable is the output plus a defensible account of how it was produced. A secure ML architecture for this kind of data is less about a clever model and more about provenance: which version of which model touched which record, and who could see what at each step.

It taught me a habit I have kept. Design as if someone careful will one day ask you to prove every step, because in the settings that matter, they will.

#computer-vision#ocr#ner#security

← all writing