The rapid growth in the amount of newly generated data makes it harder and harder to manage. Document management is as crucial for private enterprises as it is for the public sector. The amount of data that each government (both general and local) needs to store about each citizen is growing every minute.
Those growing numbers drastically affect how we manage the data, as traditional document management solutions cannot keep up with it. On the other hand, there is a need for data to be more widely available in the public sector, so the citizens don’t have to bring the same document to two different offices, and the duplication of effort can be prevented.
Our client from Cordoba municipality in Spain asked us to create a document management system that would optimize the business processes for its staff and the citizens.
We implemented the solution in three areas of document management:
- Classification and processing
- Data extraction
- Advanced Security
When it comes to understanding documentation by the machines, the most complicated part is the unstructured data. With document texts like .docx, the challenge for the machine is to understand what is the meaning of text. But with unstructured data such as PDFs, audio files, printed or handwritten text, it firstly has to find where the text actually is and what it is.
That is where the OCR (Optical Character Recognition) technique is being used. The first part of it is text detection, where the textual part within the image is determined. The localization of the text is crucial for the second part – text recognition, where the text is extracted from the image. Using these techniques together is how you can extract text from any image.
Our next step was therefore to teach the machine to recognize text inside images and convert it into an electronic form.
A big problem with document organization is there are many forms of communications: emails, phone calls, text messages, letters, etc. It does not come as a formatted database. It takes a lot of time to organize all this information and pull out the knowledge one needs. But AI manages to pull that information within seconds. The technique of data extraction is called named entity extraction. Training a model consisted of three main steps:
- Dataset preparation: at the beginning, we had to identify, integrate, and prepare the data for learning. We created a dataset containing text documents, which was loaded, and a basic pre-processing was performed. Later the dataset was split into train sets (on which the machine learns) and validation sets (used to evaluate if the learning process was successful).
- Feature engineering: The raw dataset was transformed into flat features that were later used in the machine learning model.
- Model training: The machine learning model was trained on a labeled dataset. The validation sets were then used to check how accurately the model classified text.
Our AI-powered system can enhance the security and protect citizens’ data.
It can easily detect personal identifying information (PII). The automatic classification and processing allow all the documents to be in secured locations.