Case #1: Training a model for date extraction

Task

Training a model for finding dates in invoices.

In this example we have a batch with around 266 invoices, (this is a very small amount for real training attempts, the bigger the amount of documents the better for training), but it will serve for the example.

This is also a very manual example, where we are creating the annotations by hand helped by the e-tags engine. there will be another ways of using already verified batches data to convert into NET tags.

1. Preparing valid data for machine learning training

We have processed the whole batch with the e-tags engine that will give us different suggestions for creating our NER annotations:

In this particular case, we only want to annotate dates for the invoices, so we will be adding only the "dates" values provided by the e-tags engine:

ML tools > NER > View E-Tags suggestions > Convert to a valid annotation:

You can add the annotation by clicking on the column button or in the image add button.

2. Exporting the batch to a ML dataset

Once we have all the "dates" annotated we proceed to export the whole annotated batch into a valid ML dataset for training.

To do so, we go to the Scan/ Input Tab and click export, then we add the output module "ML-Training" and we configure it for token classification:

Click ok > export and wait untill the dataset is created.

3. Configure our training

Now we can go back to our batch > ML Tools > and click on the training button:

As we can see, we are going to fine-tune the base model layoutlmv3-base for token classification and is going to create a model under DATES_MODEL training folder.

Now we can click on training model to preview/ choose our training preferences:

These are the default training parameters, if you click on their name, a tooltip will appear explaining what they are for.

Here is up to the user to choose the parameters they want for their training.

*Trainings can take a long time, and all depends on the available hardware and the training configuration set.

4. Activating our trained model

Once the training is done we will have the best model saved under the training folder. If we want to infer from it we have to activate to become a processor model.

5. Infering our model

Now we have activated a processor model, called DATES_MODEL, and we can infer from it. we can create another batch, or add a document to the current one,

preferably one that has not been used in the training for a better evaluation.

Click reprocess/ Infer and make sure to have the Execute ML-Tags inference option and the field Date mapped to the field/label of the processor model:

After processing we can check the performance of the model for this particular document:

Conclusion

This way we have fine-tuned the base model for looking for dates on a specific ChronoScan Batch.

Obviously every training can be different, and it's all about the quality of the provided data, time of training and the correct configuration to get the best results.

Case #1: Training a model for date extraction

Task

The help manual was created with Dr.Explain