Ensuring that updates to the data store meet data quality constraints is an important part of the ALIGNED methodology.
But how do we do this for large datasets? Dr Gavin Mendel-Gleason, of Trinity College Dublin, has been working on this recently, and will be telling us a little about this today.
Before each transaction is finally committed to the dataset, we need to ensure that adding it will not violate the data quality constraints. We need to check that the new dataset, with the additions, still meets these constraints. The simplest way to do this is to create a copy of the entire dataset, apply the changes, and then test this version for compliance. If it passes, then we can go ahead and commit the changes to the dataset. This makes sense if we are changing large numbers of triples. But, if the dataset is large and the change to be tested is small, then this is overkill.
A more elegant solution is to overlay the changes on top of the dataset and then test the resulting combination. Whenever a change is to be committed to the dataset, the transaction management component creates two small graphs. One contains the changes to be made, while the other contains the negation of the data that will be changed in the original graph. This second graph is used to mask the original data, so that the constraint checker ignores it. The constraint checker can then run against these three graphs – the original graph, the changes graph, and the mask – and determine if the criteria are still satisfied. If this check is successful, the changes can be safely committed, knowing that they will not violate the data quality constraints.
For a more detailed description of what the transaction management component does, visit Dr Mendel-Gleason’s description on his personal blog.