When populating projects that contain a large number of documents it is possible that the same file may be uploaded multiple times leading to duplicate files within the system. This article describes the management of duplicate files in Opus 2 Platform. Some of this may be done automatically by the system, whereas other aspects of duplication management depends on manual user input.
Files are checked for duplicates on upload. If documents already exist in the project and have been successfully processed, then files in the new upload batch will be checked against all existing files in the project to see if they are duplicates. This is done first by checking the exact size value (in bytes) of the document in the upload dialog agains documents that have been uploaded to the project. If there is a size match then the uploaded document will have an MD5Hash value generated and a final check will be conducted between the MD5Hash value of the uploaded document and the document that already exists in the project. If the MD5Hash values match then the uploaded document will be flagged as a duplicate. The uploader will see a pop-up message advising that some of the files they have selected are already present in the project. They are then presented with the option to skip uploading or to continue.
Sometimes there are duplicate files in the same upload batch. If this is the case, it is not possible to detect duplicate documents on upload, as the system has not processed the documents yet.
If a large batch of files is being uploaded, it is sometimes advisable to upload these to the files page without publishing them as documents first, in order to check for duplicate files before publication.
On the files page, there is an option to check the full list of files, or an individual batch of files for 'exact duplicates'. Once this filter is on, the list can be sorted on filename, and the two duplicate files will appear in the list next to each other, allowing the user to delete one of the files before publishing as a document.
The above duplication check is limited to files that are duplicates in both content and filename. It is still possible for two files with different names but identical content to be published as documents as the above relies on manual detection based on file name. If this has happened it is possible to look for duplicate files on the documents page using the MD5Hash value of a document.
On the documents page, it is possible to search for duplicates of a specific documents by filtering on the MD5Hash field. This is a system field that displays the value for each document. Documents that are identical in content but have differing filenames still contain the same MD5Hash value.
To detect identical versions of a specific document, the MD5Hash value of that particular document can be entered as a text string into the 'Find' box at the top of the Documents table, and any duplicate documents will be displayed.
When uploading an email with attachments Opus 2 Platform will automatically extract out the attachments as individual documents. These will then be linked to the parent email via an automatically created relationship of type 'attachment'. If the attached documents already exist in the platform then rather than extracting the documents again and creating duplicates, only the email is uploaded and the existing documents are then linked to the email using the same 'attachment' relationship type. This is done automatically as part of the file ingestion process when the email file is unpacked. This information is not available during the initial upload of the email message.
Only files that a user has access to are included in the duplicate detection process. If for example, user A uploads a file on the documents page and sets access control on it so that only they can see the document, and then user B uploads the same file with no access restrictions they will not see a warning that a duplicate exists within the project. However, if user A returns to the documents page they will now see that there is a duplicate of their file in the system. In general, users are only alerted to the presence of duplicates if they have access to both documents. This prevents users gaining information about the project that they would not otherwise have access to via the duplicate detection mechanism.