When populating projects that contain a large number of documents it is possible that the same file may be uploaded multiple times leading to duplicate files within the system. This article describes the management of duplicate files in Opus 2 Platform. Some of this may be done automatically by the system, whereas other aspects of duplication management depends on manual user input.
Files are checked for duplicates on upload. What the files are checked against depends on the user's role, and whether they have the Access file manager capability enabled.
For a role with Access file manager capability enabled, the files in the upload dialog are checked against all files in Platform, including published documents, unpublished files, and files waiting to be processed.
If the Access file manager capability is disabled, then the files in the upload dialog are only checked against published documents that the user has access to.
Files in the same upload batch will also be checked against each other. So, if a user uploads multiple versions of the same file to the upload dialog, the first file will show as unique and the rest will show as duplicates.
The duplication check is done first by checking the exact size value (in bytes) of the document in the upload dialog agains documents that have been uploaded to the project. If there is a size match then the uploaded document will have an MD5Hash value generated and a final check will be conducted between the MD5Hash value of the uploaded document and the document that already exists in the project. If the MD5Hash values match then the uploaded document will be flagged as a duplicate.
The dialogue box contains the below options to handle the duplicates.
- Number of duplicates identified
- Only show duplicates: if this option selected, it displays only the duplicate files
- Remove duplicates: will remove all the duplicates identified
To remove only certain duplicates, select the documents to be removed. Click on Actions. From the drop down, select “Remove selected uploads”.
If the user has not taken any action on the duplicate files, they will see a pop-up message advising that some of the files they have selected are already present in the project. They are then presented with the option to skip uploading or to continue.
If a large batch of files is being uploaded, it is sometimes advisable to upload these to the files page without publishing them as documents first, in order to check for duplicate files before publication.
On the files page, there is an option to check the full list of files, or an individual batch of files for 'Exact duplicates'. Once this filter is on, the list can be sorted on filename, and the two duplicate files will appear in the list next to each other, allowing the user to delete one of the files before publishing as a document.
Filename is not considered when performing the duplicate check. If two files are the same but with a different filename, they will still be marked as duplicates.
The Principal Duplicate field shows whether a document has exact duplicates present in the project. The first file uploaded within a duplicate group is identified as the principal file. A user with access to the Files page can declare a different file within the group as the principal file.
On the documents page, it is possible to search for duplicates of a specific documents by filtering on the MD5Hash field. This is a system field that displays the value for each document. Documents that are identical in content but have differing filenames still contain the same MD5Hash value.
To detect identical versions of a specific document, the MD5Hash value of that particular document can be entered as a text string into the 'Find' box at the top of the Documents table, and any duplicate documents will be displayed.
Another way to determine the duplicate documents in the Documents page is from the Duplicates metadata field. This system metadata field is visible to those users who has the "See duplicate metadata" capability. The Duplicates field displays all the duplicate files along with the principal file. The first file uploaded within a duplicate group is identified as the principal file.
When uploading an email with attachments Opus 2 Platform will automatically extract out the attachments as individual documents. These will then be linked to the parent email via an automatically created relationship of type 'attachment'. If the attached documents already exist in the platform then rather than extracting the documents again and creating duplicates, only the email is uploaded and the existing documents are then linked to the email using the same 'attachment' relationship type. This is done automatically as part of the file ingestion process when the email file is unpacked. This information is not available during the initial upload of the email message.
Only files that a user has access to are included in the duplicate detection process. If for example, user A uploads a file on the documents page and sets access control on it so that only they can see the document, and then user B uploads the same file with no access restrictions they will not see a warning that a duplicate exists within the project. However, if user A returns to the documents page they will now see that there is a duplicate of their file in the system. In general, users are only alerted to the presence of duplicates if they have access to both documents. This prevents users gaining information about the project that they would not otherwise have access to via the duplicate detection mechanism.