TABLE OF CONTENTS
At the heart of the Opus2 Search engine lies Lucene, a robust search library with powerful indexing and querying capabilities. Driven by Lucene Query Language, users can craft precise search queries to filter and retrieve documents with high specificity.
This help guide covers both the basics and more advanced syntax of the Lucene Query Language so that users can construct a range of effective search queries.
Terms
A query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases.
A Single Term is a single word such as hello or world
A Phrase is a group of words surrounded by double quotes such as "hello world"
Multiple terms can be combined together with Term Modifiers and/or Boolean Operators to form a more complex query.
Fields
When performing a search, the default field used is Text, which searches the text content of each document. You can also specify a field by typing the field name followed by a colon : and then the term you are looking for.
As an example, if you want to find documents containing the text "Contract agreement" that also have the “Review” tag applied, you can enter:
"Contract agreement" AND Tags:Review
Since Text is the default field, the field indicator is not required.
Note: The field is only valid for the term that it directly precedes, so the query
"Document Name":Contract Agreement
will only find "Contract" in the Document Name field. It will find "Agreement" in the text of the document.
Operators - Term Modifiers:
Wildcard
Wildcard searches are useful when you want to match terms based on specific patterns, such as finding terms that begin, end, or contain a particular sequence of letters. It's often used when the user knows part of the word, such a prefix, but wants flexibility in certain sections of it.
Single and multiple character wildcard searches are possible within single terms (not within phrase queries).
The single character wildcard search looks for terms that match that with the single character replaced. To perform a single character wildcard search use the ? symbol.
For example, to search for text or test you can use the search:
te?t
Multiple character wildcard searches look for 0 or more characters. To perform a multiple character wildcard search use the * symbol.
For example, to search for contract, contracts, or contractor, you can use the search:
contract*
You can also do wildcard searches in the middle of a term. For example:
cont*t
would return hits with contract, context, and content
Note: You cannot use a * or ? symbol as the first character of a search.
Fuzzy
The Fuzzy query uses the Damerau-Levenshtein distance to find all terms with one or two changes, where a change is the insertion, deletion or substitution of a single character, or transposition of two adjacent characters. It is particularly useful when you want to account for typos or slight variations in a term.
To do a fuzzy search use the tilde ~ symbol at the end of a Single word Term and add either a 1 or 2.
An edit distance of 1 should be sufficient to catch 80% of all human misspellings. For example, to search for a term similar in spelling to contract use the fuzzy search:
contract~1
This search will find terms like contact and contrast.
For a wider variance of results, set the edit distance to 2. For example:
contract~2
will find terms like context and content.
Note: Mixing fuzzy and wildcard operators is not supported. When mixed, one of the operators will not be applied.
Proximity
The proximity modifier finds words in a phrase that are a specific distance away from one another. To do a proximity search use the tilde ~ symbol at the end of a phrase.
For example, to search for contract and agreement within 5 words of each other in a document use the search:
"contract agreement"~5
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase agreement contract would be considered more relevant than agreement draft contract
Range Searches
Range queries search for documents whose fields values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically.
Range searches are particularly useful for dates. For example. to search for documents with a Date field value between 01 January 2023 and 01 January 2024, enter the following (dates are indexed in the YYYY-MM-DD format):
Date:[2023-01-01 TO 2024-01-01]
Range Queries can also be done for non-date fields. For example, to search for a subset of documents based on a Bates range one could enter:
"Bates Beg":{BB000123 TO BB000300}
This will find all documents whose Bates Beg values are between BB000123 and BB000300, but not including BB000123 and BB000300.
Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
Asterisks can also be used to create unbound range queries. For example, to search for documents with date fields before 1 January 2024, enter:
Date:{* TO 2024-01-01}
And to search for documents with date fields on or after 1 January 2024 enter:
Date:[2024-01-01 TO *]
Boosting
The boost operator is used to make one term more relevant than another. To boost a term use the caret ^ symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.
Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for purchase contract and you want the term contract to be more relevant, boost it using the ^ symbol along with the boost factor next to the term. You would type:
purchase contract^4
This will make documents with the term contract appear more relevant.
You can also boost phrases as in the example:
"purchase contract"^4 "contract agreement"
By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2)
Operators - Boolean Operators:
Boolean operators allow terms to be combined through logic operators. OR, AND, + and - are supported Boolean operators. Boolean operators must be ALL CAPS.
By default, all terms are optional, as long as one term matches. A search for "contract purchase agreement" will find any document that contains one or more of contract or purchase or agreement. To refine the parameters of each term in a query string and define mandatory requirements, boolean operators can be added.
OR
The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exist in a document.
To search for documents that contain either "unfair dismissal" or just dismissal use the query:
"unfair dismissal" dismissal
or
"unfair dismissal" OR dismissal
Note: It is possible to escape special characters that are part of the query syntax. The current list special characters are
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
To escape these characters use the \ before the character. For example to search for "Foster v Bates [2020]" use the query:
"Foster v Bates \[2020\]"
AND
The AND operator matches documents where both terms exist anywhere in the text of a single document.
To search for documents that contain "contract agreement" and "John Smith" use the query:
"contract agreement" AND "John Smith"
+ (Must have)
The + or must have operator requires that the term after the + symbol exists somewhere in a document.
For example, to search for documents that must contain contract and may* contain agreement use the query:
+contract agreement
* While the term is optional, its presence increases the document's relevance
- (Must not have)
The - or must not have operator excludes documents that contain the term after the - symbol.
To search for documents that may* contain "contract agreement" but must not contain "John Smith" use the query:
"contract agreement" -"John Smith"
*While the term is optional, its presence increases the document's relevance
Grouping
Grouping helps define precedence and logical structure for more complicated search conditions. Parentheses allow you to group query clauses and control how Boolean operators are applied. Without parentheses, Lucene applies a default operator precedence (usually AND has higher precedence than OR).
To search for documents that contain either purchase or sale and contract use the query:
(purchase OR sale) AND contract
This eliminates any confusion and makes sure that contract must exist and either purchase or sale may exist.
Field Grouping
Grouping can also be done when using field indicators.
To search for a Document Name that contains both the word contract and the phrase "John Smith" use the query:
"Document Name":(+contract +"John Smith")
Field grouping is also useful for searching a range of Tags. For example, to return documents that must have the tag Hot and must not have the tag To review:
Tags:(+Hot -"To review")