2 Basics
2.1 Introduction
Knowing what you need is crucial before you can start a search. Therefore first the key criteria to define the quality and efficiency of the search are introduced in Recall and precision.
Then the process of defining and finding relevant documents is described in Defining relevant results for your research and Working through a set of documents
Knowing the key criteria and what results are needed enables the selection of a set of documents in Selecting a set of documents and knowing when the search can be finished in When to stop.
2.2 Recall and precision
Recall and precision are key criteria for the assessment of the search process, because the quality and efficiency of the search process are important. Although recall is difficult to measure, it can still be used to understand the quality of your search.
In practice, 100% recall and 100% precision at the same time is not possible. Performing a search is therefore always a balance between effort and quality.
- Recall (how complete it is) is defined as:
-
ratio of relevant documents found compared with all relevant documents.
- Precision is defined as:
-
ratio of relevant documents found compared with all documents found.
The collections of documents used for both criteria are shown in figure 1.
A proper quality of a search for patent documents is in many cases important. In for example product development, it is important that your questions about the patent information to be used for your development are completely answered. Missing a relevant document can lead to unnecessary development work or to the inability to sell a product. This can have serious financial consequences.
The recall is thus a good criteria to assess the quality of the search. Normally the achieved recall can only be estimated by redoing the search by another person. It is therefore not a criterium that can be easily used to assess the quality of the search. In the following parts is described how to work with the recall criterium.
The efficiency of a search has an important effect on the costs of a search. The precision criterium can be used for assessing the efficiency. This can easily be measured after the search is finished by comparing the total number of documents selected to the relevant documents. This comparison has on its own almost no value. When comparing different searches or with peers, this comparison can be used for assessing efficiency. Also in the following parts the use of this criterium is described.
In general experience recall and precision work in opposite directions: As recall improves, precision decreases, and as precision increases, recall decreases. During searching you will have to make a choice about this balance for yourself. The big problem with this is that normally you do not know the recall of your search. Experience with searching will help to make better estimations for this balance. This should also become clear in the following parts.
For more information about recall and precision, see for example wikipedia.
2.3 Defining relevant results for your research
Before starting a search you should have a clear picture of what the relevant results should be. The relevant results are determined by the questions to be answered in the research.
For example for a patent examiner the claims of the patent application determine how relevant the searched documents are. This is a relative clear criterion to use for the selection of the documents. If you are doing a research for technical solutions the criterion to select documents is less clear, because often you are searching for concepts or all solutions to a certain problem. What helps in this case is to determine a technical feature or a few features that have to be present in a document to be relevant. This will allow you to quickly select possible relevant documents.
Another important reason to have a clear definition of the relevant results, is that it allows you to determine what not to look for. If it is possible to exclude certain technical fields or applications, this will reduce the amount of documents (and thus increase the precision).
2.4 Working through a set of documents
Once you have defined what the relevant results are, you can start selecting the relevant documents.
It is often difficult to quickly determine if a document is really relevant. To determine this you normally have to read the document. Reading too many documents is not possible in a limited time. Therefore as a first pass you select possible relevant documents from a workable set of documents (see figure 2). You select possible relevant documents by using your criterion for selecting these documents as described in the previous paragraph.
In this first pass it is normally not necessary to read the documents. It is often sufficient to look at the figures or read the abstract (or also a part of the description) to see if a document will pass or fail your criterion. In this way you create a selected group of possible relevant documents in the first pass.
When the first pass is done, it is time to have a more accurate look at the selected group of documents. By reading the documents the really relevant documents can be found (see figure 3). These documents give the answers to the questions of your research.
2.5 Selecting a set of documents
Selecting a workable set containing all relevant documents is the most important part of a search, because this has a huge influence on the quality and amount of work to go through the set (recall and precision). See figure 4 to get an impression of the number of documents.
In the previous paragraph is described how to work through the set to finally select only the really relevant documents. This is “just” a lot of work. Selecting this workable set does not have to take a lot of time, but is the most difficult part of finding information from patents.
In practise it is very difficult to select all relevant documents (to have a 100% recall). Therefore a workable goal is to come as close as possible to a 100% recall with a workable amount of documents (high precision). This part will give an introduction into the most used methods to achieve this goal.
The available methods are determined by the information in the databases and by the functionalities of the current search engines to perform a search. These possibilities will be explained in the following paragraphs in relation to achieving the goal to select a workable set with the relevant documents.
2.5.1 Text search
The available search engines, like for example Espacenet or Google Patents, allow the search for words in the whole text of the patent documents. The use of this so called full-text search allows to select documents whereby words can be used to describe the relevant features.
It is normally not possible to describe concepts, but parts or materials can be described using words. Care has to be taken of the possible synonyms of the words. It is for example not uncommon that in the chemical field, where compounds are normally described in a standardised way, up to 10 or 20 synonyms are used for one compound.
Patents are published in many languages and therefore potentially also have to be searched using words of the different languages. In Espacenet only the three official languages of the EPO can be used to search. In Google Patents also other languages can be used. To relief the searcher for finding synonyms in the different languages also automatic translated texts in English are used in several search databases.
Experience with full-text search shows:
Certain relevant features can be properly described using words, others not. Parts, materials or compounds can often be described properly, but relations between these are often not possible to describe.
Also not all languages have words for certain features. Often words for a feature exist in one language, but the same feature is described using multiple words in another language.
The words and phrases used in patent documents are often more descriptive instead of specific to have a broad scope of meaning. Therefore it can be difficult to find the proper words for searching.
To come close to a 100% recall all synonyms in the different languages have to be used. This is often not easy and requires knowledge and experience in a technical field.
Achieving a high recall with only full-text search generally results in a very low precision. This means that an unworkable set of documents is selected.
2.5.2 Search with classification
Patent examiners have been searching for more than a century, mostly in a time when computers were not available to search. To enable searching in this time of only paper documentation, patent classifications have been developed. The goal of these classifications is to form sets of documents (with a specific classification code) that come close to the 100% recall necessary for search reports for patent applications and at the same time have a high precision. This classification is still very useful and is also used extensively to perform searches today.
The classification systems are thus tuned to the searches performed at patent offices. To have these classifications tuned to the searches, the same persons who search (the examiners) also build and maintain the classification systems and perform the classification of documents.
Many patent offices have developed their own classification systems. Some standardisation has occurred since the International Patent Classification (IPC) was created. One of the most widely used classifications next to the IPC is the Cooperative Patent Classification (CPC). In Classification the characteristics of the different systems is described.
Using the correct classification code for your search enables a high recall and a high precision. This is therefore a preferred method to select a workable set. However, attention has to go to finding the correct classification code, because not selecting the correct code results generally in a recall of zero.
Finding the correct code requires both knowledge of the technical field and of the classification codes and how they are used for classification. A general knowledge of the classification is given in Classification and is recommended for reading.
With classification the selection of a set is the same as:
selecting the correct classification code.
2.5.3 Specialised databases for specific technical fields
For certain technical fields the use of full-text or classification is not sufficient to perform a search. Specialized search engines and databases exist for these technical fields.
For example in the field of chemistry are search engines and databases where a chemical structure can be searched.
In for example the database of chemical abstracts of CAS sequences can be searched.
2.5.4 Queries in search engines
Most of the available search engines can combine different queries in
a boolean way. A boolean combination means that a logical
AND
or OR
can be used to build a query. With
an AND
the result can be limited to a set whereby both
queries should be present. With an OR
several alternatives
or synonyms can be combined.
Thus a set can also be selected by for example combining a classification code and words. Thereby precision is increased over a pure classification search.
See the manuals and help of the search engines for the possibilities (see also IP databases).
2.5.5 Ranking
Some search engines perform a google style ranking to the results, whereby the highest ranking results are displayed first. This is only useful if you quickly want some results, but is not useful for a complete result because still all documents have to be viewed.
More advanced ranking such as for example faceted search (see wikipedia) is at this moment not available. Something similar can be achieved by repeating boolean queries, but this is not very friendly to perform.
2.5.6 Cited and citing documents
The information of search reports of the patent applications is largely available in the databases of the various search engines. This information is a valuable source for relevant documents:
The search reports of patent applications contain relevant documents for the inventions in those applications.
You should therefore include the cited documents of already found relevant documents into your selection and they particularly should be checked for relevance.
In several search engines the documents that have a citation to a particular document are also available. You should include these citing documents to a relevant document into your selection for the same reason that the cited documents are included.
Apart from the relevance of these cited and citing documents, they can also give hints to make new selections. Unknown relevant classification codes or words can be derived from these documents.
2.6 When to stop?
The question when to stop your search can be answered when the following question can be answered:
Did you select all relevant documents?
This question can only be answered with uncertainty. In practise it is not possible to check that you have selected all relevant documents (recall = 100%). Instead some indications can be used to answer the question.
If you have enough knowledge of the classification and technology of a certain technical field (similar to a patent examiner) then you are able to select one or a limited number a classification codes, whereby you have a high certainty that most relevant documents will be selected. Selecting a set with these classification codes and working through this set is then sufficient for the search. The search can then be stopped after this. This method is therefore the most simple.
The set you have selected should return a significant number of relevant or closely related documents. If this is not the case, then you have probably selected a set without relevant documents (recall is low).
When looking at cited and citing documents of the really relevant documents does not retrieve new relevant documents or technical fields, then this is an indication that most relevant documents have been found.
If you come to the conclusion that your search is not yet complete enough, then a new selection has to be made. The information from the previous results can be used for this. For example new classification codes found on relevant documents can be used.