Testing search with version labels has uncovered a potential problem with the way Alfresco tokenizes version labels in the index.
We have two distinct documents in a repository. One is at version 1.1, and the other at version 11.0. If we search for files with a version label of 11, using a query like this:
We get both the version 1.1 and version 11 documents returned in our search results, despite the fact that only one of the two documents is actually at version 11. If we add the .0 to the query:
The we get the expected result, only files at version 11.0 are shown.
Looking into the content model, the indexing behavior is not defined for the cm:versionLabel property. The default value is to tokenize the field for indexing. It looks like the tokenization is creating two entries for the versionLabel property in the index. One with the "." character, and one without. This is confirmed by inspecting the index with Luke (screenshot attached). A deeper inspection of Alfresco revealed that we use a custom Lucene TokenFilter (AlfrescoStandardFilter) that might be the source of the problem. This class is called by the AlfrescoStandardAnalyser, and appears to detect acronyms in dotted format (C.M.I.S for example). If it is detected then a new token is created for the acronym, stripped of its "." characters (CMIS, in our example case). I think this is happening to the version labels. So, a version label of 11.0 is ending up in the index as both 11.0 and 110, and a version label of 1.1 is ending up in the index as both 1.1 and 11. Thus, a search for 11 hits on both documents, incorrectly.
I think we can fix this by altering the cm:versionLabel's index behavior so that it is not tokenized.