Status: New (View Workflow)
Affects Version/s: Community Edition 201605 GA, Community Edition 201612 GA
Fix Version/s: None
Component/s: Search and Indexing (non-UI)
Security Level: external (External user)
Environment:Oracle JDK 1.8.0_112, Tomcat 7.0.47, MariaDB 5.5.50 with utf8mb4, PostgreSQL 9.5 on Windows 10
Alfresco as a heavily i18n-ized product is considered to fully supports unicode (provided backing database and servlet container are properly configured). It is possible to create a folder in Share using a 4-byte unicode character / emoji like the "pile of poo" character (note: JIRA does not allow inclusion of the character here) as the name.
When searching for the folder by name, FTS queries that perform a term query fail to be parsed while FTS queries using a phrase query succeed.
Steps to reproduce:
- Ensure DB / servlet container is set up to fully support unicode (note: MySQL/MariaDB use utf8 which only supports 3-byte unicode - see
- Create a folder via Share UI (e.g. in My Files) with name as the "pile of poo" emoji
- Open Node Browser via Admin Tools
- Perform a FTS query for =cm:name:pileOfPooEmoji
- Perform a FTS query for =cm:name:"pileOfPooEmoji"
Expectation: Both FTS queries succeed and show the folder
Observation: Only the phrase query succeeds - the term query reports "no viable alternative at character"
Assumption / analysis: The FTSLexer does not correctly handle characters when looking for tokens. Instead of handling unicode code points it may only be handling individual characters without checking for surrogate pairs (high/low characters).
I only found a mention of SOLR-based limitation to ~32.700 UTF-8 code points in
SEARCH-87. In my case I am not running into SOLR limitations since I am using a transactionally executed query and the error occurs at the FTS parsing stage.