Uploaded image for project: 'Alfresco'
  1. Alfresco
  2. ALF-21846

Cannot use 4-byte unicode characters in FTS as term query

    Details

    • Type: Bug
    • Status: New (View Workflow)
    • Priority: Unprioritized
    • Resolution: Unresolved
    • Affects Version/s: Community Edition 201605 GA, Community Edition 201612 GA
    • Fix Version/s: None
    • Security Level: external (External user)
    • Labels:
      None
    • Environment:
      Oracle JDK 1.8.0_112, Tomcat 7.0.47, MariaDB 5.5.50 with utf8mb4, PostgreSQL 9.5 on Windows 10
    • Security Severity:
      None
    • Triage:
      ACE

      Description

      Alfresco as a heavily i18n-ized product is considered to fully supports unicode (provided backing database and servlet container are properly configured). It is possible to create a folder in Share using a 4-byte unicode character / emoji like the "pile of poo" character (note: JIRA does not allow inclusion of the character here) as the name.

      When searching for the folder by name, FTS queries that perform a term query fail to be parsed while FTS queries using a phrase query succeed.

      Steps to reproduce:

      1. Ensure DB / servlet container is set up to fully support unicode (note: MySQL/MariaDB use utf8 which only supports 3-byte unicode - see ACE-773)
      2. Create a folder via Share UI (e.g. in My Files) with name as the "pile of poo" emoji
      3. Open Node Browser via Admin Tools
      4. Perform a FTS query for =cm:name:pileOfPooEmoji
      5. Perform a FTS query for =cm:name:"pileOfPooEmoji"

      Expectation: Both FTS queries succeed and show the folder
      Observation: Only the phrase query succeeds - the term query reports "no viable alternative at character"

      Assumption / analysis: The FTSLexer does not correctly handle characters when looking for tokens. Instead of handling unicode code points it may only be handling individual characters without checking for surrogate pairs (high/low characters).

      I only found a mention of SOLR-based limitation to ~32.700 UTF-8 code points in SEARCH-87. In my case I am not running into SOLR limitations since I am using a transactionally executed query and the error occurs at the FTS parsing stage.

        Attachments

          Issue Links

            Activity

            Hide
            ahind Andrew Hind added a comment -

            We should parse valid unicode characters but exclude punctuation in general.
            I think this falls into a bucker with a bunch of other things that can be addressed with ANTLR 4 or fixing special language tokens to require a leading space ....
            I will take a look at the allowed characters in the Lexer and revisit the space and special character issue.

            Show
            ahind Andrew Hind added a comment - We should parse valid unicode characters but exclude punctuation in general. I think this falls into a bucker with a bunch of other things that can be addressed with ANTLR 4 or fixing special language tokens to require a leading space .... I will take a look at the allowed characters in the Lexer and revisit the space and special character issue.
            Hide
            ahind Andrew Hind added a comment -

            Allowing the high and low surrogate ranges through the lexer is insufficient to resolve this issue.
            I am assuming the test data in UTF-8 is loaded correctly.

            Show
            ahind Andrew Hind added a comment - Allowing the high and low surrogate ranges through the lexer is insufficient to resolve this issue. I am assuming the test data in UTF-8 is loaded correctly.
            Hide
            ahind Andrew Hind added a comment -

            I am not yet convinced this can not be fixed up with ANTLR 3....
            I have liked it any way

            Show
            ahind Andrew Hind added a comment - I am not yet convinced this can not be fixed up with ANTLR 3.... I have liked it any way

              People

              • Assignee:
                searchAndDiscovery Search and Discovery
                Reporter:
                afaust Axel Faust
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Date of First Response: