Uploaded image for project: 'Service Packs and Hot Fixes'
  1. Service Packs and Hot Fixes
  2. MNT-18383

Sharding restricts the number of results / "allowed" page sizes

    Details

    • Type: Improvement
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 5.1
    • Fix Version/s: 6.0.N
    • Labels:
      None
    • Environment:
      Tomcat 7.0.47, Oracle JDK 1.8.0_112, Win 10, 4 Cores, 32 GiB RAM
    • ACT Numbers:

      Community, 00900074 Premier, 00931276

    • Premier Customer:
      Yes

      Description

      When using a SOLR 4 setup with sharding, the number of items that can be requested from the result is restricted by technical issues while SOLR 4 without sharding has no such restriction.

      Steps to reproduce:
      1) set up Alfresco and SOLR 4 in a sharded scenario, using maxHttpHeaderSize of 32768 for SOLR HTTP connector (no maxHttpHeaderSize is documented for SOLR setup, but http://docs.alfresco.com/5.1/tasks/configfiles-change-path.html mentions the value of 32768)
      2) create / bootstrap at least 1000 documents into the Repository
      3) ensure all documents have been indexed
      4) Execute the following query and request 1000 result items (i.e. via Repository Script API)

      TYPE:"cm:content" AND PROPERTIES:"{http://www.alfresco.org/model/content/1.0}name"
      

      Expectation: Query completes and 1000 results are returned
      Observation:

      • Query fails with HTTP 500 status from SOLR
      • solr.log file includes exception about a parse error handling content type of shard responses
        Caused by: org.apache.http.ParseException: Invalid content type:
        at org.apache.http.entity.ContentType.parse(ContentType.java:233)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:496)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.handler.component.AlfrescoHttpShardHandler$1.call(AlfrescoHttpShardHandler.java:182)
        at org.apache.solr.handler.component.AlfrescoHttpShardHandler$1.call(AlfrescoHttpShardHandler.java:137)

      Result of personal investigation:

      The SOLR QueryComponent.createRetrieveDocs() method includes all document IDs found to match the query as an URL parameter in the request to retrieve these documents from the shards. Since the Alfresco SOLR document ID is based on tenant + ACL ID + DB ID, each ID instance is 43 bytes long. Due to URL encoding and concatenation with an URL encoded separator, each ID effectively takes up 50 bytes.

      A maxHttpHeaderSize of 32768 effectively limits the number of search results to ~600-650 per request (631 in my test). The default value of 8192 (according to https://tomcat.apache.org/tomcat-7.0-doc/config/http.html) only allows for ~135-140 results.

      In a multi-tenant use case, the name of the tenant is a free-form text to be provided by the administrator during tenant setup. This means that SOLR document IDs may be even longer and allow for even fewer results. Using "techforall.org" (a showcase customer - see https://www.alfresco.com/customers/teach-all) as the tenant name results in a 14% reduction.

      Ideally, the SOLR document IDs should probably be transported via the POST body of the request instead of the URL and thus any restriction of the number of results could be avoided.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                searchAndDiscovery Search and Discovery
                Reporter:
                afaust Axel Faust
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated: