Uploaded image for project: 'Service Packs and Hot Fixes'
  1. Service Packs and Hot Fixes
  2. MNT-18383

Sharding restricts the number of results / "allowed" page sizes

    Details

    • Type: Service Pack Request
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 5.1
    • Fix Version/s: 5.1.N
    • Labels:
      None
    • Environment:
      Tomcat 7.0.47, Oracle JDK 1.8.0_112, Win 10, 4 Cores, 32 GiB RAM
    • ACT Numbers:

      Community, 00900074 Premier

      Description

      When using a SOLR 4 setup with sharding, the number of items that can be requested from the result is restricted by technical issues while SOLR 4 without sharding has no such restriction.

      Steps to reproduce:
      1) set up Alfresco and SOLR 4 in a sharded scenario, using maxHttpHeaderSize of 32768 for SOLR HTTP connector (no maxHttpHeaderSize is documented for SOLR setup, but http://docs.alfresco.com/5.1/tasks/configfiles-change-path.html mentions the value of 32768)
      2) create / bootstrap at least 1000 documents into the Repository
      3) ensure all documents have been indexed
      4) Execute the following query and request 1000 result items (i.e. via Repository Script API)

      TYPE:"cm:content" AND PROPERTIES:"{http://www.alfresco.org/model/content/1.0}name"
      

      Expectation: Query completes and 1000 results are returned
      Observation:

      • Query fails with HTTP 500 status from SOLR
      • solr.log file includes exception about a parse error handling content type of shard responses
        Caused by: org.apache.http.ParseException: Invalid content type:
        at org.apache.http.entity.ContentType.parse(ContentType.java:233)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:496)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.handler.component.AlfrescoHttpShardHandler$1.call(AlfrescoHttpShardHandler.java:182)
        at org.apache.solr.handler.component.AlfrescoHttpShardHandler$1.call(AlfrescoHttpShardHandler.java:137)

      Result of personal investigation:

      The SOLR QueryComponent.createRetrieveDocs() method includes all document IDs found to match the query as an URL parameter in the request to retrieve these documents from the shards. Since the Alfresco SOLR document ID is based on tenant + ACL ID + DB ID, each ID instance is 43 bytes long. Due to URL encoding and concatenation with an URL encoded separator, each ID effectively takes up 50 bytes.

      A maxHttpHeaderSize of 32768 effectively limits the number of search results to ~600-650 per request (631 in my test). The default value of 8192 (according to https://tomcat.apache.org/tomcat-7.0-doc/config/http.html) only allows for ~135-140 results.

      In a multi-tenant use case, the name of the tenant is a free-form text to be provided by the administrator during tenant setup. This means that SOLR document IDs may be even longer and allow for even fewer results. Using "techforall.org" (a showcase customer - see https://www.alfresco.com/customers/teach-all) as the tenant name results in a 14% reduction.

      Ideally, the SOLR document IDs should probably be transported via the POST body of the request instead of the URL and thus any restriction of the number of results could be avoided.

        Attachments

          Issue Links

            Activity

            Hide
            jbernstein Joel Bernstein added a comment -

            The reason that a POST is not being used is because Alfresco is also sending a JSON payload to the shards which contains the ACL authorities. The JSON payload is being sent as an attached content stream. A POST is also sent as a content stream. But, the SolrJ client only supports one content stream per request. So currently both a POST and the JSON content stream cannot be part of the same request. So a GET is used instead.

            One of things that we can investigate is if we can remove the JSON payload during the second phase of Solr's distributed search when it sends the list of docids.

            Show
            jbernstein Joel Bernstein added a comment - The reason that a POST is not being used is because Alfresco is also sending a JSON payload to the shards which contains the ACL authorities. The JSON payload is being sent as an attached content stream. A POST is also sent as a content stream. But, the SolrJ client only supports one content stream per request. So currently both a POST and the JSON content stream cannot be part of the same request. So a GET is used instead. One of things that we can investigate is if we can remove the JSON payload during the second phase of Solr's distributed search when it sends the list of docids.
            Hide
            ahind Andrew Hind added a comment - - edited

            This is a non-obvious side effect of us adding stream to the shard requests.

            It is still quite likely that faceting and other options can require larger URLs
            If you fix up the doc fetch stage you also have to fix up facet refinement too - which needs all the query time information again.

            The best solution would be to map our JSON request stuff to SOLR parameters as the SOLR JSON request stuff now does. This is a pretty big change - for next time round. With this change SOLRJ would map the parameters to JSON rather than the URL.

            Show
            ahind Andrew Hind added a comment - - edited This is a non-obvious side effect of us adding stream to the shard requests. It is still quite likely that faceting and other options can require larger URLs If you fix up the doc fetch stage you also have to fix up facet refinement too - which needs all the query time information again. The best solution would be to map our JSON request stuff to SOLR parameters as the SOLR JSON request stuff now does. This is a pretty big change - for next time round. With this change SOLRJ would map the parameters to JSON rather than the URL.

              People

              • Assignee:
                searchAndDiscovery Search and Discovery
                Reporter:
                afaust Axel Faust
              • Votes:
                1 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: