Uploaded image for project: 'Alfresco'
  1. Alfresco
  2. ALF-22091

Potential startup deadlock with ModuleComponent + nested transactions (BatchProcessor)

    Details

    • Type: Bug
    • Status: New (View Workflow)
    • Priority: Unprioritized
    • Resolution: Unresolved
    • Affects Version/s: Community Edition 201806 GA
    • Fix Version/s: None
    • Component/s: Repository
    • Security Level: external (External user)
    • Labels:
    • Environment:
      Docker Images:
      - alfresco/alfresco-content-repository-community:6.1.2-ga
      - postgres:11.1
    • Triage:
      To Do

      Description

      In a very specific constellation, a database deadlock can occur during startup when executing module components, provided the following conditions hold true:

      • the IP of the server / container in which ACS Repository runs has not yet been recorded in the alf_server table
      • no other modifying operation is performed as part of the startup prior to the module component(s), so no schema upgrades etc.
      • a module component of the first module with components to execute includes the (implicit) use of nested transactions (e.g. via use of BatchProcessor to efficiently process large amounts of nodes)
      • logic in the nested transaction requires the presence of the alf_server entry for the IP of the server / container, and (implicitly) creates it if not existing

      Such a scenario was extremely unlikely in the past with "legacy" server deployments, but is now a reasonable expectation with Docker-based deployments, as the internal IP of containers will / may change on every re-instantiation of the container, e.g. after building a new image with added logic (e.g. module components as patches).

      In this scenario, a database deadlock occurs. The outer transaction (initialised by ModuleComponentHelper.startModules()) modifies nodes in the system registry (system://system store), implicitly creating an alf_server entry for the IP of the server. Before committing these changes, and thus the alf_server entry, the module component is executed, creating potential nested transactions (via BatchProcessor), which - if they also modify nodes - attempt to create the same alf_server entry. This causes at least PostgreSQL to deadlock on the SQL INSERT, and may also affect other database servers.

       

      Steps to reproduce:

      • Start unmodified ACS Repository via docker-compose using a custom network and persistent volumes for PostgreSQL / Content Store (docker-compose up -d)
      • Stop ACS Repository, also removing the network (docker-compose down)
      • Rebuild ACS Repository image with an extension adding a trivial batch-processing module component (source files attached)
      • Start ACS Reository again

      Expected behaviour: ACS Repository starts up, executes module component and is ready for use

      Observed behaviour:

      • ACS Repository startup hangs in executing module component
      • PostgreSQL container shows one waiting transaction, and outer transaction as idle
      root@postgres:/# ps aux
      USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      postgres     1  0.0  0.3 298500 31500 ?        Ss   13:36   0:00 postgres -c max_connections=300 -c log_min_messages=LOG
      postgres    25  0.0  0.0 298500  3632 ?        Ss   13:36   0:00 postgres: checkpointer
      postgres    26  0.0  0.0 298644  6216 ?        Ss   13:36   0:00 postgres: background writer
      postgres    27  0.0  0.0 298500  8496 ?        Ss   13:36   0:00 postgres: walwriter
      postgres    28  0.0  0.0 299112  6260 ?        Ss   13:36   0:00 postgres: autovacuum launcher
      postgres    29  0.0  0.0 143716  3868 ?        Ss   13:36   0:00 postgres: stats collector
      postgres    30  0.0  0.0 298920  6272 ?        Ss   13:36   0:00 postgres: logical replication launcher
      postgres    32  0.0  0.1 299484 11356 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33304) idle
      postgres    33  0.0  0.1 299484 12032 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33306) idle
      postgres    34  0.0  0.1 299484 12032 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33308) idle
      postgres    35  0.0  0.1 299484 11972 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33310) idle
      postgres    36  0.0  0.1 299484 12032 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33312) idle
      postgres    37  0.0  0.1 299484 12032 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33314) idle
      postgres    38  0.0  0.1 299484 12032 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33316) idle
      postgres    39  0.0  0.1 299484 11972 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33318) idle
      postgres    40  0.0  0.1 300672 19132 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33320) INSERT waiting
      postgres    41  0.1  0.2 303360 22980 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33322) idle in transaction
      postgres    43  0.0  0.1 299488 12032 ?        Ss   13:37   0:00 postgres: alfresco alfresco 172.19.0.3(33326) idle
      root        67  0.6  0.0  19864  3676 pts/0    Ss   13:39   0:00 /bin/bash
      root        74  0.0  0.0  38308  3296 pts/0    R+   13:39   0:00 ps aux

      Ideally, the alf_server should be pro-actively created as an isolated step in the Repository startup process (unless server is in read-only mode). Alternatively, the alf_server table might be removed. As far as I am aware, it only exists as a remnant from the old in-process Lucene indexing component which required the server to be tracked as a reference for the alf_transaction in an Enterprise cluster deployment for asynchronous tracking of changes from other cluster members, and is no longer used for anything.

       

      The same behaviour would occur in a legacy deployment as well, if you moved your ACS instance to another server / added a new server before starting Repository up again with any added module components. I personally have not verified that the behaviour also affects k8s as k8s will not be relevant for any of my dozen or so active customers for years to come, but the basic premise is almost guaranteed to be applicable as well in that scenario.

        Attachments

          Issue Links

            Structure

              Activity

                People

                • Assignee:
                  repositoryteam Repository Team
                  Reporter:
                  afaust Axel Faust
                • Votes:
                  1 Vote for this issue
                  Watchers:
                  4 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Date of First Response:

                    Structure Helper Panel