Status: Closed (View Workflow)
Affects Version/s: Community Edition 201806 GA
Fix Version/s: 6.3
Security Level: external (External user)
Resolution Time Custom Field:12 weeks, 4 days, 2 hours, 3 minutes, 49 seconds
In a very specific constellation, a database deadlock can occur during startup when executing module components, provided the following conditions hold true:
- the IP of the server / container in which ACS Repository runs has not yet been recorded in the alf_server table
- no other modifying operation is performed as part of the startup prior to the module component(s), so no schema upgrades etc.
- a module component of the first module with components to execute includes the (implicit) use of nested transactions (e.g. via use of BatchProcessor to efficiently process large amounts of nodes)
- logic in the nested transaction requires the presence of the alf_server entry for the IP of the server / container, and (implicitly) creates it if not existing
Such a scenario was extremely unlikely in the past with "legacy" server deployments, but is now a reasonable expectation with Docker-based deployments, as the internal IP of containers will / may change on every re-instantiation of the container, e.g. after building a new image with added logic (e.g. module components as patches).
In this scenario, a database deadlock occurs. The outer transaction (initialised by ModuleComponentHelper.startModules()) modifies nodes in the system registry (system://system store), implicitly creating an alf_server entry for the IP of the server. Before committing these changes, and thus the alf_server entry, the module component is executed, creating potential nested transactions (via BatchProcessor), which - if they also modify nodes - attempt to create the same alf_server entry. This causes at least PostgreSQL to deadlock on the SQL INSERT, and may also affect other database servers.
Steps to reproduce:
- Start unmodified ACS Repository via docker-compose using a custom network and persistent volumes for PostgreSQL / Content Store (docker-compose up -d)
- Stop ACS Repository, also removing the network (docker-compose down)
- Rebuild ACS Repository image with an extension adding a trivial batch-processing module component (source files attached)
- Start ACS Reository again
Expected behaviour: ACS Repository starts up, executes module component and is ready for use
- ACS Repository startup hangs in executing module component
- PostgreSQL container shows one waiting transaction, and outer transaction as idle
Ideally, the alf_server should be pro-actively created as an isolated step in the Repository startup process (unless server is in read-only mode). Alternatively, the alf_server table might be removed. As far as I am aware, it only exists as a remnant from the old in-process Lucene indexing component which required the server to be tracked as a reference for the alf_transaction in an Enterprise cluster deployment for asynchronous tracking of changes from other cluster members, and is no longer used for anything.
The same behaviour would occur in a legacy deployment as well, if you moved your ACS instance to another server / added a new server before starting Repository up again with any added module components. I personally have not verified that the behaviour also affects k8s as k8s will not be relevant for any of my dozen or so active customers for years to come, but the basic premise is almost guaranteed to be applicable as well in that scenario.