Optimizing MetadataTracker process (SEARCH-2026)

[SEARCH-2028] Improve getting pending transactions from MetadataTracker Created: 23-Dec-19  Updated: 20-Apr-20  Resolved: 10-Mar-20

Status: Closed
Project: Search and Discovery
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Sub-task
Reporter: Angel Borroy Assignee: Unassigned
Resolution: Done Votes: 0
Labels: ACS_6.2.n
Remaining Estimate: 0 minutes
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Team Ninja-King - S&I 21, Team Ninja-King - S&I 22, Team Ninja-King - S&I 23, Team Ninja-King - S&I 24, Team Ninja-King - S&I 25
Delivery Team: Search
Template:

 Description   

Currently the transaction list for each tracking iteration is based in

https://github.com/Alfresco/alfresco-repository/blob/master/src/main/resources/alfresco/ibatis/org.alfresco.repo.domain.dialect.Dialect/solr-common-SqlMap.xml#L89

This query is limiting the results to 1 hour, what can be inefficient when repository has been idle for a long time.

Also updates and deletes count is expensive in terms of performance for this query. It should be recommended to guess if that values are required for tracking.



 Comments   
Comment by Angel Borroy [ 27-Dec-19 ]

How to reproduce this problem.

Change the first commit time in transactions table.

alfresco=# update alf_transaction set commit_time_ms = 3 where id = 1;
UPDATE 1

alfresco=# select * from alf_transaction where id = 1;
 id | version | server_id |            change_txn_id             | commit_time_ms
----+---------+-----------+--------------------------------------+----------------
  1 |       1 |         1 | 9db1c369-9f41-4fa6-802c-948611f23d57 |              3
(1 row)

After that, clear SOLR Indexes and start a new indexation.

solr6_1                  | #### Check txnsFound : 0
solr6_1                  | ======= fromCommitTime: 3
solr6_1                  | #### Get txn from commit time: 3
solr6_1                  | #### Check txnsFound : 0
solr6_1                  | ======= fromCommitTime: 3
solr6_1                  | #### Get txn from commit time: 3
solr6_1                  | Scanning transactions ...
solr6_1                  | .... from Transaction [id=1, commitTimeMs=3, updates=4, deletes=0]
solr6_1                  | .... to Transaction [id=1, commitTimeMs=3, updates=4, deletes=0]
solr6_1                  | Scanning transactions ...
solr6_1                  | .... from Transaction [id=1, commitTimeMs=3, updates=4, deletes=0]
solr6_1                  | .... to Transaction [id=1, commitTimeMs=3, updates=4, deletes=0]

solr6_1                  | #### Check txnsFound : 1
solr6_1                  | ======= fromCommitTime: 3
solr6_1                  | #### Get txn from commit time: 3
solr6_1                  | #### Check txnsFound : 1
solr6_1                  | ======= fromCommitTime: 3
solr6_1                  | #### Get txn from commit time: 3

After indexing the first transaction, the tracker starts to add 1 hour to the initial commit time (3 ms in this case) in order to find the next transaction to be indexed. This can take a long time, specially in this sample where the second transaction has a commit time of 1577438553586.

Comment by Angel Borroy [ 07-Jan-20 ]

Finding next transaction commit time from a given commit time can be obtained with a simple query:

explain analyze select min(commit_time_ms) from alf_transaction where commit_time_ms > 3;
                                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------
-------------------------------------------------
 Result  (cost=0.33..0.34 rows=1 width=8) (actual time=0.061..0.061 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.29..0.33 rows=1 width=8) (actual time=0.019..0.019 rows=1 loops=1)
           ->  Index Only Scan using idx_alf_txn_ctms on alf_transaction  (cost=0.29..1095.57 rows=26896 wi
dth=8) (actual time=0.018..0.018 rows=1 loops=1)
                 Index Cond: ((commit_time_ms IS NOT NULL) AND (commit_time_ms > 3))
                 Heap Fetches: 1
 Planning Time: 4.135 ms
 Execution Time: 0.089 ms
(8 rows)

Including this request before looping through the transactions can help to skip huge amount of time where the repository is no ingesting new content.

Comment by Angel Borroy [ 07-Jan-20 ]

Proposed changes available in:
https://git.alfresco.com/search_discovery/insightengine/merge_requests/311

Also required repository module available in:
https://github.com/Alfresco/solr-transactions-tracker

Comment by Tom Page [ 08-Jan-20 ]

If this module is being merged into the core ACS code then it would be nice to have an API in ACS to find out what version is deployed. That way SS can determine whether ACS is compatible or not.

Comment by Tom Page [ 08-Jan-20 ]

The module code looks good to me too. In addition it makes a really nice example of a module for learning purposes and IMO would make a good blog post.

Comment by Angel Borroy [ 08-Jan-20 ]

New code to detect if the repo service is available has been added to
https://git.alfresco.com/search_discovery/insightengine/merge_requests/311

Still pending to implement required changes in alfresco-repository and alfresco-remote-api artifacts as they are being updated this week by Repo Team.

Comment by Angel Borroy [ 17-Jan-20 ]

First MR has been provided for Repo artifacts, adding required changes to alfresco-repository master:

https://github.com/Alfresco/alfresco-repository/pull/780

After this change is merged and the new version is released, required changes will be added to alfresco-remote-api.

Once master is updated for these projects, the changes will be cherry picked to HF branches in order to have this feature also available in 6.2

Comment by Angel Borroy [ 17-Jan-20 ]

Now that alfresco-repository:8.86 has been released, MR for alfresco-remote-api is available:

https://github.com/Alfresco/alfresco-remote-api/pull/505

Comment by Angel Borroy [ 17-Jan-20 ]

alfresco-remote-api:8.86 has been released.

Changes are now available in ACS Master.

Comment by Angel Borroy [ 17-Jan-20 ]

Changes are also part of ACS 6.2:

  • alfresco-repository: 7.N > 7.139
  • alfresco-remote-api: 7.N > 7.110
  • acs-packaging: support/SP/6.2.N
Comment by Angel Borroy [ 20-Jan-20 ]

Adding nextTxCommitTime service.

First MR for alfresco-repository master available in:
https://github.com/Alfresco/alfresco-repository/pull/783

Comment by Angel Borroy [ 20-Jan-20 ]

MR for alfresco-remote-api available in:
https://github.com/Alfresco/alfresco-remote-api/pull/508

Comment by Angel Borroy [ 20-Jan-20 ]

Changes are now available in ACS Master:

  • alfresco-repository:8.89
  • alfresco-remote-api:8.88
Comment by Angel Borroy [ 20-Jan-20 ]

Changes are also part of ACS 6.2:

  • alfresco-repository: 7.N > 7.140
  • alfresco-remote-api: 7.N > 7.111
  • acs-packaging: support/SP/6.2.N
Comment by Angel Borroy [ 20-Jan-20 ]

MR for Search Service has been merged to master.

Generated at Thu Jul 09 10:04:36 BST 2020 using JIRA 7.6.3#76005-sha1:8a4e38d34af948780dbf52044e7aafb13a7cae58.