As part of MNT-11225 Tika and its related transitive dependencies should be upgraded to include TIKA-1278.
|Library||Current 4.2.N||Tika 1.6-SNAPSHOT||Notes|
|java-libpst||–||0.7||License: Apache 2|
|jdom||–||1.0||Dep of rome, may not be needed|
|tika-*||1.5-20130720-alfresco-patched||1.6-yyyyMMdd-alfresco-patched||Patched to use asm 3.1, See MNT-9291|
Those that are a little concerning are in bold above.
PDFBox 1.8.4 must be patched again as our changes have not been incorporated in there.
Tika must also be patched to downgrade asm to 3.1 as a change to cglib could touch several other dependencies. (See MNT-9291 and comment on BDE-266.)
A feature added in Tika is the ability to parse artifacts embedded in PDFs (TIKA-1268). That additional parsing can be quite resource intensive for some PDFs, particularly one used in our PDF content transformer tests. A method of disabling the parsing of embedded attachments via config must be developed and the parsing of embedded images in PDFs should be disabled by default.