Uploaded image for project: 'Service Packs and Hot Fixes'
  1. Service Packs and Hot Fixes
  2. MNT-21022

POI metadata and text extraction for Excel files failing

    Details

    • Bug Priority:
      Category 2
    • Hot Fix Version:
      6.1.1.2
    • ACT Numbers:

      00998597, 00993142, 00999256, 00991904

    • Premier Customer:
      Yes
    • Sprint:
      Repo 68, Repo 69 - Product changes
    • Story Points:
      13

      Description

      Problem Description

      POI metadata and text extraction failing for Excel files

      Steps to Reproduce
      1. Install manually, non-containerized ACS 6.1.1 deployment
      2. Upload any .xls or .xlsx file to Alfresco.
      3. Wait for indexing to trigger a transformation to text

      Expected Results
      Metadata and text extraction completes without error

      Observed Results
      Metadata extraction fails with

      2019-10-25 10:06:16,106 WARN  [org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-9] Metadata extraction failed (turn on DEBUG for full error): 
         Extracter: org.alfresco.repo.content.metadata.PoiMetadataExtracter@5ec7eb3
         Content:   ContentAccessor[ contentUrl=store://2019/10/25/10/6/fac49d97-b5e6-4ebb-81ce-08eac6877fe3.bin, mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, size=2893882, encoding=UTF-8, locale=en_US]
         Failure:   Could not initialize class org.apache.poi.ooxml.POIXMLTypeLoadernull
      2019-10-25 10:07:09,245 DEBUG [org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-7] Starting metadata extraction: 
         reader: ContentAccessor[ contentUrl=store://2019/10/25/10/7/ccdf4968-ca90-41b4-b67e-0f8450282bcc.bin, mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, size=2893882, encoding=UTF-8, locale=en_US]
         extracter: org.alfresco.repo.content.metadata.PoiMetadataExtracter@5ec7eb3
      2019-10-25 10:07:09,245 DEBUG [org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-7] Concurrent extractions : 0
      2019-10-25 10:07:09,245 DEBUG [org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-7] New extraction accepted. Concurrent extractions : 1
      2019-10-25 10:07:09,360 DEBUG [org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-7] Extraction finalized. Remaining concurrent extraction : 0
      2019-10-25 10:07:09,384 DEBUG [org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-7] Metadata extraction failed: 
         Extracter: org.alfresco.repo.content.metadata.PoiMetadataExtracter@5ec7eb3
         Content:   ContentAccessor[ contentUrl=store://2019/10/25/10/7/ccdf4968-ca90-41b4-b67e-0f8450282bcc.bin, mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, size=2893882, encoding=UTF-8, locale=en_US]null
      java.lang.NoClassDefFoundError: Could not initialize class org.apache.poi.ooxml.POIXMLTypeLoader
      	at org.apache.poi.ooxml.POIXMLProperties.<init>(POIXMLProperties.java:82)
      	at org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor.<init>(XSSFEventBasedExcelExtractor.java:80)
      	at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:215)
      	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161)
      	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
      	at org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter.extractRaw(TikaPoweredMetadataExtracter.java:396)
      	at org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter$ExtractRawCallable.call(AbstractMappingMetadataExtracter.java:2005)
      	at org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter$ExtractRawCallable.call(AbstractMappingMetadataExtracter.java:1)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)

      Text extraction fails with

      2019-10-25 10:21:30,066 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13             xlsx txt  221019x2019_10_25-12.37.32.xlsx 2.7 MB -- index -- SolrIndexer
      2019-10-25 10:21:30,066 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13             workspace://SpacesStore/159fe734-1fd2-4d78-a76f-7f08b4590311 
      2019-10-25 10:21:30,067 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13             **a) [120] TikaAuto           0 ms
      2019-10-25 10:21:30,067 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13               b) [130] Poi                0 ms
      2019-10-25 10:21:30,067 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13               c) [130] OOXML              0 ms
      2019-10-25 10:21:30,070 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13.1           xlsx txt  221019x2019_10_25-12.37.32.xlsx 2.7 MB TikaAuto
      2019-10-25 10:21:30,605 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13.1                     Failed Could not initialize class org.apache.poi.ooxml.POIXMLTypeLoader
      2019-10-25 10:21:30,607 DEBUG [org.alfresco.repo.content.transform.TransformerDebug] [http-nio-8080-exec-5] 13             Finished in 542 ms
      

      Workaround
      There is no workaround for metadata extraction that I can find.

      For txt transforms, one could enable JodConverter to handle it, e.g.

      content.transformer.complex.JodConverter.PdfBox.extensions.xlsx.txt.priority=105
      content.transformer.complex.JodConverter.PdfBox.extensions.xlsx.txt.supported=true

      Notes
      I also tried this using the 6.1.1 docker-compose with Transform Service. While everything worked as expected, unexpectedly the tika/poi extraction and txt transform seemed to happen locally.

        Attachments

          Issue Links

            Structure

              Activity

                People

                • Assignee:
                  closedbugs Closed Bugs (Inactive)
                  Reporter:
                  sashcraft Scott Ashcraft
                • Votes:
                  1 Vote for this issue
                  Watchers:
                  9 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Structure Helper Panel