Uploaded image for project: 'Alfresco'
  1. Alfresco
  2. ALF-21970

Parsing a PDF freezes the system due to CPU consumption (Tika related issue)

    Details

    • Resolution Time Custom Field:
      11 weeks, 3 days, 21 hours, 16 minutes, 42 seconds

      Description

      Steps to reproduce the issue

      Drop "Tika-Issue-Full-CPU.PDF" file (attached to the issue) into any folder.

      When SOLR tries to index the content, CPU consumption will be maintained nearly 100% for a long time. Alfresco stops to work after a while due to this CPU consumption.

      The thread dump shows how Tika is trying to extract content from the PDF:

       

      java.lang.Thread.State: RUNNABLE
       at java.lang.Object.hashCode(Native Method)
       at java.util.HashMap.hash(HashMap.java:338)
       at java.util.HashMap.put(HashMap.java:611)
       at org.apache.pdfbox.pdmodel.PDResources.reverseMap(PDResources.java:658)
       at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:332)
       at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:286)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
       at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:220)
       at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
       at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
       at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
       at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(TikaPoweredContentTransformer.java:244)
       at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:250)
       at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:202)
       at org.alfresco.repo.web.scripts.solr.NodeContentGet.execute(NodeContentGet.java:206)
       
      

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                closedissues Closed Issues
                Reporter:
                angel.borroy Angel Borroy
              • Votes:
                1 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Date of First Response: