Uploaded image for project: 'Service Packs and Hot Fixes'
  1. Service Packs and Hot Fixes
  2. MNT-11062

PDF metadata extractors causing high CPU/memory with PDFs created using 'PDF Xpansion'

    Details

    • Type: Service Pack Request
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: 4.1.7.6
    • Fix Version/s: 4.1.10
    • Labels:
      None
    • Environment:
      Alfresco 4.1.7.6 with SOLR / Windows 2008 R2 / JDK 1.6.0_33 / Tomcat 6.0.35 / MySQL 5.5.36

      Description

      [ Problem ]
      Both registered PDF extractors (PDFBox and TikaAuto) struggle to extract text from certain PDFs and cause high memory and high/prolonged CPU usage

      [ Description ]
      Certain PDFs can cause OOM on the system when metadata extraction is run on them. These specific PDFs appear to have been created using 'PDF Xpansion' and when they are uploaded we run either TikaAuto or PDFBox to extract the metadata. It's been found that heap usage can easily be exhausted when attempting to extract the metadata when several of these files are uploaded at once. Uploading 10 of these files at once will likely cause an DoS.

      [ Steps to reproduce ]
      Take the attached PDF ('100MB-memory-eating-PDF.pdf' ~156k big) and upload, say 7 or 8 at the same time.
      Observe memory and CPU usage

      [ Observations ]
      Metadata extraction eventually fails on these files however memory usage increases either to the point where OOM occurs or there are no many Full GCs occurring that the system is unusable.

      [ Analysis ]
      From the snapshots we can see that:
      org.apache.tika.parser.pdf.PDFParser.parse
      and
      org.apache.tika.parser.AutoDetectParser.parse

      • respectively both take up the highest percentage of time.

      In the heap dump ('heapdump-1396270240921.hprof') we can also see that the largest retained sizes are help by:
      org.apache.tika.parser.pdf.PDF2XHTML [Stack Local] 110182584
      org.apache.pdfbox.pdmodel.PDResources [Stack Local] 51391568
      and
      org.apache.tika.parser.pdf.PDF2XHTML [Stack Local] 110182584

      The text extraction (PDFBox), triggered by NodeContentGet, also takes up a large amount of memory as well however we have worked around this by testing both 'pdfminer' and 'pdftotext' (both these third-party applications don't cause any full GCs and complete in <2s, I have some comparative tests to show this.

      Also tested using the latest PDFBox lib (1.8.4)

        Attachments

          Structure

            Activity

              People

              • Assignee:
                closedbugs Closed Bugs (Inactive)
                Reporter:
                astrachan Alex Strachan
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 week, 5 hours
                  1w 5h

                    Structure Helper Panel