-
Type:
Bug
-
Status: Closed
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: Community Edition 201707 GA
-
Fix Version/s: Community Edition 201803 EA
-
Component/s: Tika, POI, and Metadata Extraction
-
Security Level: external (External user)
-
Labels:
Steps to reproduce the issue
Drop "Tika-Issue-Full-CPU.PDF" file (attached to the issue) into any folder.
When SOLR tries to index the content, CPU consumption will be maintained nearly 100% for a long time. Alfresco stops to work after a while due to this CPU consumption.
The thread dump shows how Tika is trying to extract content from the PDF:
java.lang.Thread.State: RUNNABLE at java.lang.Object.hashCode(Native Method) at java.util.HashMap.hash(HashMap.java:338) at java.util.HashMap.put(HashMap.java:611) at org.apache.pdfbox.pdmodel.PDResources.reverseMap(PDResources.java:658) at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:332) at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:286) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:220) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(TikaPoweredContentTransformer.java:244) at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:250) at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:202) at org.alfresco.repo.web.scripts.solr.NodeContentGet.execute(NodeContentGet.java:206)