[ALF-21970] Parsing a PDF freezes the system due to CPU consumption (Tika related issue) Created: 18-Dec-17 Updated: 22-Mar-18 Resolved: 09-Mar-18 |
|
Status: | Closed |
Project: | Alfresco |
Component/s: | Tika, POI, and Metadata Extraction |
Affects Version/s: | Community Edition 201707 GA |
Fix Version/s: | Community Edition 201803 EA |
Security Level: | external (External user) |
Type: | Bug | Priority: | Critical |
Reporter: | Angel Borroy (Inactive) | Assignee: | Closed Issues |
Resolution: | Fixed | Votes: | 1 |
Labels: | PatchAttached | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() |
||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||
Date of First Response: |
Description |
Steps to reproduce the issue Drop "Tika-Issue-Full-CPU.PDF" file (attached to the issue) into any folder. When SOLR tries to index the content, CPU consumption will be maintained nearly 100% for a long time. Alfresco stops to work after a while due to this CPU consumption. The thread dump shows how Tika is trying to extract content from the PDF:
java.lang.Thread.State: RUNNABLE at java.lang.Object.hashCode(Native Method) at java.util.HashMap.hash(HashMap.java:338) at java.util.HashMap.put(HashMap.java:611) at org.apache.pdfbox.pdmodel.PDResources.reverseMap(PDResources.java:658) at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:332) at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:286) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:220) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(TikaPoweredContentTransformer.java:244) at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:250) at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:202) at org.alfresco.repo.web.scripts.solr.NodeContentGet.execute(NodeContentGet.java:206)
|
Comments |
Comment by Angel Borroy (Inactive) [ 18-Dec-17 ] |
It looks like it's related with https://issues.apache.org/jira/browse/TIKA-1742 I've backported some part from https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java#L170 to Alfresco patched Tika and the issue seems solved. I'm attaching patched class for your consideration. Is there any public repository for Tika 1.6 source code patched by Alfresco? |
Comment by Angel Borroy (Inactive) [ 18-Dec-17 ] |
Workaround provided at https://github.com/keensoft/alf-21970-repo |
Comment by Younes REGAIEG (Inactive) [ 19-Dec-17 ] |
@angel.borroy not sure if this is the right repo but you might want to issue a PR here : https://github.com/Alfresco/tika |
Comment by Richard Esplin [X] (Inactive) [ 04-Jan-18 ] |
Thanks for the issue report, and the patch. If you are inclined to submit a PR, I think this is the right branch of the Alresco/tika repository: |
Comment by Angel Borroy (Inactive) [ 05-Jan-18 ] |
I have no permissions to create a pull request on that repository. |
Comment by Derek Hulley [X] (Inactive) [ 11-Jan-18 ] |
I am just getting admin access to that repo and will enable PRs or access for you, soon. |
Comment by Alex Mukha [ 09-Mar-18 ] |
The latest community version uses patched tika 1.17 (https://github.com/Alfresco/alfresco-tika/tree/1.17-alfresco-patched) I merged the fix to the patched 1.6 to be included into 5.X service packs, see |