[ALF-21970] Parsing a PDF freezes the system due to CPU consumption (Tika related issue) Created: 18-Dec-17  Updated: 22-Mar-18  Resolved: 09-Mar-18

Status: Closed
Project: Alfresco
Component/s: Tika, POI, and Metadata Extraction
Affects Version/s: Community Edition 201707 GA
Fix Version/s: Community Edition 201803 EA
Security Level: external (External user)

Type: Bug Priority: Critical
Reporter: Angel Borroy (Inactive) Assignee: Closed Issues
Resolution: Fixed Votes: 1
Labels: PatchAttached
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Java Source File PDF2XHTML.java     PDF File Tika-Issue-Full-CPU.PDF    
Issue Links:
Related
is related to by MNT-19079 High CPU triggered by org.apache.tika... Closed
Requires
requires REPO-1066 Upgrade Apache Tika - 1.17 Done
Shadow
Date of First Response:

 Description   

Steps to reproduce the issue

Drop "Tika-Issue-Full-CPU.PDF" file (attached to the issue) into any folder.

When SOLR tries to index the content, CPU consumption will be maintained nearly 100% for a long time. Alfresco stops to work after a while due to this CPU consumption.

The thread dump shows how Tika is trying to extract content from the PDF:

 

java.lang.Thread.State: RUNNABLE
 at java.lang.Object.hashCode(Native Method)
 at java.util.HashMap.hash(HashMap.java:338)
 at java.util.HashMap.put(HashMap.java:611)
 at org.apache.pdfbox.pdmodel.PDResources.reverseMap(PDResources.java:658)
 at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:332)
 at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:286)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:220)
 at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
 at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
 at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
 at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
 at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(TikaPoweredContentTransformer.java:244)
 at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:250)
 at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:202)
 at org.alfresco.repo.web.scripts.solr.NodeContentGet.execute(NodeContentGet.java:206)
 

 



 Comments   
Comment by Angel Borroy (Inactive) [ 18-Dec-17 ]

It looks like it's related with https://issues.apache.org/jira/browse/TIKA-1742 

I've backported some part from https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java#L170 to Alfresco patched Tika and the issue seems solved.

I'm attaching patched class for your consideration.

Is there any public repository for Tika 1.6 source code patched by Alfresco?

Comment by Angel Borroy (Inactive) [ 18-Dec-17 ]

Workaround provided at https://github.com/keensoft/alf-21970-repo

Comment by Younes REGAIEG (Inactive) [ 19-Dec-17 ]

@angel.borroy not sure if this is the right repo but you might want to issue a PR here : https://github.com/Alfresco/tika

Comment by Richard Esplin [X] (Inactive) [ 04-Jan-18 ]

Thanks for the issue report, and the patch.

If you are inclined to submit a PR, I think this is the right branch of the Alresco/tika repository:
https://github.com/Alfresco/tika/tree/1.6-alfresco-patched

Comment by Angel Borroy (Inactive) [ 05-Jan-18 ]

I have no permissions to create a pull request on that repository.

Comment by Derek Hulley [X] (Inactive) [ 11-Jan-18 ]

I am just getting admin access to that repo and will enable PRs or access for you, soon.

Comment by Alex Mukha [ 09-Mar-18 ]

The latest community version uses patched tika 1.17 (https://github.com/Alfresco/alfresco-tika/tree/1.17-alfresco-patched)

I merged the fix to the patched 1.6 to be included into 5.X service packs, see MNT-19079.

Generated at Mon Mar 08 14:58:20 GMT 2021 using Jira 7.13.15#713015-sha1:7c5ddd2c3e1709974ae9c48c17df8edd3919fe2c.