The transformer for RFC822 messages EMLTransformer.java has a severe bug that for those who store a lot of emails impacts performance.
The transformation of Multipart emails will always return the entire email, including attachments base64 text.
- For indexing this results in indexing the plain text of base64 encoded attachment. A client of mine with 100.000+ emails could pretty much enter any character combination and get a hit. The index file size became 300+GB.
- Preview of EML files, can get 300+ pages long in PdfJS viewer, since the the attachment base64 text is displayed.
How to reproduce
- Create an email with html body and at least one attachment.
- Create folder with a rule to transform to plain text
- Transfer to Alfresco as EML file, drop into folder above.
Expected: Only text should show up
Actual: Text and encoding keys present. Attachment visible at base64.
Note: A long outstanding issue is that html part of email plain text is included when transforming. So you would probably see html as part of the transformation.
What is the cause?
In the EMLTransformer.java row 85-90 the mimetype is set to text/plain on the message. This destroys the message actual type of being multipart, so when the getContent is called it is always a string and never instanceof Multipart.
Just remove that and it works. It may have been needed with javax.mail 1.4.x, but it seem like it is not needed now with 1.5.x.
I will also have a look at making sure that that a plain text transformation does not include the html part of the message, and create a transformer that can pick out the html part and use that if available.
Setting this as a regression as it used to work with 4.2.