Type: Hot Fix Request
Affects Version/s: 126.96.36.199, 5.2.2
Environment:Index subsystem: noindex
Java version: 1.7.0_76
OS: Linux (Description: Red Hat Enterprise Linux Server release 5.5 (Tikanga))
Database type: Oracle
Java details: Java HotSpot(TM) 64-Bit Server VM
Alfresco version: 5.0.0 (.12 r104855-b57)
Database version: Oracle Database 11g Enterprise Edition Release 188.8.131.52.0 - 64bit Production
Application Server - TomcatIndex subsystem: noindex Java version: 1.7.0_76 OS: Linux (Description: Red Hat Enterprise Linux Server release 5.5 (Tikanga)) Database type: Oracle Java details: Java HotSpot(TM) 64-Bit Server VM Alfresco version: 5.0.0 (.12 r104855-b57) Database version: Oracle Database 11g Enterprise Edition Release 184.108.40.206.0 - 64bit Production Application Server - Tomcat
Summary of the issue
Title property (cm:title) has garbled characters when a HTML document which contains Japanese characters in <title> field is uploaded to Alfresco. The charset defined in the HTML document is "Shift_JIS" (all the content within the HTML document is in Japanese). The issue also occur if there is no charset defined in the document. If auditing is enabled, a new entry is automatically created in the alf_prop_string_value table with respect to the new document's Title. The fields string_value & string_end_lower both contains garbled characters in it. While looking at the audit data using the Audit web script - http://localhost:8080/alfresco/s/api/audit/query/alfresco-access?verbose=true&forward=false we can see that the garbled characters are displayed something like - "
The issue can be reproduced internally in out of the box versions 220.127.116.11 and 5.2.2. Screenshots and sample files are attached.
Steps to reproduce
Optional Step: Enable auditing feature in the app - audit.enabled=true & audit.alfresco-access.enabled=true in alfresco-global.properties (issue occur with/without auditing enabled)
1) Create a HTML document with some Japanese content and some Japanese characters in the <title> element/field.(use the sample files attached - Test1.htm, Test2.htm)
2) Upload the HTML document to Alfresco Share (either Drag&Drop or Upload button)
3) Document is uploaded successfully but the Title property(cm:title) is saved with some garbled characters. Actual Japanese character is not saved.
4) If auditing is enabled then a few entries are created automatically in the alf_prop_string_value table. One of the entry is in relation to the Title property value of the new document. This entry will contain garbled characters in "string_value" & string_end_lower" columns.
5) Check the audit data associated with this document using the API - http://localhost:8080/alfresco/s/api/audit/query/alfresco-access?verbose=true&forward=false. The title information will be displayed something like this - "
NOTE: One other problem is when they try to migrate from one DB to another (Oracle -> Postgres) the migration fails due to 'ERROR: Invalid byte sequence with encoding method "UTF 8": 0xed 0xb 0 0x8 b incompatibility' (this is because of invalid values stored in the DB). Once they change the string_end_lower & string_value columns to valid values the migration is successful.
When html document is uploaded, Title property should contain the original Japanese characters not the garbled characters. If Auditing is already enabled, it should create an entry with appropriate Japanese characters in string_value & string_end_lower columns and not the garbled characters.
Garbled characters are saved in the Title property and also in string_value, string_end_lower columns in alf_prop_string_value table.