1) Presentation

When making a SunOne DS to RedHat-DS migration with different charset: Redhat DS is using UTF-8 charset. It means ldif import files used by RH-DS have to be  UTF-8. This is even more true, when it comes to binary data, which ought to be encoded using using UTF-8.

2) How is it possible to get the value of the charset

The command locale gives you teh charset used on your platform. It is recorded in the LANG environment variable

locale
LANG=fr_FR.UTF-8

Discrepancy is coming from the following

1) SunOne charset is iso_8859 (1byte)
2) RH-DS does only accept UTF-8 (2 bytes)

3) Binary values are encoded on SunOne using iso_8859 charset.
When reading a ldif import coming from sunone, RH-DS import is blowing up indicating a violation with a message such as « violates attribute syntax »

3) How to fix it

The fix is quite tricky as for binary value it consists of providing the following operations:
-a) read the encoded binary value (using iso8859 charset)
-b) decode the binary value, which has been read
-c) reencode the value in binary using UTF8
At upper level, it makes binary attributes values and binary acis values which were iso_8859 charset encoded to be replaced by their corresponding UTF8 in the ldif file.
d) With such a transformation ldif import will now behaves fine using rh-ds

You can write such a transformation using java for example.

Example
=========


1) ISO_8859 encoding
===================== 
The cn encoded in iso_8859 is François Rivat
---> the binary value is 
cn:: RnJhbudvaXMgUml2YXQ=


This corresponds to entry

# entry-id: 10
dn: uid=frivat,ou=People,dc=example,dc=com
uid: frivat
givenName: francois
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetorgperson
sn: rivat
cn:: RnJhbudvaXMgUml2YXQ=
creatorsName: uid=admin,ou=Administrators,ou=TopologyManagement,o=NetscapeRoot
modifiersName: uid=admin,ou=Administrators,ou=TopologyManagement,o=NetscapeRoot
createTimestamp: 20181211091901Z
modifyTimestamp: 20181211091901Z
nsUniqueId: cb3a1a01-fd2511e8-ad1fc1a5-ec63facb


2) ldif import with ISO_8859 encoding 
======================================
The entry is rejected 
---> [13/Dec/2018:16:38:42.460516110 +0100] - WARN - import_producer - import userRoot: Skipping entry "uid=frivat,ou=People,dc=ovh,dc=net" 
which violates attribute syntax, 
ending line 168 of file "/tmp/test2_8859.ldif"

ldif2db -Z host-2389 -n userRoot -i /tmp/test2_8859.ldif
importing data ...
[13/Dec/2018:16:38:42.090160723 +0100] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000
[13/Dec/2018:16:38:42.109035769 +0100] - INFO - dblayer_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[13/Dec/2018:16:38:42.113566568 +0100] - INFO - check_and_set_import_cache - pagesize: 4096, available bytes 6758842368, process usage 37781504 
[13/Dec/2018:16:38:42.121801510 +0100] - INFO - check_and_set_import_cache - Import allocates 2640172KB import cache.
[13/Dec/2018:16:38:42.251224607 +0100] - INFO - import_main_offline - import userRoot: Beginning import job...
[13/Dec/2018:16:38:42.254042367 +0100] - INFO - import_main_offline - import userRoot: Index buffering enabled with bucket size 100
[13/Dec/2018:16:38:42.456513364 +0100] - INFO - import_producer - import userRoot: Processing file "/tmp/test2_8859.ldif"
[13/Dec/2018:16:38:42.460516110 +0100] - WARN - import_producer - 
import userRoot: Skipping entry "uid=frivat,ou=People,dc=ovh,dc=net" 
which violates attribute syntax, ending line 168 of 
file "/tmp/test2_8859.ldif"
[13/Dec/2018:16:38:42.463281395 +0100] - INFO - import_producer - import userRoot: Finished scanning file "/tmp/test2_8859.ldif" (9 entries)
[13/Dec/2018:16:38:42.959570547 +0100] - INFO - import_monitor_threads - import userRoot: Workers finished; cleaning up...
[13/Dec/2018:16:38:43.162926039 +0100] - INFO - import_monitor_threads - import userRoot: Workers cleaned up.
[13/Dec/2018:16:38:43.165695678 +0100] - INFO - import_main_offline - import userRoot: Cleaning up producer thread...
[13/Dec/2018:16:38:43.168191021 +0100] - INFO - import_main_offline - import userRoot: Indexing complete. Post-processing...
[13/Dec/2018:16:38:43.170430668 +0100] - INFO - import_main_offline - import userRoot: Generating numsubordinates (this may take several minutes to complete)...
[13/Dec/2018:16:38:43.176801092 +0100] - INFO - import_main_offline - import userRoot: Generating numSubordinates complete.
[13/Dec/2018:16:38:43.179294550 +0100] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs...
[13/Dec/2018:16:38:43.181680973 +0100] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished gathering ancestorid non-leaf IDs.
[13/Dec/2018:16:38:43.190312574 +0100] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Creating ancestorid index (new idl)...
[13/Dec/2018:16:38:43.193173380 +0100] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Created ancestorid index (new idl).
[13/Dec/2018:16:38:43.195769472 +0100] - INFO - import_main_offline - import userRoot: Flushing caches...
[13/Dec/2018:16:38:43.198300399 +0100] - INFO - import_main_offline - import userRoot: Closing files...
[13/Dec/2018:16:38:43.246529362 +0100] - INFO - dblayer_pre_close - All database threads now stopped
[13/Dec/2018:16:38:43.248973794 +0100] - INFO - import_main_offline - import userRoot: Import complete. Processed 9 entries (1 were skipped) in 1 seconds. (9.00 entries/sec)


3) Transforming ldif Iso_8859 format to UTF8 format
===================================================
We run the java parser to transform the value.
Binary value are converted is_8859 charset to uf8 charset


binary encoding for François Rivat is transformed as follows:
ISO_8859
cn:: RnJhbudvaXMgUml2YXQ=

UTF8 
cn:: RnJhbsOnb2lzIFJpdmF0

4) Running the parser
=====================
The binary parser is run as follows

java parsebinaryldif test2_8859.ldif  test2_utf8_bin.ldif

5) Ldif File (UTF8 format)
==========================
As can be seen, the binary cn value has been updated with the new encoding after having run the parser


# entry-id: 10
dn: uid=frivat,ou=People,dc=example,dc=com
uid: frivat
givenName: francois
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetorgperson
sn: rivat
cn:: RnJhbsOnb2lzIFJpdmF0
creatorsName: uid=admin,ou=Administrators,ou=TopologyManagement,o=NetscapeRoot
modifiersName: uid=admin,ou=Administrators,ou=TopologyManagement,o=NetscapeRoo
t
createTimestamp: 20181211091901Z
modifyTimestamp: 20181211091901Z
nsUniqueId: cb3a1a01-fd2511e8-ad1fc1a5-ec63facb

7) Successful ldif import - UTF8 format
========================================
Now, as the binary value has been fixed, the import can succeed quietly

ldif2db -Z host-2389 -n userRoot -i /tmp/test2_utf8_bin.ldif
importing data ...
[13/Dec/2018:16:40:26.885890518 +0100] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000
[13/Dec/2018:16:40:26.909416917 +0100] - INFO - dblayer_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[13/Dec/2018:16:40:26.913494377 +0100] - INFO - check_and_set_import_cache - pagesize: 4096, available bytes 6745919488, process usage 38993920 
[13/Dec/2018:16:40:26.915705239 +0100] - INFO - check_and_set_import_cache - Import allocates 2635124KB import cache.
[13/Dec/2018:16:40:27.038299523 +0100] - INFO - import_main_offline - import userRoot: Beginning import job...
[13/Dec/2018:16:40:27.040931875 +0100] - INFO - import_main_offline - import userRoot: Index buffering enabled with bucket size 100
[13/Dec/2018:16:40:27.242589479 +0100] - INFO - import_producer - import userRoot: Processing file "/tmp/test2_utf8_bin.ldif"
[13/Dec/2018:16:40:27.247774026 +0100] - INFO - import_producer - import userRoot: Finished scanning file "/tmp/test2_utf8_bin.ldif" (10 entries)
[13/Dec/2018:16:40:27.745407174 +0100] - INFO - import_monitor_threads - import userRoot: Workers finished; cleaning up...
[13/Dec/2018:16:40:27.950110230 +0100] - INFO - import_monitor_threads - import userRoot: Workers cleaned up.
[13/Dec/2018:16:40:27.954113625 +0100] - INFO - import_main_offline - import userRoot: Cleaning up producer thread...
[13/Dec/2018:16:40:27.957192674 +0100] - INFO - import_main_offline - import userRoot: Indexing complete. Post-processing...
[13/Dec/2018:16:40:27.959937709 +0100] - INFO - import_main_offline - import userRoot: Generating numsubordinates (this may take several minutes to complete)...
[13/Dec/2018:16:40:27.967695638 +0100] - INFO - import_main_offline - import userRoot: Generating numSubordinates complete.
[13/Dec/2018:16:40:27.971176394 +0100] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs...
[13/Dec/2018:16:40:27.973927796 +0100] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished gathering ancestorid non-leaf IDs.
[13/Dec/2018:16:40:27.986313595 +0100] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Creating ancestorid index (new idl)...
[13/Dec/2018:16:40:27.990063161 +0100] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Created ancestorid index (new idl).
[13/Dec/2018:16:40:27.992770265 +0100] - INFO - import_main_offline - import userRoot: Flushing caches...
[13/Dec/2018:16:40:27.995290425 +0100] - INFO - import_main_offline - import userRoot: Closing files...
[13/Dec/2018:16:40:28.062282718 +0100] - INFO - dblayer_pre_close - All database threads now stopped
[13/Dec/2018:16:40:28.065680271 +0100] - INFO - import_main_offline - import userRoot: Import complete. Processed 10 entries in 1 seconds. (10.00 entries/sec)
4) Using a parser

We internally developed a parser to cope with these charset burden, if you face such problem and are interested  for assitance to deploy this parser don’t hesitate to contact us.

janua
Les derniers articles par janua (tout voir)