Innodata Isogen I18N Support Library

Last Updated: 5 Oct 2004

Copyright (©) 2004, Innodata Isogen

Document Language: en


Contents

Innodata Isogen I18N Support Library

Last Updated: 5 Oct 2004

Copyright (©) 2004, Innodata Isogen


Overview

The Innodata Isogen Internationalization (I18N) Support Library is a collection of Java classes that provide fundamental services to document processors for localizing and internationalizing the rendered form of XML documents.

The services provided include:

  • Language and locale-specific static ("generated") text strings. Processors simply have to know what database key and language to ask for, the system then returns the appropriate language-specific string. Database keys can be element type names (e.g. "Chapter") or arbitrary strings (e.g., "#index").
  • Language-specific comparators for doing language and locale-appropriate lexical sorting of strings (for example, with the xsl:sort command through Saxon). The generic "getComparator" functions can be bound to any implementation of the Java Comparator interface. The default Comparator implementation is that provided by the ICU4J package ( http://oss.software.ibm.com/icu4j/).

  • Back-of-the-book index configuration management, making it easy to define and use language-specific index grouping and sorting.
  • Functionality exposed as XSLT extension functions through Saxon's extension API.

The core functions (I18nService) are processor independent and can be bound to any specific processor through a relatively thin binding layer, as demonstrated by the provided Saxoni18nService class. For example, the I18nService can be bound to Epic Editor through it's Java API, other Java-based XSLT processors, or Java-based user interfaces, or DOM-based XML processors.


Configuration Files

The I18N Support Library uses two configuration files, one for static text and one for index configuration. Both are XML documents. As far as the core library is concerned these files can be anywhere. However, the Saxon extension class requires that the files be in specific locations relative to the root of the "i18n home" directory (which is set using the "com.innodata.i18n.home" Java system variable.

For the Saxon extensions, the configuration files must be in the following directories:

static_text_database.xml
{com.innodata.i18n.home}/config/static_text/
botb_index_rules.xml
{com.innodata.i18n.home}/config/botb_index_rules/

This restriction is a side effect of the fact that there's no direct way to pass parameters to the Saxon extension library (except through Java system properties set on the Java command line). If more flexibility is needed, it would be possible to define additional system properties for specifying the exact locations of these configuration files.

Static Text Database

The static text database document consists of two main parts: the "contexts" and " attribute maps". The contexts are primarily intended to map element types to their text before and, if needed, text after. However, the contexts can include entries with arbitrary string keys, for example, for strings that have no associated element type. The attribute maps map values of enumerated attributes to specific strings.

The static text database configuration vocabulary is bound to the XML name space URI "http://www.innodata-isogen.com/vocabularies/i18n_support/static_text_database".

Contexts

The <contexts_common> element contains the context entries, consisting of one more <context> elements. Each <context> element has a <lookup_key>, which contains the string by which they context is looked up. This can be anything, but values that are the same as element type names can be accessed using the getGeneratedText functions that take an element as one of their arguments. By convention, non-element-type name keys are prefixed with "#" to ensure that they do not conflict with any element type names (XML names cannot start with "#").

Following <lookup_key> is one <text_before> and one <text_after> element. Each of these is either empty or has a <default_item> element and zero or more <item> elements.

The <default_item> element defines the default value to be used when there is no item for a specific language. This can either be an useful value, or a string like "{toc not translated}" which will provide a clear visual indicator of a missing translation.

Each <item> element provides the translation for a single language, specified using the xml:lang= attribute.

A typical context element is:

<context>
  <lookup_key>#full_stop</lookup_key>
  <text_before>
    <default_item>.</default_item>
    <item xml:lang="zh-CN">&#x3002;</item>
    <item xml:lang="zh-HK">&#x3002;</item>
    <item xml:lang="zh-TW">&#x3002;</item>
  </text_before>
  <text_after/>
</context>

This example defines the character to use for full stop (period) in various languages. This might be used in constructing cross reference strings, for example.

Attribute Maps

TBD: document the attribute_map elements.

Back-of-the-book Index Rules

The back-of-the-book (botb) index rules configuration file lets you define the alphabetic groups for each language, as well as defining the collation (sorting) rules for the language, if necessary. Grouping rules can be defined by enumerating each character or character sequence for each group or, for languages with lots of characters, such as ideographic languages, you can define groups by specifying the first member of each group (and the last member of the last group).

The back-of-the-book index configuration vocabulary is bound to the XML name space URI "http://www.innodata-isogen.com/vocabularies/i18n_support/botb_index_config".

The element types involved are:

botb_index_rules
The root of the overall configuration file.
metadata
Contains descriptive metadata for the configuration file itself (author, dates, revision history, description, etc.).
index_config
Defines the index configuration for a single language. It does not matter what order the index configurations occur within the overall botb index rules document.
national_language
Specifies the language and locale the index configuration applies to. The content of the element is a language code. This value will be matched against language codes used in documents, so the rules for the language code syntax are determined by the rules for the documents to which the configuration applies. Normal practice is to use ISO 639 two-character language codes, possibly with an ISO 3166 country code, e.g. "pt" or "zh-CN".
description
Contains a description of the index configuration.
collation_spec
Defines the collation rules. Defines the specific character-by-character collation rules, if necessary. The collation rule specifications use the Java-defined RuleBasedCollator syntax.
sort_method
Controls how the membership of each group is determined (by members explicitly listed or by sorting between keys). Also controls how English is sorted relative with the non-English words.
group_definitions
Contains the definitions of each of the groups within the index (e.g., "A", "B", "C", etc.) (term_group ). Each group must have at least a group key. If the sort method is "group by members", it must also contain an explicit list of group member characters (group_members). Groups can also have a group label that is different from the group key, and, if necessary, a group sort key that is different from either the group label or group key. If only the group key is specified, it is also used as the group label and group sort key. If a group label is specified, it is used as the group sort key if no explicit sort key is defined. Note that any character that does not sort into one of the defined groups will be grouped into the "Symbol/Numeric" group (group key "#NUMERIC").
term_group
Defines a single group within the index.
group_key
Contains a single character that acts as the key for the group. The group key is used distinguish each group within the index and must be unique. For alphabetic languages the group key is usually the same character used as the group label (e.g., "A" for the "A" group). For ideographic or sylabic languages, such as Chinese and Korean, the group key is the first character that sorts into that group.
group_label
Optional specification of the display label to use for the group if it is different from the group key. For most alphabetic languages, the group key and group label are the same, but for most ideographic and sylabic languages the group key and group label are different. For example, in Traditional Chinese (zh-TW), the group label is the Chinese characters for "n-Stroke Character" where "n" is the number of strokes in the character ("one-stroke characters", "two-stroke characters", etc.). In Simplified Chinese (zh-CN), characters are grouped by their Pin-Yin transliteration, which uses latin characters, so the group labels are "A", "B", "C", etc.
group_sort_key
Defines the group's sort key if it is different from group key. For example, in Simplified Chinese, groups are sorted alphabetically using the latin alphabet but the group keys are the actual characters that start each group.
group_members
Contains one or more char_or_seq elements to enumerate the characters within the group. The group_members element should not be used if the sort method is "sort between keys", except for the last group, which must specify the last_member element to indicate the last member of the last group.
char_or_seq
Contains one or more characters that are to be sorted as a single unit. For example, in English each char_or_seq element would contain one character, one for each each lowercase and uppercase letter. For languages like Spanish, where two or more characters are treated as a single character for sorting and grouping, you would specify multiple characters within a single group, e.g. <char_or_seq>ch</char_or_seq>.
last_member
Within group_members, identifies the last member of the last group for indexes that use the "sort between keys" sort method (e.g., the ideographic languages).

The sample index configuration document provides examples of index configurations for alphabetic, sylabic (Korean), and ideographic languages, showing how to configure each type of language. The configurations for these languages are discussed in more detail below.

NOTE: The index configuration mechanism has been implemented to use a single XML document instance to hold the configurations for all the languages needed. If you find it convenient to put each language's configuration is a separate file, you can use normal XML external parsed entities to do this. While it hasn't been done, it would not be difficult to implement an XInclude-style inclusion mechanism if there is a strong requirement for it.

English (en) Index Configuration

The English index configuration is the simplest configuration, as it requires nothing more than a set of groups, each consisting of two single-character char_or_seq elements, one for the lowercase form of a letter, one for the uppercase form. There is no special collation specification or sorting method. The English index configuration must always be present and is used as the fallback configuration for any language for which no explicit configuration is found and for grouping and sorting English words (the current code base assumes that words not in the document's base national language will be in English--that is, the current code base does not provide for a Chinese document that contains Spanish words that need to sort according to the Spanish index rules).

The English index configuration can be used as the base for any other latin-based language--just copy the index_config element, change the national language value, and adjust the groups as necessary.

Spanish (es) Index Configuration

The Spanish index configuration demonstrates using char_or_seq to define a group as having a multi-character sequence as a member. In Spanish, "ch" is treated as a single character for the purposes of grouping and sorting, so the Spanish configuration differs from the English in having this additional entry:

<term_group>
  <group_key>CH</group_key>
  <group_members>
    <char_or_seq>ch</char_or_seq>
    <char_or_seq>CH</char_or_seq>
  </group_members>
</term_group>

Note that it is not necessary to define all the possible case combinations of the character group (e.g., "Ch", "cH"), just the all lowercase and all uppercase versions.

For grouping and sorting, this definition causes all words starting with "ch" to be grouped and sorted all words starting with "c" and followed by any character other than "h".

Note also that this treatment of "ch" must be defined in the Java collation rules for the language. In the case of Spanish (and all or most other European and East European languages), the appropriate collation rules are provided by the standard Java distribution.

Simplified Chinese (zh-CN) Index Configuration

The Simplified Chinese index configuration demonstrates several features. Simplified Chinese, as an ideographic language, uses at least 40,000 characters, grouped and sorted alphabetically according to their Pin-Yin transliteration. For example, the character for "horse" is transliterated as "ma" (ignoring tone indicators) in Pin-Yin. Thus, words starting with this character will be grouped under "M" and sorted before any character that transliterates as "mi".

Because of the large number of characters it would be impractical (but not impossible) and inefficient to enumerate the members of each group. Instead, Chinese (and all the other ideographic languages) use the "sort between keys" sort strategy, as indicated by the <sort_between_keys> element within the <sort_method> element.

In addition, the editorial style for Simplified Chinese is that English words sort before Chinese words, so that the English word "math" would sort before all Chinese characters within the "M" group. This is indicated by the <sort_english_before> within <sort_method>. In most non-latin languages English words are sorted after the words in the main language, so that is the default.

Each group has a group key, which is the first Chinese character within that group, and a group label, which is the latin character label for that group ("A", "B", "C", etc.). Because the group key is used as the group sort key by default, there is no need to specify a separate group sort key.

Each group has a <group_members> element but it is empty for all but the last member. For the last member, the <group_members> element contains a <last_member> element that contains the last Chinese character member of the last group. Without this specification, any characters that are defined as sorting after the ideographs would also be sorted into the last group.

Finally, the built-in Java collation rules for Simplified Chinese in Java 1.3 and 1.4 are not correct. Therefore, custom collation rules are used, as specified with the <java_collation_spec> element within the <collation_spec> element. The <java_collation_spec> element contains an <include_collation_spec> element, whose content is a path to a file containing a Java RuleBasedCollator collation rule specification. If this path is relative, it is relative to the location of the index configuration document.

The Simplified Chinese collation rules provided with the I18N Support package were created using the Unicode database, which provides the Pin-Yin transliteration for most characters (the "unihan.txt" file, available from the Unicode consortium Web site). However, there is no single authority for transliterations, so different readers or authorities may result in different collation rules. The most precise collation rules require the use of an agreed upon and authoritative Simplified Chinese dictionary and would require significant human effort to develop and verify.

Traditional Chinese (zh-TW) Index Configuration

Traditional Chinese indexes are sorted and group by character stroke count and then by radical (the base graphical element within a character). The group labels are the Characters for "one-stroke character", "two-stroke characters", and so on. Thus, where for Simplified Chinese the group label and sort key are the same, here the group label and sort key are different. The sort key is the same as the group key so there is no need to specify a separate group sort key.


Installation

To install the I18N Support library, simply unpack the package, creating the subdirectories. The `i18n_support.jar file includes a manifest that automatically adds the 3rd-party libraries in the lib/ directory to the Java class path. As long as the relative relationship is maintained you do not need to set or extend the Java CLASSPATH environment variable or command-line parameter to include the 3rd-party jars, only the i18n_support.jar itself..

The configuration files can be in any location, although the Saxon extension class (Saxoni18nService) expects them to be in config/ below the root of the distribution (the com.innodata.xml.i18nhome Java system property). If you change the organization of the configuration files you must update the Java source to reflect those changes.


Using the Saxon Extensions

To use the Saxon extensions you must declare an extension to use for the extension functions and bind them to the com.isogen.i18n.I18nService class, e.g.:

<xsl:stylesheet
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:isoi18n="java:com.isogen.saxoni18n.Saxoni18nService"
>

You can then use the static methods defined in the Saxoni18nService class as XSLT extensions functions, e.g.:

<xsl:value-of select="isoi18n:getGeneratedTextForKeyBefore('#toc', $currentLang)"/>

See the Java API docs for the details of the extension functions provided.


Index

Generated index goes here