This is an old revision of the document!

Defining Custom Text Transforms: the @text-transform rule

The general form of an @text-transform at-rule is:

@text-transform <transform-name>
{ [ descriptor: value; ]+ }

The descriptors express the conversion from certain characters to other characters, using different mechanism to specify the the source and target characters. If several descriptors are used, the transform described is the result of successively applying them all, in the order they appear in the @text-transform.

ISSUE: should “@text-transform foo {…}” be used as “text-transform: custom(foo);” or as “text-transform: foo;”? I would rather do the same as counter-styles in lists, but let's discuss.

Example:

The following two transforms are identical.

@text-tranform abcdef1
{
    convert: "abc" to "def";
}

@text-tranform abcdef2
{
    convert: "a" to "d";
    convert: "b" to "e";
    convert: "c" to "f";
}

The convert descriptor

Name: convert
Value: <string> to <string>
default: N/A

This descriptor creates a 1 to 1 mapping from the characters in the first string to the characters in the second string.

ISSUE: how should we define character here? Legacy or extended grapheme cluster?

Both strings should be of equal length. If they are not, the longer on is truncated to the same length as the shorter one.

ISSUE: define length properly in terms of grapheme clusters

The convert-range descriptor

Name: convert-range
Value: <string>,<string> to <string>,<string>
default: N/A

It would sometimes be tedious to use the convert descriptor when the list of characters is long, but this can be simplified using convert-range when the characters' unicode code points for a continuous sequence.

Each pair of strings define an range of unicode characters, inclusive of the ones listed. All 4 strings must contain a single Unicode character.

NOTE: Here, grapheme clusters don't make sense

The numerical code point value of the character in the first (resp. third) string must be less than the one in the second (resp. fourth) string. If it is not, the descriptor must be ignored. Both ranges should be of equal length. If they are not, the longer on is truncated to the same length as the shorter one. The ranges may overlap.

Example:

@text-transform latin-only-uppercase
{
    convert-range: "a","z" to "A","Z";
}

The convert-predefined descriptor

Name: convert-predefined
Value: <text-transform>
default: N/A

This descriptor makes it possible to refer to existing text tranforms, either predefined by CSS or defined by the author. While an @text-transform using only this descriptor is not very useful, combining it with other descriptors allows authors to extend or define variants of existing transforms. convert-predefined cannot refer to the text-transform whose definition it is part of.

ISSUE: Should we combine the 3 descriptors into one, with the following syntax?

 [<string> [, <string>]? to <string> [, <string>]?] | <text-transform>

ISSUE: do we also need one more desciptor along these lines:

Name: applies-to
Value: all | initial
Default: all

It would let people define customized versions of text-transform:capitalize;

Use cases

Single Languages use cases

The following use cases only apply to a single language. Defining all the possibly useful text-transforms for all languages would go beyond the capacity and expertise of the CSS WG. Having the generic mechanism allows authors to solve their specific problem.

Full-size kana

In Japanese, small kanas appearing within ruby are sometimes replaced by the equivalent full-size kana. The following transform defines this conversion

@text-transform full-size-kana
{
    convert: "ぁぃぅぇぉゕゖっゃゅょゎ" to "あいうえおかけつやゆよわ"; 
    convert: "ァィゥェォヵㇰヶㇱㇲッㇳㇴㇵㇶㇷㇸㇹㇺャュョㇻㇼㇽㇾㇿヮ" to "アイウエオカクケシスツトヌハヒフヘホヤユヨラリルレロワ";
    convert: "ｧｨｩｪｫｬｭｮ" to "ｱｲｳｴｵﾂﾔﾕﾖ";
}

German ß

As discussed in this thread, ß (aka ß or U+00DF) is traditionally considered a lower case letter without an uppercase equivalent. text-transform:uppercase leaves it unchanged. Unicode has introduced ẞ (U+1E9E), an uppercase version of it since 5.1, but without making it a target of toupper().

This letter being rather new, authors are bound to disagree whether it is a proper uppercase variant of U+00DF, or not. Those who think it is not may use text-transform:uppercase; and text-transform:lowercase Those who think it is could use the following.

@text-transform german-uppercase
{
    convert-predefined: uppercase;
    convert: "ß" to "ẞ";
}

@text-transform german-lowercase
{
    convert-predefined: lowercase;
    convert: "ẞ" to "ß";
}

Turkish i/ı

http://en.wikipedia.org/wiki/Dotted_and_dotless_I

In Turkish and a few related languages, dotted and dotless i are distinct letters, both in upper land lower case.

The uppercasing and lowercasing algorithm defined for the text-transform property only preserve this when the content language of the element is known.

Someone, for example in a user style sheet, may want to apply an uppercase or lowercase transform to a document where language is insufficiently marked up, but known to the author of the style sheet to be Turkish. In this case, the generic uppercase and lowercase transforms would fail, but the following would work.

@text-transform turkic-uppercase
{
    convert: "i" to "İ";
    convert-predefined: uppercase;
}

@text-transform turkic-lowercase
{
    convert: "I" to "ı";
    convert-predefined: lowercase;
}

Georgian upper/lower case

http://en.wikipedia.org/wiki/Letter_case#Other_forms_of_case http://en.wikipedia.org/wiki/Georgian_alphabet

The Georgian language has used three different unicameral alphabets through history: Asomtavruli, Nuskhuri, and Mkhedruli. Recently, some authors have been using Asomtavruli letters in an otherwise Mkhedruli text, in a way that resembles a bicameral alphabet. One may assume that they would find the following transform useful.

@text-transform Mkhedruli-to-Asomtavruli {

  convert: "ა","ჵ" to "Ⴀ","Ⴥ";

}

@text-transform Asomtavruli-to-Mkhedruli {

  convert: "Ⴀ","Ⴥ" to "ა","ჵ";

}

Cross-language use cases

The following cases are examples of cases useful in several languages, but rare enough that they are better addressed by authors when needed than by the CSS WG.

Long s

http://en.wikipedia.org/wiki/Long_s http://www.fileformat.info/info/unicode/char/17f/index.htm

In old (18th century and earlier) European texts, the letter s, when at the middle or begining of the word, was written ſ (U+017F). S occuring at the end of a word would be written as the modern s is.

Modern readers are often unfamiliar with this letter form, and for readability reasons, one may want to convert from one to the other. The follow transform would accomplish this.

@text-transform modernize-s
{
    convert: "ſ" to "s";
}

Miscellaneous

Here are some more example of how the generic mechanism may be used

Comic book vikings

In the “Asterix and the Great Crossing” comic book, the Viking characters are supposed to speak a foreign language unintelligible to the main characters, but still understandable to the readers. This is represented by writing down their speech normally, except that some letters are replaced by similarly looking letters found in Scandinavian languages.

This effect could be obtained by the following transform:

@text-transform fake-norse
{
    convert: "aoAO" to "åøÅØ";
}

Table of Contents