This is an early draft for a possible generic mechanism to allow authors to define custom text-transforms.

Defining Custom Text Transforms: the @text-transform rule

The general form of an @text-transform at-rule is:

@text-transform <transform-name>
{ [ descriptor: value; ]+ }

<transform-name> may be any valid identifier other than none, inherit and initial.

A text transform created using this at-rule may be used simply by using <transform-name> as the value of the text-transform property. If <transform-name> conflicts with a existing CSS keyword, the conflict is resolved in favor of the name introduced using @text-transform.

Each @text-transform rule specifies a value for every text-transform descriptor, either implicitly or explicitly. Those not given explicit value in the rule take the initial value listed with each descriptor in this specification. These descriptors apply solely within the context of the @text-transform rule in which they are defined, and do not apply to document language elements. There is no notion of which elements the descriptors apply to or whether the values are inherited by child elements. When a given descriptor occurs multiple times in a given @text-transform rule, only the last specified value is used; all prior values for that descriptor must be ignored.

The transformation descriptor

Name: transformation
Value: <conversion>#
default: N/A
 
<conversion> = [<char-list> to <char-list>] | <'text-transform'>
<char-list> = <enumeration> | <range>
<range> = <urange> | <string>
<enumeration> = <string>

This descriptor defines which character will be replaced by which, by listing a series of conversions, to be applied in the same order as they appear in the descriptor.

Conversions may refer to existing text transforms, either predefined by CSS or defined by the author. While an transformation using only a single such conversion is not very useful, combining it with other conversions allows authors to extend or define variants of existing transforms. Referring to the text-transform currently being define is not allowed, and makes the whole descriptor invalid.

Conversions may also define new mapping from one <char-list> to another.

When defined using a <urange> 1) , the <char-list> is composed of each individual Unicode character code point designated by the <urange>.

A <range> may also be defined as a string made of a single unicode character, followed by a hyphen (U+002D) followed by another signle unicode character. The semantics are identical to the <urange> U+XXXXXX-YYYYYY where XXXXXX is the code point of the first character and YYYYYY the code point of the second character.

If defined by an <enumeration>, it is composed of each character in the string, where what a character is depends on the character-type descriptor. The same character may not appear twice in the <char-list> defining the source of the mapping, otherwise the whole descriptor is invalid.

In addition to the usual CSS rules of character escaping, hyphen (U+002D) need to be escaped to appear in a string in a conversion.

In a <conversion>, If the source <char-list> is longer than the target <char-list>, then the last item of the target list is used for all remaining items in the source list.

ISSUE 1: Should we allow spaces and other collapsible characters in the target? Since text-transform is applied after white space collapsing, what are the implications of generating runs of collapsible white space that won't be collapsed? It has been proposed that we should allow them, and trigger a second white space collapsing if they are actually used. If we do want to allow them we should consider how that interacts with the proposed 'spaced' value of the character-type descriptor, and whether escaping is needed.
ISSUE 2: Should we allow an empty <char-list> as the target? It has been suggested that this be used to delete text. I am not sure I like the idea that text-transform could be able to make some non-empty element empty.
ISSUE 3: It has been suggested that it should be possible to write text-transforms that behave differently on different languages. This can probably be achieved by adding some optional part at the beginning of each <conversion>, although I am not sure what the syntax should be.

Examples:

@text-transform latin-only-uppercase 
{
    transformation: "a-z" to "A-Z";
}

The following two transforms are identical.

@text-tranform abcdef1 
{
    transformation: "abc" to "def";
}
@text-tranform abcdef2
{
    transformation: "a" to "d",
                    "b" to "e",
                    "c" to "f";
}

The character-type descriptor

Name: character-type
Value: extended | legacy | single | spaced
Default: extended
ISSUE 4: extended is proposed as the default, but is this the right choice? Maybe single would be more suitable

This definition affects what is meant by character processing in two different contexts:

  • strings used as an <enumeration> in the transformation descriptor
  • the text to which the text-transform will be applied.

In an <enumeration>, the possible values have the following meanings:

  • extended: characters are extended grapheme clusters, as defined in UAX29
  • legacy: characters are legacy grapheme clusters, as defined in UAX29
  • single: characters are single Unicode code points.
  • spaced: characters are space separated sequences of Unicode code points.

How the the text to which the text-transform property is applied must be processed also depends on the value of this descriptor.

When the value is 'extended', 'legacy', or 'single', each character (defined a respectively as extended grapheme clusters, legacy grapheme clusters, or single unicode code points) in the text is processed individually.

Example:

@text-transform foo {
  character-type: extended;
  transformation: "e" to "a";
}

If the text to which the above text transform is applied contains the U+65 U+301 sequence ('é'), it will not be transformed, because the transform applies to whole grapheme clusters.

On the other hand, the following text transform would transform that same sequence into U+61 U+301 ('á'), as it would consider each Unicode code point individually.

@text-transform foo {
  character-type: single;
  transformation: "e" to "a";
}

ISSUE 5: Define the processing model on the text for the 'spaced' value, to decide what happens to a piece of text like “aaaaa” when a transform like the following is applied to it:

@text-transform foo {
  character-type: spaced;
  transformation: "aa ca a" to "c a ca";
}

ISSUE 6: Are all unicode code point sequences possible to decompose in valid grapheme clusters, or some combinations invalid? If invalid sequences exist, what happens when we run into them and either extended or legacy is specified. Skip the invalid character, make the conversion invalid, or make the whole descriptor invalid?
ISSUE 7: define what happens when a text-transform refers via its transformation descriptor to another text-transform which has a different character-type. My guess: the original character-type applies to processing the enumerations, while the character-type in the including text-transform applies to the text that will be transformed. Or maybe this means this descriptor should be split in two.

The scope descriptor

Name: scope
Value: all | [initial || medial || final]
Default: all

This descriptor makes it possible to restrict which characters in the source text are affected by the transform.

  • 'all' lets the transform apply to any character
  • 'initial' lets the transform apply at the beginning of a word
  • 'final' lets the transform apply at the end of a word
  • 'medial' lets the transform apply to characters within a word other than at the beginning and the end.
ISSUE 8: More fancy values could be added here in the future to support things like title case, or to match only the base character, or only the diacritics.

The definition of “word” is UA-dependent; UAX29 is suggested (but not required) for determining such word boundaries.

The transformation descriptor may be used to refer to existing text-transforms in the definition of a new one. If the text-transforms referred to have a different scope than the scope specified in the text-transform that refers to them, they apply at the intersection of the two scopes.

Example:

@text-transform latin-only-uppercase
{
    transformation: "a-z" to "A-Z";
}
@text-transform latin-only-capitalize
{
    transformation: latin-only-uppercase;
    scope: initial;
}

DOM interaction

Custom text transform values defined within @text-transform rules are accessible via the following modifications to the CSS Object Model.

Interface CSSRule

The following additional rule type is added to the CSSRule interface.

IDL Definition

interface CSSRule {
...
const unsigned short TEXT_TRANSFORM_RULE = 1000;
...
};

Interface CSSTextTransformRule

The CSSTextTransformRule interface represents a complete set of keyframes for a single animation.

IDL Definition

  interface CSSTextTransformRule : CSSRule {
      attribute          DOMString   name;
      readonly attribute CSSStyleDeclaration style;
  };
  

Attributes

name of type DOMString

This attribute is the name of the transform, used by the text-transform property.

style of type CSSStyleDeclaration

This attribute represents all the descriptors associated with this text-transform.

Use cases

Single Languages use cases

The following use cases only apply to a single language. Defining all the possibly useful text-transforms for all languages would go beyond the capacity and expertise of the CSS WG. Having the generic mechanism allows authors to solve their specific problem.

Full-size kana

In Japanese, small kanas appearing within ruby are sometimes replaced by the equivalent full-size kana. The following transform defines this conversion

@text-transform full-size-kana
{
    transformation: "ぁぃぅぇぉゕゖっゃゅょゎ" to "あいうえおかけつやゆよわ",
                    "ァィゥェォヵㇰヶㇱㇲッㇳㇴㇵㇶㇷㇸㇹㇺャュョㇻㇼㇽㇾㇿヮ" to "アイウエオカクケシスツトヌハヒフヘホヤユヨラリルレロワ",
                    "ァィゥェォャュョ" to "アイウエオツヤユヨ";
}

German ß

As discussed in this thread, ß (aka &szlig; or U+00DF) is traditionally considered a lower case letter without an uppercase equivalent. text-transform:uppercase leaves it unchanged. Unicode has introduced ẞ (U+1E9E), an uppercase version of it since 5.1, but without making it a target of toupper().

This letter being rather new, authors are bound to disagree whether it is a proper uppercase variant of U+00DF, or not. Those who think it is not may use text-transform:uppercase; and text-transform:lowercase Those who think it is could use the following.

@text-transform german-uppercase
{
    transformation: U+00DF to U+1E9E, uppercase;
}
 
@text-transform german-lowercase
{
    transformation: U+1E9E to U+00DF, lowercase;
}
ISSUE 9: It has been suggested that overloading existing values with a language descriptor or selector would be better:

@text-transform uppercase
{
    transformation: U+00DF to U+1E9E;
    language: de;
}
@text-transform uppercase:lang(de)
{
    transformation: U+00DF to U+1E9E;
}

Turkish i/ı

http://en.wikipedia.org/wiki/Dotted_and_dotless_I

In Turkish and a few related languages, dotted and dotless i are distinct letters, both in upper land lower case.

The uppercasing and lowercasing algorithm defined for the text-transform property only preserve this when the content language of the element is known.

Someone, for example in a user style sheet, may want to apply an uppercase or lowercase transform to a document where language is insufficiently marked up, but known to the author of the style sheet to be Turkish. In this case, the generic uppercase and lowercase transforms would fail, but the following would work.

@text-transform turkic-uppercase
{
    transformation: "i" to "İ", uppercase;
}
 
@text-transform turkic-lowercase
{
    transformation: "I" to "ı", lowercase;
}

Georgian upper/lower case

http://en.wikipedia.org/wiki/Letter_case#Other_forms_of_case http://en.wikipedia.org/wiki/Georgian_alphabet

The Georgian language has used three different unicameral alphabets through history: Asomtavruli, Nuskhuri, and Mkhedruli. Recently, some authors have been using Asomtavruli letters in an otherwise Mkhedruli text, in a way that resembles a bicameral alphabet. One may assume that they would find the following transform useful.

@text-transform Mkhedruli-to-Asomtavruli
{
    transformation: "ა-ჵ" to "Ⴀ-Ⴥ";
}
 
@text-transform Asomtavruli-to-Mkhedruli
{
    transformation: "Ⴀ-Ⴥ" to "ა-ჵ";
}

Cross-language use cases

The following cases are examples of cases useful in several languages, but rare enough that they are better addressed by authors when needed than by the CSS WG.

Long s

http://en.wikipedia.org/wiki/Long_s http://www.fileformat.info/info/unicode/char/17f/index.htm

In old (18th century and earlier) European texts, the letter s, when at the middle or begining of the word, was written ſ (U+017F). S occuring at the end of a word would be written as the modern s is.

Modern readers are often unfamiliar with this letter form, and for readability reasons, one may want to convert from one to the other. The follow transform would accomplish this.

@text-transform modernize-s
{
    transformation: "ſ" to "s";
}

This does the opposite transform:

@text-transform long-s
{
    transformation: "s" to "ſ" ;
    scope: initial medial;
}

Miscellaneous

Here are some more example of how the generic mechanism may be used

Transliteration

Most writing systems of the world have at least one common transliteration scheme into the roman script.

romanization.css
@text-transform romanization 
{
    character-type: spaced;
 /* ISO 9 (Cyrillic) */
    transformation: "А а Ӑ ӑ Ӓ ӓ Ә ә Б б В в Г г Ґ ґ Ҕ ҕ Ғ ғ Д д Ђ ђ Ѓ ѓ Е е Ё	ё Ӗ ӗ Є є Ҽ ҽ Ҿ ҿ
                     Ж ж Ӂ ӂ Ӝ ӝ Җ җ З з Ӟ ӟ Ѕ ѕ Ӡ ӡ И и Ӥ ӥ І і Ї ї Й й Ј ј К к Қ қ Ҟ ҟ Л л Љ љ
                     М м Н н Њ њ Ҥ ҥ Ң ң О о Ӧ ӧ Ө ө П п Ҧ ҧ Р р С с Ҫ ҫ Т т Ҭ ҭ Ћ ћ Ќ ќ
                     У у У́ у́ Ў ў Ӱ ӱ Ӳ ӳ Ү ү Ф ф Х х Ҳ ҳ Һ һ Ц ц Ҵ ҵ Ч ч Ӵ ӵ Ҷ ҷ Џ џ Ш ш Щ щ
                     Ъ ъ ’ Ы ы Ӹ ӹ Ь ь Э э Ю ю Я я Ѣ ѣ Ѫ ѫ Ѳ ѳ Ѵ ѵ Ҩ ҩ"
                to  "A a Ă ă Ä ä A̋ a̋ B b V v G g G̀ g̀ Ğ ğ Ġ ġ D d Đ đ Ǵ ǵ E e Ë ë Ĕ ĕ Ê ê C̆ c̆ Ç̆ ç̆
                     Ž ž Z̆ z̆ Z̄ z̄ Ž̦ ž̧ Z z Z̈ z̈ Ẑ ẑ Ź ź I i Î î Ì ì Ï ï J j J̌ ǰ K k Ķ ķ K̄ k̄ L l L̂ l̂
                     M m N n N̂ n̂ Ṅ ṅ Ṇ ṇ O o Ö ö Ô ô P p Ṕ ṕ R r S s Ç ç T t Ţ ţ Ć ć Ḱ ḱ
                     U u Ú ú Ŭ ŭ Ü ü Ű ű Ù ù F f H h Ḩ ḩ Ḥ ḥ C c C̄ c̄ Č č C̈ c̈ Ç ç D̂ d̂ Š š Ŝ ŝ
                     ʺ ʺ ‵ Y y Ÿ ÿ ʹ ʹ È è Û û Â â Ě ě Ǎ ǎ F̀ f̀ Ỳ ỳ Ò ò",
 /* ISO 843 (Greek) */
                    "Α α Ά ά Β β Γ γ Δ δ Ε ε Έ έ Ζ ζ Η η Ή ή Θ  θ  Ι ι Ί ί Ϊ ϊ ΐ Κ κ Λ λ Μ μ
                     Ν ν Ξ ξ Ο ο Ό ό Π π Ρ ρ Σ σ ς Τ τ Υ υ Ύ ύ Ϋ ϋ Φ φ Χ  χ  Ψ  ψ  Ω ω Ώ ώ"
                to  "A a Á á V v G g D d E e É é Z z Ī ī Ī́ ī́ Th th I i Í í Ï ï ḯ K k L l M m
                     N n X x O o Ó ó P p R r S s s T t Y y Ý ý Ÿ ÿ F f Ch ch Ps ps Ō ō Ṓ ṓ";
}

Comic book vikings

In the “Asterix and the Great Crossing” comic book, the Viking characters are supposed to speak a foreign language unintelligible to the main characters, but still understandable to the readers. This is represented by writing down their speech normally, except that some letters are replaced by similarly looking letters found in Scandinavian languages.

This effect could be obtained by the following transform:

@text-transform fake-norse
{
    transformation: "aoAO" to "åøÅØ";
}

Leet speak

In Internet, hacker and gamer culture, a phenomenon is quite common, where characters are replaced by other characters or character sequences which have a somewhat similar glyphic appearance. Although no single consensual convention exists and sometimes mappings are neither injective nor surjective, one could simulate this playful style with a transform like the following:

@text-transform leet-speak
{
    transformation: "A-Z" to "48©)3F6H1!K£MN0¶9®57UVW*¥2";
}
 
ideas/at-text-transform.txt · Last modified: 2012/04/24 03:03 by florian
Recent changes RSS feed Valid XHTML 1.0 Valid CSS Driven by DokuWiki