This is an old revision of the document!
This is an early draft for a possible generic mechanism to allow authors to define custom text-transforms.
The general form of an @text-transform at-rule is:
@text-transform <transform-name> { [ descriptor: value; ]+ }
<transform-name> may be any valid identifier other than none, inherit and initial.
A text transform created using this at-rule may be used simply by using <transform-name> as the value of the text-transform property. If <transform-name> conflicts with a existing CSS keyword, the conflict is resolved in favor of the name introduced using @text-transform.
Each @text-transform rule specifies a value for every text-transform descriptor, either implicitly or explicitly. Those not given explicit value in the rule take the initial value listed with each descriptor in this specification. These descriptors apply solely within the context of the @text-transform rule in which they are defined, and do not apply to document language elements. There is no notion of which elements the descriptors apply to or whether the values are inherited by child elements. When a given descriptor occurs multiple times in a given @text-transform rule, only the last specified value is used; all prior values for that descriptor must be ignored.
Name: transformation Value: <conversion># default: N/A <conversion> = [<char-list> to <char-list>] | <'text-transform'> <char-list> = <enumeration> | <range> <range> = <urange> | <string> <enumeration> = <string>
This descriptor defines which character will be replaced by which, by listing a series of conversions, to be applied in the same order as they appear in the descriptor.
Conversions may refer to existing text transforms, either predefined by CSS or defined by the author. While an transformation using only a single such conversion is not very useful, combining it with other conversions allows authors to extend or define variants of existing transforms. Referring to the text-transform currently being define is not allowed, and makes the whole descriptor invalid.
Conversions may also define new mapping from one <char-list> to another.
When defined using a <urange> 1) , the <char-list> is composed of each individual Unicode character code point designated by the <urange>.
A <range> may also be defined as a string made of a single unicode character, followed by a hyphen (U+002D) followed by another signle unicode character. The semantics are identical to the <urange> U+XXXXXX-YYYYYY where XXXXXX is the code point of the first character and YYYYYY the code point of the second character.
If defined by an <enumeration>, it is composed of each character in the string, where what a character is depends on the character-type descriptor. The same character may not appear twice in the <char-list> defining the source of the mapping, otherwise the whole descriptor is invalid.
In addition to the usual CSS rules of character escaping, hyphen (U+002D) need to be escaped to appear in a string in a conversion.
In a <conversion>, If the source <char-list> is longer than the target <char-list>, then the last item of the target list is used for all remaining items in the source list.
Examples:
@text-transform latin-only-uppercase { transformation: "a-z" to "A-Z"; }
The following two transforms are identical.
@text-tranform abcdef1 { transformation: "abc" to "def"; } @text-tranform abcdef2 { transformation: "a" to "d", "b" to "e", "c" to "f"; }
Name: character-type Value: extended | legacy | single | spaced Default: extended
This definition affects what is meant by character processing in two different contexts:
In an <enumeration>, the possible values have the following meanings:
How the the text to which the text-transform property is applied must be processed also depends on the value of this descriptor.
When the value is 'extended', 'legacy', or 'single', each character (defined a respectively as extended grapheme clusters, legacy grapheme clusters, or single unicode code points) in the text is processed individually.
@text-transform foo { character-type: extended; transformation: "e" to "a"; }
If the text to which the above text transform is applied contains the U+65 U+301 sequence ('é'), it will not be transformed, because the transform applies to whole grapheme clusters.
On the other hand, the following text transform would transform that same sequence into U+61 U+301 ('á'), as it would consider each Unicode code point individually.
@text-transform foo { character-type: single; transformation: "e" to "a"; }
@text-transform foo { character-type: spaced; transformation: "aa ca a" to "c a ca"; }
Name: scope Value: all | [initial || medial || final] Default: all
This descriptor makes it possible to restrict which characters in the source text are affected by the transform.
The definition of “word” is UA-dependent; UAX29 is suggested (but not required) for determining such word boundaries.
The transformation descriptor may be used to refer to existing text-transforms in the definition of a new one. If the text-transforms referred to have a different scope than the scope specified in the text-transform that refers to them, they apply at the intersection of the two scopes.
Example:
@text-transform latin-only-uppercase { transformation: "a-z" to "A-Z"; } @text-transform latin-only-capitalize { transformation: latin-only-uppercase; scope: initial; }
Custom text transform values defined within @text-transform rules are accessible via the following modifications to the CSS Object Model.
The following additional rule type is added to the CSSRule interface.
interface CSSRule { ... const unsigned short TEXT_TRANSFORM_RULE = 1000; ... };
The CSSTextTransformRule interface represents a complete set of keyframes for a single animation.
interface CSSTextTransformRule : CSSRule { attribute DOMString name; readonly attribute CSSStyleDeclaration style; };
This attribute is the name of the transform, used by the text-transform property.
This attribute represents all the descriptors associated with this text-transform.
The following use cases only apply to a single language. Defining all the possibly useful text-transforms for all languages would go beyond the capacity and expertise of the CSS WG. Having the generic mechanism allows authors to solve their specific problem.
In Japanese, small kanas appearing within ruby are sometimes replaced by the equivalent full-size kana. The following transform defines this conversion
@text-transform full-size-kana { transformation: "ぁぃぅぇぉゕゖっゃゅょゎ" to "あいうえおかけつやゆよわ", "ァィゥェォヵㇰヶㇱㇲッㇳㇴㇵㇶㇷㇸㇹㇺャュョㇻㇼㇽㇾㇿヮ" to "アイウエオカクケシスツトヌハヒフヘホヤユヨラリルレロワ", "ァィゥェォャュョ" to "アイウエオツヤユヨ"; }
As discussed in this thread, ß (aka ß or U+00DF) is traditionally considered a lower case letter without an uppercase equivalent. text-transform:uppercase leaves it unchanged. Unicode has introduced ẞ (U+1E9E), an uppercase version of it since 5.1, but without making it a target of toupper().
This letter being rather new, authors are bound to disagree whether it is a proper uppercase variant of U+00DF, or not. Those who think it is not may use text-transform:uppercase; and text-transform:lowercase Those who think it is could use the following.
@text-transform german-uppercase { transformation: U+00DF to U+1E9E, uppercase; } @text-transform german-lowercase { transformation: U+1E9E to U+00DF, lowercase; }
@text-transform uppercase { transformation: U+00DF to U+1E9E; language: de; }
@text-transform uppercase:lang(de) { transformation: U+00DF to U+1E9E; }
http://en.wikipedia.org/wiki/Dotted_and_dotless_I
In Turkish and a few related languages, dotted and dotless i are distinct letters, both in upper land lower case.
The uppercasing and lowercasing algorithm defined for the text-transform property only preserve this when the content language of the element is known.
Someone, for example in a user style sheet, may want to apply an uppercase or lowercase transform to a document where language is insufficiently marked up, but known to the author of the style sheet to be Turkish. In this case, the generic uppercase and lowercase transforms would fail, but the following would work.
@text-transform turkic-uppercase { transformation: "i" to "İ", uppercase; } @text-transform turkic-lowercase { transformation: "I" to "ı", lowercase; }
http://en.wikipedia.org/wiki/Letter_case#Other_forms_of_case http://en.wikipedia.org/wiki/Georgian_alphabet
The Georgian language has used three different unicameral alphabets through history: Asomtavruli, Nuskhuri, and Mkhedruli. Recently, some authors have been using Asomtavruli letters in an otherwise Mkhedruli text, in a way that resembles a bicameral alphabet. One may assume that they would find the following transform useful.
@text-transform Mkhedruli-to-Asomtavruli { transformation: "ა-ჵ" to "Ⴀ-Ⴥ"; } @text-transform Asomtavruli-to-Mkhedruli { transformation: "Ⴀ-Ⴥ" to "ა-ჵ"; }
The following cases are examples of cases useful in several languages, but rare enough that they are better addressed by authors when needed than by the CSS WG.
http://en.wikipedia.org/wiki/Long_s http://www.fileformat.info/info/unicode/char/17f/index.htm
In old (18th century and earlier) European texts, the letter s, when at the middle or begining of the word, was written ſ (U+017F). S occuring at the end of a word would be written as the modern s is.
Modern readers are often unfamiliar with this letter form, and for readability reasons, one may want to convert from one to the other. The follow transform would accomplish this.
@text-transform modernize-s { transformation: "ſ" to "s"; }
This does the opposite transform:
@text-transform long-s { transformation: "s" to "ſ" ; scope: initial medial; }
Here are some more example of how the generic mechanism may be used
Most writing systems of the world have at least one common transliteration scheme into the roman script.
@text-transform romanization { character-type: spaced; /* ISO 9 (Cyrillic) */ transformation: "А а Ӑ ӑ Ӓ ӓ Ә ә Б б В в Г г Ґ ґ Ҕ ҕ Ғ ғ Д д Ђ ђ Ѓ ѓ Е е Ё ё Ӗ ӗ Є є Ҽ ҽ Ҿ ҿ Ж ж Ӂ ӂ Ӝ ӝ Җ җ З з Ӟ ӟ Ѕ ѕ Ӡ ӡ И и Ӥ ӥ І і Ї ї Й й Ј ј К к Қ қ Ҟ ҟ Л л Љ љ М м Н н Њ њ Ҥ ҥ Ң ң О о Ӧ ӧ Ө ө П п Ҧ ҧ Р р С с Ҫ ҫ Т т Ҭ ҭ Ћ ћ Ќ ќ У у У́ у́ Ў ў Ӱ ӱ Ӳ ӳ Ү ү Ф ф Х х Ҳ ҳ Һ һ Ц ц Ҵ ҵ Ч ч Ӵ ӵ Ҷ ҷ Џ џ Ш ш Щ щ Ъ ъ ’ Ы ы Ӹ ӹ Ь ь Э э Ю ю Я я Ѣ ѣ Ѫ ѫ Ѳ ѳ Ѵ ѵ Ҩ ҩ" to "A a Ă ă Ä ä A̋ a̋ B b V v G g G̀ g̀ Ğ ğ Ġ ġ D d Đ đ Ǵ ǵ E e Ë ë Ĕ ĕ Ê ê C̆ c̆ Ç̆ ç̆ Ž ž Z̆ z̆ Z̄ z̄ Ž̦ ž̧ Z z Z̈ z̈ Ẑ ẑ Ź ź I i Î î Ì ì Ï ï J j J̌ ǰ K k Ķ ķ K̄ k̄ L l L̂ l̂ M m N n N̂ n̂ Ṅ ṅ Ṇ ṇ O o Ö ö Ô ô P p Ṕ ṕ R r S s Ç ç T t Ţ ţ Ć ć Ḱ ḱ U u Ú ú Ŭ ŭ Ü ü Ű ű Ù ù F f H h Ḩ ḩ Ḥ ḥ C c C̄ c̄ Č č C̈ c̈ Ç ç D̂ d̂ Š š Ŝ ŝ ʺ ʺ ‵ Y y Ÿ ÿ ʹ ʹ È è Û û Â â Ě ě Ǎ ǎ F̀ f̀ Ỳ ỳ Ò ò", /* ISO 843 (Greek) */ "Α α Ά ά Β β Γ γ Δ δ Ε ε Έ έ Ζ ζ Η η Ή ή Θ θ Ι ι Ί ί Ϊ ϊ ΐ Κ κ Λ λ Μ μ Ν ν Ξ ξ Ο ο Ό ό Π π Ρ ρ Σ σ ς Τ τ Υ υ Ύ ύ Ϋ ϋ Φ φ Χ χ Ψ ψ Ω ω Ώ ώ" to "A a Á á V v G g D d E e É é Z z Ī ī Ī́ ī́ Th th I i Í í Ï ï ḯ K k L l M m N n X x O o Ó ó P p R r S s s T t Y y Ý ý Ÿ ÿ F f Ch ch Ps ps Ō ō Ṓ ṓ"; }
In the “Asterix and the Great Crossing” comic book, the Viking characters are supposed to speak a foreign language unintelligible to the main characters, but still understandable to the readers. This is represented by writing down their speech normally, except that some letters are replaced by similarly looking letters found in Scandinavian languages.
This effect could be obtained by the following transform:
@text-transform fake-norse { transformation: "aoAO" to "åøÅØ"; }
In Internet, hacker and gamer culture, a phenomenon is quite common, where characters are replaced by other characters or character sequences which have a somewhat similar glyphic appearance. Although no single consensual convention exists and sometimes mappings are neither injective nor surjective, one could simulate this playful style with a transform like the following:
@text-transform leet-speak { transformation: "A-Z" to "48©)3F6H1!K£MN0¶9®57UVW*¥2"; }