2021.12.19 11:10

Unicode cldr version 35 download

Unicode CLDR 35 provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. The following summarizes the changes in the release. A dot release, version Aside from documenting additional structure, there have been important modifications to the following areas of LDML:.

Part 1: Core. Section 3 Unicode Language and Locale Identifiers. Section 4. Section 5. Adds transliteration of 9 Indic languages to Urdu and Arabic. The following shows the growth of locale-data over time. This graph does not include data outside of the main and annotations directories, such as sorting order, transliterations, validity data, and so forth.

The following gives the total overview of the change in data items in CLDR. The measurement of the number of items is reflects the different ways that the information is represented. A single data field element or attribute value may result in multiple data items. For example, plural rules may be shared by multiple languages, and a single data field contains all the languages to which those rules apply. For more details, see the Delta Data charts. The Survey Tool is now hosted on Microsoft Azure.

The Survey Tool now shows a link to where inherited data is aliased from. Voting resolution has been changed within organizations so that the latest vote wins. Documented the Unicode locale extension key "em" to control emoji presentation.

Noted that number patterns may contain bidi controls. Described how to handle fractional seconds S when matching skeletons and adjusting corresponding patterns. Added a specification for synthesizing emoji ZWJ sequence names. For details, see Modifications. This contains a list of elements that provide the user-translated names for language codes, as described in Section 3, Unicode Language and Locale Identifiers.

There should be no expectation that the list of languages with translated names be complete: there are thousands of languages that could have translated names. For debugging purposes or comparison, when a language display name is missing, the Description field of the language subtag registry can be used to supply a fallback English user-readable name.

The type can actually be any locale ID as specified above. The set of which locale IDs is not fixed, and depends on the locale. For example, in one language one could translate the following locale IDs, and in another, fall back on the normal composition. Thus when a complete locale ID is formed by composition, the longest match in the language type is used, and the remaining fields if any added using composition. Alternate short forms may be provided for some languages and for territories and other display names , for example.

This element can contain an number of script elements. For example, in the language of this locale, the name for the Latin script might be "Romana", and for the Cyrillic script is "Kyrillica".

That would be expressed with the following. When a script name requires a different form for stand-alone use, this can be specified using the "stand-alone" alternate:. This contains a list of elements that provide the user-translated names for territory codes, as described in Section 3, Unicode Language and Locale Identifiers.

This contains a list of elements that provide the user-translated names for the key values described in Section 3, Unicode Language and Locale Identifiers. Note that the type values may use aliases. Thus if the locale u-extension key "co" does not match, then the aliases have to be tried, using the bcp47 XML data:. This contains a list of elements that provide the user-translated names for the type values described in Section 3, Unicode Language and Locale Identifiers.

Since the translation of an option name may depend on the key it is used with, the latter is optionally supplied. Note that the key and type values may use aliases. Thus if the locale u-extension key "co" does not match, then the aliases have to be tried, using the bcp47 XML data.

This contains a list of elements that provide the user-translated names for systems of measurement. The types currently supported are "US", "metric", and "UK". Note: In the future, we may need to add display names for the particular measurement units millimeter versus millimetre versus whatever the Greek, Russian, etc are , and a message format for positioning those with respect to numbers.

The type values are the fully qualified subdivsion names. For example:. See also Part 6 Section 2. This top-level element specifies general layout features. The lineOrder and characterOrder elements specify the default general ordering of lines within a page, and characters within a line. The possible values are:. If the value of lineOrder is one of the vertical values, then the value of characterOrder must be one of the horizontal values, and vice versa.

For example, for English the lines are top-to-bottom, and the characters are left-to-right. For Mongolian in the Mongolian Script the lines are right-to-left, and the characters are top to bottom.

This does not override the ordering behavior of bidirectional text; it does, however, supply the paragraph direction for that text for more information, see UAX 9: The Bidirectional Algorithm [ UAX9 ]. For dates, times, and other data to appear in the right order, the display for them should be set to the orientation of the locale.

This element controls whether display names language, territory, etc are title cased in GUI menu lists and the like. It is only used in languages where the normal display is lower case, but title case is used in lists.

There are two options:. In both cases, the title case operation is the default title case function defined by Chapter 3 of [ Unicode ]. In the second case, only the first word using the word boundaries for that locale will be title cased. This element indicates the casing of the data in the category identified by the inText type attribute, when that data is written in text or how it would appear in a dictionary.

The possible values and their meanings are :. It may also be used to help reduce confusability issues: see [ UTR39 ]. The stopwords are an experimental feature, and should not be used. Exemplars are characters used by a language, separated into different categories.

The following table provides a summary, with more details below. The basic exemplar character sets main and auxiliary contain the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation.

It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included in the main and auxiliary sets. In particular, format characters like CGJ are not included. There are five sets altogether: main, auxiliary, punctuation, numbers, and index. The main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on.

Major style guidelines are good references for the auxiliary set. Thus English has the following:. For a given language, there are a few factors that help for determining whether a character belongs in the auxiliary set, instead of the main set:. For example, the exemplar character set for en English is the set [a-z].

The main set typically includes those letters commonly "alphabet". The punctuation set consists of common punctuation characters that are used with the language corresponding to main and auxiliary. For example, English would have something like the following:. The numbers exemplars does not currently include lesser-used characters: exponential notation 3. It may contain some special formatting characters like the RLM.

The digits used in each numbering system are accessed in numberingSystems. When determining the character repertoire needed to support a language, a reasonable initial set would include at least the characters in the main and punctuation exemplar sets, along with the digits and common symbols associated with the numberSystems supported for the locale see Numbering Systems.

The index characters are a set of characters for use as a UI "index", that is, a list of clickable characters or character sequences that allow the user to see a segment of a larger "target" list.

The index set may only contain characters whose lowercase versions are in the main and auxiliary exemplar sets, though for cased languages the index exemplars are typically in uppercase.

Characters from the auxiliary exemplar set may be necessary in the index set if it needs to properly handle items such as names which may require characters not included in the main exemplar set. The display of the index characters can be modified with the indexLabel s elements, discussed in Section 3. In all of the exemplar characters, the list of characters is in the Unicode Set format, which normally allows boolean combinations of sets of letters and Unicode properties. The characters should be in normalized form NFC.

Where combining marks are used generatively, and apply to a large number of base characters such as in Indic scripts , the individual combining marks should be included. Where they are used with only a few base characters, the specific combinations should be included. Wherever there is not a precomposed character for example, single codepoint for a given combination, that must be included within braces. For example, to include sequences from the Where is my Character? The language would probably have plain 'z' in the auxiliary set, for use in foreign words.

If combining characters can be used productively in combination with a large number of others such as say Indic matras , then they are not listed in all the possible combinations, but separately, such as:.

The exemplar character set for Han characters is composed somewhat differently. It is even harder to draw a clear line for Han characters, since usage is more like a frequency curve that slowly trails off to the right in terms of decreasing frequency. So for this case, the exemplar characters simply contain a set of reasonably frequent characters for the language. The ordering of the characters in the set is irrelevant, but for readability in the XML file the characters should be in sorted order according to the locale's conventions.

The main and auxiliary sets should only contain lower case characters except for the special case of Turkish and similar languages, where the dotted capital I should be included ; the upper case letters are to be mechanically added when the set is used.

For more information on casing, see the discussion of Special Casing in the Unicode Character Database. This element has been deprecated.

For information on its structure and how it was intended to specify locale-specific preferred encodings for various purposes e-mail, web , see the Mapping section from the CLDR 27 version of the LDML Specification. This element and its subelements have been deprecated. For information on its structure and how it was intended to provide data for a compressed display of index exemplar characters where space is limited, see the Index Labels section from the CLDR 27 version of the LDML Specification.

The ellipsis element provides patterns for use when truncating strings. There are three versions: initial for removing an initial part of the string leaving final characters ; medial for removing from the center of the string leaving initial and final characters , and final for removing a final part of the string leaving initial characters.

For example, the following uses the ellipsis character in all three cases although some languages may have different characters for different positions.

There are alternatives for cases where the breaks are on a word boundary, where some languages include a space. For example, such as case would be:. The moreInformation string is one that can be displayed in an interface to indicate that more information is available. The parseLenient elements are used to indicate that characters within a particular UnicodeSet are normally to be treated as equivalent when doing a lenient parse.

The scope attribute value defines where the lenient sets are intended for use. The level attribute value is included for future expansion; currently the only value is "lenient". The sample attribute value is a paradigm element of that UnicodeSet, but the only reason for pulling it out separately is so that different classes of characters are separated, and to enable inheritance overriding.

The first version of this data is populated with the data used for lenient parsing from ICU. The delimiters supply common delimiters for bracketing quotations. The quotation marks are used with simple quoted text, such as:. When quotations are nested, the quotation marks and alternate marks are used in an alternating fashion:. The delimiter data can be used for language-specific tailoring of linebreak behavior, as suggested in the description of linebreak class QU: Quotation in [ UAX14 ].

This is an example of tailoring type 1 from that same document , changing the line breaking class assignment for some characters. Some characters with multiple uses should generally be excluded from this linebreak class remapping, such as:. The values are "metric", "US", or "UK"; others may be added over time. In some cases, it may be common to use different measurement systems for different categories of measurements. For example, the following indicates that for the category of temperature, in the regions LR and MM, it is more common to use metric units than US units.

The paperSize attribute gives the height and width of paper used for normal business letters. The values are "A4" and "US-Letter". For both measurementSystem entries and paperSize entries, later entries for specific territories such as "US" will override the value assigned to that territory by earlier entries for more inclusive territories such as "". The measurement information was formerly in the main LDML file, and had a somewhat different format.

Again, for finer-grained detail about specific units for various usages, see Part 6: Supplemental: Section 2. The measurement element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the measurementData element in the supplemental data file should be used. These elements specify the localized way of formatting quantities of units such as years, months, days, hours, minutes and seconds— for example, in English, "1 day" or "3 days".

The German rules are more complicated, because German has both gender and case. They thus have additional information, as illustrated below. Note that if there is no case attribute, for backwards compatibility the implied case is nominative. The possible values for case are listed in the grammaticalFeatures element.

These follow the inheritance specified in Part 1, Section 4. Units, like other values with a count attribute, use a special inheritance. See Part 1: Core: Section 4. The displayName is used for labels, such as in a UI.

It is typically lowercased and as neutral a plural form as possible, and then uses the casing context for the proper display. For example, for English in a UI it would appear as titlecase:. This is more fine-grained than merely a preference for metric versus US or UK measurement systems. For example, one locale may use meters alone, while another may use centimeters alone or a combination of meters and centimeters; a third may use inches alone, or informally a combination of feet and inches.

The unit preference and conversion data allows formatting functions to pick the right measurement units for the locale and usage, and convert input measurement into those units. For example, a program or database could use 1. The size of the measurement can also be taken into account, so that an infant can have a height as 18 inches , and an adult the height as 6 foot 2 inches.

Units of measurement, such as meter , have defined programmatic identifiers as described in this section. Yet while the user's desired languages really doesn't tell us the priority ranking among their languages, normally the fall-off between the user's languages is substantially greater than regional variants. The base language subtag "und" is a special case.

Suppose we have the following situation:. Part of this is because 'und' has a special function in BCP 47; it stands in for 'no supplied base language'. To prevent this from happening, if the desired base language is und, the language matcher should not apply likely subtags to it.

For example, suppose that nn-DE and nb-FR are being compared. The list is searched. The languages are truncated to nn-Latn and nb-Latn, then to nn and nb. Note that language matching is orthogonal to the how closely two languages are related linguistically. For example, Breton is more closely related to Welsh than to French, but French is the better match because it is more likely that a Breton reader will understand French than Welsh.

This also illustrates that the matches are often asymmetric: it is not likely that a French reader will understand Breton. The results may be more understandable by users. Looking for en-SK, for example, should fall back to something within Europe eg en-GB in preference to something far away and unrelated eg en-SG.

Such a closeness metric does not need to be exact; a small amount of data can be used to give an approximate distance between any two regions. The enhanced format for language matching adds structure to enable better matching of languages. The extended structure allows matching to take into account broad similarities that would give better results. Each region in that cluster should be closer to each other than to any other region.

And a region outside the cluster should be closer to another region outside that cluster than to one inside. Note that we use for all of the Americas in the variables above, because en-US should be in the same cluster as es and its contents. In the rules, the percent value These new variables and rules divide up the world into clusters, where items in the same clusters for specific languages get the normal regional difference, and items in different clusters get different weights.

Each cluster can have one or more associated paradigmLocales. These are locales that are preferred within a cluster. It would be possible to express this in rules, but using this mechanism handles these very common cases without bulking up the tables. The paradigmLocales also allow matching to macroregions. But es-MX should match more closely to es than to any of the other es sublocales.

There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees. The status of the data is the same, whether or not data is split.

That is, for the purpose of validation and lookup, all of the data for the above ja. The file name must match the identity element. Supplemental data can have different root elements, currently: ldmlBCP47 , supplementalData , keyboard , and platform. Keyboard and platform files are considered distinct. The ldmlBCP47 files and supplementalData files that have the same root are all logically part of the same file; they are simply split into separate files for convenience.

Implementations may split the files in different ways, also for their convenience. The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the ldml. The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged.

In most cases, an alternate structure is provided for expressing the information. There is only one exception: newer DTDs cannot be used with version 1. In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information such as numbers or dates. The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.

There are two kinds of elements in LDML: rule elements and structure elements. For structure elements, there are restrictions to allow for effective inheritance and processing:. Rule elements do not have this restriction, but also do not inherit, except as an entire block.

The rule elements are listed in serialElements in the supplemental metadata. See also Section 4. For more technical details, see Updating-DTDs. Note that the data in examples given below is purely illustrative, and does not match any particular language. For a more detailed example of this format, see [ Example ]. There is also a DTD for this format, but remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor the interrelationships between the different elements and attributes.

You may wish to have copies of each of these to hand as you proceed through the rest of this document. In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple instances. However, unless an element is listed as a serialElement, or has a distinguishing attribute, it can only occur once as a subelement of a given element.

Thus, for example, the following is illegal even though allowed by the DTD:. There must be only one instance of these per parent, unless there are other distinguishing attributes such as an alt element.

Thus LDML documents must not be normalized as a whole. Lists, such as singleCountries are space-delimited. That means that they are separated by one or more XML whitespace characters,. This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute xmlns , which specifies the XML namespace of the special data.

For example, the following used the version 1. The elements in this section are not part of the Locale Data Markup Language 1. Instead, they are special elements used for application-specific data to be stored in the Common Locale Repository. They may change or be removed future versions of this document, and are present her more as examples of how to extend the format.

Some of these items may move into a future version of the Locale Data Markup Language specification. The above examples are old versions: consult the documentation for the specific application to see which should be used. These DTDs use namespaces and the special element. To include one or more, use the following pattern to import the special DTDs that are used in the file:. That element has been withdrawn, pending further investigation, since is a Type 1 TR: "when the required support cannot be obtained for the publication of an International Standard, despite repeated effort".

See the ballot comments on Comments for details on the defects. For example, most of these patterns make little provision for substantial changes in format when elements are empty, so are not particularly useful in practice. Compare, for example, the mail-merge capabilities of production software such as Microsoft Word or OpenOffice.

Note: While the CLDR specification guarantees backwards compatibility, the definition of specials is up to other organizations. Any assurance of backwards compatibility is up to those organizations. A number of the elements above can have extra information for openoffice.

The contents of any element in root can be replaced by an alias, which points to the path where the data can be found. If not found there, then the resource bundle at "de" will be searched, and so on. If the path attribute is present, then its value is an [ XPath ] that points to a different node in the tree.

The default value if the path is not present is the same position in the tree. All of the attributes in the [ XPath ] must be distinguishing elements. For more details, see Section 4. This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.

For more details on data resolution, see Section 4. Aliases must be resolved recursively. An alias may point to another path that results in another alias being found, and so on.

For example, looking up Thai buddhist abbreviated months for the locale xx-YY may result in the following chain of aliases being followed:. It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups including inheritance and lateral inheritance can be followed indefinitely without terminating.

Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs. Where present, the display names must be unique; that is, two distinct code would not get the same display name.

There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs. Any translations should follow customary practice for the locale in question. For more information, see [ Data Formats ]. Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, in certain instances extra syntax is required to represent those code points that cannot be otherwise represented in element content.

The escaping syntax is only defined on a few types of elements, such as in collation or exemplar sets, and uses the appropriate syntax for that type. If this attribute is present, it indicates the status of all the data in this element and any subelements unless they have a contrary draft value , as per the following:.

The draft attribute should only occur on "leaf" elements, and is deprecated elsewhere. For a more formal description of how elements are inherited, and what their draft status is, see Section 4. This attribute labels an alternative value for an element. The value is a descriptor indicates what kind of alternative it is, and takes one of the following.

It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro".

Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:. The values for variantname at this time include "variant", "list", "email", "www", "short", and "secondary". For a more complete description of how draft applies to data, see Section 4. The value of this attribute is a token representing a reference for the information in the element, including standards that it may conform to.

In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated. The reference element may be inherited. When attribute specify date ranges, it is usually done with attributes from and to. The from attribute specifies the starting point, and the to attribute specifies the end point.

The deprecated time attribute was formerly used to specify time with the deprecated weekEndStart and weekEndEnd elements, which were themselves inherently from or to. The data format is a restricted ISO format, restricted to the fields year , month , day , hour , minute , and second in that order, with "-" used as a separator between date fields, a space used as the separator between the date and the time fields, and : used as a separator between the time fields.

If the minute or minute and second are absent, they are interpreted as zero. If the hour is also missing, then it is interpreted based on whether the attribute is from or to. That is, Friday at is the same time as Saturday at Thus when the hour is missing, the from and to are interpreted inclusively: the range includes all of the day mentioned. If the from element is missing, it is assumed to be as far backwards in time as there is data for; if the to element is missing, then it is from this point onwards, with no known end point.

The dates and times are specified in local time, unless otherwise noted. The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order for example, the year, month, and day for dates.

In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which the element is embedded. For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.

Some attribute values or element contents use UnicodeSet notation. A UnicodeSet represents a finite set of Unicode code points and strings, and is defined by lists of code points and strings, Unicode property sets, and set operators, all bounded by square brackets. In this context, a code point means a string consisting of exactly one code point. Note however that it may deviate from the syntax provided in [ UTS18 ], which is illustrative rather than a requirement.

There is one exception to the supported semantics, Section RL2. In such a case, the specification may specify a subset of the syntax provided here. Notably, property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [ UAX44 ]. In addition, quoted values that resolve to more than one code point are disallowed in ranges of the form char '-' char.

Lists are a sequence of strings that may include ranges, which are indicated by a '-' between two code points, as in "a-z". The sequence start-end specifies the range of all code points from the start to end, inclusive, in Unicode order. For example, [a c d-f m] is equivalent to [a c d e f m].

Whitespace can be freely used for clarity, as [a c d-f m] means the same as [acd-fm]. It can be used with the range notation, as described in Section 5. There is an additional restriction on string ranges in a UnicodeSet: the number of codepoints in the first string of the range must be identical to the number in the second.

Outside of single quotes, certain backslashed code point sequences can be used to quote code points:. Anything else following a backslash is mapped to itself, except the property syntax described below, or in an environment where it is defined to have some special meaning.

Any code point formed as the result of a backslash escape loses any special meaning and is treated as a literal. In contrast, Java treats Unicode escapes as just a way to represent arbitrary code points in an ASCII source file, and any resulting code points are not tagged as literals. The property names are defined by the PropertyAliases.

For more information, see [ UAX44 ]. If the property value is omitted, it is assumed to represent a boolean property with the value "true". Also, the table shows the "Negative" version, which is a property that excludes all code points of a given kind. The low-level lists or properties then can be freely combined with the normal set operations union, inverse, difference, and intersection :. Another example is the set [[ace][bdf] - [abc][def]] , which is not the empty set, but instead equal to [[[[ace] [bdf]] - [abc]] [def]] , which equals [[[abcdef] - [abc]] [def]] , which equals [[def] [def]] , which equals [def].

That is, they must be immediately preceded and immediately followed by a set. For example, the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of upper case letters except for 'A', enclose the 'A' in brackets: [[:Lu:]-[A]]. There may be additional, domain-specific requirements for validity of the expansion of the string range. The identity element contains information identifying the target locale for this data, and general information about the version of this data.

The version element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between this version and the last. This is not to be confused with the version attribute on the ldml element, which tracks the dtd version. The generation element is now deprecated. It was used to contain the last modified date for the data. The language code is the primary part of the specification of the locale id, with values as described above.

The script code may be used in the identification of written languages, with values described above. The territory code is a common part of the specification of the locale id, with values as described above. The variant code is the tertiary part of the specification of the locale id, with values as described above.

When combined according to the rules described in Section 3, Unicode Language and Locale Identifiers , the language element, along with any of the optional script , territory , and variant elements, must identify a known, stable locale identifier.

Otherwise, it is an error. The following are restrictions on the format of LDML files to allow for easier parsing and comparison of files. Peer elements have consistent order. That is, if the DTD or this specification requires the following order in an element foo :. Note that there was one case that had to be corrected in order to make this true. For that reason, pattern occurs twice under currency:.

XML files can have a wide variation in textual form, while representing precisely the same data. By putting the LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely and in CVS to detect differences when vetting changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily. That is, new IDs are added, but existing ones keep the original form.

The TZ timezone database keeps a set of equivalences in the "backward" file. These are used to map other tzids to the canonical form. An element is ordered first by the element name, and then if the element names are identical, by the sorted set of attribute-value pairs. For the latter, compare the first pair in each in sorted order by attribute pair. If not identical, go to the second pair, and so on. Elements and attributes are ordered according to their order in the respective DTDs.

Attribute value comparison is a bit more complicated, and may depend on the attribute and type. This is currently done with specific ordering tables. Any future additions to the DTD must be structured so as to allow compatibility with this ordering.

See also Section 5. To make up for that, DTD annotations are added. These are of the form. The current annotations are:. There is additional information in the attributeValueValidity. For example, the following line indicates that the 'currency' element in the ldml dtd must have values from the bcp47 'cu' type.

The element values may be literals, regular expressions, or variables some of which are set programmatically according to other CLDR data, such as the above. However, the information as this point does not cover all attribute values, is used only for testing, and should not be used in implementations since the structure may change without notice.

The following are constraints on the attribute values. That is because the data is more likely to be parsed by implementations that already parse UCD data. This file provides general information about scripts that may be useful to implementations processing text. The information is the best currently available, and may change between versions of CLDR. The format is similar to Unicode Character Database property file, and is documented in the header of the data file.

As of Emoji version This file provides general information about associations of labels to characters that may be useful to implementations of character-picking applications. Initially, the contents are focused on emoji, but may be expanded in the future to other types of characters.

Note that a character may have multiple labels. There are also specific test files for the supported Indic scripts in the unittest directory. User input is frequently messy. Attempting to parse it by matching it exactly against a pattern is likely to be unsuccessful, even when the meaning of the input is clear to a human being. The goal of lenient parsing is to accept user input whenever it is possible to decipher what the user intended. Doing so requires using patterns as data to guide the parsing process, rather than an exact template that must be matched.

This informative section suggests some heuristics that may be useful for lenient parsing of dates, times, and numbers. Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:. Loose matching involves logically applying the above transform to both the input text and to each of the field elements used in matching, before applying the specific heuristics below. For example, if the input number text is " - NA f. The currency signs are also transformed, so "NA f.

As with other Unicode algorithms, this is a logical statement of the process; actual implementations can optimize, such as by applying the transform incrementally during matching. The recommended behavior for handling such an invalid pattern field is:.

The remainder of this section describes selected cases of deprecated structure that were present in previous versions of CLDR. The fallback element is deprecated. Implementations should use instead the information in Section 4.

Virginia Powell's Ownd

0コメント

1000 / 1000