[Mirrored from: http://www.textuality.com/xml/typing.html, May 15, 1997. See the updated (May 21) version.]

Adding Strong Data Typing to SGML and XML

Tim Bray
May 3, 1997
This draft is intended for public discussion.

Table of Contents

1. Introduction
2. SQL Data Typing
3. Associating Types with XML Elements
    3.1 Types: The XML-SQLTYPE and XML-SQLSIZE Attributes
    3.2 Value Ranges: the XML-SQLMIN and XML-SQLMAX Attributes
    3.3 #PCDATA Only
4. Type Semantics
    4.1 The Meaning of Type Declarations
    4.2 Counting Characters
    4.3 XML-SQLTYPE="CHAR"
    4.4 XML-SQLTYPE="VARCHAR"
    4.5 XML-SQLTYPE="INTEGER"
    4.6 XML-SQLTYPE="DECIMAL"
    4.7 XML-SQLTYPE="FLOAT"
    4.8 XML-SQLTYPE="DATE"
    4.9 XML-SQLTYPE="TIME"
    4.10 XML-SQLTYPE="TIMESTAMP"
5. Examples

1. Introduction

SGML and XML ("XML" refers to both from here on in) provide facilities for declaring document structures. However, there is very limited support for data typing as a database person would see it. This is an obvious deficiency whose seriousness will increase as XML is used increasingly for electronic data interchange and database-related applications.

This note proposes a mechanism to attach strong type declarations to XML elements using reserved attributes. While this is similar to HyTime's "architectural form" mechanism, this note does not include assume understanding, nor provide any discussion, of that mechanism.

2. SQL Data Typing

SQL, as defined in International Standard ISO/IEC 9075:1992, is a language designed for use in defining and accessing structured data repositories. It includes a comprehensive selection of data types: see 6.1 <data type>. This selection has been proven effective in practice.

This note provides XML mechanisms for declaring elements to be one of a subset of these SQL types, and for restricting the range of allowed values for numeric types.

3. Associating Types with XML Elements

3.1 Types: The XML-SQLTYPE and XML-SQLSIZE Attributes

SQL data types may be associated with SQL elements based on the use of two reserved attributes, XML-SQLTYPE and XML-SQLSIZE. The value of the XML-SQLTYPE corresponds to the name of a SQL data type. The value of the XML-SQLSIZE attribte is a number, which parameterizes the type. Its interpretation depends on the value of the XML-SQLTYPE attribute. Although this attribute conflates several distinct SQL concepts, its usage is unambiguous in the context of the declared data type.

The SGML declaration for these attributes is given below; we use an SGML rather than an XML declaration as a convenient method of expressing the constraint that the value of XML-SQLSIZE must be numeric.

<!ELEMENT AnyElement (#PCDATA)>
<!ATTLIST AnyElement
          XML-SQLTYPE ( CHAR|VARCHAR
                       |INTEGER|DECIMAL|FLOAT
                       |DATE|TIME|TIMESTAMP)  #IMPLIED
          XML-SQLSIZE NUMBER                  "0" >

In the example above, XML-SQLTYPE is #IMPLIED; in practice, one would expect this to be given a #FIXED default in the DTD, so that all instances of some element would have the same type. When the XML-SQLTYPE attribute is not provided for some element, this simply means that no assertion is made concerning the data type of that element.

If XML-SQLTYPE is provided but XML-SQLSIZE is not, the value is assumed to be zero, as declared above. This value makes sense in the one case (TIME) where it is desirable to have a default, and is sufficiently unusable for character and numeric types as to require that it be provided, without having to write that requirement into this document.

3.2 Value Ranges: the XML-SQLMIN and XML-SQLMAX Attributes

Elements for which the type is constrained with XML-SQLTYPE may have ranges of validity declared using the attributes XML-SQLMIN and XML-SQLMAX. These have no defaults of any kind; if not provided, no range constraint is placed on the content.

<!ATTLIST AnyElement
          XML-MIN CDATA #IMPLIED
          XML-MAX CDATA #IMPLIED >

In all cases, the value of XML-SQLMIN and XML-SQLMAX must meet the constraints expressed by XML-SQLTYPE and XML-SQLSIZE.

For numeric, date, and time data types, the ordering is unambiguous and the interpretation of XML-SQLMIN and XML-SQLMAX is obvious. For the CHAR and VARCHAR data types, the lexical ordering of strings is often implementation dependent. While the ordering of strings made up of characters from the ASCII and ISO-Latin character sets is well-understood, this is not the case with Unicode characters representing the glyphs of many Asian languages.

Confusion is also possible due to the fact that the direction in which characters are visually rendered into strings varies from language to language (Arabic runs right-to-left) and even within languages (Chinese may be rendered validly in many different directions).

To avoid ambiguity, for XML-SQLMIN and XML-SQLMAX range checking of CHAR and VARCHAR elements, lexical comparison of strings must always be done using the numeric values of the Unicode encoding of the characters in the string, in increasing order of the address at which they are stored.

3.3 #PCDATA Only

The attributes described in this note may only be attached to elements with #PCDATA content; that is, those which have no child elements.

4. Type Semantics

In this section, the term content refers to the character data contained in an element.

4.1 The Meaning of Type Declarations

A type declaration of the form described in this note asserts that the content of some element should meet the constraints (described herein) expressed by that declaration.

4.2 Counting Characters

Several of the type declarations constrain the allowed length of the content. In this case, the length is in characters, and should be evaluated after all entity and character references have been processed; i.e. the count applies to the content as received by an application, not as encoded in the containing entity.

4.3 XML-SQLTYPE="CHAR"

The content is fixed in length.

  • The length of the content must be the number of characters given in XML-SQLSIZE.
  • If XML-SQLMIN is provided, it must be a character string of the length given by XML-SQLSIZE. The content must be lexically greater than or equal to the value of XML-SQLMIN.
  • If XML-SQLMAX is provided, it must be a character string of the length given by XML-SQLSIZE. The content must be lexically less than or equal to the value of XML-SQLMAX.

4.4 XML-SQLTYPE="VARCHAR"

The content is variable in length up to a fixed maximum.

  • The length of the content must be less than or equal to the number of characters given in XML-SQLSIZE.
  • If XML-SQLMIN is provided, it must be a character string of the length less than or equal to that given by XML-SQLSIZE. The content must be lexically greater than or equal to the value of XML-SQLMIN.
  • If XML-SQLMAX is provided, it must be a character string of the length less than or equal to that given by XML-SQLSIZE. The content must be lexically less than or equal to the value of XML-SQLMAX.

4.5 XML-SQLTYPE="INTEGER"

The content represents a decimal integer number.

  • XML-SQLSIZE is ignored.
  • The content must match: ('+'|'-')? digit+
  • If XML-SQLMIN is provided, it must be an integer. The content, interpreted as a number, must be greater than or equal to the value of XML-SQLMIN.
  • If XML-SQLMAX is provided, it must be an integer. The content, interpreted as a number, must be less than or equal to the value of XML-SQLMAX.

4.6 XML-SQLTYPE="DECIMAL"

The content represents a fixed-point decimal number with a fixed number of digits after the decimal point.

  • The content must match: ('+'|'-')? digit+ '.' digit*
  • XML-SQLSIZE provides the SQL scale of the content's numeric value; the number of digits following the decimal point must be equal to that scale.
  • If XML-SQLMIN is provided, it must be a decimal number with the same scale as the content. The content, interpreted as a number, must be greater than or equal to the value of XML-SQLMIN.
  • If XML-SQLMAX is provided, it must be a decimal number with the same scale as the content. The content, interpreted as a number, must be less than or equal to the value of XML-SQLMAX.

4.7 XML-SQLTYPE="FLOAT"

The content represents a floating-point number.

  • XML-SQLSIZE is ignored.
  • The content must match: ('+'|'-')? digit+ '.' digit*
  • If XML-SQLMIN is provided, it must be a floating-point number. The content, interpreted as a number, must be greater than or equal to the value of XML-SQLMIN.
  • If XML-SQLMAX is provided, it must be a floating-point number. The content, interpreted as a number, must be less than or equal to the value of XML-SQLMAX.

4.8 XML-SQLTYPE="DATE"

The content represents a date, provided in the order Year, Month, Day.

  • XML-SQLSIZE is ignored.
  • The content must match: digit digit digit digit '/' digit digit '/' digit digit
  • The first four digits represent the year; all values between 0000 and 9999 inclusive are allowed.
  • The second two digits represent the month; only values between 01 and 12 inclusive are allowed.
  • The final two digits represent the day of the month; only values between 01 and 31 inclusive are allowed.
  • The values of year, month, and day must be congruent according to the rules for well-formed dates in the Gregorian calendar.
  • If XML-SQLMIN is provided, it must represent a date as described immediately above. The content, interpreted as a date, must be later than or equal to the value of XML-SQLMIN.
  • If XML-SQLMAX is provided, it must represent a date as described immediately above. The content, interpreted as a date, must be earlier than or equal to the value of XML-SQLMAX.

4.9 XML-SQLTYPE="TIME"

Describes a time of day.

  • If XML-SQLSIZE is provided with any value other than the default 0, this signals that the timezone value, described below, must be provided in the content. In SQL terms, a nonzero value is equivalent to the declaration fragment WITH TIME ZONE.
  • The content must match: digit digit ':' digit digit ':' digit digit ('.' digit+)? (('+'|'-') digit digit ('.' digit digit)? )?
  • The first two digits represent the hour of the day; only values between 00 and 23 inclusive are allowed.
  • The second two digits represent the minute within the hour; only values between 00 and 59 inclusive are allowed.
  • The third two digits, optionally with a trailing decimal fraction, represent the seconds within the minute. For the first two of these, only values between 00 and 59 inclusive are allowed. The trailing decimal point and digits, if provided, represent a portion of a second as a decimal fraction.
  • The trailing timezone value may be provided only if XML-SQLSIZE has a nonzero value.
  • The first character of the timezone indicator must be a + or - sign, signaling the direction of displacement from Co-ordinated Universal Time.
  • The first two digits of the timezone value represent the hours of displacement; only values between 00 and 13 inclusive are allowed.
  • The two digits following the decimal point in the timezone value represent minutes of displacement; only values between 00 and 59 inclusive are allowed.
  • If XML-SQLMIN is provided, it must represent a time as described immediately above. The content, interpreted as a time of day, must be later than or equal to that given in XML-SQLMIN.
  • If XML-SQLMAX is provided, it must represent a time as described immediately above. The content, interpreted as a time of day, must be earlier than or equal to that given in XML-SQLMAX.

4.10 XML-SQLTYPE="TIMESTAMP"

Describes a timestamp, including both date and time.

  • The interpretation of XML-SQLSIZE is identical to that for XML-SQLTYPE="TIME".
  • The content must be a a string which would be valid as content of an element with XML-SQLTYPE="DATE", followed by a space, followed by a string which would be valid as content of an element with XML-SQLTYPE="TIME".
  • If XML-SQLMIN is provided, it must represent a timestamp as described immediately above. The content, interpreted as a timestamp, must be later than or equal to that given in XML-SQLMIN.
  • If XML-SQLMAX is provided, it must represent a timestamp as described immediately above. The content, interpreted as a timestamp, must be earlier than or equal to that given in XML-SQLMAX.

5. Examples

For a bank loan; balance, interest rate, and maturity date:

<!ELEMENT BALANCE  (#PCDATA) >
<!ATTLIST BALANCE  XML-SQLTYPE CDATA #FIXED "DECIMAL"
                   XML-SQLSIZE CDATA #FIXED "2" 
                   XML-SQLMIN  CDATA #FIXED "0.00" >
<!ELEMENT INTEREST (#PCDATA)>
<!ATTLIST INTEREST XML-SQLTYPE CDATA #FIXED "FLOAT" 
                   XML-SQLMIN  CDATA #FIXED "0.0" >
<!ELEMENT MATURITY (#PCDATA)>
<!ATTLIST MATURITY XML-SQLTYPE CDATA #FIXED "DATE" >

For an airline departure: passenger name, seat number, and departure time:

<!ELEMENT LAST-NAME (#PCDATA)>
<!ATTLIST LAST-NAME XML-SQLTYPE CDATA #FIXED "VARCHAR"
                    XML-SQLSIZE CDATA #FIXED "20" >
<!ELEMENT FIRST-INITIAL (#PCDATA)>
<!ATTLIST FIRST-INITIAL XML-SQLTYPE CDATA #FIXED "CHAR"
                        XML-SQLSIZE CDATA #FIXED "1" >
<!ELEMENT SEAT-ROW (#PCDATA)>
<!ATTLIST SEAT-ROW XML-SQLTYPE CDATA #FIXED "INTEGER"
                   XML-SQLMIN  CDATA #FIXED "1"
                   XML-SQLMAX  CDATA #FIXED "36" >
<!ELEMENT SEAT-LETTER (#PCDATA)>
<!ATTLIST SEAT-LETTER XML-SQLTYPE CDATA #FIXED "CHAR"
                      XML-SQLSIZE CDATA #FIXED "1" 
                      XML-SQLMIN  CDATA #FIXED "A"
                      XML-SQLMAX  CDATA #FIXED "F" >
<!ELEMENT DEPARTURE (#PCDATA)>
<!ATTLIST DEPARTURE XML-SQLTYPE CDATA #FIXED "TIMESTAMP" >