XML is the acronym for Extensible Markup Language, which focuses on describing data and what data actually is. HTML is also a markup language, but it deals with how data looks and is displayed. XML tags are not predefined like HTML tags – you must invent your own. XML doesn’t physically do anything; rather, it helps to structure, store, and send information across different information systems in an easy, simple way that doesn’t require any translation at all.
Understanding the Syntax Rules of XML
***Please note, we have added an * asterix between the < > in order for the tags not to function and be viewable.***
The first line of an XML document, called the XML declaration, is optional. It gives the version of XML currently being used (either 1.0 or 1.1, although 1.0 is the most common), as well as the character encoding.
<*?xml version="1.0" encoding="ISO-8859-1"?*>
The above example describes version 1.0 of XML and its ISO-8859-1 character set (one of many potential choices).
The rest of an XML document invariably contains nested elements, or pairs of tags, inserted throughout. Each element is comprised of one pair of tags, called a start tag and an end tag. The start tag is formed by putting a term in angle brackets. The end tag is formed in the same way as the start tag, using the same term, except this time there is a slash directly after the first angle bracket and before the term.
example start term: <*rule*> example end term: <*/rule*>
Everything in between the start and end tags is called the content.
<*rule*>Everything in between the start and end tags is called the content.<*/rule*>
Everything in between the <*rule*> and <*/rule*> start and end tags in the example above is considered the content. A full element has a start tag, content, and an end tag, just like the example.
Besides text content, an XML element may also include attributes. An attribute is a name and a value paired together, placed in the start tag directly after the element name.
<*term number=“1” type=“technical”*>Attribute<*/term*>
In the above example, the element name term has 2 attributes – number=“1” is an attribute, and so is type=“technical”. They are both included in in the start tag right after the element name (term). In the number=“1” attribute, the name number has the value 1. In the type=“technical” attribute, the name type has the value technical. The complete XML element describes the function of the text – that there is a certain number (1) of terms being described, and that the type of term is technical. Attribute is the 1 technical term being addressed.
**Keep in mind that although the number 1 is a quantity and that the term technical is a measurement of quality, in XML they are merely supposed to stand for the terms they describe, not function as the terms themselves.
The values of attributes must be put in either single or double quotes. In the above example, the “1” and “technical” attribute values have been correctly placed in quotes. Each different attribute name may only be used once in any given element. In the previous example, the attribute names term number and type have each been used only once.
Elements can include other elements inside of them.
<*termlist*>
<*term*>Element<*/term*>
<*term*>Attribute<*/term*>
<*term*>Name<*/term*>
<*term*>Value<*/term*>
<*/termlist*>
In this example, the element termlist contains three term elements. The element termlist is also known as the top-level root element, or document element. XML that does not contain a top-level root element is formed badly, and is considered malformed.
Incorrect XML Example:
<*term*>Element<*/term*>
<*term*>Attribute<*/term*>
<*term*>Name<*/term*>
<*term*>Value<*/term*>
Without the top-level root element termlist, the term subelements are badly-created XML.
More Differences Between XML and HTML
XML is different than HTML in many subtle but crucial ways, so it follows that there are some tasks that are better suited for XML than HTML, and vice versa – for instance, with XML it is a much simpler task to access crucial document information than with HTML, which would sometimes require an excess of so-called markup language red tape.
HTML doesn’t have to have a closing tag, but XML does (except in the case of the XML declaration, which is not considered an element, so the usual rules don’t apply).
example of correct HTML: <*p*>A correct HTML paragraph doesn’t have to have a closing tag
<*p*>New paragraphs can start without old paragraphs having a closing tag.
This is incorrect XML, however.
example of correct XML: <*p*>A correct XML paragraph has a closing tag<*/p*>
<*p*>If XML doesn’t have a closing tag, then it is wrongly constructed<*/p*>
HTML isn’t case sensitive, but XML is.
example of correct HTML: <*Rule*>The capitalization in ‘rule’ here is inconsistent, but fine for HTML<*/rule*>
This type of mixed capitalization works for HTML, but is considered incorrect XML.
example of correct XML: <*rule*>The capitalization in ‘rule’ here is the same in both the start and end tag<*/rule*>
HTML tags can be used in different orders (or nested improperly), but XML tags need to be used exactly symmetrically without overlapping (or nested properly).
example of correct HTML: <*b*><*i*>Go to the store, Jimmy!<*/b*><*/i*>
This is fine for HTML, but the <*i*> tags overlap with the <*b*> tags, so as HTML, the above markup fails.
example of correct XML: <*b*><*i*>Go to the store, Jimmy!<*/i*><*/b*>
HTML tags get rid of any white space purposely included in a document, whereas XML preserves all white space.
Original text: Don’t use HTML to do the following thing: preserve space
HTML version: Don’t use HTML to do the following thing: preserve space
XML version: Don’t use HTML to do the following thing: preserve space
In the above example, white space was intentionally included, which the HTML version is shown as incapable of preserving. The XML version is successful in this respect.
With these basic tenements of the syntax of XML and its differences from HTML under your belt, you should have a firm idea of how to create and use basic, valid XML.