XML Parser: Usage Overview

TXmlParser is a free native XML Parser for Borland Delphi. This document gives an overview of how the parser is working.

↓ Download
↓ Licence
↓ Scanning an XML Document
↓ Loading the XML Document
↓ Character Encodings
↓ Differences between Version 1 and 2
↓ Buffer Handling
↓ Whitespace Handling
↓ Non XML conforming
↓ Version Numbers, Change History

Downloading TXmlParser

The official site to get this parser is http://www.destructor.de

Licence

Usage and Distribution of this Source Code is ruled by the "Destructor.de Source code Licence" (DSL) which comes with the parser files.

The DSL says in short: This source code is free. You use it completely on your own risk.

Scanning an XML document

TXmlParser is designed to scan an XML document rather than loading it into an Object Model into memory. You always start scanning at the beginning of the XML document. You can cancel scanning at every point you wish to and you can start scanning the same document as often as you want to.

TXmlParser's main task is to crack up the XML document into its parts (Start Tags, End Tags, Text Content between them, CDATA sections, Empty-Element Tags, the DTD, Processing Instructions (PIs) and Comments). As you scan through the document, TXmlParser will tell you, what it has found.

In short, the process looks like this:

// Create an Instance of TXmlParser

VAR
  Parser : TXmlParser;
BEGIN
  Parser := TXmlParser.Create;          // Create Parser Instance

// Tell the parser if you want content to be normalized
// (i.e. pack and trimm off Whitespaces)

  Parser.Normalize := TRUE;

// Load the XML Document

  Parser.LoadFromFile (XmlFilename);    // Load XML file

// Initialize for Scanning through the Document

  Parser.StartScan;                     // Prepare Scanning through the document

// Scanning is done with this loop

  WHILE Parser.Scan DO 
    CASE Parser.CurPartType OF          // Here the parser tells you what it has found
      ptStartTag,
      ptEmptyTag : // Process Parser.CurName and Parser.CurAttr (see below) fields here
      ptContent,
      ptCData    : // Process Parser.CurContent field here
      ptEndTag   : // Process End-Tag here (Parser.CurName)
      ptPI       : // Process PI here (Parser.CurName is the target, Parser.CurContent)
      END;
  Parser.Free;
END;

// In Start Tags (ptStartTag) and Empty-Element Tags (ptEmptyTag) you 
// can access attributes (CurAttr)

You can access the attributes by name:

Value := Parser.CurAttr.Value ('name');   // where 'name' is the attribute's name

Or you can scan through the whole list of attributes:

FOR i := 0 TO Parser.CurAttr.Count-1 DO BEGIN
  Name  := Parser.CurAttr.Name  (i);
  Value := Parser.CurAttr.Value (i);
    // or
  Value := Parser.CurAttr.Value (Name);   // Same result, but slower
  END;

Loading the XML document

An XML Document must be loaded completely into RAM before TXmlParser can scan it. So the maximum size of Document is limited by available RAM. However, as TXmlParser does not build an Object Model, the RAM needed is exactly the same as the size of the document.

You can load the XML document from a file with the "LoadFromFile" method. It is beyond the scope of TXmlParser to perform HTTP or FTP accesses. If you want your application to handle such requests (URLs), you can load the XML via HTTP or FTP or whatever protocol and hand over the data buffer using the "LoadFromBuffer" or "SetBuffer" method.

"LoadFromBuffer" loads the internal buffer of TXmlParser with the given null-terminated string, thereby creating a copy of that buffer.

"SetBuffer" just takes the pointer to another buffer, which means that the given buffer pointer must be valid while the document is accessed via TXmlParser.

Character Encodings

TXmlParser "understands" all common 8-Bit encodings like

UTF-8
ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-6, ISO 8859-8, ISO 8859-9, ISO 8859-15
Windows 1250, Windows 1251, Windows 1252, Windows 1253, Windows 1254, Windows 1255, Windows 1256, Windows 1257, Windows 1258
US-ASCII, IBM850

Any flavor of 16-Bit encodings like UTF-16 is not supported. Sorry.

The parser expects and delivers all strings in the usual Delphi UTF-16 UnicodeString format.

Differences between Version 1 and 2

Version 2 introduces Unicode support. It is fully source compatible, there are only a few tweaks you should be aware of:

The Unit name is now LibXmlParserU
The "TranslateEncoding" and "TranslateCharacter" methods are completely gone. All character set conversions are done automatically. So if you have these overwritten, you can delete them.
All "string" fields and parameters now expect and deliver the usual Unicode Delphi string, which effectively is the UTF-16 UnicodeString type. All conversions between the 8-Bit XML and the UTF-16 strings are done automatically.
The parser will only read files that are based on 8-Bit charater sets. When you want to read UTF-16 files, you must convert them first (however, I assume the number of UTF-16 XML out there is very low ...).

Buffer Handling

The document must be loaded completely into a piece of RAM
All character positions are referenced by PAnsiChar pointers
The TXmlParser instance can either "own" the buffer itself (then, FBufferSize is > 0) or reference the buffer of another instance or object (then, FBuffersize is 0 and FBuffer is not NIL)
The Property DocBuffer passes back a pointer to the first byte of the document. If there is no document stored (FBuffer is NIL), the DocBuffer returns a pointer to a NULL character.

Whitespace Handling, Normalization

The TXmlParser property "Normalize" determines how Whitespace is returned in Text Content:

While Normalize is true, all leading and trailing whitespace characters are trimmed off, all Whitespace is converted to Space #x20 characters and contiguous Whitespace characters are compressed to one.

If the "Scan" method reports a ptContent part, the application can get the original text with all whitespace characters by extracting the characters from "CurStart" to "CurFinal".

If the application detects an xml:space attribute, it can set "Normalize" accordingly or use CurStart/CurFinal..

Please note that TXmlParser does not normalize Line Breaks to single LineFeed characters as the XmlSpec requires (XmlSpec 2.11). The xml:space attribute is not handled by TXmlParser. This is on behalf of the application.

Non-XML-Conforming

TXmlParser does not conform 100 % exactly to the XmlSpec:

UTF-16 is not supported (XmlSpec 2.2)
(Workaround: Convert UTF-16 to UTF-8 and hand the buffer over to TXmlParser)
As the parser only works with single byte strings, all Unicode characters > 255 can not be handled.
Line breaks are not normalized to single Linefeed #x0A characters (XmlSpec 2.11)
(Workaround: The Application can access the text contents on its own [CurStart, CurFinal], thereby applying every normalization it wishes to)
See also the code parts marked with three consecutive exclamation marks. These are parts which are not finished in the current code release.

This list may be incomplete, so it may grow if I get to know any other points. As work on the parser proceeds, this list may also shrink.

Version Numbers, Change History

There is a history at the beginning of the LibXmlParser unit's source code. This history is the only valid history. All other histories my be erroneous or out of date.