www.destructor.de

About | Contact | Impressum


Home |  Code |  Articles |  Misc |  x
XML Parser |  TAR Library |  Linked Lists |  WinSock 1.1 |  x
General |  Downloads |  Documentation |  History |  Links |  x
x

XML Parser: Usage Overview

TXmlParser is a free native XML Parser for Borland Delphi. This document gives an overview of how the parser is working.

Contents

Download
Licence
Scanning an XML Document
Loading the XML Document
Character Encodings
Buffer Handling
Whitespace Handling
Non XML conforming
Version Numbers, Change History


Downloading TXmlParser

The official site to get this parser is http://www.destructor.de


Licence

Usage and Distribution of this Source Code is ruled by the "Destructor.de Source code Licence" (DSL) which comes with the parser files.

The DSL says in short: This source code is free. You use it completely on your own risk.


Scanning an XML document

TXmlParser is designed to scan an XML document rather than loading it into an Object Model into memory. You always start scanning at the beginning of the XML document. You can cancel scanning at every point you wish to and you can start scanning the same document as often as you want to.

TXmlParser's main task is to crack up the XML document into its parts (Start Tags, End Tags, Text Content between them, CDATA sections, Empty-Element Tags, the DTD, Processing Instructions (PIs) and Comments). As you scan through the document, TXmlParser will tell you, what it has found.

In short, the process looks like this:

// Create an Instance of TXmlParser
VAR
  Parser : TXmlParser;
BEGIN
  Parser := TXmlParser.Create;          // Create Parser Instance
// Tell the parser if you want content to be normalized
// (i.e. pack and trimm off Whitespaces)
  Parser.Normalize := TRUE;
// Load the XML Document
  Parser.LoadFromFile (XmlFilename);    // Load XML file
// Initialize for Scanning through the Document
  Parser.StartScan;                     // Prepare Scanning through the document
// Scanning is done with this loop
  WHILE Parser.Scan DO 
    CASE Parser.CurPartType OF          // Here the parser tells you what it has found
      ptStartTag,
      ptEmptyTag : // Process Parser.CurName and Parser.CurAttr (see below) fields here
      ptContent,
      ptCData    : // Process Parser.CurContent field here
      ptEndTag   : // Process End-Tag here (Parser.CurName)
      ptPI       : // Process PI here (Parser.CurName is the target, Parser.CurContent)
      END;
  Parser.Free;
END;
// In Start Tags (ptStartTag) and Empty-Element Tags (ptEmptyTag) you 
// can access attributes (CurAttr)

You can access the attributes by name:

Value := Parser.CurAttr.Value ('name');   // where 'name' is the attribute's name

Or you can scan through the whole list of attributes:

FOR i := 0 TO Parser.CurAttr.Count-1 DO BEGIN
  Name  := Parser.CurAttr.Name  (i);
  Value := Parser.CurAttr.Value (i);
    // or
  Value := Parser.CurAttr.Value (Name);   // Same result, but slower
  END;

Loading the XML document

An XML Document must be loaded completely into RAM before TXmlParser can scan it. So the maximum size of Document is limited by available RAM. However, as TXmlParser does not build an Object Model, the RAM needed is exactly the same as the size of the document.

You can load the XML document from a file with the "LoadFromFile" method. It is beyond the scope of TXmlParser to perform HTTP or FTP accesses. If you want your application to handle such requests (URLs), you can load the XML via HTTP or FTP or whatever protocol and hand over the data buffer using the "LoadFromBuffer" or "SetBuffer" method.

"LoadFromBuffer" loads the internal buffer of TXmlParser with the given null-terminated string, thereby creating a copy of that buffer.

"SetBuffer" just takes the pointer to another buffer, which means that the given buffer pointer must be valid while the document is accessed via TXmlParser.


Character Encodings

TXmlParser "understands" the following encodings

Any flavor of multi-byte characters and UTF-16 is not supported. Sorry.

Every string which has to be passed to the application passes the virtual method "TranslateEncoding" which translates the string from the current encoding (stored in "CurEncoding") into the encoding the application wishes to receive.

The "TranslateEncoding" method that is built into TXmlParser assumes that the application wants to receive Windows ANSI (about the same as ISO-8859-1) and is able to convert UTF-8 and ISO-8859-1 encodings.

For other source and target encodings, you will have to override "TranslateEncoding" yourself.


Buffer Handling


Whitespace Handling, Normalization

The TXmlParser property "Normalize" determines how Whitespace is returned in Text Content:

While Normalize is true, all leading and trailing whitespace characters are trimmed off, all Whitespace is converted to Space #x20 characters and contiguous Whitespace characters are compressed to one.

If the "Scan" method reports a ptContent part, the application can get the original text with all whitespace characters by extracting the characters from "CurStart" to "CurFinal".

If the application detects an xml:space attribute, it can set "Normalize" accordingly or use CurStart/CurFinal..

Please note that TXmlParser does not normalize Line Breaks to single LineFeed characters as the XmlSpec requires (XmlSpec 2.11). The xml:space attribute is not handled by TXmlParser. This is on behalf of the application.


Non-XML-Conforming

TXmlParser does not conform 100 % exactly to the XmlSpec:

This list may be incomplete, so it may grow if I get to know any other points. As work on the parser proceeds, this list may also shrink.


Version Numbers, Change History

There is a history at the beginning of the LibXmlParser unit's source code. This history is the only valid history. All other histories my be erroneous or out of date.