www.destructor.de
TXmlParser is a free native XML Parser for Borland Delphi. This document gives an overview of how the parser is working.
↓ Download
↓ Licence
↓ Scanning an XML Document
↓ Loading the XML Document
↓ Character Encodings
↓ Differences between Version 1 and 2
↓ Buffer Handling
↓ Whitespace Handling
↓ Non XML conforming
↓ Version Numbers, Change History
The official site to get this parser is http://www.destructor.de
Usage and Distribution of this Source Code is ruled by the "Destructor.de Source code Licence" (DSL) which comes with the parser files.
The DSL says in short: This source code is free. You use it completely on your own risk.
TXmlParser is designed to scan an XML document rather than loading it into an Object Model into memory. You always start scanning at the beginning of the XML document. You can cancel scanning at every point you wish to and you can start scanning the same document as often as you want to.
TXmlParser's main task is to crack up the XML document into its parts (Start Tags, End Tags, Text Content between them, CDATA sections, Empty-Element Tags, the DTD, Processing Instructions (PIs) and Comments). As you scan through the document, TXmlParser will tell you, what it has found.
In short, the process looks like this:
// Create an Instance of TXmlParser
VAR Parser : TXmlParser; BEGIN Parser := TXmlParser.Create; // Create Parser Instance
// Tell the parser if you want content to be normalized // (i.e. pack and trimm off Whitespaces)
Parser.Normalize := TRUE;
// Load the XML Document
Parser.LoadFromFile (XmlFilename); // Load XML file
// Initialize for Scanning through the Document
Parser.StartScan; // Prepare Scanning through the document
// Scanning is done with this loop
WHILE Parser.Scan DO CASE Parser.CurPartType OF // Here the parser tells you what it has found ptStartTag, ptEmptyTag : // Process Parser.CurName and Parser.CurAttr (see below) fields here ptContent, ptCData : // Process Parser.CurContent field here ptEndTag : // Process End-Tag here (Parser.CurName) ptPI : // Process PI here (Parser.CurName is the target, Parser.CurContent) END; Parser.Free; END;
// In Start Tags (ptStartTag) and Empty-Element Tags (ptEmptyTag) you // can access attributes (CurAttr)
You can access the attributes by name:
Value := Parser.CurAttr.Value ('name'); // where 'name' is the attribute's name
Or you can scan through the whole list of attributes:
FOR i := 0 TO Parser.CurAttr.Count-1 DO BEGIN Name := Parser.CurAttr.Name (i); Value := Parser.CurAttr.Value (i); // or Value := Parser.CurAttr.Value (Name); // Same result, but slower END;
An XML Document must be loaded completely into RAM before TXmlParser can scan it. So the maximum size of Document is limited by available RAM. However, as TXmlParser does not build an Object Model, the RAM needed is exactly the same as the size of the document.
You can load the XML document from a file with the "LoadFromFile" method. It is beyond the scope of TXmlParser to perform HTTP or FTP accesses. If you want your application to handle such requests (URLs), you can load the XML via HTTP or FTP or whatever protocol and hand over the data buffer using the "LoadFromBuffer" or "SetBuffer" method.
"LoadFromBuffer" loads the internal buffer of TXmlParser with the given null-terminated string, thereby creating a copy of that buffer.
"SetBuffer" just takes the pointer to another buffer, which means that the given buffer pointer must be valid while the document is accessed via TXmlParser.
TXmlParser "understands" all common 8-Bit encodings like
Any flavor of 16-Bit encodings like UTF-16 is not supported. Sorry.
The parser expects and delivers all strings in the usual Delphi UTF-16 UnicodeString format.
Version 2 introduces Unicode support. It is fully source compatible, there are only a few tweaks you should be aware of:
The TXmlParser property "Normalize" determines how Whitespace is returned in Text Content:
While Normalize is true, all leading and trailing whitespace characters are trimmed off, all Whitespace is converted to Space #x20 characters and contiguous Whitespace characters are compressed to one.
If the "Scan" method reports a ptContent part, the application can get the original text with all whitespace characters by extracting the characters from "CurStart" to "CurFinal".
If the application detects an xml:space attribute, it can set "Normalize" accordingly or use CurStart/CurFinal..
Please note that TXmlParser does not normalize Line Breaks to single LineFeed characters as the XmlSpec requires (XmlSpec 2.11). The xml:space attribute is not handled by TXmlParser. This is on behalf of the application.
TXmlParser does not conform 100 % exactly to the XmlSpec:
This list may be incomplete, so it may grow if I get to know any other points. As work on the parser proceeds, this list may also shrink.
There is a history at the beginning of the LibXmlParser unit's source code. This history is the only valid history. All other histories my be erroneous or out of date.