www.destructor.de

About | Contact | Impressum


Home |  Code |  Articles |  Misc |  x
XML Parser |  TAR Library |  Linked Lists |  WinSock 1.1 |  x
General |  Downloads |  Documentation |  History |  x
x

TXmlParser Reference

This document gives in-depth information about the TXmlParser XML Parser for Borland Delphi. You can download the latest version of the parser at www.destructor.de. I assume that you have an understanding of the terms of XML. The first letter of XML terms is capitalized in this document.

Contents


General

TXmlParser is a Delphi CLASS which is responsible for

Like the XML specification points out, TXmlParser is the "XML Processor". The code that uses TXmlParser is the "Application".

The common container class for TXmlParser is the TObjectList that comes with Delphi 5. If you have a version below 5, you can easily derive one from TList. TObjectList assumes that the objects added to the list 'belong' to the list; so calls to "Delete", "Clear" or "Free" also destroy the objects.

It is out of the scope of TXmlParser to load documents from HTTP or FTP servers. If you want to implement an application which is able to do this, you will have to use your own HTTP or FTP (or whatever) network client and hand over the loaded document to the TXmlParser using the LoadFromBuffer or SetBuffer methods.

Optimized for Speed

TXmlParser was developed with speed in mind. You will notice this when you compare the speed with other XML parsers. The idea is that there is a PChar pointer running through the document, analyzing what's at its position.

For this reason, the entire document must be a null terminated string.

Public Interface

Properties

Note: Properties with the prefix "Cur" hold data concerning the current "Scan" step. All other properties are more or less independent of scanning.

XmlVersion

Type: AnsiString, read-only

This string contains the XML version number which is declared in the document's XML Prolog. The string is filled when the Prolog is scanned. Before that, it has the value '1.0'.

Encoding

Type: AnsiString, read-only

The name of the character encoding. Default for XML is 'UTF-8'. Another widely used encoding is 'ISO-8859-1', which is about the same as the Windows "ANSI" 1252 character set (which would be 'windows-1252').

The Encoding property holds the encoding name for the XML Document. If you want to determine the encoding of the current content while scanning, you must use "CurEncoding" (encode can change during the document when the parser reads an external entity which has a different encoding than the root document).

You can read the Encoding field everytime, but it will be reset by the XML Prolog (PartType = ptXmlProlog)

Standalone

Type: Boolean, read-only

If the XML Prolog says "standalone='yes'", then the Standalone property is TRUE, else it is FALSE.

RootName

Type: AnsiString, read-only

The name of the Root Element. This is determined from the DOCTYPE declaration. Until the DOCTYPE declaration is found in the document, "RootName" is empty. It remains set after a "StartScan". Note that RootName will not be set by the start tag of the root element.

Normalize

Type: Boolean, read/write

Set this property to TRUE if you want Element Content to be normalized. This means that:

You can set Normalize at any time. The next "Scan" call will immediately work accordingly. So if you find a Start Tag with an "xml:space" attribute, you can (and must) set "Normalize" yourself.

Note: Normalization of Attribute Values is completely governed by the XML Spec.

Note: The XML spec requires that all line breaks be changed to single linefeed (#x0A) characters. TXmlParser doesn't change line break characters, so will normally have CR+LF sequences, depending on what's in your XML file.

Source

Type: AnsiString, read-only

This is not the Document source itself but instead the name of the source you got the document from:

DocBuffer

Type: PChar, read-only

Returns a pointer to the first character of the document. If there is no document loaded, "DocBuffer" returns a pointer to a null (#x00) character. So you always have a valid pointer and never NIL.

Elements

Type: TElemList (derived from TObjectList), Attribute

This list contains all Element declarations which have been found in the DTD. Every Element definition is stored in a TElemDef object.

Entities

Type: TNvpList (derived from TObjectList), Attribute

This list contains all General Entity declarations which have been found in the DTD. Every Entity definition is stored in a TEntityDef object.

ParEntities

Type: TNvpList (derived from TObjectList), Attribute

This list contains all Parameter Entity declarations which have been found in the DTD. Every Parameter Entity definition is also stored in a TEntityDef object.

Notations

Type: TNvpList (derived from TObjectList), Attribute

This list contains all Notation declarations which have been found in the DTD. Every Notation definition is stored in a TNotationDef object.

CurPartType

Type: TPartType (Enumeration Type), Attribute

Every time the "Scan" method returns, CurPartType holds the type of the current part which has been found by "Scan". This can be one of the following part types:

CurPartType Meaning CurName CurContent
ptNone This should never be returned. If it is, there must be an error in the XML document (or in my code ;-) Undefined Undefined
ptXmlProlog The XML Prolog has been read in. Now you can read the properties XmlVersion, Encoding, and Standalone. Undefined Undefined
ptComment A comment has been found. You can retrieve the comment by extracting the buffer part from CurStart to CurFinal Undefined Untouched
ptPI A Processing Instruction has been found. If it has "pseudo attributes", you can find these in the CurAttr list Target name PI content
ptStartTag A Start Tag has been found. You can find the Attributes in the CurAttr list Element name Untouched
ptEmptyTag An Empty-Element tag has been found. You can find the Attributes in the CurAttr list.
NOTE: TXmlParser distinguishes between Empty-Element Tags and a Start Tag directly followed by an End Tag. So <BR/> will be returned as an Empty-Element Tag (ptEmptyTag) and <BR></BR> will be returned as a Start Tag followed by an End Tag.
Tag name Untouched
ptEndTag An End Tag has been found. Tag name Untouched
ptContent Text Content (the part between Tags) has been found. General Entities have been resolved. If "Normalize" is TRUE, the content is already normalized. The Encoding has been transferred by the "TranslateEncoding" method. If "Normalize" is true, then Whitespace-only content will not be return (i.e. there will be no ptContent part for them). Untouched Content
ptCData A CDATA section has been found. The Encoding has been transferred by the "TranslateEncoding" method. Whitespace is unchanged. Empty Content

CurName

Type: AnsiString, Attribute

The Name of the last part which has a name (e.g. start tags or PIs have a name, comments or text contents don't have a name). If there is a part without a name, the CurName attribute stays untouched. So when you have a ptContent part, you (usually) know the name of the last tag by looking at CurName.

CurContent

Type: AnsiString, Attribute

The last Content (from a ptContent, ptCData or ptPI). Like CurName, CurContent is not overwritten by parts which have no content (like Tags or Comments).

CurStart, CurFinal

Type: PChar, Attribute

A pointer to the first (CurStart) and last (CurFinal) character of the current part returned by the Scan method. You can use these pointers to retrieve the exact part string. 

Example: You want to extract the contents of a comment (which is not done by the Scan method), you can use CurStart and CurFinal:

SetString (MyComment, CurStart, CurFinish - CurStart + 1);

or you could use the SetStringSF function which is exported by the LibXmlParser unit:

SetStringSF (MyComment, CurStart, CurFinish);

CurAttr

Type: TAttrList, Attribute

This is a list of TAttr Objects. Every TAttr has a Name and a Value field, which contain the name and value of one attribute. The ValueType field of TAttr tells you where the value comes from:

ValueType Meaning
vtNormal The attribute has been specified completely in the tag
vtImplied The attribute value is undefined; the attribute is defined as #IMPLIED by the DTD. Your application must know how to handle this attribute
vtFixed The attribute is defined as #FIXED in the DTD. If there was an attribute value in the tag, it has been overwritten by the attribute default value from the DTD
vtDefault The attribute was not specified in the tag; instead, it was added because it was defined in the DTD. The attribute value is the default value from the DTD's ATTLIST definition.

The AttrType field tells you the type of the Attribute. It is copied from the TAttrDef object which was created when the attribute was declared in the DTD.

CurEncoding

Type: AnsiString, read-only property

This is the name of the current Encoding. Encoding can change in the middle of the document if the parser has to parse an External Parsed Entity which has a different Encoding than the main document. This value is mainly used by the TranslateEncoding method. But it can also be used by the application.


Methods

PROCEDURE LoadFromFile (Filename : STRING);

Loads the File into the internal Buffer of the TXmlParser instance. If this is successfull, then the Source property holds the name of the file.

PROCEDURE LoadFromBuffer (Buffer : PChar);

Loads the null-terminated string given by Buffer into the internal Buffer of the TXmlParser instance. The Source property has the value '<MEM>' after this step.

PROCEDURE SetBuffer (Buffer : PChar);

If you already have the XML Document loaded into memory and you don't want TXmlParser to keep the entire document in its own piece of memory, you can use the SetBuffer method. SetBuffer will not allocate memory but instead will let the internal FBuffer attribute point to your Buffer.

Note: The XML document must be null-terminated (i.e. there is a NULL (#x0;) character at the end).

You must not free your buffer before you free your TXmlParser instance or call the Clear method. This would cause access violations.

PROCEDURE Clear;

Clears all internal variables and deallocates all buffer space previously allocated by the TXmlParser instance. After this, the TXmlParser is prepared for loading a new buffer. Clear is automatically called by the loading methods like LoadFromFile, LoadFromBuffer, or SetBuffer.

PROCEDURE StartScan;

While you scan through your document (using the Scan method), there is always a pointer pointing to the current part of the document (in fact, it's two pointers: CurStart and CurFinal). StartScan initializes all pointers and all Cur* attributes in order to prepare for a new scan from the beginning of the Document.

You may call StartScan as often as you want and at any time.

FUNCTION Scan : BOOLEAN;

This is where scanning through the XML document really happens. The Scan method performs the following steps:

After that, Scan returns a boolean value which is

This behaviour of the Scan method was chosen so that you can write a WHILE loop for scanning through the document.

With this WHILE loop, you can handle everything that you need in local variables of the procedure/function/method which analyzes your XML Document. In an event centric model, there would be a procedure call for every Document part and so you would have to handle everything in more or less global variables.

Virtual Methods

With the virtual methods of TXmlParser you can modify the behaviour of TXmlParser. Just override them in a class descendant of your own.

FUNCTION LoadExternalEntity (SystemId, PublicId, Notation : STRING) : TXmlParser; VIRTUAL;

Override this method if you have implemented a special mechanism for loading documents, if you want to process PUBLIC IDs or Notations.

LoadExternalEntity is called for every External Entity (be it Parsed or Unparsed). The known System and/or Public IDs and the Notation is passed as a string to this method. It has to create a new TXmlParser instance and load the desired Entity into the buffer of that instance.

LoadExternalEntity is also called when the External DTD Subset is to be loaded.

FUNCTION TranslateEncoding (CONST Source : STRING) : STRING; VIRTUAL;

NOTE: This function has been dropped in Version 2. All conversions are now done automatically.

For Version 1: The XML Specification states that every XML parser must be able to handle UTF-8 and UTF-16 documents. Beside these, parsers should be able to handle other Encodings. The encoding for a document is defined in the XML Prolog (for entire XML Documents) or in a Text Declaration at the beginning of Parsed External Entities or External DTD subsets.

So there is a source Encoding (the Encoding of the Document and its external parts) and a destination encoding (the encoding your application wishes to process). For every content string which is passed to your application (Text Content between Tags, CDATA sections, Attribute values) the TranslateEncoding method is called. It retrieves the current source encoding by looking at the CurEncoding property and translates the passed "Source" string into the desired destination encoding.

The TranslateEncoding method that is built into TXmlParser assumes that the destination encoding is the Windows ANSI encoding used in Windows apps. It can handle UTF-8 and ISO-8859-1 as source encodings. Note: It is assumed here that ISO-8859-1 and "Windows ANSI" are the same, which is not exactly true for some characters. But for the largest part of documents this should be true. 

UTF-8 correctly translated into the single-byte ANSI Windows-1252 format.

At the time of this writing, TXmlParser is not able to handle multi-byte character strings. This is likely to change in the future.

PROCEDURE DtdElementFound (DtdElementRec : TDtdElementRec); VIRTUAL;

The Scan method just tells you when it just scanned the DTD declaration. It doesn't tell you anything about what it found in the DTD. (You could scan the Lists Elements, Entities, ParEntities, Notations but then you know nothing about comments or PIs inside the DTD.)

If you want to build a validating parser or a tool which presents the elements of the DTD of your XML document or you want to handle comments or PIs inside the DTD, you can override the DtdElementFound virtual method. It is called everytime there is a DTD element found during the scan of the Document Type Declaration.

DtdElementFound gets passed a TDtdElementRec, which is a variant record with the following declaration:

  TDtdElementRec = RECORD    // --- This Record is returned by the DTD parser callback function
                     CASE ElementType : TDtdElemType OF
                       deElement,
                       deAttList  : (ElemDef      : TElemDef);
                       deEntity   : (EntityDef    : TEntityDef);
                       deNotation : (NotationDef  : TNotationDef);
                       dePI       : (Target       : PChar;
                                     Content      : PChar;
                                     AttrList     : TAttrList);
                       deComment  : (Start, Final : PChar);
                       deError    : (Pos          : PChar);
                   END;

The ElementType field tells you which type of DTD element the parser has just found. Depending on this field, you can find out what has been found:

ElementType Field Description
deElement, deAttList An <!ELEMENT> or <!ATTLIST> declaration has been found. The ElemDef field points to the TElemDef instance created (for deElement) or filled with Attribute definitions (for deAttList)
deEntity An <!ENTITY> declaration has been found. The EntityDef field points to the TEntityDef instance created
deNotation A <!NOTATION> declaration has been found. The NotationDef field points to the TNotationDef instance created
dePI A Processing Instruction (PI) has been found inside the DTD. Target points to a null-terminated string containing the PI target; Content points to the part between the target and ?> in the PI. AttrList is the list of pseudo attributes in the PI.
deComment A comment has been found inside the DTD. Start points to the opening '<!--' and Final points to the closing '>' of the comment.
deError There is an error inside the DTD. Pos points to the position of the error

Note that all pointers are only valid when DtdElementFound is called. Don't keep them for later use.


Other Classes

TNvpNode, TNvpList

Name/Value Pairs (NVPs) are not handled using a TStringList (and its Names and Values properties). Instead, they are handled in a TNvpNode with the fields Name and Value and a special list for such nodes, TNvpList, which has special methods to get elements from the list. This concept was introduced because there are nodes derived from TNvpNode (like TAttr) which have additional fields.

Method Description
PROCEDURE Add (Node : TNvpNode) Adds a new node to the list. Nodes are always sorted by name so the Node method can use a binary search
FUNCTION Node (Name : STRING) : TNvpNode Retrieves the node instance with the given name. If the node can not be found, NIL is returned.
FUNCTION Node (Index : INTEGER) : TNvpNode Retrieves the node instance at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned.
FUNCTION Value (Name : STRING) : STRING Retrieves the string value of the given name. If there is no node with the name, an empty string is returned.
FUNCTION Value (Index : INTEGER) : STRING Retrieves the string value at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned.
FUNCTION Name (Index : INTEGER) : STRING; Retrieves the name of the attribute at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned.

TAttr, TAttrList

Derived from TNvpNode and TNvpList. Used for passing back Tag attributes to the application.

TAttr has two additional fields to hold informations about the Attribute, ValueType and AttrType:

Field Description
ValueType See the description of the CurAttr property
AttrType

The type of the attribute, as declared in the DTD:

Type Description
atUnknown Unknown type
atCData Character data only
atID An ID (Unique attribute value)
atIdRef An ID reference
atIdRefs Several ID References, separated by Whitespace
atEntity Name of an unparsed Entity
atEntities Several unparsed Entity names, separated by Whitespace
atNmToken Name Token
atNmTokens Several Name Tokens, separated by Whitespace
atNotation A selection of Notation names (Unparsed Entity), separated by pipe symbols. You can find these in the Notations field of the TAttrDef definition where the Attribute belongs to
atEnumeration Enumeration (possible values, separated by pipe symbols). You can find the enumeration definition in the TypeDef field of the TAttrDef definition where the Attribute belongs to

TEntityStack, TEntityStackNode

When the parser scans through the document, it can find a reference to a parsed entity, internal or external. In this case, the current position pointer is pushed to a stack (the EntityStack) and set to the first character of the entity replacement text. After the entity is scanned, the old pointer is popped off the stack and processing of the original document continues. As Entity references may nest, this has to be organized as a stack.

TAttrDef

Every <!ATTLIST> element gets transferred into a TAttrDef instance, which is inserted into the TElemDef where the Attribute definition belongs to. TAttrDef has the following fields:

Field Description
Name Name of the Atribute
Value Default value
TypeDef The Type definition from the <!ATTLIST> declaration
Notations The listing of notations, if it is a NOTATION attribute. The notation names are separated by pipe symbols.
AttrType Type of the Attribute
DefaultType Type of the default value declaration of the Attribute (normal default value, #REQUIRED, #IMPLIED, #FIXED)

TElemDef, TElemList

TElemDef holds the data of an <!ELEMENT> definition:

Field Description
Name Name of the Element
ElemType

Type of the element:

Type Description
etEmpty Element is always an Empty Element
etAny Element can have any mixture of PCDATA and any elements
etChildren Element must contain only elements, no PCDATA
etMixed Mixed PCDATA and elements. The Definition field holds the exact definition as specified in the DTD.
Definition The exact definition of the element from the DTD

As TElemDef is both, a node and a list, there is a special TElemList, which has almost the same code as TNvpList.

TEntityDef

TEntityDef holds the data of an <!ENTITY> definition. Depending on the type (General or Parameter Entity), the TEntityDef node is added to the Entities or ParEntities list.

Field Description
Name Name of the entity
Value The replacement text of the entity
SystemId For External Entities, this field contains the SYSTEM id
PublicId For External Entities, this field contains the PUBLIC id. This field may be empty.
NotationName For NDATA Unparsed External Entities, this field contains the Notation Name.

TNotationDef

TNotationDef holds the data of a <!NOTATION> definition:

Field Description
Name Name of the notation
Value SYSTEM id
PublicId PUBLIC id

Standalone Procedures and Functions

FUNCTION ConvertWs (Source: STRING; PackWs: BOOLEAN) : STRING;

Converts all Whitespace characters (Space, Tab, Carriage Return, Linefeed) in the String to Space #x20 characters. If the PackWs parameter is true, contiguous whitespace characters will be packed to one space character.

PROCEDURE SetStringSF (VAR S : STRING; BufferStart, BufferFinal : PChar);

The same as SysUtils.SetString. Exception: the second parameter denotes the position of the last character to transfer into the string, not the length.

FUNCTION StrSFPas (Start, Finish : PChar) : STRING;

Same as SysUtils.StrPas. In addition to the start of the string, the last character is also passed (Finish).

FUNCTION TrimWs (Source : STRING) : STRING;

Trimms all whitespace characters off the beginning and end of the Source string.

FUNCTION AnsiToUtf8 (Source : ANSISTRING) : STRING;

Converts the Windows 1252 ANSI Source string to a UTF-8 string.

FUNCTION Utf8ToAnsi (Source : STRING; UnknownChar : CHAR = '¿') : ANSISTRING;

Converts the given UTF-8 string to Windows ANSI. Unicode characters which don't fit into the Windows-1252 range are converted to the "UnknownChar" character, which defaults to a reverse question mark.