|Did you know ...||Search Documentation:|
SGML or XML files are loaded through the common predicate load_structure/3. This is a predicate with many options. For simplicity a number of commonly used shorthands are provided: load_sgml_file/2, load_xml_file/2, and load_html_file/2.
stream(StreamHandle)or a file-name. Options is a list of options controlling the conversion process.
A proper XML document contains only a single toplevel element whose name matches the document type. Nevertheless, a list is returned for consistency with the representation of element content. The ListOfContent consists of the following types:
CDATA. Note this is possible in SWI-Prolog, as there is no length-limit on atoms and atom garbage collection is provided.
ListOfAttributes is a list of Name=Value
pairs for attributes. Attributes of type
CDATA are returned
literal. Multi-valued attributes (
NAMES, etc.) are
returned as a list of atoms. Handling attributes of the types
NUMBERS depends on the setting of the
By default they are returned as atoms, but automatic conversion to
Prolog integers is supported. ListOfContent defines the
content for the element.
SDATAis encountered, this term is returned holding the data in Text.
NDATAis encountered, this term is returned holding the data in Text.
<?...?>), Text holds the text of the processing instruction. Please note that the
<?xml ...?>instruction is handled internally.
The Options list controls the conversion process. Currently defined options are below. Other options are passed to sgml_parse/2.
<!DOCTYPE ...>declaration is ignored and the document is parsed and validated against the provided DTD. If provided as a variable, the created DTD is returned. See section 3.5.
xmlns. See the option
dialectof set_sgml_parser/2 for details.
is accepted with warning as part of an unquoted attribute-value, though
/>still closes the element-tag in XML mode. It may be set to false for parsing HTML documents to allow for unquoted URLs containing
xml:space. See section 3.2.
NUMBERSare handled. If
token(default) they are passed as an atom. If
integerthe parser attempts to convert the value to an integer. If successful, the attribute is passed as a Prolog integer. Otherwise it is still passed as an atom. Note that SGML defines a numeric attribute to be a sequence of digits. The
sign is not allowed and
1is different from
01. For this reason the default is to handle numeric attributes as tokens. If conversion to integer is enabled, negative values are silently accepted.
truefor XML and
falsefor SGML and HTML dialects.
false. Setting this option sets the
case_sensitive_attributesto the same value. This option was added to support HTML quasi quotations and most likely has little value in other contexts.
false, only the attributes occurring in the source are emitted.
CDATAentities can be specified with this construct. Multiple entity options are allowed.
max_memory(0)(the default) means no resource limit will be enforced.
string. The choice is not obvious. Strings are allocated on the Prolog stacks and subject to normal stack garbage collection. They are quicker to create and avoid memory fragmentation. But, multiple copies of the same string are stored multiple times, while the text is shared if atoms are used. Strings are also useful for security sensitive information as they are invisible to other threads and cannot be enumerated using, e.g., current_atom/1. Finally, using strings allows for resource usage limits using the global stack limit (see set_prolog_stack/2).
string. See above for the advantages and disadvantages of using strings.
true, xmlns namespaces with prefixes are returned as
ns(Prefix, URI)terms. If
false(default), the prefix is ignored and the xmlns namespace is returned as just the URI.
SGML2PL has four modes for handling white-space. The initial mode can
be switched using the
space(SpaceMode) option to
In XML mode, the mode is further controlled by the
attribute, which may be specified both in the DTD and in the document.
The defined modes are:
sgmlspace-mode, all consequtive white-space is reduced to a single space-character. This mode canonicalises all white space.
default, all leading and trailing white-space is removed from
CDATAobjects. If, as a result, the
CDATAbecomes empty, nothing is passed to the application. This mode is especially handy for processing‘data-oriented' documents, such as RDF. It is not suitable for normal text documents. Consider the HTML fragment below. When processed in this mode, the spaces between the three modified words are lost. This mode is not part of any standard; XML 1.0 allows only
Consider adjacent <b>bold</b> <ul>and</ul> <it>italic</it> words.
The parser can operate in two modes:
sgml mode and
mode, as defined by the
dialect(Dialect) option. Regardless
of this option, if the first line of the document reads as below, the
parser is switched automatically into XML mode.
<?xml ... ?>
Currently switching to XML mode implies:
<element [attribute...] />is recognised as an empty element.
_) and colon (
:) are allowed in names.
preserve. In addition to setting white-space handling at the toplevel the XML reserved attribute
xml:spaceis honoured. It may appear both in the document and the DTD. The
removeextension is honoured as
xml:spacevalue. For example, the DTD statement below ensures that the
preelement preserves space, regardless of the default processing mode.
<!ATTLIST pre xml:space nmtoken #fixed preserve>
Using the dialect
xmlns, the parser will
interpret XML namespaces. In this case, the names of elements are
returned as a term of the format
If an identifier has no namespace and there is no default namespace it is returned as a simple atom. If an identifier has a namespace but this namespace is undeclared, the namespace name rather than the related URL is returned.
Attributes declaring namespaces (
are reported as if
xmlns were not a defined resource.
In many cases, getting attribute-names as url:name is not desirable. Such terms are hard to unify and sometimes multiple URLs may be mapped to the same identifier. This may happen due to poor version management, poor standardisation or because the the application doesn't care too much about versions. This package defines two call-backs that can be set using set_sgml_parser/2 to deal with this problem.
xmlns is called as XML namespaces are
noticed. It can be used to extend a canonical mapping for later use by
urlns call-back. The following illustrates this
behaviour. Any namespace containing
rdf-syntax in its URL
or that is used as
rdf namespace is canonicalised to
implies that any attribute and element name from the RDF namespace
:- dynamic xmlns/3. on_xmlns(rdf, URL, _Parser) :- !, asserta(xmlns(URL, rdf, _)). on_xmlns(_, URL, _Parser) :- sub_atom(URL, _, _, _, 'rdf-syntax'), !, asserta(xmlns(URL, rdf, _)). load_rdf_xml(File, Term) :- load_structure(File, Term, [ dialect(xmlns), call(xmlns, on_xmlns), call(urlns, xmlns) ]).
The library provides iri_xml_namespace/3 to break down an IRI into its namespace and localname:
. Note however that this can produce unexpected results. E.g., in the example below, one might expect the namespace to be http://example.com/images\#, but an XML name cannot start with a digit.
?- iri_xml_namespace('http://example.com/images#12345', NS, L). NS = 'http://example.com/images#12345', L = ''.
As we see from the example above, the Localname can be the empty atom. Similarly, Namespace can be the empty atom if IRI is an XML name. Applications will often have to check for either or both these conditions. We decided against failing in these conditions because the application typically wants to know which of the two conditions (empty namespace or empty localname) holds. This predicate is often used for generating RDF/XML from an RDF graph.
The DTD (Document Type Definition) is a separate entity in sgml2pl, that can be created, freed, defined and inspected. Like the parser itself, it is filled by opening it as a Prolog output stream and sending data to it. This section summarises the predicates for handling the DTD.
dialectoption from open_dtd/3 and the
encodingoption from open/4. Notably the
dialectoption must match the dialect used for subsequent parsing using this DTD.
xmlnsprocesses the DTD case-sensitive.
dtdusing the call:
..., absolute_file_name(dtd(Type), [ extensions([dtd]), access(read) ], DtdFile), ...
Note that DTD objects may be modified while processing errornous
documents. For example, loading an SGML document starting with
<?xml ...?> switches the DTD to XML mode and
encountering unknown elements adds these elements to the DTD object.
Re-using a DTD object to parse multiple documents should be restricted
to situations where the documents processed are known to be error-free.
html is handled separately. The Prolog flag
html_dialect specifies the default html dialect, which is
that HTML5 has no DTD. The loaded DTD is an informal DTD that includes
most of the HTML5 extensions (http://www.cs.tut.fi/~jkorpela/html5-dtd.html).
In addition, the parser sets the
dialect flag of the DTD
object. This is used by the parser to accept HTML extensions.
Next, the corresponding DTD is loaded.
omit(OmitOpen, OmitClose), where both arguments are booleans (
falserepresenting whether the open- or close-tag may be omitted. Content is the content-model of the element represented as a Prolog term. This term takes the following form:
cdata, but entity-references are expanded.
nutoken. For DTD types that allow for a list, the notation
list(Type)is used. Finally, the DTD construct
(a|b|...)is mapped to the term
Default describes the sgml default. It is one
implied. If a
real default is present, it is one of
As this parser allows for processing partial documents and process the DTD separately, the DOCTYPE declaration plays a special role.
If a document has no DOCTYPE declaraction, the parser returns a list holding all elements and CDATA found. If the document has a DOCTYPE declaraction, the parser will open the element defined in the DOCTYPE as soon as the first real data is encountered.
Some documents have no DTD. One of the neat facilities of this
library is that it builds a DTD while parsing a document with an
implicit DTD. The resulting DTD contains all elements
encountered in the document. For each element the content model is a
disjunction of elements and possibly
#PCDATA that can be
repeated. Thus, if we found element
y and CDATA in element
x, the model is:
<!ELEMENT x - - (y|#PCDATA)*>
Any encountered attribute is added to the attribute list with the
CDATA and default
The example below extracts the elements used in an unknown XML document.
elements_in_xml_document(File, Elements) :- load_structure(File, _, [ dialect(xml), dtd(DTD) ]), dtd_property(DTD, elements(Elements)), free_dtd(DTD).
sgml, but implies
shorttag(false)and accepts XML empty element declarations (e.g.,
html, accept attributes named
data-without warning. This value initialises the charset to UTF-8.
xhtml5accepts attributes named
<?xml ...>is encountered. See section 3.3 for details.
xmlns) mode. Default and standard compliant is not to qualify such elements. If
true, such attributes are qualified with the namespace of the element they appear in. This option is for backward compatibility as this is the behaviour of older versions. In addition, the namespace document suggests unqualified attributes are often interpreted in the namespace of their element.
token(default), attributes of type number are passed as a Prolog atom. If
integer, such attributes are translated into Prolog integers. If the conversion fails (e.g. due to overflow) a warning is issued and the value is passed as an atom.
encoding=attribute in the header. Explicit use of this option is only required to parse non-conforming documents. Currently accepted values are
<!DOCTYPEdeclaration has been parsed, the default is the defined doctype. The parser can be instructed to accept the first element encountered as the toplevel using
doctype(_). This feature is especially useful when parsing part of a document (see the
parseoption to sgml_parse/2.
on_begin, etc. callbacks from sgml_parse/2.
end) is caused by an element written down using the shorttag notation (
#pcdatais part of Elements. If no element is open, the doctype is returned.
This option is intended to support syntax-sensitive editors. Such an editor should load the DTD, find an appropriate starting point and then feed all data between the starting point and the caret into the parser. Next it can use this option to determine the elements allowed at this point. Below is a code fragment illustrating this use given a parser with loaded DTD, an input stream and a start-location.
..., seek(In, Start, bof, _), set_sgml_parser(Parser, charpos(Start)), set_sgml_parser(Parser, doctype(_)), Len is Caret - Start, sgml_parse(Parser, [ source(In), content_length(Len), parse(input) % do not complete document ]), get_sgml_parser(Parser, allowed(Allowed)), ...
Input is a stream. A full description of the option-list is below.
string. See load_structure/3 for details.
source(Stream), this implies reading is stopped as soon as the element is complete, and another call may be issued on the same stream to read the next element.
elementbut assumes the element has already been opened. It may be used in a call-back from
call(to parse individual elements after validating their headers.
allowed(Elements)option of get_sgml_parser/2. It disables the parser's default to complete the parse-tree by closing all open elements.
max_errors(-1)makes the parser continue, no matter how many errors it encounters.
error(limit_exceeded(max_errors, Max), _)
quiet, the error is suppressed. Can be used together with
call(urlns, Closure)to provide external expansion of namespaces. See also section 3.3.1.
Handler(+Tag, +Attributes, +Parser).
Handler(+CDATA, +Parser), where CDATA is an atom representing the data.
Handler(+Text, +Parser), where Text is the text of the processing instruction.
<!...>) has been read. The named handler is called with two arguments:
Handler(+Text, +Parser), where Text is the text of the declaration with comments removed.
This option is expecially useful for highlighting declarations and comments in editor support, where the location of the declaration is extracted using get_sgml_parser/2.
Handler(+Severity, +Message, +Parser), where Severity is one of
errorand Message is an atom representing the diagnostic message. The location of the error can be determined using get_sgml_parser/2
If this option is present, errors and warnings are not reported using print_message/3
xmlnsmode, a new namespace declaraction is pushed on the environment. The named handler is called with three arguments:
Handler(+NameSpace, +URL, +Parser). See section 3.3.1 for details.
xmlnsmode, this predicate can be used to map a url into either a canonical URL for this namespace or another internal identifier. See section 3.3.1 for details.
In some cases, part of a document needs to be parsed. One option is
to use load_structure/2
or one of its variations and extract the desired elements from the
returned structure. This is a clean solution, especially on small and
medium-sized documents. It however is unsuitable for parsing really big
documents. Such documents can only be handled with the call-back output
interface realised by the
call(Event, Action) option of sgml_parse/2.
Event-driven processing is not very natural in Prolog.
The SGML2PL library allows for a mixed approach. Consider the case
where we want to process all descriptions from RDF elements in a
document. The code below calls
on each element that is directly inside an RDF element.
:- dynamic in_rdf/0. load_rdf(File) :- retractall(in_rdf), open(File, read, In), new_sgml_parser(Parser, ), set_sgml_parser(Parser, file(File)), set_sgml_parser(Parser, dialect(xml)), sgml_parse(Parser, [ source(In), call(begin, on_begin), call(end, on_end) ]), close(In). on_end('RDF', _) :- retractall(in_rdf). on_begin('RDF', _, _) :- assert(in_rdf). on_begin(Tag, Attr, Parser) :- in_rdf, !, sgml_parse(Parser, [ document(Content), parse(content) ]), process_rdf_description(element(Tag, Attr, Content)).