| Did you know ... | Search Documentation: |
| Unicode Prolog source |
The ISO standard specifies the Prolog syntax in ASCII characters. As SWI-Prolog supports Unicode in source files we must extend the syntax. This section describes the implication for the source files, while writing international source files is described in section 3.1.3.
The SWI-Prolog Unicode character classification follows the Unicode
release reported by the read-only Prolog flag unicode_syntax_version.
Note that char_type/2
and friends, intended for processing arbitrary text rather than Prolog
source code, are based on the C library locale-based classification
routines, and that the predicates in
library(unicode) report the version of the bundled utf8proc
data, which may differ from the syntax classifier's version (see
unicode_version/1).
\uXXXX and \UXXXXXXXX (see
section 2.15.1.3)
were introduced to specify Unicode code points in ASCII files.
XID_Start and XID_Continue
sets: identifiers start with an XID_Start code point
followed by a sequence of XID_Continue code points. As a
profile addition, the superscript digits (², ³, ¹, and ⁰--⁹, i.e. U+00B2,
U+00B3, U+00B9, U+2070, U+2074..U+2079) and the subscript digits
(₀--₉, U+2080..U+2089) are also accepted as XID_Continue,
allowing variables such as X² and X₁. Such sequences are handled as a
single token. The token is a variable iff it starts with an
underscore (_) or with a code point in general category
Lu (uppercase letter). Otherwise it is an atom. Note that
titlecase letters (general category Lt, e.g. Dž) start an
atom, not a variable; this differs from earlier releases that used the
broader derived Uppercase property. Many languages do not
have the notion of character case; in such languages variables
must be written as _name.
In source code (read_term/2),
numeric literals use ASCII digits
0--9 only. Conversion via atom_number/2, number_codes/2,
and number_chars/2
additionally accepts any Unicode Nd block for integers,
rational numbers (see section
2.15.1.6) and floating point numbers; in a single number all digits
must come from the same block, i.e., if the numerator of a rational uses
Indian script the denominator must too. The sign, rational separator,
floating point , and floating point exponent
are always ASCII.
.
Pattern_White_Space set defined by UAX #31:
U+0009..U+000D, U+0020, U+0085, U+200E, U+200F, U+2028, and U+2029. NBSP
(U+00A0) is deliberately excluded from Pattern_White_Space;
appearing outside quoted material it raises a stray-character syntax
error. Programs that paste from word processors will occasionally
encounter NBSP in the wrong place, and reporting it explicitly is
preferable to silently treating it as a separator.
% line comments, drive the source-position
line counter, and act as the newline for backslash-newline
continuation inside quoted strings (\<EOL> followed
by zero or more blanks is consumed). The same set is exposed to user
code through
prolog_end_of_line in code_type/2
and char_type/2;
the unprefixed end_of_line stays restricted to the four
ISO/POSIX control codes (LF, VT, FF, CR), and the eleven-member
Pattern_White_Space set itself is prolog_layout.
syntax_error(illegal_character). This includes the C0 and
C1 control range, unassigned and noncharacter code points, surrogates,
the Zs / Zl / Zp separator
classes that are not in Pattern_White_Space (NBSP, OGHAM
SPACE MARK, NARROW NO-BREAK SPACE, IDEOGRAPHIC SPACE, ...),
Cf format characters that are not in
Pattern_White_Space and not in Other_ID_Continue
(SOFT HYPHEN, ZERO WIDTH SPACE, ...), enclosing combining marks (Me),
and other-number characters (No, e.g. the vulgar fractions
and Roman-numeral form U+00BC..U+00BE) that are not in the explicit
super- or subscript-digit profile.
Non-spacing combining marks (Mn, Mc) are
likewise rejected at token-start position --- they do not start an
identifier --- but are in XID_Continue so they
absorb into a preceding identifier (the sequence U+0061 followed by
U+0300 COMBINING GRAVE reads as a single-token identifier two code
points long).
'...'), double-quoted strings ("..."),
back-quoted text (`...`), Unicode quote pairs (see above), %
comments, and /* ... */ comments,
any Unicode scalar value (U+0000 to U+10FFFF, excluding
surrogates which UTF-8 cannot encode) is accepted verbatim. The escape
sequences \uXXXX and \UXXXXXXXX (section
2.15.1.3) are available for portability and explicit clarity, not as
a gate. The single exception is the bidirectional override / isolate
range (U+202A..U+202E and U+2066..U+2069), which is rejected as a
Trojan-source defense (see
unicode_atoms).
Quoted writing (e.g. writeq/1)
of an atom or string that contains a control or zero-width code point
causes the atom or string to be quoted and the offending code points to
be written using an escape sequence; see section
2.15.1.3.
==, =.., :-
and so on. Beyond ASCII, all Unicode symbol characters (general
categories
Sm, Sc, Sk, So) and
the connector, dash, and other-punctuation classes (Pc, Pd,
Po) are treated as solo: each forms an atom on its
own and does not glue with adjacent symbols. This is a deliberate change
from earlier releases, in which Unicode symbols glued into compound
atoms in the same way as ASCII symbols; the change ensures that
characters such as ≤, €, and · keep their per-character meaning.
Operators built from Unicode symbols must be declared explicitly with op/3.
Numeric characters of other type (general category No, e.g. fractions
and circled digits) are not part of the identifier set; only the
explicitly listed super- and subscript digits extend identifiers.
Ps and Pe
form bracket pairs: an opening character followed by a
Prolog term and the matching closing character reads as a unary
compound whose functor is the two delimiter characters joined. This is
the same shape as {Term} becoming '{}'(Term),
generalised to the full Unicode Ps/Pe set (64
pairs, sourced from Unicode BidiMirroring.txt filtered by
general category). Operators inside brackets are honoured; nesting works
as expected. Mismatched or stray closes raise
syntax_error.
Pi and Pf
form
quote pairs: an opening character followed by literal text and
the matching closing character reads as a unary compound whose functor
is the two delimiter characters joined and whose argument is the
contained text in the form selected by
double_quotes
(string by default; also atom, codes, or chars). The contained text is not
parsed as a Prolog term; escape sequences (\n, \uXXXX,
...) are processed as in ASCII quoted strings. For example, with
double_quotes
set to string, the source text «hello, world» reads as a
compound whose functor is the two-character atom «» and whose single
argument is the string
"hello, world". The quote pairs come from
Pi/Pf entries of BidiMirroring.txt
(8 pairs) plus the standard left/right curly quote pairs U+2018/U+2019
and U+201C/U+201D, which are absent from BidiMirroring.txt.
Mismatched closes and unmatched opens raise syntax_error.The features above let source text contain Unicode without escapes:
?- X² = 4. % superscript variable profile
X² = 4.
?- atom_number('१२३', N). % Devanagari Nd via atom_number/2
N = 123.
?- atom_codes(≤, Cs). % Unicode symbol stays solo
Cs = [8804].
?- term_to_atom(T, '⟨a, b⟩'). % bracket pair (Ps/Pe)
T = '⟨⟩'((a,b)).
?- term_to_atom(T, '«hello, world»'). % quote pair (Pi/Pf)
T = '«»'("hello, world").