Did you know ... | Search Documentation: |
Pack logtalk -- logtalk-3.85.0/tests/prolog/unicode/NOTES.md |
This file is part of Logtalk https://logtalk.org/ SPDX-FileCopyrightText: 1998-2023 Paulo Moura <pmoura@logtalk.org> SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This directory contains work-in-progress test sets for Prolog Unicode
support. Currently, three test sets are provided: builtins
(for flags,
built-in predicates, and stream properties), encodings
(for UTF-8,
UTF-16, and UTF-32 encodings, with and without a BOM), and syntax
(for
the \uXXXX
and \UXXXXXXXX
escape sequences). The encodings
test set
is only enabled for backends supporting all the above encodings (currently,
CxProlog, XVM, SICStus Prolog, SWI-Prolog, and Trealla Prolog).
The tests are based on an extended version of the October 5, 2009 WG17 ISO Prolog Core revision standardization proposal, which specifies the following minimal language features:
encoding
Prolog flag, allowing applications to query the default
encoding for opening streams. When the Prolog systems supports multiple
encodings, the default encoding can be changed by setting this flag to a
supported encoding.http://www.iana.org/assignments/character-sets
For example, 'UTF-8'
, 'UTF-16LE'
, or 'UTF-32'
.
encoding(Atom)
and bom(Boolean)
.
The handling of these options depends on the mode argument, only applies to
text files, and follows from the Unicode standard guidelines and current
practice:write
mode: If an encoding/1 option is present, use the specified
encoding, otherwise use the default encoding (which can be queried using
the encoding
flag). If bom(true)
option is present, write a BOM if the
encoding is a Unicode encoding. If no bom/1 option is used, write a BOM
if the encoding is UTF-16
or UTF-32
but not if the encoding is UTF-8
,
`UTF-16LE`, `UTF-16BE`, `UTF-32LE`, or `UTF-32LE`. If the encoding is
UTF-16
or UTF-32
, write the data big-endian.append
mode: If an encoding/1 option is present, use that encoding,
otherwise use the default encoding (which can be queried using the
encoding
flag). Ignore bom/1 option if present and never write a BOM.read
mode: the default is bom(true)
, i.e. perform BOM detection and use
the corresponding encoding if a BOM is found. If no BOM is detected, then use
the encoding/1 option if present and the default encoding otherwise. When a
bom(false)
option is present, no BOM detection is performed, an encoding/1
is required if the file encoding is different from the default encoding, and
a BOM at the beginning of the stream is to be interpreted as a ZERO WIDTH
NON-BREAKING SPACE (ZWNBSP).
The bom/1 option is ignored when not using a Unicode encoding. The bom/1
and encoding/1 options are ignored when a type(binary)
option is present.
read
and uses the corresponding encoding if a BOM is found. Otherwise the
default encoding is used (which can be queried using the encoding
flag).
In write
mode, a BOM is written if the default encoding is UTF-16
or
UTF-32
but not if the encoding is UTF-8
, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`,
or `UTF-32LE`. If the encoding is UTF-16
or UTF-32
, the data is written
big-endian. In append
mode, no BOM is written and the default encoding is
used.encoding(Atom)
and bom(Boolean)
, set from
the open/3-4
calls and the default values as described above, that can be
queried using the standard stream_property/2 predicate.get_char/1-2
get_code/1-2
open/3-4
peek_char/1-2
peek_code/1-2
put_char/1-2
put_code/1-2
\uXXXX
and \UXXXXXXXX
escape sequences. The \uXXXX
escape sequence, using four hexadecimal digits, covers the Basic Multilingual
Plane (BMP). The \UXXXXXXXX
escape sequence, using eight hexadecimal digits,
covers the full Unicode code points space. The use of code points makes these
escape sequences independent of both the chosen Unicode text encoding and the
Prolog system internal character set (thus providing better portability than
the ISO Prolog Core standard octal and hexadecimal escape sequences).