This module provides persistency for rdf_db.pl based on the
rdf_monitor/2 predicate to track changes to the repository. Where
previous versions used autosafe of the whole database using the
quick-load format of rdf_db, this version is based on a quick-load file
per source (4th argument of rdf/4), and journalling for edit operations.
The result is safe, avoids frequent small changes to large files which
makes synchronisation and backup expensive and avoids long disruption of
the server doing the autosafe. Only loading large files disrupts service
for some time.
The persistent backup of the database is realised in a directory, using
a lock file to avoid corruption due to concurrent access. Each source is
represented by two files, the latest snapshot and a journal. The state
is restored by loading the snapshot and replaying the journal. The
predicate rdf_flush_journals/1 can be used to create fresh snapshots and
delete the journals.
- See also
- - rdf_edit.pl
- To be done
- - If there is a complete `.new' snapshot and no journal, we should
move the .new to the plain snapshot name as a means of recovery.
- - Backup of each graph using one or two files is very costly if there
are many graphs. Although the currently used subdirectories avoid
hitting OS limits early, this is still not ideal. Probably we
should collect (small, older?) files and combine them into a single
quick load file. We could call this (similar to GIT) a `pack'.
rdf_attach_db(+Directory, +Options) is det- Start persistent operations using Directory as place to store
files. There are several cases:
- Empty DB, existing directory
Load the DB from the existing directory
- Full DB, empty directory
Create snapshots for all sources in directory
Options:
- access(+AccessMode)
- One of
auto
(default), read_write
or
read_only
. Read-only access implies that the RDF
store is not locked. It is read at startup and all
modifications to the data are temporary. The default
auto
mode is read_write
if the directory is
writeable and the lock can be acquired. Otherwise
it reverts to read_only
.
- concurrency(+Jobs)
- Number of threads to use for loading the initial
database. If not provided it is the number of CPUs
as obtained from the flag
cpu_count
.
- max_open_journals(+Count)
- Maximum number of journals kept open. If not provided,
the default is 10. See limit_fd_pool/0.
- directory_levels(+Count)
- Number of levels of intermediate directories for storing
the graph files. Default is 2.
- silent(+BoolOrBrief)
- If
true
(default false
), do not print informational
messages. Finally, if brief
it will show minimal
feedback.
- log_nested_transactions(+Boolean)
- If
true
, nested log transactions are added to the
journal information. By default (false
), no log-term
is added for nested transactions.\\
- Errors
- -
existence_error(source_sink, Directory)
- -
permission_error(write, directory, Directory)
rdf_attach_db_ro(+Directory, +Options)[private]- Open an RDF database in read-only mode.
rdf_persistency_property(?Property) is nondet- True when Property is a property of the current persistent database.
Exposes the properties that can be passed as options to
rdf_attach_db/2. Specifically,
rdf_persistency_property(access(read_only))
is true iff the database
is mounted in read-only mode. In addition, the following property is
supported:
- directory(Dir)
- The directory in which the database resides.
no_agc(:Goal)[private]- Run Goal with atom garbage collection disabled. Loading an RDF
database creates large amounts of atoms we know are not
garbage.
rdf_detach_db is det- Detach from the current database. Succeeds silently if no
database is attached. Normally called at the end of the program
through at_halt/1.
rdf_current_db(?Dir)- True if Dir is the current RDF persistent database.
rdf_flush_journals(+Options)- Flush dirty journals. Options:
- min_size(+KB)
- Only flush if journal is over KB in size.
- graph(+Graph)
- Only flush the journal of Graph
- To be done
- - Provide a default for min_size?
load_db is det[private]- Reload database from the directory specified by rdf_directory/1.
First we find all names graphs using find_dbs/1 and then we load
them.
make_goals(+DBs, +Silent, +Index, +Total, -Goals)[private]
concurrency(-Jobs)[private]- Number of jobs to run concurrently.
find_dbs(+Dir, -Graphs, -SnapBySize, -JournalBySize) is det[private]- Scan the persistent database and return a list of snapshots and
journals, both sorted by file-size. Each term is of the form
db(Size, Ext, DB, DBFile, Depth)
scan_db_files(+Files, +Dir, +Prefix, +Depth)// is det[private]- Produces a list of
db(DB, Size, File)
for all recognised RDF
database files. File is relative to the database directory Dir.
attach_graph(+Graph, +Options) is det[private]- Load triples and reload journal from the indicated snapshot
file.
load_journal(+File:atom, +DB:atom) is det[private]- Process transactions from the RDF journal File, adding the given
named graph.
rdf_persistency(+DB, Bool)- Specify whether a database is persistent. Switching to
false
kills the persistent state. Switching to true
creates it.
rdf_db:property_of_graph(?Property, +Graph) is nondet[multifile]- Extend rdf_graph_property/2 with new properties.
start_monitor is det[private]
stop_monitor is det[private]- Start/stop monitoring the RDF database for changes and update
the journal.
monitor(+Term) is semidet[private]- Handle an rdf_monitor/2 callback to deal with persistency. Note
that the monitor calls that come from rdf_db.pl that deal with
database changes are serialized. They do come from different
threads though.
check_nested(+Level) is semidet[private]- True if we must log this transaction. This is always the case
for toplevel transactions. Nested transactions are only logged
if
log_nested_transactions(true)
is defined.
open_transaction(+DB, +Fd) is det[private]- Add a
begin(Id, Level, Time, Message)
term if a transaction
involves DB. Id is an incremental integer, where each database
has its own counter. Level is the nesting level, Time is a floating
point timestamp and Message is the message provided as argument to
the log message.
next_transaction_id(+DB, -Id) is det[private]- Id is the number to user for the next logged transaction on DB.
Transactions in each named graph are numbered in sequence.
Searching the Id of the last transaction is performed by the 2nd
clause starting 1Kb from the end and doubling this offset each
failure.
- end_transactions(+DBs:list(atom:id)) is det[private]
- End a transaction that affected the given list of databases. We
write the list of other affected databases as an argument to the
end-term to facilitate fast finding of the related transactions.
In each database, the transaction is ended with a term end(Id,
Nesting, Others)
, where Id and Nesting are the transaction
identifier and nesting (see open_transaction/2) and Others is a
list of DB:Id, indicating other databases affected by the
transaction.
sync_loaded_graphs(+Graphs)[private]- Called after a binary triple has been loaded that added triples
to the given graphs.
journal_fd(+DB, -Stream) is det[private]- Get an open stream to a journal. If the journal is not open, old
journals are closed to satisfy the
max_open_journals
option.
Then the journal is opened in append
mode. Journal files are
always encoded as UTF-8 for portability as well as to ensure
full coverage of Unicode.
limit_fd_pool is det[private]- Limit the number of open journals to max_open_journals (10).
Note that calls from rdf_monitor/2 are issued in different
threads, but as they are part of write operations they are fully
synchronised.
sync_journal(+DB, +Fd)[private]- Sync journal represented by database and stream. If the DB is
involved in a transaction there is no point flushing until the
end of the transaction.
close_journal(+DB) is det[private]- Close the journal associated with DB if it is open.
close_journals[private]- Close all open journals.
create_db(+Graph)[private]- Create a saved version of Graph in corresponding file, close and
delete journals.
delete_db(+DB)[private]- Remove snapshot and journal file for DB.
lock_db(+Dir)[private]- Lock the database directory Dir.
unlock_db(+Dir) is det[private]
unlock_db(+Stream, +File) is det[private]
dir_levels(+File, +Levels, ?Segments, ?Tail) is det[private]- Create a list of intermediate directory names for File. Each
directory consists of two hexadecimal digits.
db_files(+DB, -Snapshot, -Journal)[private]
- db_files(-DB, +Snapshot, -Journal)[private]
- db_files(-DB, -Snapshot, +Journal)[private]
- True if named graph DB is represented by the files Snapshot and
Journal. The filenames are local to the directory representing
the store.
rdf_journal_file(+Graph, -File) is semidet
- rdf_journal_file(-Graph, -File) is nondet
- True if File the name of the existing journal file for Graph.
rdf_snapshot_file(+Graph, -File) is semidet
- rdf_snapshot_file(-Graph, -File) is nondet
- True if File the name of the existing snapshot file for Graph.
rdf_db_to_file(+DB, -File) is det
- rdf_db_to_file(-DB, +File) is det
- Translate between database encoding (often an file or URL) and
the name we store in the directory. We keep a cache for two
reasons. Speed, but much more important is that the mapping of
raw --> encoded provided by www_form_encode/2 is not guaranteed
to be unique by the W3C standards.
url_to_filename(+URL, -FileName) is det[private]
- url_to_filename(-URL, +FileName) is det[private]
- Turn a valid URL into a filename. Earlier versions used
www_form_encode/2, but this can produce characters that are not
valid in filenames. We will use the same encoding as
www_form_encode/2, but using our own rules for allowed
characters. The only requirement is that we avoid any filename
special character in use. The current encoding use US-ASCII
alnum characters, _ and %
reindex_db(+Dir, +Levels)[private]- Reindex the database by creating intermediate directories.
load_prefixes(+RDFDBDir) is det[private]- If the file RDFDBDir/prefixes.db exists, load the prefixes. The
prefixes are registered using rdf_register_ns/3. Possible errors
because the prefix definitions have changed are printed as
warnings, retaining the old definition. Note that changing
prefixes generally requires reloading all RDF from the source.
mkdir(+Directory)[private]- Create a directory if it does not already exist.
time_stamp(-Integer)[private]- Return time-stamp rounded to integer.