Copyright © 2010-2024 World Wide Web Consortium . W3C ® liability , trademark and permissive document license rules apply.
RDF [ RDF11-CONCEPTS ] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.
This section describes the status of this document at the time of its publication. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document describes the RDFC-1.0 algorithm for canonicalizing RDF datasets, which was the input from the W3C Credentials Community Group published as [ CCG-RDC-FINAL ].
At the time of publication, [ RDF11-CONCEPTS ] is the most recent recommendation defining RDF datasets and [ N-QUADS ], however work on an updated specification is ongoing within the W3C RDF-star Working Group . Some dependencies from relevant updated specifications are provided normatively in this specification with the expectation that a future update to this specification will replace those with normative references to updated RDF specifications.
To successfully advance to Proposed Recommendation, each test in the test suite MUST be passed by at least three independent implementations. Two independent implementations MUST pass all tests.
This document was published by the RDF Dataset Canonicalization and Hash Working Group as an Editor's Draft.
Publication as an Editor's Draft does not imply endorsement by W3C and its Members.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress. Future updates to this specification may incorporate new features .
This document was produced by a group operating under the W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 03 November 2023 W3C Process Document .
This section is non-normative.
When data scientists discuss canonicalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to transform each of these different ways into a single, standard representation. With a standard representation, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.
In particular, this specification is about normalizing RDF datasets , which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires canonicalization to achieve the aforementioned goals and any others that may arise via serendipity.
Most RDF datasets can be canonicalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly in the worst case. Fortunately, existing real world data is rarely, if ever, modeled in a way that manifests as the worst case and new data can be modeled to avoid it. In fact, software systems that detect a problematic dataset (see 7.1 Dataset Poisoning ) can choose to assume it's an attempted denial of service attack, rather than a real input, and abort.
This document outlines an algorithm for generating a canonical serialization of an RDF dataset given an RDF dataset as input. The algorithm is called the RDF Canonicalization algorithm version 1.0 or RDFC-1.0 .
RDF
1.1
Concepts
and
Abstract
Syntax
[
RDF11-CONCEPTS
]
lacks
clarity
on
the
representation
of
language-tagged
strings
,
where
language
tags
of
the
form
xx-YY
are
treated
as
being
case
insensitive.
Implementations
might
represent
language
tags
using
all
lower
case
in
the
form
xx-yy
,
retain
the
original
representation
xx-YY
,
or
use
[
BCP47
]
formatting
conventions
,
leading
to
different
canonical
forms,
and
therefore,
different
hashed
values.
xx-YY
is
case
insensitive,
which
might
lead
to
different
canonicalizations
if
the
user
is
not
aware
of
this
problem.
See B. URDNA2015 for a comparison with the version of the algorithm published in RDF Dataset Canonicalization [ CCG-RDC-FINAL ].
There are different use cases where graph or dataset canonicalization are important:
A canonicalization algorithm is necessary, but not necessarily sufficient, to handle many of these use cases. The use of blank nodes in RDF graphs and datasets has a long history and creates inevitable complexities. Blank nodes are used for different purposes:
Furthermore, RDF semantics dictate that deserializing an RDF document results in the creation of unique blank nodes , unless it can be determined that on each occasion, the blank node identifies the same resource. This is due to the fact that blank node identifiers are an aspect of a concrete RDF syntax and are not intended to be persistent or portable. Within the abstract RDF model, blank nodes do not have identifiers (although some RDF store implementations may use stable identifiers and may choose to make them portable). See Blank Nodes in [ RDF11-CONCEPTS ] for more information.
RDF does have a provision for allowing blank nodes to be published in an externally identifiable way through the use of Skolem IRIs , which allow a given RDF store to replace the use of blank nodes in a concrete syntax with IRIs , which then serve to repeatably identify that blank node within that particular RDF store; however, this is not generally useful for talking about the same graph in different RDF stores, or other concrete representations. In any case, a stable blank node identifier defined for one RDF store or serialization is arbitrary, and typically not relatable to the context within which it is used.
This specification defines an algorithm for creating stable blank node identifiers repeatably for different serializations possibly using individualized blank node identifiers of the same RDF graph (dataset) by grounding each blank node through the nodes to which it is connected. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting canonicalized dataset , allowing for the isomorphism and digital signing use cases. This specification does not define such a graph signature.
As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes. If any information which would change the generated blank node identifier, a resulting diff might indicate a greater set of changes than actually exists. Additionally, if the starting dataset is an N-Quads document, it may be possible to correlate the original blank node identifiers used within that N-Quads document with those issued in the canonicalized dataset .
Although alternative hash algorithms might be used with this specification, applications ought to carefully weigh the advantages and disadvantages of using an alternative hash function. This is the case, in particular, for any representation of the canonical n-quads form or issued identifiers map that does not identify the associated hash algorithm. Any use case that requires reproduction of the same output is expected to unequivocally express or communicate the internal hash algorithm that was used when generating the canonical n-quads form .
This document is a detailed specification for an RDF dataset canonicalization algorithm. The document is primarily intended for the following audiences:
To understand the basics in this specification you must be familiar with basic RDF concepts [ RDF11-CONCEPTS ]. A working knowledge of graph theory and graph isomorphism is also recommended.
This section is non-normative.
Cannot
GET
/uploads/liJRiS/spec/common/typographical-conventions.html
/uploads/8h9OC5/spec/common/typographical-conventions.html
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MUST , MUST NOT , and SHOULD in this document are to be interpreted as described in BCP 14 [ RFC2119 ] [ RFC8174 ] when, and only when, they appear in all capitals, as shown here.
A conforming processor is a system which can generate the canonical n-quads form of an input dataset consistent with the algorithms defined in this specification.
The algorithms in this specification are normative, because to consistently reproduce the same canonical identifiers, implementations MUST strictly conform to the steps outlined in these algorithms.
Implementers can partially check their level of conformance with this specification by successfully passing the test cases of the RDF Dataset Canonicalization test suite . Note, however, that passing all the tests in the test suite does not imply complete conformance to this specification. It only implies that the implementation conforms to the aspects tested by the test suite.
LF
(line
feed,
code
point
U+000A
).
Implementations MUST support a parameter to define the hash algorithm , MUST support SHA-256 and SHA-384 [ FIPS-180-4 ], and SHOULD support the ability to specify other hash algorithms. Using a different hash algorithm will generally result in different output than using the default.
There is no expectation that the default hash algorithm will also be used by any application creating a hash digest of the canonical N-Quads result.
_:
that
is
used
as
an
identifier
for
a
blank
node
.
Blank
node
identifiers
are
typically
implementation-specific
local
identifiers;
this
document
specifies
an
algorithm
for
deterministically
specifying
them.
_:
string
to
differentiate
them
from
other
nodes
in
the
graph.
This
affects
the
canonicalization
algorithm,
which
is
based
on
calculating
a
hash
over
the
representations
of
quads
in
this
format.
true
and
false
A
and
B
),
using
Unicode
Codepoint
Collation
,
as
defined
in
[
XPATH-FUNCTIONS
],
which
defines
a
total
ordering
of
strings
comparing
code
points.
Note
that
for
UTF-8
encoded
strings,
comparing
the
byte
sequences
gives
the
same
result
as
code
point
order
.
Canonicalization is the process of transforming an input dataset to its serialized canonical form . That is, any two input datasets that contain the same information, regardless of their arrangement, will be transformed into the same serialized canonical form . The problem requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally-unique identifiers, but can be difficult to do when some of the nodes do not. Any nodes without globally-unique identifiers must be issued deterministic identifiers.
This specification defines a canonicalized dataset to include stable identifiers for blank nodes, practical uses of which will always generate a canonical serialization of such a dataset.
In time, there may be more than one canonicalization algorithm and, therefore, for identification purposes, this algorithm is named the "RDF Canonicalization algorithm version 1.0" ( RDFC-1.0 ).
Figure 1 provides an overview of RDFC-1.0 , with steps 1 through 7 corresponding to the various steps described in 4.4.3 Algorithm .
This section is non-normative.
To determine a canonical labeling, RDFC-1.0 considers the information connected to each blank node. Nodes with unique first degree information can immediately be issued a canonical identifier via the Issue Identifier algorithm . When a node has non-unique first degree information, it is necessary to determine all information that is transitively connected to it throughout the entire dataset. 4.6 Hash First Degree Quads defines a node’s first degree information via its first degree hash.
Hashes are computed from the information of each blank node. These hashes encode the mentions incident to each blank node. The hash of a string s , is the lower-case, hexadecimal representation of the result of passing s through a cryptographic hash function. By default, RDFC-1.0 uses the SHA-256 hash algorithm [ FIPS-180-4 ].
The "degree" terminology is used within this specification as colloquial way of describing the eccentricity or radius of any two nodes within a dataset. This concept is also related to "degrees of separation", as in, for example, " six degrees of separation ". Nodes with unique first degree information can be considered nodes with a radius of one.
When performing the steps required by the canonicalization algorithm, it is helpful to track state in a data structure called the canonicalization state . The information contained in the canonicalization state is described below.
c14n
(short
for
canonicalization),
for
issuing
canonical
blank
node
identifiers
.
The canonicalization algorithm issues identifiers to blank nodes . The Issue Identifier algorithm uses an identifier issuer to accomplish this task. The information an identifier issuer needs to keep track of is described below.
c14n
is
a
proper
initial
value
for
the
identifier
prefix
that
would
produce
blank
node
identifiers
like
c14n1
.
0
.
The canonicalization algorithm converts an input dataset into a canonicalized dataset or raises an error if the input dataset is determined to be overly complex. This algorithm will assign deterministic identifiers to any blank nodes in the input dataset .
This section is non-normative.
RDFC-1.0
canonically
labels
an
RDF
dataset
by
assigning
each
blank
node
a
canonical
identifier.
In
RDFC-1.0
,
an
RDF
dataset
D
is
represented
as
a
set
of
quads
of
the
form
<
s,
p,
o,
g
>
where
the
graph
component
g
is
empty
if
and
only
if
the
triple
<
s,
p,
o
>
is
in
the
default
graph
.
It
is
expected
that,
for
two
RDF
datasets,
RDFC-1.0
returns
the
same
canonically
labeled
list
of
quads
if
and
only
if
the
two
datasets
are
isomorphic
(i.e.,
the
same
modulo
blank
node
identifiers).
RDFC-1.0 consists of several sub-algorithms. These sub-algorithms are introduced in the following sub-sections. First, we give a high level summary of RDFC-1.0 .
This section is non-normative.
The following algorithm will run with a minimal number of iterations in each step for typical input datasets . In some extreme cases, the algorithm can behave poorly, particularly in Step 5 . Implementations MUST defend against potential denial-of-service attacks by raising suitable exceptions and terminating early. See 7.1 Dataset Poisoning for further information.
Implementations can consider placing limits on the number of calls to 4.8 Hash N-Degree Quads based on the number of blank nodes in the hash to blank nodes map . For most typical datasets, more than a couple of iterations on 4.8 Hash N-Degree Quads per blank node would be unusual.
This has the effect of initializing the blank node to quads map , and the hash to blank nodes map , as well as instantiating a new canonical issuer .
After this algorithm completes, the input blank node identifier map state and canonical issuer may be used to correlate blank nodes used in the input dataset with both their original identifiers, and associated canonical identifiers.
This establishes the blank node to quads map , relating each blank node with the set of quads of which it is a component, via the map for each blank node in the input dataset to its assigned identifier.
Literal
components
of
quads
are
not
subject
to
any
normalization.
As
noted
in
Section
3.3
of
[
RDF11-CONCEPTS
],
literal
term
equality
is
based
on
the
lexical
form
,
rather
than
the
literal
value
,
so
two
literals
"01"^^xsd:integer
and
"1"^^xsd:integer
are
treated
as
distinct
resources.
Log the state of the blank node to quads map :
# Blank node to quads map for unique hashes example
ca:
log point: Entering the canonicalization function (4.4.3).
ca.2:
log point: Extract quads for each bnode (4.4.3 (2)).
Bnode to quads:
e0:
- <http://example.com/#p> <http://example.com/#q> _:e0 .
- _:e0 <http://example.com/#s> <http://example.com/#u> .
e1:
- <http://example.com/#p> <http://example.com/#r> _:e1 .
- _:e1 <http://example.com/#t> <http://example.com/#u> .
...
This step creates a hash for every blank node in the input document. Some blank nodes will lead to a unique hash, while other blank nodes may share a common hash.
Log the results from the Hash First Degree Quads algorithm .
# First degree hashes for unique hashes example ca: ... ca.3: log point: Calculated first degree hashes (4.4.3 (3)). with: - identifier: e0 h1dq: log point: Hash First Degree Quads function (4.6.3). nquads: - <http://example.com/#p> <http://example.com/#q> _:a . - _:a <http://example.com/#s> <http://example.com/#u> . hash: 21d1dd5ba21f3dee9d76c0c00c260fa6f5d5d65315099e553026f4828d0dc77a - identifier: e1 h1dq: log point: Hash First Degree Quads function (4.6.3). nquads: - <http://example.com/#p> <http://example.com/#r> _:a . - _:a <http://example.com/#t> <http://example.com/#u> . hash: 6fa0b9bdb376852b5743ff39ca4cbf7ea14d34966b2828478fbf222e7c764473 ...
This step establishes the canonical identifier for blank nodes having a unique hash, which are recorded in the canonical issuer .
Log the assigned canonical identifiers.
# Assigned canonical identifiers for shared hashes example ca: ... ca.4: log point: Create canonical replacements for hashes mapping to a single node (4.4.3 (4)). with: - identifier: e2 hash: 15973d39de079913dac841ac4fa8c4781c0febfba5e83e5c6e250869587f8659 canonical label: c14n0 - identifier: e3 hash: 7e790a99273eed1dc57e43205d37ce232252c85b26ca4a6ff74ff3b5aea7bccd canonical label: c14n1 ...
This step establishes the canonical identifier for blank nodes having a shared hash. This is done by creating unique blank node identifiers for all blank nodes traversed by the Hash N-Degree Quads algorithm , running through each blank node without a canonical identifier in the order of the hashes established in the previous step.
Log hash and identifier list for this iteration.
# Hash and Identifier List for each iteration of step 5 using shared hashes example ca: ... ca.5: log point: Calculate hashes for identifiers with shared hashes (4.4.3 (5)). with: - hash: 3b26142829b8887d011d779079a243bd61ab53c3990d550320a17b59ade6ba36 identifier list: [ "e0", "e1"] ... ...
This list will be populated in step 5.2 , and will establish an order for those blank nodes sharing a common first-degree hash.
b
.
Include logs for each call to Hash N-Degree Quads algorithm .
# Logs from calls to Hash N-Degree Quads algorithm for shared hashes example ca: ... ca.5: log point: Calculate hashes for identifiers with shared hashes (4.4.3 (5)). with: - hash: 3b26142829b8887d011d779079a243bd61ab53c3990d550320a17b59ade6ba36 identifier list: [ "e0", "e1"] ca.5.2: log point: Calculate hashes for identifiers with shared hashes (4.4.3 (5.2)). with: - identifier: e0 hndq: log point: Hash N-Degree Quads function (4.8.3). identifier: e0 issuer: {e0: b0} ... ... ... ...
The previous step created temporary identifiers for the blank nodes sharing a common first degree hash, which is now used to generate their canonical identifiers.
In Step 5.2 , hash path list was created with an ordered set of results. Each result contained a temporary issuer which recorded temporary identifiers associated with a particular blank node identifier in identifier list . This step processes each returned temporary issuer , in order, and allocates canonical identifiers to the temporary identifier mappings contained within each temporary issuer , creating a full order on the remaining blank nodes with unissued canonical identifiers.
Log newly issued canonical identifiers.
# Newly issued canonical identifiers from step 5.3 for shared hashes example ca: ... ca.5: log point: Calculate hashes for identifiers with shared hashes (4.4.3 (5)). with: - hash: 3b26142829b8887d011d779079a243bd61ab53c3990d550320a17b59ade6ba36 identifier list: [ "e0", "e1"] ... ca.5.3: log point: Canonical identifiers for temporary identifiers (4.4.3 (5.3)). issuer: - blank node: e1 canonical identifier: c14n2 - blank node: e0 canonical identifier: c14n3 ...
This step adds the issued identifiers map from the canonical issuer to the canonicalized dataset , the keys in the issued identifiers map are map entries in the input blank node identifier map .
Log the state of the canonical issuer at the completion of the algorithm.
# Canonical issuer state after step 6 for shared hashes example
ca:
...
ca.6:
log point: Issued identifiers map (4.4.3 (6)).
issued
identifiers
map:
{e2:
c14n0,
e3:
c14n1,
e1:
c14n2,
e0:
c14n3}
Technically speaking, one implementation might return a canonicalized dataset that maps particular blank nodes to different identifiers than another implementation, however, this only occurs when there are isomorphisms in the dataset such that a canonically serialized expression of the dataset would appear the same from either implementation.
The serialized canonical form is an N-Quads document where the blank node identifiers are taken from the canonical identifiers associated with each blank node.
The canonicalized dataset is composed of the original input dataset , the input blank node identifier map , containing identifiers for each blank node in the input dataset , and the canonical issuer , containing an issued identifiers map mapping the identifiers in the input blank node identifier map to their canonical identifiers.
This algorithm issues a new blank node identifier for a given existing blank node identifier . It also updates state information that tracks the order in which new blank node identifiers were issued. The order of issuance is important for canonically labeling blank nodes that are isomorphic to others in the dataset.
The
algorithm
maintains
an
issued
identifiers
map
to
relate
an
existing
blank
node
identifier
from
the
input
dataset
to
a
new
blank
node
identifier
using
a
given
identifier
prefix
(
c14n
)
with
new
identifiers
issued
by
appending
an
incrementing
number.
For
example,
when
called
for
a
blank
node
identifier
such
as
e3
,
it
might
result
in
a
issued
identifier
of
c14n1
.
The algorithm takes an identifier issuer I and an existing identifier as inputs. The output is a new issued identifier . The steps of the algorithm are:
This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component. If the hash uniquely identifies that blank node, no further examination is necessary. Otherwise, a hash will be created for the blank node using the algorithm in 4.8 Hash N-Degree Quads invoked via 4.4 Canonicalization Algorithm .
This section is non-normative.
To determine whether the first degree information of a node n is unique, a hash is assigned to its mention set , Q n . The first degree hash of a blank node n , denoted h f (n) , is the hash that results from 4.6 Hash First Degree Quads when passing n . Nodes with unique first degree hashes have unique first degree information.
For
consistency,
blank
node
identifiers
used
in
Q
n
are
replaced
with
placeholders
in
a
canonical
n-quads
serialization
of
that
quad.
Every
blank
node
component
is
replaced
with
either
a
or
z
,
depending
on
if
that
component
is
n
or
not.
The resulting serialized quads are then code point ordered , concatenated, and hashed. This hash is the first degree hash of n , h f (n) .
This section is non-normative.
This algorithm takes the canonicalization state and a reference blank node identifier as inputs.
a
,
otherwise,
use
the
blank
node
identifier
z
.
Log the inputs and result of running this algorithm.
# Inputs and hash result for the Hash First Degree Hash algorithm for unique hashes example h1dq: log point: Hash First Degree Quads function (4.6.3). nquads: - <http://example.com/#p> <http://example.com/#q> _:a . - _:a <http://example.com/#s> <http://example.com/#u> . hash: 21d1dd5ba21f3dee9d76c0c00c260fa6f5d5d65315099e553026f4828d0dc77a
This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component for which the hash does not uniquely identify that blank node. This is done by expanding the search from quads directly referencing that blank node (the mention set ), to those quads which contain nodes which are also components of quads in the mention set, called the gossip path . This process proceeds in every greater degrees of indirection until a unique hash is obtained.
This section is non-normative.
Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a deterministic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.
When two blank nodes have the same first degree hash, extra steps must be taken to detect global, or N -degree, distinctions. All information that is in any way connected to the blank node n through other blank nodes, even transitively, must be considered.
To consider all transitive information, the algorithm traverses and encodes all possible paths of incident mentions emanating from n , called gossip paths , that reach every unlabeled blank node connected to n . Each unlabeled blank node is assigned a temporary identifier in the order in which it is reached in the gossip path being explored. The mentions that are traversed to reach connected blank nodes are encoded in these paths via related hashes. This provides a deterministic way to order all paths coming from n that reach all blank nodes connected to n without relying on input blank node identifiers.
This algorithm works in concert with the main canonicalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes.
Ultimately, the algorithm selects the shortest gossip path (based on its encoding as a string), distributing canonical identifiers to the unlabeled blank nodes in the order in which they appear in this path. The hash of this encoded shortest path, called the N-degree hash of n , distinguishes n from other blank nodes in the dataset.
For clarity, we consider a gossip path encoded via the string s to be shortest provided that:
For example, abc is shorter than bbc , whereas abcd is longer than bcd .
The following provides a high level outline for how the N-degree hash of n is computed along the shortest gossip path . Note that the full algorithm considers all gossip paths, ultimately returning the hash of the shortest encoded path.
As described above in step 2.3 , HN recurses on each unlabeled blank node when it is first reached along the gossip path being explored. This recursion can be visualized as moving along the path from n to the blank node n i that is receiving a temporary identifier. If, when recursing on n i , another unlabeled blank node n j is discovered, the algorithm again recurses. Such a recursion traces out the gossip path from n to n j via n i .
The recursive hash r(i) is the hash returned from the completed recursion on the node n i when computing h N (n) . Just as h N (n) is the hash of D n , we denote the data to hash in the recursion on n i as D i . So, r(i) = h(D i ) . For each related hash x ∈ H n , R n (x) is called the recursion list on which the algorithm recurses.
This section is non-normative.
The inputs to this algorithm are the canonicalization state , the identifier for the blank node to recursively hash quads for, and path identifier issuer which is an identifier issuer that issues temporary blank node identifier s. The output from this algorithm will be a hash and the identifier issuer used to help generate it.
Log the inputs to the algorithm.
# Inputs for the Hash N-Degree Quads algorithm for double circle example
hndq:
log point: Hash N-Degree Quads function (4.8.3).
identifier: e0
issuer: {e0: b0}
...
quads is the mention set of identifier .
Log the quads from the mention set of identifier .
# Inputs for the Hash N-Degree Quads algorithm for double circle example
hndq:
identifier: e0
log point: Hash N-Degree Quads function (4.8.3).
issuer: {e0: b0}
hndq.2:
log point: Quads for identifier (4.8.3 (2)).
quads:
- _:e0 <http://example.org/vocab#next> _:e1 .
- _:e0 <http://example.org/vocab#prev> _:e1 .
- _:e1 <http://example.org/vocab#next> _:e0 .
- _:e1 <http://example.org/vocab#prev> _:e0 .
...
This loop calculates the related hash H n for other blank nodes within the mention set of identifier .
s
,
o
,
or
g
based
on
whether
component
is
a
subject
,
object
,
graph
name
,
respectively.
Include the logs for each iteration of the Hash Related Blank Node algorithm and the resulting H n .
# Step 3 of Hash N-Degree Quads using double circle example hndq: identifier: e0 log point: Hash N-Degree Quads function (4.8.3). issuer: {e0: b0} ... hndq.3: log point: Hash N-Degree Quads function (4.8.3 (3)). with: - quad: _:e0 <http://example.org/vocab#next> _:e1 . hndq.3.1: log point: Hash related bnode component (4.8.3 (3.1)) with: - position: o related: e1 h1dq: log point: Hash First Degree Quads function (4.6.3). nquads: - _:z <http://example.org/vocab#next> _:a . - _:z <http://example.org/vocab#prev> _:a . - _:a <http://example.org/vocab#next> _:z . - _:a <http://example.org/vocab#prev> _:z . hash: 60dc8fc7b5481014b6ea38efb05455676d1e93e19b99119ab294941dacc16b3b input: "o<http://example.org/vocab#next>60dc8fc7b5481014b6ea38efb05455676d1e93e19b99119ab294941dacc16b3b" hash: 20bb08971220a5382a9a06ba2977c5fb859e63192e0b2015a378af89e453f25e - quad: _:e0 <http://example.org/vocab#prev> _:e1 . ... Hash to bnodes: 20bb08971220a5382a9a06ba2977c5fb859e63192e0b2015a378af89e453f25e: - e1 1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d: - e1 56d0774755aaf8d9cf4da8af3728e5589f94e5cd7d9aee86f0c5a7bc1d71c7ca: - e1 2a5dd448b9467a08479008a5350829441868b7f913343cd500fe8619e047cff4: - e1 ...
This loop explores the gossip paths for each related blank node sharing a common hash to identifier finding the shortest such path ( chosen path ). This determines how canonical identifiers for otherwise commonly hashed blank nodes are chosen.
Each path is represented by the concatenation of the identifiers for each related blank node – either the issued identifier, or a temporary identifier created using a copy of issuer . Those for which temporary identifiers were issued are later recursed over using this algorithm.
Log the value of related hash and state of data to hash .
# Log related hash and data to hash in each iteration of step 5 for double circle example. hndq: log point: Hash N-Degree Quads function (4.8.3). identifier: e0 issuer: {e0: b0} ... hndq.5: log point: Hash N-Degree Quads function (4.8.3 (5)), entering loop. with: - related_hash: 1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d data_to_hash: "" ...
Log each permutation p .
# Log each permutation of step 5.4 using double circle example. hndq: log point: Hash N-Degree Quads function (4.8.3). identifier: e0 issuer: {e0: b0} ... hndq.5: log point: Hash N-Degree Quads function (4.8.3 (5)), entering loop. with: - related_hash: 1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d data_to_hash: "" hndq.5.4: log point: Hash N-Degree Quads function (4.8.3 (5.4)), entering loop. with: - perm: [ "e1"] ...
_:
,
followed
by
the
canonical
identifier
for
related
,
to
path
.
A canonical identifier may have been generated before calling this algorithm, if it was issued from an earlier call to Hash First Degree Quads algorithm . There is no reason to recurse and apply the algorithm to any related blank node that has already been assigned a canonical identifier. Furthermore, using the canonical identifier also further distinguishes it from any temporary identifier, allowing for even greater efficiency in finding the chosen path.
Temporarily labeled nodes have identifiers recorded in issuer copy , which is later used to recursively call this algorithm, so that eventually all nodes are given canonical identifiers.
_:
,
followed
by
the
result,
to
path
.
If path is already longer than the prospective chosen path , we can terminate this iteration early.
path
is
used
to
generate
a
hash
at
a
later
step;
in
this
respect,
it
is
similar
to
the
Hash
First
Degree
Quads
algorithm
which
uses
the
serialization
of
quads
in
nquads
for
hashing.
For
the
sake
of
consistency,
the
nquad
representation
of
blank
node
identifiers
is
used
in
these
steps,
hence
the
usage
of
the
_:
string.
Log related and path .
# Log related and path of step 5.4.4 using double circle example. hndq: log point: Hash N-Degree Quads function (4.8.3). identifier: e0 issuer: {e0: b0} ... hndq.5: log point: Hash N-Degree Quads function (4.8.3 (5)), entering loop. with: - related_hash: 1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d data_to_hash: "" hndq.5.4: log point: Hash N-Degree Quads function (4.8.3 (5.4)), entering loop. with: - perm: [ "e1"] hndq.5.4.4: log point: Hash N-Degree Quads function (4.8.3 (5.4.4)), entering loop. with: - related: e1 path: "" ...
The prospective path is extended with the hash resulting from recursively calling this algorithm on each related blank node issued a temporary identifier.
Log recursion list and path .
# Log related and path of step 5.4.5 using double circle example. hndq: log point: Hash N-Degree Quads function (4.8.3). identifier: e0 issuer: {e0: b0} ... hndq.5: log point: Hash N-Degree Quads function (4.8.3 (5)), entering loop. with: - related_hash: 1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d data_to_hash: "" hndq.5.4: log point: Hash N-Degree Quads function (4.8.3 (5.4)), entering loop. with: - perm: [ "e1"] ... hndq.5.4.5: log point: Hash N-Degree Quads function (4.8.3 (5.4.5)), before possible recursion. recursion list: [ "e1"] path: "_:b1" ...
Log related and include logs for each recursive call to Hash N-Degree Quads algorithm .
# Log related and path of step 5.4.5.1 using double circle example. hndq: log point: Hash N-Degree Quads function (4.8.3). identifier: e0 issuer: {e0: b0} ... hndq.5: log point: Hash N-Degree Quads function (4.8.3 (5)), entering loop. with: - related_hash: 1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d data_to_hash: "" hndq.5.4: log point: Hash N-Degree Quads function (4.8.3 (5.4)), entering loop. with: - perm: [ "e1"] ... hndq.5.4.5: log point: Hash N-Degree Quads function (4.8.3 (5.4.5)), before possible recursion. recursion list: [ "e1"] path: "_:b1" with: - related: e1 hndq: ...
_:
,
followed
by
the
result,
to
path
.
<
,
the
hash
in
result
,
and
>
to
path
.
If path is already longer than the prospective chosen path , we can terminate this iteration early.
Log chosen path and data to hash .
# Log chosen path and data to hash logs of step 5.5 using double circle example. hndq: log point: Hash N-Degree Quads function (4.8.3). identifier: e0 issuer: {e0: b0} ... hndq.5: log point: Hash N-Degree Quads function (4.8.3 (5)), entering loop. with: - related_hash: 1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d data_to_hash: "" ... hndq.5.5: log point: Hash N-Degree Quads function (4.8.3 (5.5). End of current loop with Hn hashes. chosen path: "_:b1_:b1<1ae899f76e760eb7caf6656437aaef845b50887aff7baeb3531add85ec02ed35>" data to hash: "1e4e55ba02b8b0b527c32e2343fbcfee2e2bd9c1972c67cc01f85fabde7bc42d_:b1_:b1<1ae899f76e760eb7caf6656437aaef845b50887aff7baeb3531add85ec02ed35>" ...
Log issuer and results from passing data to hash through the hash algorithm .
# Log issuer and resulting hash of step 6 using double circle example.
hndq:
log point: Hash N-Degree Quads function (4.8.3).
identifier: e0
issuer: {e0: b0}
...
hndq.6:
log point: Leaving Hash N-Degree Quads function (4.8.3).
hash: e332b4b59e1c4794ee72a4df0f63723326ffb6d6a5c0d0cb4d2dd8d8d5ebf5a4
issuer:
{e0:
b0,
e1:
b1}
This section describes the process of creating a serialized [ N-Quads ] representation of a canonicalized dataset .
The
serialized
canonical
form
of
a
canonicalized
dataset
is
an
N-Quads
document
[
N-QUADS
]
created
by
representing
each
quad
from
the
canonicalized
dataset
in
canonical
n-quads
form
,
sorting
them
into
code
point
order
,
and
concatenating
them.
(Note
that
each
canonical
N-Quads
statement
ends
with
a
new
line,
so
no
additional
separators
are
needed
in
the
concatenation.)
The
resulting
document
has
a
media
type
of
application/n-quads
,
as
described
in
C.
N-Quads
Internet
Media
Type,
File
Extension
and
Macintosh
File
Type
of
[
N-QUADS
].
When serializing quads in canonical n-quads form , components which are blank nodes MUST be serialized using the canonical label associated with each blank node from the issued identifiers map component of the canonicalized dataset .
This section is non-normative.
The nature of the canonicalization algorithm inherently correlates its output, i.e., the canonical labels and the sorted order of quads, with the input dataset. This could pose issues, particularly when dealing with datasets containing personal information. For example, even if certain information is removed from the canonicalized dataset for some privacy-respecting reason, there remains the possibility that a third party could infer the omitted data by analyzing the canonicalized dataset. If it is necessary to decouple the canonicalization algorithm's input and output, some suitable post-processing methods for the output of the canonicalization should be performed. This specification has been designed to help make additional processing easier, but other specifications that build on top of this one are responsible for providing any specific details. See Selective Disclosure in Verifiable Credential Data Integrity 1.0 [ VC-DATA-INTEGRITY ] for more details about such post-processing methods.
This section is non-normative.
This section is non-normative.
The canonicalization algorithm examines every difference in the information connected to blank nodes in order to ensure that each will properly receive its own canonical identifier. This process can be exploited by attackers to construct datasets which are known to take large amounts of computing time to canonicalize, but that do not express useful information or express it using unnecessary complexity. Implementers of the algorithm are expected to add mitigations that will, by default, abort canonicalizing problematic inputs.
Suggested mitigations include, but are not limited to:
Additionally, software that uses implementations of the algorithm can employ best-practice schema validation to reject data that does not meet application requirements, thereby preventing useless poison datasets from being processed. However, such mitigations are application specific and not directly applicable to implementers of the canonicalization algorithm itself.
This section is non-normative.
It is possible that the default hash algorithm used by RDFC-1.0 might become insecure at some point in the future. To mitigate this, this algorithm and implementations of it can be parameterized to use a different hash function, without the need to make any changes to the canonicalization algorithm itself. However, using a different hash algorithm will generally lead to different results; applications making use of this specification should carefully weigh the advantages and disadvantages of using an alternative hash function.
The possible implications of the default hash algorithm becoming insecure are mitigated by that fact that no internal hash values are revealed, and the canonicalization algorithm is designed to cope with first-degree hash collisions.
This section is non-normative.
The use cases that have driven the development of the RDF Dataset Canonicalization algorithm are documented in a separate document. It includes further background and explanations for the design decisions taken [ RCH-EXPLAINER ].
This section is non-normative.
This example illustrates a more complicated example where the same paths through blank nodes are duplicated in a graph, but use different blank node identifiers .
_:e0 :p1 _:e1 .
_:e1 :p2 "Foo" .
_:e2 :p1 _:e3 .
_:e3
:p2
"Foo"
.
The following is a summary of the more detailed execution log found here .
This example illustrates another complicated example of nodes that are doubly connected in opposite directions.
_:e0 :next _:e1 .
_:e0 :prev _:e1 .
_:e1 :next _:e0 .
_:e1
:prev
_:e0
.
The example is not explored in detail, but the execution log found here shows examples of more complicated pathways through the algorithm
This example illustrates an example of a dataset , where one graph is named using a blank node , which is also the object of a triple in the default graph .
_:e0 :p1 _:e1 .
_:e1 :p2 "Foo" .
_:e1 :p3 _:g0 .
_:e0 :p1 _:e1 _:g0 .
_:e1
:p2
"Bar"
_:g0
.
The following is a summary of the more detailed execution log found here .
This section defines a canonical form of N-Quads which has a completely specified layout. The grammar for the language remains unchanged.
Canonical
N-Quads
updates
and
extends
Canonical
N-Triples
in
[
N-TRIPLES
]
to
include
graphLabel
.
While
the
N-Quads
syntax
[
N-QUADS
]
allows
choices
for
the
representation
and
layout
of
RDF
data,
the
canonical
form
of
N-Quads
provides
a
unique
syntactic
representation
of
any
quad.
Each
code
point
can
be
represented
by
only
one
of
UCHAR
,
ECHAR
,
or
unencoded
character,
where
the
relevant
production
allows
for
a
choice
in
representation.
Each
quad
is
represented
entirely
on
a
single
line
with
specified
white
space.
Canonical N-Quads has the following additional constraints on layout:
subject
,
predicate
,
object
,
and
graphLabel
,
each
of
which
MUST
be
a
single
space
(code
point
U+0020
).
http://www.w3.org/2001/XMLSchema#string
MUST
NOT
use
the
datatype
IRI
part
of
the
literal
,
and
are
represented
using
only
STRING_LITERAL_QUOTE
.
HEX
MUST
use
only
digits
(
[0-9]
)
and
uppercase
letters
(
[A-F]
).
BS
(backspace,
code
point
U+0008
),
HT
(horizontal
tab,
code
point
U+0009
),
LF
(line
feed,
code
point
U+000A
),
FF
(form
feed,
code
point
U+000C
),
CR
(carriage
return,
code
point
U+000D
),
"
(quotation
mark,
code
point
U+0022
),
and
\
(backslash,
code
point
U+005C
)
MUST
be
encoded
using
ECHAR
.
U+0000
to
U+0007
,
VT
(vertical
tab,
code
point
U+000B
),
characters
in
the
range
from
U+000E
to
U+001F
,
DEL
(delete,
code
point
U+007F
),
and
characters
not
matching
the
Char
production
from
[
XML11
]
MUST
be
represented
by
UCHAR
using
a
lowercase
\u
with
4
HEX
es.
ECHAR
or
UCHAR
MUST
be
represented
by
their
native
[
UNICODE
]
representation.
EOL
MUST
be
a
single
LF
(line
feed,
code
point
U+000A
).
EOL
MUST
be
provided.
This section is non-normative.
RDF
Dataset
Canonicalization
[
CCG-RDC-FINAL
]
describes
"Universal
RDF
Dataset
Normalization
Algorithm
2015"
(
URDNA2015
),
essentially
the
same
algorithm
as
RDFC-1.0
,
and
generally
implementations
implementing
URDNA2015
should
be
compatible
with
this
specification.
The
minor
change
is
in
the
canonical
n-quads
form
where
some
control
characters
were
previously
represented
without
escaping.
The
version
of
the
algorithm
defined
in
A.
A
Canonical
form
of
N-Quads
clarifies
the
representation
of
simple
literals
and
the
characters
within
STRING_LITERAL_QUOTE
that
are
encoded
using
ECHAR
.
This section is non-normative.
A previous version of this algorithm has light deployment. For purposes of identification, the algorithm is called the "Universal RDF Graph Canonicalization Algorithm 2012" ( URGNA2012 ), and differs from the stated algorithm in the following ways:
g
,
instead
of
z
.
<
and
>
;
there
were
no
delimiters.
p
,
for
property,
when
the
related
blank
node
was
a
subject
and
the
value
r
,
for
reverse
or
reference,
when
the
related
blank
node
was
an
object
.
Since
URGNA2012
only
canonicalized
graphs,
not
datasets,
there
was
no
use
of
the
graph
name
position.
p
as
position
.
r
as
position
.
map
)
map
)
This section is non-normative.
xyz
format
for
blank
node
identifiers,
instead
of
_:xyz
.
See
Issue
46
for
the
discussion.
simple
flag,
which
was
unused
in
existing
implementations.
The
original
design
of
the
algorithm
was
to
use
the
assigned
canonical
blank
node
identifier
,
if
available,
instead
of
_:a
or
_:z
,
similar
to
how
it
is
used
in
the
related
hash
algorithm,
but
this
text
never
made
it
into
the
spec
before
implementations
moved
forward.
Therefore,
the
hashes
never
change,
making
the
loop
based
on
the
simple
flag
that
calls
this
algorithm
unnecessary.
See
Issue
23
for
the
discussion.
_:
which
is
a
serialization
artifact.
This
is
still
required
in
the
algorithms,
but
the
distinction
between
what
is
an
identifier,
and
the
serialization
form
is
clarified.
This section is non-normative.
This section is non-normative.
The editors would like to thank Jeremy Carroll for his work on the graph canonicalization problem, Andy Seaborne and Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification, Sir Tim Berners-Lee for his thoughts on graph canonicalization over the years, Jesús Arias Fisteus for his work on a similar algorithm, and Aiden Hogan, whose publication [ Hogan-Canonical-RDF ] provided an important contemporary analysis of the canonicalization problem and served as an independent justification of the development of RDFC-1.0 .
Cannot
GET
/uploads/liJRiS/spec/common/participants.html
/uploads/8h9OC5/spec/common/participants.html
This specification is based on work done in the W3C Credentials Community Group published as [ CCG-RDC-FINAL ]. Contributors to the Community Group Final Report include: Blake Regalia, Dave Longley, David Lehn, David Lozano Jarque, Gregg Kellogg, Manu Sporny, Markus Sabadello, Matt Collier, and Sebastian Schmittner.
Portions of the work on this specification have been funded by the European Union's StandICT.eu 2023 program under sub-grantee contract numbers No. 08/12 and 09/25. The content of this specification does not necessarily reflect the position or the policy of the European Union and no official endorsement should be inferred.
Portions of the work on this specification have also been funded by the U.S. Department of Homeland Security's Silicon Valley Innovation Program under contracts 70RSAT21T00000020 and 70RSAT23T00000006. The content of this specification does not necessarily reflect the position or the policy of the U.S. Government and no official endorsement should be inferred.
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in: