Copyright © 2021 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and permissive document license rules apply.
Privacy is an essential part of the Web [ETHICAL-WEB]. This document provides definitions for privacy and related concepts that are applicable worldwide. It also provides a set of privacy principles that should guide the development of the Web as a trustworthy platform. Users of the Web would benefit from a stronger relationship between technology and policy, and this document is written to work with both.
This document was published as a Draft Finding by the Web Privacy Principles Task Force, which was convened by the Technical Architecture Group (TAG) and the Privacy Interest Group (PING). Publication as a Draft Finding does not imply endorsement by the TAG or by the W3C Membership. This draft does not yet reflect the consensus of the task force and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to cite this document as anything other than a work in progress.
It does not contain any normative content.
This document will continue to evolve and the task force will issue updates as often as needed. At the conclusion of the task force, it is intended to be adopted by the TAG as a Finding.
Privacy is essential to trust, and trust is a cornerstone value of the Web [RFC8890]. In much of everyday life, people have little difficulty assessing whether a given flow of information constitutes a violation of privacy or not [NYT-PRIVACY]. However, in the digital space, users struggle to understand how their data may flow between contexts and how such flows may affect them, not just immediately but at a much later time and in completely different situations. Some actors then seize upon this confusion in order to extract and exploit personal data at unprecedented scale.
The goal of this document is to define all the terms that may prove useful in developing technology and policy that relate to privacy and personal data. It additionally provides a toolbox to support the common need that is privacy threat modelling, the frequent debate over consent, and the under-developed set of issues in privacy that are of a collective, relational nature.
Personal data is a regulated object, and this document naturally recognises the jurisdictional primacy of existing data protection regimes. However, the global nature of the Web means that, as we develop technology, we benefit from shared concepts that guide the evolution of the Web as a system built for its users [RFC8890]. A clear and well-defined view of privacy on the Web, grounded in an up-to-date understanding of the state of the art, can hopefully help the Web's constituencies thrive across jurisdictional disparity, with the shared understanding that the law is a floor, not a ceiling.
This section provides a number of elementary building blocks from which to establish a shared understanding of privacy. Some of the definitions below build atop the work in Tracking Preference Expression (DNT) [tracking-dnt].
A user (also person or data subject) is any natural person.
We define personal data as any information relating to a person such that:
Data is permanently de-identified when there exists a high level of confidence that no human subject of the data can be identified, directly or indirectly (e.g., via association with an identifier, user agent, or device), by that data alone or in combination with other retained or available information, including as being part of a group. Note that further considerations relating to groups are covered in the Collective Issues in Privacy section.
Data is pseudonymous when:
the identifiers used in the data are under the direct and exclusive control of the first party; and
when these identifiers are shared with a third party, they are made unique to that third party such that if they are shared with more than one third party these cannot then match them up with one another; and
there is a strong level of confidence that no third party can match them to any data other than that obtained through interactions with the first party; and
any third party receiving such identifiers is barred (eg. based on legal terms) from sharing them or the related data further; and
technical measures exist to prevent re-identification or the joining of different data sets involving these identifiers, notably against timing or k-anonymity attacks; and
there exist contractual terms between the first party and third party describing the limited purpose for which the data is being shared.
This can ensure that pseudonymous data is used in a manner that provides a minimum degree of governance such that technical and procedural means to guarantee the maintenance of pseudonymity are preserved. Note that pseudonymity, on its own, is not sufficient to render data processing appropriate.
A vulnerable person is a person who, at least in the context of the processing being discussed, are unable to exercise sufficient self-determination for any consent they may provide to be receivable. This includes for example children, employees with respect to their employers, people in some situations of intellectual or psychological impairment, or refugees.
A party is an entity that a person can reasonably understand as a single "thing" they're interacting with. Uses of this document in a particular domain are expected to describe how the core concepts of that domain combine into a user-comprehensible party, and those refined definitions are likely to differ between domains.
The first party is a party with which the user intends to interact. Merely hovering over, muting, pausing, or closing a given piece of content does not constitute a user's intent to interact with another party, nor does the simple fact of loading a party embedded in the one with which the user intends to interact. In cases of clear and conspicuous joint branding, there can be multiple first parties. The first party is necessarily a data controller of the data processing that takes places as a consequence of a user interacting with it.
A third party is any party other than the user, the first party, or a service provider acting on behalf of either the user or the first party.
A service provider or data processor is considered to be the same party as the entity contracting it to perform the relevant processing if it:
A data controller is a party that determines the means and purposes of data processing. Any party that is not a service provider is a data controller.
The Vegas Rule is a simple implementation of privacy in which "what happens with the first party stays with the first party." Put differently, it describes a situation in which the first party is the only data controller. Note that, while enforcing the Vegas Rule provides a rule of thumb describing a necessary baseline for appropriate data processing, it is not always sufficient to guarantee appropriate processing since the first party can process data inappropriately.
A party processes data if it carries out operations on personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, sharing, dissemination or otherwise making available, selling, alignment or combination, restriction, erasure or destruction.
A party shares data if it provides it to any other party. Note that, under this definition, a party that provides data to its own service providers is not sharing it.
A party sells data when it shares it in exchange for consideration, monetary or otherwise.
The purpose of a given processing of data is an anticipated, intended, or planned outcome of this processing which is achieved or aimed for within a given context. A purpose, when described, should be specific enough to be actionable by someone familiar with the relevant context (ie. they could independently determine means that reasonably correspond to an implementation of the purpose).
The means are the general method of data processing through which a given purpose is implemented, in a given context, considered at a relatively abstract level and not necessarily all the way down to implementation details. Example: the user will have their preferences restored (purpose) by looking up their identifier in a preferences store (means).
A context is a physical or digital environment that a person interacts with for a purpose of their own (that they typically share with other person who interact with the same environment).
A context can be further described through:
A context carries context-relative informational norms that determine whether a given data processing is appropriate (if the norms are adhered to) or inappropriate (when the norms are violated). A norm violation can be for instance the exfiltration of personal data from a context or the lack of respect for transmission principles. When norms are respected in a given context, we can say that contextual integrity is maintained; otherwise that it is violated ([PRIVACY-IN-CONTEXT], [PRIVACY-AS-CI]).
We define privacy as a right to appropriate data processing. A privacy violation is, correspondingly, inappropriate data processing [PRIVACY-IN-CONTEXT].
Note that a first party can be comprised of multiple contexts if it is large enough that people would interact with it for more than one purpose. Sharing personal data across contexts is, in the overwhelming majority of cases, inappropriate.
Your cute little pup uses Poodle Naps to find comfortable places to snooze, and Poodle Fetch to locate the best sticks. Napping and fetching are different contexts with different norms, and sharing data between these contexts is a privacy violation despite the shared ownership of Naps and Fetch by the Poodle conglomerate.
Colloquially, tracking is understood to be any kind of inappropriate data collection.
Additionally, privacy labour is the practice of having a person carry out the work of ensuring data processing of which they are the subject is appropriate, instead of having the parties be responsible for that work as is more respectable.
The user agent acts as an intermediary between a user and the web. The user agent is not a context in that it is expected to coincide with the subject and operate exclusively in the subject's interest. It is not the first party. The user agent serves the user in a relationship of fiduciary agency: it always puts the user's interest first, up to and including, on occasion, protecting the user from themselves by preventing them from carrying out a harmful decision, or at the very least by speed-bumping it [FIDUCIARY-UA]. For example, the user agent will make it difficult for the user to connect to a site the authenticity of which is hard to ascertain, will double-check that the user really intends to expose a sensitive device capability, or will prevent the user from consenting to permanent monitoring of their behaviour. Its fiduciary duties include [TAKING-TRUST-SERIOUSLY]:
These duties ensure the user agent will care for the user. It is important to note that there is a subtle difference between care and data paternalism which is that the latter purports to help in part by removing agency ("don't worry about it, so long as your data is with us it's safe, you don't need to know what we do with it, it's all good because we're good people") whereas care aims to support people by enhancing their agency and sovereignty.
A person's identity is the set of characteristics that define them. Their identity in a context is the set of characteristics they present to that context. People frequently present different identities to different contexts, and also frequently share an identity among several contexts.
Cross-context recognition is the act of recognising that an identity in one context is the same person as an identity in another context. Cross-context recognition can at times be appropriate but anyone who does it needs to be careful not to apply the norms of one context in ways that violate the norms around use of information acquired in a different context. (For example, if you meet your therapist at a cocktail party, you expect them to have rather different discussion topics with you than they usually would, and possibly even to pretend they do not know you.) This is particularly true for vulnerable people as recognising them in different contexts may force their vulnerability into the open.
In computer systems and on the Web, an identity seen by a particular website is typically assigned an identifier of some type, which makes it easier for an automated system to store data about that user.
To do this, user agents have to make some assumptions about the borders between contexts. By default, user agents define a machine-enforceable context or partition as:
Even though this is the default, user agents are free to restrict this context as their users need. For example, some user agents may help their users present different identities to subdivisions of a single site.
There is disagreement about whether user agents may also widen their machine-enforceable contexts. For example, some user agents might want to help their users present a single identity to multiple sites that the user understands represent a single party, or to a site across multiple installations.
User agents should prevent their user from being recognized across machine-enforceable contexts unless the user intends to be recognized. This is a "should" rather than a "must" because there are many cases where the user agent isn't powerful enough to prevent recognition. For example if two services that a user needs to use insist that the user share a difficult-to-forge piece of their identity in order to use the services, it's the services behaving inappropriately rather than the user agent.
If a site includes multiple contexts whose norms indicate that it's inappropriate to share data between the contexts, the fact that those distinct contexts fall inside a single machine-enforceable context doesn't make sharing data or recognizing identities any less inappropriate.
A person's autonomy is their ability to make decisions of their own volition, without undue influence from other parties. People have limited intellectual resources and time with which to weigh decisions, and by necessity rely on shortcuts when making decisions. This makes their privacy preferences malleable [PRIVACY-BEHAVIOR] and susceptible to manipulation [DIGITAL-MARKET-MANIPULATION]. A person's autonomy is enhanced by a system or device when that system offers a shortcut that aligns more with what that person would have decided given arbitrary amounts of time and relatively unfettered intellectual ability; and autonomy is decreased when a similar shortcut goes against decisions made under ideal conditions.
Affordances and interactions that decrease autonomy are known as dark patterns. A dark pattern does not have to be intentional, the deceptive effect is sufficient to define them [DARK-PATTERNS], [DARK-PATTERN-DARK].
Because we are all subject to motivated reasoning, the design of defaults and affordances that may impact user autonomy should be the subject of independent scrutiny. Implementers are enjoined to be particularly cautious to avoid slipping into data paternalism.
Given the sheer volume of potential data-related decisions in today's data economy, complete informational self-determination is impossible. This fact, however, should not be confused with the contention that privacy is dead. Careful design of our technological infrastructure can ensure that users' autonomy as pertaining to their own data is enhanced through appropriate defaults and choice architectures.
In the 1970s, the Fair Information Practices or FIPs were elaborated in support of individual autonomy in the face of growing concerns with databases. The FIPs assume that there is sufficiently little data processing taking place that any person will be able to carry out sufficient diligence to enable autonomy in their decision-making. Since they entirely offload the privacy labour to users and assume perfect, unfettered autonomy, the FIPs do not forbid specific types of data processing but only place them under different procedural requirements. Such an approach is appropriate for parties that are processing data in the 1970s.
One notable issue with procedural approaches to privacy is that they tend to have the same requirements in situations where the user finds themselves in a significant asymmetry of power with a party — for instance the user of an essential service provided by a monopolistic platform — and those where user and parties are very much on equal footing, or even where the user may have greater power, as is the case with small businesses operating in a competitive environment. It further does not consider cases in which one party may coerce other parties into facilitating its inappropriate practices, as is often the case with dominant players in advertising [CONSENT-LACKEYS] or in content aggregation [CAT].
Reference to the FIPs survives to this day. They are often referenced as transparency and choice, which, in today's digital environment, is often a strong indication that inappropriate processing is being described.
Different procedural mechanisms exist to enable people to control the processing done to their data. Mechanisms that increase the number of purposes for which their data is being processed are referred to as opt-in or consent; mechanisms that decrease this number of purposes are known as opt-out.
When deployed thoughtfully, these mechanisms can enhance people's autonomy. Often, however, they are used as a way to avoid putting in the difficult work of deciding which types of processing are appropriate and which are not, offloading privacy labour to the user.
Privacy regulatory regimes are often anchored at extremes: either they default to allowing only very few strictly essential purposes such that many parties will have to resort to consent, habituating people to ignore legal prompts and incentivising dark patterns, or, conversely, they default to forbidding only very few, particularly egregious purposes, such that people will have to perform the privacy labour to opt out in every context in order to produce appropriate processing.
An approach that is more aligned with the expectation that the Web should provide a trustworthy, person-centric environment is to establish a regime consisting of three privacy tiers:
When an opt-out mechanism exists, it should preferably be complemented by a global opt-out mechanism. The function of a global opt-out mechanism is to rectify the automation asymmetry whereby service providers can automate data processing but people have to take manual action. A good example of a global opt-out mechanism is the Global Privacy Control [GPC].
Conceptually, a global opt-out mechanism is an automaton operating as part of the user agent, which is to say that it is equivalent to a robot that would carry out the user's bidding by pressing an opt-out button with every interaction that the user has with a site, or more generally conveys an expression of the user's rights in a relevant jurisdiction. (For instance, under [GDPR], the user may be conveying objections to processing based on legitimate interest or the withdrawal of consent to specific purposes.) It should be noted that, since a global opt-out signal is reaffirmed automatically with every user interaction, it will take precedence in terms of specificity over any manner of blanket consent that a site may obtain, unless that consent is directly attached to an interaction (eg. terms specified on a form upon submission).
When designing Web technology, we naturally pay attention to potential impacts on the person using the Web through their user agent. In addition to potential individual harms we also pay heed to collective effects that emerge from the accumulation of individual actions as influenced by entities and the structure of technology.
Note that in evaluating impact, we deliberately ignore what implementers or specifiers may have intended and only focus on outcomes. This framing is known as POSIWID, or "the Purpose Of a System Is What It Does".
The collective problem of privacy is known as legibility. Legibility concerns population-level data processing that may impact populations or individuals, including in ways that people could not control even under the optimistic assumptions of the FIPs. For example, based on population-level analysis, a company may know that site.example is predominantly visited by people of a given race or gender, and decide not to run its job ads there. Visitors to that page are implicitly having their data processed in inappropriate ways, with no way to discover the discrimination or seek relief [DEMOCRATIC-DATA].
What we consider is therefore not just the relation between the people who expose themselves and the entities that invite that disclosure [RELATIONAL-TURN], but also between the people who expose themselves and those who do not but may find themselves recognised as such indirectly anyway. One key understanding here is that such relations may persists even when data is permanently de-identified.
Legibility practices can be legitimate or illegitimate depending on the context and on the norms that apply in that context. Typically, a legibility practice may be legitimate if it is managed through an acceptable process of collective governance. For example, it is often considered legitimate for a government, under the control of its citizens, to maintain a database of license plates for the purpose of enforcing the rules of the road. It would be illegitimate to observe the same license plates near places of worship to build a database of religious identity.
Legibility is often used to order information about the world. This can notably create problems of reflexivity and of autonomy.
Problems of reflexivity occur when the ordering of information about the world used to produce legibility finds itself changing the way in which the world operates. This can produce self-reinforcing loops that can have deleterious effects both individual and collective [SEEING-LIKE-A-STATE].
Issues of autonomy occur depending on the manner in which legibility is implemented. When legibility is used to order the world following rules set by the user or following methods subject to public scrutiny and governance models with strong checks and balances (such as a newspaper's editorial decisions), then it will enhance user autonomy and tend to be legitimate. When it is done in the user's stead and without governance, it decreases user autonomy and tends to be illegitimate.
Data governance refers to the rules and processes for how data is processed in any given context. How data is governed describes who has power to make decisions over data and how [DATA-FUTURES-GLOSSARY].
In general, collective issues in data require collective solutions. The proper goal of data governance at the standards-setting level is the development of structural controls in user agents and the provision of institutions that can handle population-level problems in data. Governance will often struggle to achieve its goals if it works primarily by increasing individual control over data. A collective approach reduces the cost of control.
Collecting data at large scales can have significant pro-social outcomes. Problems tend to emerge when entities take part in dual-use collection in which data is processed for collective benefit but also for self-dealing purposes that may degrade welfare. The self-dealing purposes will be justified as bankrolling the pro-social outcomes, which, absent collective oversight, cannot be considered to support claims to legitimacy for such legibility. It is vital for standards-setting organisations to establish not just purely technical devices but techno-social systems that can govern data at scale.
User agents should attempt to defend their users from a variety of high-level threats or attacker goals, described in this section.
These threats are an extension of the ones discussed by [RFC6973].
These threats combine into the particular concrete threats we want web specifications to defend against, described in subsections here:
Contributes to surveillance, correlation, and identification.
Users of most instantiations of the web platform expect that if they visit a
site on one day, and then visit again the next day, the site will be able to
recognize that they're the same user. This allows sites to save the user's
preferences, shopping carts, etc. The web platform offers many mechanisms that
are either intended to accomplish this recognition or that can be trivially used
for it, including cookies, localStorage,
indexedDB, CacheStorage, and other forms of
storage.
A privacy harm only occurs if the user wants to break the association between two visits, but the site can still determine with high probability that the two visits came from the same user.
A user might expect that their two visits won't be associated if they:
This recognition is generally accomplished by either "supercookies" or browser fingerprinting.
Supercookies occur when a browser stores data for a site but makes that data more difficult to clear than other cookies or storage. Fingerprinting Guidance § Clearing all local state discusses how specifications can help browsers avoid this mistake.
Fingerprinting consists of using attributes of the user's browser and platform that are consistent between the two visits and probabilistically unique to the user.
The attributes can be exposed as information about the user's device that is otherwise benign (as opposed to § 3.3 Sensitive information disclosure). For example:
See [fingerprinting-guidance] for how to mitigate this threat.
Contributes to surveillance, correlation, and identification, usually more significantly than § 3.1 Unwanted same-site recognition.
This occurs if a site can determine with high probability that a visit to that site comes from the same user as another visit to a different site.
Contributes to correlation, identification, secondary use, and disclosure.
Many pieces of information about a user could cause privacy harms if disclosed. For example:
A particular piece of information may have different sensitivity for different users. Language preferences, for example, might typically seem innocent, but also can be an indicator of belonging to an ethnic minority. Precise location information can be extremely sensitive (because it's identifying, because it allows for in-person intrusions, because it can reveal detailed information about a person's life) but it might also be public and not sensitive at all, or it might be low-enough granularity that it is much less sensitive for many users.
When considering whether a class of information is likely to be sensitive to users, consider at least these factors:
Issue(16): This description of what makes information sensitive still needs to be refined.
See intrusion.
Privacy harms don't always come from a site learning things. For example it is intrusive for a site to
if the user doesn't intend for it to do so.
Contributes to misattribution.
For example, a site that sends SMS without the user's intent could cause them to be blamed for things they didn't intend.
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in: