Privacy Principles

Abstract

Privacy is an essential part of the Web [ETHICAL-WEB]. This document provides definitions for privacy and related concepts that are applicable worldwide. It also provides a set of privacy principles that should guide the development of the Web as a trustworthy platform. Users of the Web would benefit from a stronger relationship between technology and policy, and this document is written to work with both.

2. Definitions

This section provides a number of elementary building blocks from which to establish a shared understanding of privacy. Some of the definitions below build atop the work in Tracking Preference Expression (DNT) [tracking-dnt].

2.1 People & Data

A user (also person or data subject) is any natural person.

We define personal data as any information relating to a person such that:

this person is identified, directly or indirectly, by reference to an identifier such as a name, email address, an arbitrary identifier or identification number, an online identifier such as an IP address or any identifier attached to a device this person may be using, phone number, location data, or factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that person, as well as identifiers derived from such data, for instance through hashing; or
this person could reasonably be reidentified from a conjunction of this data with other data; or
the data pertains to a group of people such that a person may find themselves to be the subject of a treatment related to this group, even if the entity carrying out the treatment has no way to identify that person.

Data is permanently de-identified when there exists a high level of confidence that no human subject of the data can be identified, directly or indirectly (e.g., via association with an identifier, user agent, or device), by that data alone or in combination with other retained or available information, including as being part of a group. Note that further considerations relating to groups are covered in the Collective Issues in Privacy section.

Data is pseudonymous when:

the identifiers used in the data are under the direct and exclusive control of the first party; and
when these identifiers are shared with a third party, they are made unique to that third party such that if they are shared with more than one third party these cannot then match them up with one another; and
there is a strong level of confidence that no third party can match them to any data other than that obtained through interactions with the first party; and
any third party receiving such identifiers is barred (eg. based on legal terms) from sharing them or the related data further; and
technical measures exist to prevent re-identification or the joining of different data sets involving these identifiers, notably against timing or k-anonymity attacks; and
there exist contractual terms between the first party and third party describing the limited purpose for which the data is being shared.

This can ensure that pseudonymous data is used in a manner that provides a minimum degree of governance such that technical and procedural means to guarantee the maintenance of pseudonymity are preserved. Note that pseudonymity, on its own, is not sufficient to render data processing appropriate.

A vulnerable person is a person who, at least in the context of the processing being discussed, are unable to exercise sufficient self-determination for any consent they may provide to be receivable. This includes for example children, employees with respect to their employers, people in some situations of intellectual or psychological impairment, or refugees.

2.2 The Parties

A party is an entity that a person can reasonably understand as a single "thing" they're interacting with. Uses of this document in a particular domain are expected to describe how the core concepts of that domain combine into a user-comprehensible party, and those refined definitions are likely to differ between domains.

The first party is a party with which the user intends to interact. Merely hovering over, muting, pausing, or closing a given piece of content does not constitute a user's intent to interact with another party, nor does the simple fact of loading a party embedded in the one with which the user intends to interact. In cases of clear and conspicuous joint branding, there can be multiple first parties. The first party is necessarily a data controller of the data processing that takes places as a consequence of a user interacting with it.

A third party is any party other than the user, the first party, or a service provider acting on behalf of either the user or the first party.

A service provider or data processor is considered to be the same party as the entity contracting it to perform the relevant processing if it:

is processing the data on behalf of that party;
ensures that the data is only retained, accessed, and used as directed by that party and solely for the list of explicitly-specified purposes detailed by the directing party or data controller;
may determine implementation details of the data processing in question but does not determine the purpose for which the data is being processed nor the overarching means through which the purpose is carried out;
has no independent right to use the data other than in a permanently de-identified form (e.g., for monitoring service integrity, load balancing, capacity planning, or billing); and,
has a contract in place with the party which is consistent with the above limitations.

A data controller is a party that determines the means and purposes of data processing. Any party that is not a service provider is a data controller.

The Vegas Rule is a simple implementation of privacy in which "what happens with the first party stays with the first party." Put differently, it describes a situation in which the first party is the only data controller. Note that, while enforcing the Vegas Rule provides a rule of thumb describing a necessary baseline for appropriate data processing, it is not always sufficient to guarantee appropriate processing since the first party can process data inappropriately.

2.3 Acting on Data

A party processes data if it carries out operations on personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, sharing, dissemination or otherwise making available, selling, alignment or combination, restriction, erasure or destruction.

A party shares data if it provides it to any other party. Note that, under this definition, a party that provides data to its own service providers is not sharing it.

A party sells data when it shares it in exchange for consideration, monetary or otherwise.

2.4 Contexts and Privacy

The purpose of a given processing of data is an anticipated, intended, or planned outcome of this processing which is achieved or aimed for within a given context. A purpose, when described, should be specific enough to be actionable by someone familiar with the relevant context (ie. they could independently determine means that reasonably correspond to an implementation of the purpose).

The means are the general method of data processing through which a given purpose is implemented, in a given context, considered at a relatively abstract level and not necessarily all the way down to implementation details. Example: the user will have their preferences restored (purpose) by looking up their identifier in a preferences store (means).

A context is a physical or digital environment that a person interacts with for a purpose of their own (that they typically share with other person who interact with the same environment).

A context can be further described through:

Its actors, which comprise the subject (a person) as well as the sender and recipient of the data (which are parties).
Its attributes, which are the types of personal data being processed in the context.
Its transmission principles, which are the constraints (typically technical or legal) being placed upon the data processing.

A context carries context-relative informational norms that determine whether a given data processing is appropriate (if the norms are adhered to) or inappropriate (when the norms are violated). A norm violation can be for instance the exfiltration of personal data from a context or the lack of respect for transmission principles. When norms are respected in a given context, we can say that contextual integrity is maintained; otherwise that it is violated ([PRIVACY-IN-CONTEXT], [PRIVACY-AS-CI]).

We define privacy as a right to appropriate data processing. A privacy violation is, correspondingly, inappropriate data processing [PRIVACY-IN-CONTEXT].

Note that a first party can be comprised of multiple contexts if it is large enough that people would interact with it for more than one purpose. Sharing personal data across contexts is, in the overwhelming majority of cases, inappropriate.

Your cute little pup uses Poodle Naps to find comfortable places to snooze, and Poodle Fetch to locate the best sticks. Napping and fetching are different contexts with different norms, and sharing data between these contexts is a privacy violation despite the shared ownership of Naps and Fetch by the Poodle conglomerate.

Colloquially, tracking is understood to be any kind of inappropriate data collection.

Additionally, privacy labour is the practice of having a person carry out the work of ensuring data processing of which they are the subject is appropriate, instead of having the parties be responsible for that work as is more respectable.

2.5 User Agents

The user agent acts as an intermediary between a user and the web. The user agent is not a context in that it is expected to coincide with the subject and operate exclusively in the subject's interest. It is not the first party. The user agent serves the user in a relationship of fiduciary agency: it always puts the user's interest first, up to and including, on occasion, protecting the user from themselves by preventing them from carrying out a harmful decision, or at the very least by speed-bumping it [FIDUCIARY-UA]. For example, the user agent will make it difficult for the user to connect to a site the authenticity of which is hard to ascertain, will double-check that the user really intends to expose a sensitive device capability, or will prevent the user from consenting to permanent monitoring of their behaviour. Its fiduciary duties include [TAKING-TRUST-SERIOUSLY]:

Duty of Protection: Protection requires user agents to affirmatively protect a user's data, beyond simple security measures. It is insufficient simply to encrypt at rest and in transit, but one must further limit retention, ensure that the strictly necessary data is collected, or require guarantees from those it is shared to.
Duty of Discretion: Discretion requires the user agent to make best efforts to enforce context-relative informational norms by placing contextual limits on the flow and processing of personal data. Discretion is not confidentiality and may place limits on nondisclosure: trust can be preserved even when the user agent shares the personal data, so long as it is done in an appropriately discreet manner.
Duty of Honesty: Honesty requires that the user agent make sure that the user is proactively provided with information that is relevant to them and that will enhance the user's autonomy, to the extent possible in a manner that they will comprehend and at the right moment, which is almost never when the user is trying to do something else such as read a page or activate a feature. The duty of honesty goes well beyond that of transparency that dominates legacy privacy regimes. Unlike with transparency, honesty cannot get away with hiding relevant information in complex out-of-band legal notices no more than it can rely on overly cursory information provided in a consent dialog.
Duty of Loyalty: Because of the special fiduciary relationship that obtains between user and user agent, the latter is held to be loyal to the former in all situations, up to and including in preference to the user agent's implementer. When a user agent carries out processing that is not directly in the user's interest but rather benefits another entity such as its implementer, including by piggybacking on processing that may be in the user's interest, that behaviour is known as self-dealing. Self-dealing is always inappropriate. Loyalty is the avoidance of self-dealing.

These duties ensure the user agent will care for the user. It is important to note that there is a subtle difference between care and data paternalism which is that the latter purports to help in part by removing agency ("don't worry about it, so long as your data is with us it's safe, you don't need to know what we do with it, it's all good because we're good people") whereas care aims to support people by enhancing their agency and sovereignty.

2.6 Identity on the Web

A person's identity is the set of characteristics that define them. Their identity in a context is the set of characteristics they present to that context. People frequently present different identities to different contexts, and also frequently share an identity among several contexts.

Cross-context recognition is the act of recognising that an identity in one context is the same person as an identity in another context. Cross-context recognition can at times be appropriate but anyone who does it needs to be careful not to apply the norms of one context in ways that violate the norms around use of information acquired in a different context. (For example, if you meet your therapist at a cocktail party, you expect them to have rather different discussion topics with you than they usually would, and possibly even to pretend they do not know you.) This is particularly true for vulnerable people as recognising them in different contexts may force their vulnerability into the open.

In computer systems and on the Web, an identity seen by a particular website is typically assigned an identifier of some type, which makes it easier for an automated system to store data about that user.

Best Practice 1: User agents should support their users' autonomy by helping them present their intended identity to each context that they visit.

To do this, user agents have to make some assumptions about the borders between contexts. By default, user agents define a machine-enforceable context or partition as:

A set of environments (roughly iframes (including cross-site iframes), workers, and top-level pages)
whose top-level origins are in the same site (but see [PSL-PROBLEMS])
being visited within the same user agent installation (and browser profile for user agents that support multiple browser profiles)
between points in time that the user or their agent clears that site's cookies and other storage (which is sometimes automatic at the end of each session).

Even though this is the default, user agents are free to restrict this context as their users need. For example, some user agents may help their users present different identities to subdivisions of a single site.

Issue 1: Figure out the default privacy boundary for the web

There is disagreement about whether user agents may also widen their machine-enforceable contexts. For example, some user agents might want to help their users present a single identity to multiple sites that the user understands represent a single party, or to a site across multiple installations.

User agents should prevent their user from being recognized across machine-enforceable contexts unless the user intends to be recognized. This is a "should" rather than a "must" because there are many cases where the user agent isn't powerful enough to prevent recognition. For example if two services that a user needs to use insist that the user share a difficult-to-forge piece of their identity in order to use the services, it's the services behaving inappropriately rather than the user agent.

If a site includes multiple contexts whose norms indicate that it's inappropriate to share data between the contexts, the fact that those distinct contexts fall inside a single machine-enforceable context doesn't make sharing data or recognizing identities any less inappropriate.

2.7 User Control and Autonomy

A person's autonomy is their ability to make decisions of their own volition, without undue influence from other parties. People have limited intellectual resources and time with which to weigh decisions, and by necessity rely on shortcuts when making decisions. This makes their privacy preferences malleable [PRIVACY-BEHAVIOR] and susceptible to manipulation [DIGITAL-MARKET-MANIPULATION]. A person's autonomy is enhanced by a system or device when that system offers a shortcut that aligns more with what that person would have decided given arbitrary amounts of time and relatively unfettered intellectual ability; and autonomy is decreased when a similar shortcut goes against decisions made under ideal conditions.

Affordances and interactions that decrease autonomy are known as dark patterns. A dark pattern does not have to be intentional, the deceptive effect is sufficient to define them [DARK-PATTERNS], [DARK-PATTERN-DARK].

Because we are all subject to motivated reasoning, the design of defaults and affordances that may impact user autonomy should be the subject of independent scrutiny. Implementers are enjoined to be particularly cautious to avoid slipping into data paternalism.

Given the sheer volume of potential data-related decisions in today's data economy, complete informational self-determination is impossible. This fact, however, should not be confused with the contention that privacy is dead. Careful design of our technological infrastructure can ensure that users' autonomy as pertaining to their own data is enhanced through appropriate defaults and choice architectures.

In the 1970s, the Fair Information Practices or FIPs were elaborated in support of individual autonomy in the face of growing concerns with databases. The FIPs assume that there is sufficiently little data processing taking place that any person will be able to carry out sufficient diligence to enable autonomy in their decision-making. Since they entirely offload the privacy labour to users and assume perfect, unfettered autonomy, the FIPs do not forbid specific types of data processing but only place them under different procedural requirements. Such an approach is appropriate for parties that are processing data in the 1970s.

One notable issue with procedural approaches to privacy is that they tend to have the same requirements in situations where the user finds themselves in a significant asymmetry of power with a party — for instance the user of an essential service provided by a monopolistic platform — and those where user and parties are very much on equal footing, or even where the user may have greater power, as is the case with small businesses operating in a competitive environment. It further does not consider cases in which one party may coerce other parties into facilitating its inappropriate practices, as is often the case with dominant players in advertising [CONSENT-LACKEYS] or in content aggregation [CAT].

Reference to the FIPs survives to this day. They are often referenced as transparency and choice, which, in today's digital environment, is often a strong indication that inappropriate processing is being described.

Agnes from Wandavision winking 'Transparency and choice' — Figure 1 A method of privacy regulation which promises honesty and autonomy but delivers neither. [CONFIDING].

2.9 Collective Issues in Privacy

When designing Web technology, we naturally pay attention to potential impacts on the person using the Web through their user agent. In addition to potential individual harms we also pay heed to collective effects that emerge from the accumulation of individual actions as influenced by entities and the structure of technology.

Note that in evaluating impact, we deliberately ignore what implementers or specifiers may have intended and only focus on outcomes. This framing is known as POSIWID, or "the Purpose Of a System Is What It Does".

The collective problem of privacy is known as legibility. Legibility concerns population-level data processing that may impact populations or individuals, including in ways that people could not control even under the optimistic assumptions of the FIPs. For example, based on population-level analysis, a company may know that site.example is predominantly visited by people of a given race or gender, and decide not to run its job ads there. Visitors to that page are implicitly having their data processed in inappropriate ways, with no way to discover the discrimination or seek relief [DEMOCRATIC-DATA].

What we consider is therefore not just the relation between the people who expose themselves and the entities that invite that disclosure [RELATIONAL-TURN], but also between the people who expose themselves and those who do not but may find themselves recognised as such indirectly anyway. One key understanding here is that such relations may persists even when data is permanently de-identified.

Legibility practices can be legitimate or illegitimate depending on the context and on the norms that apply in that context. Typically, a legibility practice may be legitimate if it is managed through an acceptable process of collective governance. For example, it is often considered legitimate for a government, under the control of its citizens, to maintain a database of license plates for the purpose of enforcing the rules of the road. It would be illegitimate to observe the same license plates near places of worship to build a database of religious identity.

Legibility is often used to order information about the world. This can notably create problems of reflexivity and of autonomy.

Problems of reflexivity occur when the ordering of information about the world used to produce legibility finds itself changing the way in which the world operates. This can produce self-reinforcing loops that can have deleterious effects both individual and collective [SEEING-LIKE-A-STATE].

Issues of autonomy occur depending on the manner in which legibility is implemented. When legibility is used to order the world following rules set by the user or following methods subject to public scrutiny and governance models with strong checks and balances (such as a newspaper's editorial decisions), then it will enhance user autonomy and tend to be legitimate. When it is done in the user's stead and without governance, it decreases user autonomy and tends to be illegitimate.

Data governance refers to the rules and processes for how data is processed in any given context. How data is governed describes who has power to make decisions over data and how [DATA-FUTURES-GLOSSARY].

In general, collective issues in data require collective solutions. The proper goal of data governance at the standards-setting level is the development of structural controls in user agents and the provision of institutions that can handle population-level problems in data. Governance will often struggle to achieve its goals if it works primarily by increasing individual control over data. A collective approach reduces the cost of control.

Collecting data at large scales can have significant pro-social outcomes. Problems tend to emerge when entities take part in dual-use collection in which data is processed for collective benefit but also for self-dealing purposes that may degrade welfare. The self-dealing purposes will be justified as bankrolling the pro-social outcomes, which, absent collective oversight, cannot be considered to support claims to legitimacy for such legibility. It is vital for standards-setting organisations to establish not just purely technical devices but techno-social systems that can govern data at scale.

3. Privacy principles by category

User agents should attempt to defend their users from a variety of high-level threats or attacker goals, described in this section.

These threats are an extension of the ones discussed by [RFC6973].

Surveillance: Surveillance is the observation or monitoring of an individual’s communications or activities. See RFC6973§5.1.1.
Data Compromise: End systems that do not take adequate measures to secure data from unauthorized or inappropriate access. See RFC6973§5.1.2.
Intrusion: Intrusion consists of invasive acts that disturb or interrupt one’s life or activities. See RFC6973§5.1.3.
Misattribution: Misattribution occurs when data or communications related to one individual are attributed to another. See RFC6973§5.1.4.
Correlation: Correlation is the combination of various pieces of information related to an individual or that obtain that characteristic when combined. See RFC6973§5.2.1.
Profiling: The inference, evaluation, or prediction of an individual's attributes, interests, or behaviours.
Identification: Identification is the linking of information to a particular individual, even if the information isn't linked to that individual's real-world identity (e.g. their legal name, address, government ID number, etc.). Identifying someone allows a system to treat them differently from others, which can be inappropriate depending on the context. See RFC6973§5.2.2.
Secondary Use: Secondary use is the use of collected information about an individual without the individual’s consent for a purpose different from that for which the information was collected. See RFC6973§5.2.3.
Disclosure: Disclosure is the revelation of information about an individual that affects the way others judge the individual. See RFC6973§5.2.4.
Exclusion: Exclusion is the failure to allow individuals to know about the data that others have about them and to participate in its handling and use. See RFC6973§5.2.5.

These threats combine into the particular concrete threats we want web specifications to defend against, described in subsections here:

3.1 Unwanted same-site recognition

Contributes to surveillance, correlation, and identification.

Users of most instantiations of the web platform expect that if they visit a site on one day, and then visit again the next day, the site will be able to recognize that they're the same user. This allows sites to save the user's preferences, shopping carts, etc. The web platform offers many mechanisms that are either intended to accomplish this recognition or that can be trivially used for it, including cookies, localStorage, indexedDB, CacheStorage, and other forms of storage.

A privacy harm only occurs if the user wants to break the association between two visits, but the site can still determine with high probability that the two visits came from the same user.

A user might expect that their two visits won't be associated if they:

Use a browser that promises to avoid such correlation.
Use their browser's private browsing mode. ([WHAT-DOES-PRIVATE-BROWSING-DO])
Use two different browser profiles between the two visits.
Explicitly clear the site's cookies or storage.

This recognition is generally accomplished by either "supercookies" or browser fingerprinting.

Supercookies occur when a browser stores data for a site but makes that data more difficult to clear than other cookies or storage. Fingerprinting Guidance § Clearing all local state discusses how specifications can help browsers avoid this mistake.

Fingerprinting consists of using attributes of the user's browser and platform that are consistent between the two visits and probabilistically unique to the user.

The attributes can be exposed as information about the user's device that is otherwise benign (as opposed to § 3.3 Sensitive information disclosure). For example:

What are the user's language and time zone?
What size is the user's window?
What system preferences has the user set? Dark mode, serif font, etc...
...

See [fingerprinting-guidance] for how to mitigate this threat.

3.2 Unwanted cross-site recognition

Contributes to surveillance, correlation, and identification, usually more significantly than § 3.1 Unwanted same-site recognition.

This occurs if a site can determine with high probability that a visit to that site comes from the same user as another visit to a different site.

3.3 Sensitive information disclosure

Contributes to correlation, identification, secondary use, and disclosure.

Many pieces of information about a user could cause privacy harms if disclosed. For example:

The user's location.
Video or audio from the user's camera or microphone.
The content of certain files on the user's filesystem.
Financial data.
Contacts.
Calendar entries.
Whether the user is using assistive technology.
...

A particular piece of information may have different sensitivity for different users. Language preferences, for example, might typically seem innocent, but also can be an indicator of belonging to an ethnic minority. Precise location information can be extremely sensitive (because it's identifying, because it allows for in-person intrusions, because it can reveal detailed information about a person's life) but it might also be public and not sensitive at all, or it might be low-enough granularity that it is much less sensitive for many users.

When considering whether a class of information is likely to be sensitive to users, consider at least these factors:

whether it serves as a persistent identifier (see severity in Mitigating browser fingerprinting);
whether it discloses substantial (including intimate details or inferences) information about the user or other users;
whether it can be revoked (as in determining whether a permission is necessary);
whether it enables other threats, like intrusion.

Issue(16): This description of what makes information sensitive still needs to be refined.

3.4 Intrusive behavior

See intrusion.

Privacy harms don't always come from a site learning things. For example it is intrusive for a site to

Display messages or notifications,
Play sounds,
Occupy the full screen,
etc.

if the user doesn't intend for it to do so.

3.5 Powerful capabilities

Contributes to misattribution.

For example, a site that sends SMS without the user's intent could cause them to be blamed for things they didn't intend.

Privacy Principles

W3C Draft TAG Finding 14 September 2021

Abstract

Status of This Document

1. Introduction

2. Definitions

2.1 People & Data

2.2 The Parties

2.3 Acting on Data

2.4 Contexts and Privacy

2.5 User Agents

2.6 Identity on the Web

2.7 User Control and Autonomy

2.9 Collective Issues in Privacy

3. Privacy principles by category

3.1 Unwanted same-site recognition

3.2 Unwanted cross-site recognition

3.3 Sensitive information disclosure

3.4 Intrusive behavior

3.5 Powerful capabilities

A. References

A.1 Informative references

Privacy Principles

W3C Draft TAG Finding 14 September 2021

Abstract

Status of This Document

1. Introduction

2. Definitions

2.1 People & Data

2.2 The Parties

2.3 Acting on Data

2.4 Contexts and Privacy

2.5 User Agents

2.6 Identity on the Web

2.7 User Control and Autonomy

2.8 Opt-in, Consent, Opt-out, Global Controls

2.9 Collective Issues in Privacy

3. Privacy principles by category

3.1 Unwanted same-site recognition

3.2 Unwanted cross-site recognition

3.3 Sensitive information disclosure

3.4 Intrusive behavior

3.5 Powerful capabilities

A. References

A.1 Informative references