Safety Stack Technical Standard

Last updated 18 Mar 2026

This piece accompanies the draft technical standard proposal itself, available here.

Introduction

To run a social tech product is to be in the business of content moderation. The social tech industry exists to spread user-generated content (UGC) and to monetize its consumption. The management of information is an essential business function, and a major part of that information management is termed “content moderation.” That is, “the act of reviewing user-generated content to detect, identify or address reports of content or conduct that may violate applicable laws or a digital service’s content policies or terms of service.” (Digital Trust & Safety Partnership Glossary)

Different people want different things when they’re on the internet. That means content moderation is a normative exercise: there is no “correct” rule for what content the platform decides to allow. For example, some social tech platforms allow sexualized nudity, while others prohibit it. The first point of Mike Masnick’s Impossibility Theorem is “any moderation is likely to end up pissing off those who are moderated.” While there is no correct, universal answer to what content a platform should decide to allow, there are greater and lesser degrees of competence to how the platform handles the inevitable task of content moderation. There are better and worse practices. This piece proposes that some of the better attributes of Trust & Safety systems be codified as semantic technical standards.

Variation, Standardization, and Maturity

Businesses may approach content moderation differently because of differences in the ways their product works, their revenue models, or the cultural norms they promote among their users. The freedom to operate independently on each of these dimensions creates a complicated relationship between varied social tech businesses and the adoption of standardized methods. On the one hand, technical standards can provide reliability and speed, by resolving known problems with known and straight-forwardly replicable solutions. On the other hand, if technical standards introduce requirements that challenge the business’s product or strategy, they can come at the cost of innovation.

The tension between innovation and standardization is well reflected in the evolution from Web 1.0 to Web 2.0. Darcy DiNucci contemplated this in the 1999 article Fragmented Future, writing:

For practical and competitive reasons (the tendency of any company to differentiate among its own products), we’re bound to see a proliferation of new Web publishing formats… developers will cleave to existing standards when practical, and strike out on their own when it provides competitive advantages.

Technical standards have yet to arrive on the content moderation scene. In the meantime, other quasi-standardizing currents have appeared. A new industry and academic field of “Trust and Safety” emerged with the professionalization of content moderation services. Industry associations such as the Digital Trust and Safety Partnership (DTSP), the Trust and Safety Professional Association (TSPA), and the Integrity Institute (II) codify “best practices” with strategic frameworks and analysis, curricula, assessment instruments, and cookbooks. In recent years, governments have begun passing laws expressly about content moderation, most notably the Digital Services Act in the EU and the Online Safety Act in the UK. And in July 2025, the International Standards Organization adopted ISO/IEC 25389, by DTSP. This is not a technical standard; rather, DTSP describes it as an “overall framework and set of aims for what constitutes a responsible approach to managing content- and conduct-related risks, to which digital services can then map their specific practices.”

Further evidence of T&S industry maturity is that the content moderation market is growing rapidly. Social tech businesses routinely contract off-the-shelf solutions to known problems by vendors like Hive (“offers turnkey software products”) or Cinder (“providing solutions that are not only effective now but also scale to meet future challenges”) or SafetyKit (“Ship products, add features, and expand to new regions without risk and compliance slowing you down”). An open-source T&S tooling consortium, ROOST, launched in 2025, additionally reflecting a level of industry-wide maturity in T&S practices.

These modes of standardization are broad. Industry recommendations from practitioners on the one hand, and new zones of liability from regulators on the other, point the design and operation of T&S systems towards a general direction, while leaving much of the technical detail alone. Industry and policymakers alike have tread lightly around the need for flexibility and innovation of business, and for the most part have had very little to say about the specific design of technical systems.

However, large parts of social tech infrastructure are not unique, and rather follow one another in a well-trod path worn by shared business needs. For instance:

They all operate on the internet, a well-established system of shared hardware and of stable technical standards (the web).
They virtually all conform to a narrow set of cybersecurity protocols.
They virtually all support two mobile operating systems (Android and iOS), in their various versions, as well as a small set of desktop operating systems and web browsers.

Beyond these deeper parts of the stack, all social tech also confronts basic “read” and “write” requests from clients: all platforms disallow certain kinds of activity on their network. The most rote of these rules simply concerns the size or speed of inbound traffic. Platforms’ terms of service uniformly prohibit activity that could amount to a Distributed Denial of Service (DDOS) attack, for example. Since 1994, the robots.txt web standard has helped implement anti-scraping Terms of Service in a programmatic way. The further we delve into content moderation strategies, the more we see that many attributes of content moderation systems are in fact universal across social tech platforms.

The origins of these “best practices”–peer-generated, convergent norms cosigned by some critical mass of practitioners–fits the anatomy of how technical specifications, protocols, and ultimately standards develop in an industry.

On Technical Standards

Technical standards excel where “best practices” fall short.

First, the primary audience of technical standards is often software engineers. If Lawrence Lessig’s adage, “code is law,” is true, then the software engineer is the ultimate policymaker - what Michael Lipsky might call a “street-level bureaucrat.” By equipping the engineer directly with a set of parameters, the trust and safety field matures dramatically: we can move past the “PDF Report” and “risk assessment memo” step of many best practices, and directly into minimum architecture and code requirements.
Second, technical standards are not squishy: while they may be extensible and configurable, they are precise methods for meeting known needs. If the current corpus of T&S “best practices” forgoes specificity for the sake of flexibility, technical standards advance both.
Third, technical standards for content moderation are not just for the benefit of the Trust and Safety industry; they also facilitate the growth of the greater social tech industry itself. One new frontier of social tech is federated, interoperable applications. The EU’s Digital Markets Act sets out new requirements for interoperability. Even Meta, in its launch of Threads on the ActivityPub protocol, explained the move as “giv[ing] people more choice.” From APIs that enable middleware content moderation to directly compatible messaging systems, social tech is rapidly evolving beyond the proprietary methods of its origins. And yet, the documentation of the two leading fediverse protocols offer only general guidance about moderation methods:
- ActivityPub (e.g. Mastodon, Threads, WordPress): “it is recommended that servers filter incoming content both by local untrusted users and any remote users through some sort of spam filter” and “Servers should implement protections against denial-of-service attacks from other, federated servers. This can be done using, for example, some kind of ratelimiting mechanism.” (emphasis added)
- ATProto (Bluesky): “Moderation Primitives: The com.atproto.admin.* routes for handling moderation reports and doing infrastructure-level take-downs is specified in Lexicons but should also be described in more detail.” (emphasis added)
Finally, coding agents dramatically lower the barriers to adoption of semantic technical standards. A developer can import the technical standard into their directory, and use Claude to create roadmap items that comply with the standard; or incorporate it into instruction markdown documents, almost like a local, internal robots.txt for ensuring implementation of these standards during development.

The social tech industry is ready for a standardized set of technical methods for a large majority of the content moderation system. Although platforms may vary considerably in their rules and enforcement for user expression–resulting from their product design, business strategy, and norms–this variation is a “last mile” delivery problem in their communication services. The well-established approaches and collective wisdom in the many steps upstream of what a user experiences are a good fit for technical standardization.

What should these technical standards be? Let’s look at what should constitute this safety stack: an abstracted sequence of operational protocols and technical methods that ensure that a controller of social tech is best positioned to steward the enormous volume of user-generated content on its network.

The Safety Stack

There are four layers constituting a stack for building social tech safely. This stack specifies operational protocols and technical methods to product teams for building user-facing products and features in a way that is “safe by design,” as well as standardized and extensible. In terms of organizing personnel, the stack is occupation-agnostic: each layer comprises work across all functional roles in the company (engineering, product management, product design, content moderation, data science, policy & legal, etc.).

How It Works

The attached spreadsheet has the following columns.

Layer: Indicates which of the four layers (discussed below) the item is in.
Group: A short qualitative categorization intended for improving comprehension, but with little bearing on substance.
Standard Identifier: A handle for each item, following a simple convention.
Function: A designation of whether the item is “technical” or “operational” in nature.
Method: An imperative sentence containing the substantive, specific, technical “best practice.”
Not E2EE: For platforms that are not end-to-end encrypted, what requirements does this standard set out.
Remarks-Not E2EE: Additional commentary on the prior column.
E2EE: For platforms that are end-to-end encrypted, what requirements does this standard set out.
Remarks-E2EE: Additional commentary on the prior column.

The requirements themselves are one of the following:

Must
Should
May
Should Not
Must Not

I borrowed this schema from early internet protocols, such as IETF’s RFC 1122 establishing TCP/IP. For clear definitions of these, please review https://datatracker.ietf.org/doc/html/rfc2119.

Where Does this Come From?

As a conceptual framework, this 4-layer model reflects to a certain extent the OODA Loop, established by John Boyd in the 1970s U.S. Air Force: Observe, Orient, Decide, Act. More recently, the spam team at Meta in 2018 conceptualized a “design pattern” for fighting spam as Observe, Classify, Respond (OCR), which further shaped the concepts of this Safety Stack.

The substance comes from the experiences and observations of T&S practitioners building and operating dozens of social tech platforms over the past decade. Its sources include roadmaps and workplans, crises, edge-cases, transparency reports, conference proceedings, white papers, academic papers, and press coverage. See also Four Functional Quadrants for Trust & Safety Tools: Detection, Investigation, Review & Enforcement (DIRE) in “Trust, Safety, and the Internet We Share: Multistakeholder Insights” (forthcoming).Camille Francois, Juliet Shen, Yoel Roth, Samantha Lai, Mariel Povolny; 30 July 2025. Retrieved 17 March 2026 from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5369158

The version documented here is a first draft attempt, and I intend and expect it to be developed considerably by input of many more practitioners.

Layer I: Detection

What can the platform operators “see”?

The Detection layer is concerned with the collection, storage, wrangling, and presentation of data. There are two sides of the Detection layer:

Collection
Presentation

Collection

This is where the platform logs or ingests data, and to what specific locations. The three other layers depend entirely on design decisions made here.

This is where the universe of entities, media, and object types is defined (“entities” such as user accounts; “media” such as text, image, video, live video, etc.; and “objects” such as post, comment, group, channel, etc.); where data is logged and stored (content or “objects” as well as behavior or “events”); and where data is exposed to other internal locations and tools. It is necessary but not sufficient to decide “we log this to the server v. to the client” - we must also say “…and specifically to data domain X, which is accessible to staff with permissions Y,” and even “and it renders correctly in Moderation Tool Z.”

User interactions with T&S systems through in-app reporting is a Detection-layer design. In addition, the Detection layer also concerns off-platform information collection, through channels like law enforcement notices and intelligence-gathering operations.

Presentation

To automated services

For automated content moderation such as rules engines and classifiers, the Detection layer is where we ensure the right data is sent to the right services. Event-handling systems, such as a rules engine, are examples of this. This also entails making strategic decisions about the quantity of events or content that we can support. Transforming content into a hash is a Detection-layer step.

To human-readable applications

For manual content moderation, for both reactive review of inbound reports and proactive investigations, the Detection-layer is where review tooling lives. The design of this layer ensures that the necessary content and fact patterns are accessible to the appropriate members of staff.

On Policies

The detection layer is unopinionated with respect to Content Policy. The platform does not make value judgments about UGC being “acceptable” or “unacceptable.” The only relevant policies here are the Terms of Service, the Privacy Policy, data retention policies, and other back-end access considerations; not “Content Policy” (variably named Community Standards, Community Guidelines, etc).

In the attached technical standard, Detection standards begin with the letter D.

Layer II: Review

What normative judgments does the platform make about what it sees?

The Review layer is concerned with what is permissible or impermissible, with respect to its policies and relevant laws. It consists of rules and tradeoff thresholds.

Rules

The Review layer is where the platform decides its rules for content and behavior. It is opinionated. It is defined first by the Content Policies - both the public-facing documentation as well as more detailed, internal guidelines and operational protocols. Additional content policies - such as what content is allowed to be monetized by users; what apps are allowed to be built by third-party developers on a public API; and so forth - also live within the Review layer.

Internal protocols for considering user appeals and for evaluating moderation performance (Quality Assurance), as well as the definition of performance thresholds for these operational workflows, all live within the Review layer.

Decisions and Designations

This layer is also where thresholds (e.g. rate limits) are specified and tradeoff decisions (e.g. tolerance for False Positive and False Negative risk) are made. The system responses at these thresholds are implemented in the Enforcement layer, below.

This is the realm of labeling: violating, borderline-violating, non-violating; allowances, exceptions; spirit of the law, letter of the law; quality assurance; prevalence measurement. This is also where precision, recall and other metric concepts related to “accuracy” are established and maintained.

High-touch content moderation - both proactive investigations and reactive escalations to policy experts or corporate executives, requiring some “case management” - all live within the Review layer.

Matching a hash of content is a Review-layer step, because it entails making a normative judgment (e.g. setting a quantitative threshold) of what “similar enough” should be. Because Hashing is an unopinionated, Detection-layer operation, the action of “hash matching” - an industry standard method for quickly (often immediately, or even preventively) identifying policy-violating content by comparing it to a database of known violating content - straddles the Detection and Review layers.

The designation of certain entities (e.g. users; pages) as special classes (e.g. managed partners, verified celebrities) through things like UI elements (a “verified checkmark”) or internal controls (“shielding” to prevent false-positive enforcement on high profile entities) is a Review-layer consideration.

In the attached technical standard, Review standards begin with the letter R.

Layer III: Enforcement

What actions does the platform take?

The Enforcement layer is concerned with consequences to instances of contents or behaviors being deemed impermissible in the Review layer.

Enforcement actions, sometimes called “interventions,” are material changes to the product experience a platform makes in response to a determination–made in the Review layer–that a content or behavior violates the platform’s content policies. In product parlance, an enforcement action is known as a “feature limit.” Here are some common examples of the arrows in the enforcement quiver, increasing in severity:

demote: cause content to appear less frequently in a feed or in search results
blur: place a visual effect over content so a user must click additional times in order to see it.
interstitial or modal: place a message (such as a warning) over a content so that users must pass through additional UX before accessing the content.
captcha/verification: require a user to move through a UX flow that verifies their identity to some extent.
timeout/cooldown: disable certain functionalities available to a user for a period of time.
log-out: end the user’s session and require that they re-enter account credentials to access the platform again.
delete: remove UGC from the platform.
ban: remove a user from the platform.
bulk actioning: applying some feature limit, such as these listed, to many entities all at once through a script.
demonetize: remove an actor’s ability to collect revenue on the platform.

The Enforcement layer is where signal-sharing with off-platform entities, such as law enforcement and other clearinghouses (NCMEC; GIFCT; StopNCII), is implemented.

Treatment of Entities

While designations of “shielded” entities (for False Positive error prevention) is a Review-layer consideration, the implementation of specific shielding methods (e.g. adding an entity to a hard-coded list; implementing an automatic heuristic) exists in the Enforcement layer. Relatedly, the application or removal of a front-end verification badge (e.g. checkmark icon) is an Enforcement-layer action. This layer is where onboarding entities to and suspending entities from special levels of access or featuresets - such as monetization or advertising programs or API access - occurs. In addition, internal labeling or profiling–such as adding a flag to a user account for a review to consider them “suspicious”—is also an Enforcement-layer activity.

The construction of new entities (e.g. clusters of accounts) or signal patterns is an Enforcement-layer exercise, but–interestingly–is also a handshake back to the Detection layer. For example, a platform might define a “person” in terms of an IP address, a device ID, a pattern in the names of registration email addresses associated with multiple user accounts, and so forth. In so doing, it “constructs” the entity of “person” and may apply preemptive enforcements, such as a captcha or an outright ban, if it suspects this is a repeat offender of content policies. In defining new parameters that instantiate an entity (“person”) this way, it also reflects the “defining entities” faculty of the Detection layer. That creates a new sort of “sieve” for collecting and organizing information, and is an example of how these layers interact.

Strikes

The Enforcement layer is where a “strikes system” is housed - an internal accounting ledger that keeps a record of what content policies each user has violated, what penalties it carries, and how feature limits escalate, up to the user being banned from the platform.

In the attached technical standard, Enforcement standards begin with the letter E.

Layer IV: Outcomes

How does the platform hold itself accountable for good content moderation?

Metrics

The Outcomes layer is where the platform defines success in terms of specific metric compositions. Its internal incentives - reducing “Bad” things and maximizing “Good” things - are established here, such as defining Good MAU (users who have no or few recorded violations of the content policies) as MAU minus Bad MAU (users who have some or many recorded violations of the content policies), and then reporting Good MAU as the relevant topline goal to executives and shareholders, and rewarding staff with performance evaluations relating to Good MAU. Relatedly, attribution of increases in “bad” metrics to specific product surfaces and feature launches through the construction of data pipelines and metric trees is an exercise of the Outcomes layer.

Transparency

The creation of transparency reports–both for regulatory compliance (e.g. DSA Article 15) and in support of its own voluntary marketing strategies–lives in the Outcomes layer.

The Outcomes layer contains in-product transparency features, such as an account-level “Status” dashboard where the platform exposes directly to each user the same accounting records that it structures in the Enforcement layer. Push notifications or emails informing users of policy violations are simpler versions of transparency methods in the Outcomes layer.

Business Management

KPIs and business insights relating to operational capacity management live in the Outcomes layer. This includes things like the portfolio management of human content moderators (given metrics like Average Handling Time and parameters like SLA and price) and engineering resources for automated methods like ML classifiers. Business management decisions like “build v. buy” of content moderation capabilities also are part of the Outcomes layer.

Communications

Finally, public marketing copy (blog posts, tweets, and help center articles) and public representations made to users, press, regulators, legislators, and courts are all Outcomes layer activities. For example, a platform saying “We review content that may violate our policies…” leaves ambiguity: all or some such content? Reviewed by humans or machines? Proactively or reactive to a user-generated report? For what harm types, in what prioritization? In what jurisdiction or languages?

The adoption of technical standards is useful even in merely standardizing public dialogue concerning a platform’s efforts at content moderation. It is hard for a platform to know what precisely a regulator’s requirements may mean, and hard for a regulator to know if a platform is making fulsome or obfuscated representations; and harder still for the regulator to fact-find. It is more efficient and exhaustive for all parties to say “The platform has implemented standard XYZ.123.”

Assumptions & Considerations

This proposed 4-layer standard has notable limitations, some of which I’ll name here.

Network

I assume a classic server-client model, with a centralized controller receiving requests from many clients. Note that this standard is applicable to fediverse (ActivityPub, ATProto) networks, because those are server-client architectures. It is not applicable to mesh (peer-to-peer; blockchain) networks.

Encryption

Today, a large majority of social tech platforms are not end-to-end encrypted, which accommodates much of the “Detection”-layer data collection specified in this standard. However, my personal view is that end-to-end encryption is a best practice, and if we intend this Safety Stack to reflect best practices, it must also accommodate encryption. In the attached documentation, I’ve created two columns to address this–”Not E2EE” and “E2EE”–and I believe a T&S technical standard should have different requirements in those different privacy contexts. Most of these differences are in the Detection layer.

Out of Scope

These 4 layers are oriented towards classic, centralized, platform-owned system design for a controller. There are many other facets to Trust & Safety and content moderation that are outside of this scope. I want to acknowledge some examples of the capabilities not addressed by these standards:

User controls–equipping users with the ability to block or filter (share customized lists and labels)
Middleware and public API support for third-party moderation solutions
Hierarchical content moderation (e.g. community moderators; oversight boards and consultative councils)
Detailed investigation methods (e.g. network analysis tools and procedures)
Oversight by public interest (e.g. research APIs, additional transparency reports, etc)

Some of these approaches may prove to be standardizeable in the future, while others are “last mile” product decisions that should probably remain unique to different platforms’ idiosyncrasies.

Next Steps

In September 2025, the Stanford T&S Research Conference hosted a workshop (Towards Technical Standards: Safety By Design & Open Source) that some colleagues and I developed on this topic. My hope is that T&S practitioners continue to engage with the proposed standards and provide rigorous feedback. I will seek input, and I would love to help create a more formal survey of practitioners in the future.

I welcome collaborators, co-authors, good faith detractors, galore! Please add comments to this doc and to the accompanying .

18 March 2026:

Shared to Integrity Institute Slack/General channel
Shared to ROOST Discord Server/General channel