Lost in moderation: How commercial content moderation apis over- and under-moderate group-targeted hate speech and linguistic variations

Hartmann, David; Oueslati, Amin; Staufer, Dimitri; Pohlmann, Lena; Munzert, Simon; Heuer, Hendrik

Lost in moderation: How commercial content moderation apis over- and under-moderate group-targeted hate speech and linguistic variations

dc.contributor.author	Hartmann, David
dc.contributor.author	Oueslati, Amin
dc.contributor.author	Staufer, Dimitri
dc.contributor.author	Pohlmann, Lena
dc.contributor.author	Munzert, Simon
dc.contributor.author	Heuer, Hendrik
dc.date.accessioned	2025-06-02T08:44:17Z
dc.date.available	2025-06-02T08:44:17Z
dc.date.issued	2025
dc.description.abstract	Commercial content moderation APIs are marketed as scalable solutions to combat online hate speech. However, the reliance on these APIs risks both silencing legitimate speech, called over-moderation, and failing to protect online platforms from harmful speech, known as under-moderation. To assess such risks, this paper introduces a framework for auditing black-box NLP systems. Using the framework, we systematically evaluate five widely used commercial content moderation APIs. Analyzing five million queries based on four datasets, we find that APIs frequently rely on group identity terms, such as “black”, to predict hate speech. While OpenAI’s and Amazon’s services perform slightly better, all providers under-moderate implicit hate speech, which uses codified messages, especially against LGBTQIA+ individuals. Simultaneously, they over-moderate counter-speech, reclaimed slurs and content related to Black, LGBTQIA+, Jewish, and Muslim people. We recommend that API providers offer better guidance on API implementation and threshold setting and more transparency on their APIs’ limitations.Warning: This paper contains offensive and hateful terms and concepts. We have chosen to reproduce these terms for reasons of transparency.
dc.identifier.citation	Hartmann, D., Oueslati, A., Staufer, D., Pohlmann, L., Munzert, S., & Heuer, H. (2025). Lost in moderation: How commercial content moderation apis over- and under-moderate group-targeted hate speech and linguistic variations. Proceedings of the 2025 CHI conference on human factors in computing systems. https://doi.org/10.1145/3706598.3713998
dc.identifier.doi	10.1145/3706598.3713998
dc.identifier.isbn	979-8-4007-1394-1
dc.identifier.uri	https://www.weizenbaum-library.de/handle/id/901
dc.language.iso	eng
dc.publisher	Association for Computing Machinery
dc.rights	open access
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	content moderation apis
dc.subject	audit
dc.subject	ai transparency and account- ability
dc.subject	human-ai interaction in content moderation
dc.subject	algorithmic bias in hate speech detection
dc.title	Lost in moderation: How commercial content moderation apis over- and under-moderate group-targeted hate speech and linguistic variations
dc.type	ConferencePaper
dc.type.status	publishedVersion
dcmi.type	Text
dcterms.bibliographicCitation.url	https://doi.org/10.1145/3706598.3713998
local.researchgroup	Daten, algorithmische Systeme und Ethik
local.researchtopic	Digitale Technologien in der Gesellschaft