Conference Papers
Permanent URI for this collectionhttps://www.weizenbaum-library.de/handle/id/1114
Conference Papers
Browse
39 results
Search Results
Item Maintaining Stable Personas? Examining Temporal Stability in LLM-Based Human Simulation(2026-01-20) Gonnermann-Müller, Jana; Haase, Jennifer; Leins, Nicolas; Kosch, Thomas; Pokutta, Sebastian; Oliver,Nuria; Shamma,David A.; Candello, Heloisa; Cesar, Pablo; Lopes, Pedro; Artizzu,Valentino; Draxler, Fiona; López, Gustavo; Reinschluessel, Anke V.; Tong, Xin; Toups Dugas, Phoebe O.Large language models (LLMs) are increasingly employed in Human-Computer Interaction (HCI) research to simulate human behavior for prototype testing and social simulations. The validity of these interactions rests on the assumption that LLMs maintain stable personas. Our work investigates temporal stability in LLM-based human simulation, examining both stability across independent instantiations and within extended interactions. We combined self-reports with observer-ratings of four persona intensity levels (low, moderate, and high ADHD representations, default persona), seven LLMs, and three persona prompts. Results from N = 3, 473 conversations and N = 4, 054 assessments indicate that LLMs generally reproduce personas across conversations in self-reports and observer ratings, suggesting that LLMs hold promise as tools for simulating human behavior. Within extended 18-turn interactions, observer ratings reveal a decline for moderate and high personas, a discrepancy that warrants further investigation. Our findings indicate methodological considerations for HCI researchers employing LLM-based human simulation and implications for future research.Item Knowledge-Enhanced Language Models Are Not Bias-Proof: Situated Knowledge and Epistemic Injustice in AI(2024-06-03) Kraft, Angelie; Soulier, EloïseThe factual inaccuracies ("hallucinations") of large language models have recently inspired more research on knowledge-enhanced language modeling approaches. These are often assumed to enhance the overall trustworthiness and objectivity of language models. Meanwhile, the issue of bias is usually only mentioned as a limitation of statistical representations. This dissociation of knowledge-enhancement and bias is in line with previous research on AI engineers’ assumptions about knowledge, which indicate that knowledge is commonly understood as objective and value-neutral by this community. We argue that claims and practices by actors of the field still reflect this underlying conception of knowledge. We contrast this assumption with literature from social and, in particular, feminist epistemology, which argues that the idea of a universal disembodied knower is blind to the reality of knowledge practices and seriously challenges claims of "objective" or "neutral" knowledge. Knowledge enhancement techniques commonly use Wikidata and Wikipedia as their sources for knowledge, due to their large scales, public accessibility, and assumed trustworthiness. In this work, they serve as a case study for the influence of the social setting and the identity of knowers on epistemic processes. Indeed, the communities behind Wikidata and Wikipedia are known to be male-dominated and many instances of hostile behavior have been reported in the past decade. In effect, the contents of these knowledge bases are highly biased. It is therefore doubtful that these knowledge bases would contribute to bias reduction. In fact, our empirical evaluations of RoBERTa, KEPLER, and CoLAKE, demonstrate that knowledge enhancement may not live up to the hopes of increased objectivity. In our study, the average probability for stereotypical associations was preserved on two out of three metrics and performance-related gender gaps on knowledge-driven task were also preserved. We build on these results and critical literature to argue that the label of "knowledge" and the commonly held beliefs about it can obscure the harm that is still done to marginalized groups. Knowledge enhancement is at risk of perpetuating epistemic injustice, and AI engineers’ understanding of knowledge as objective per se conceals this injustice. Finally, to get closer to trustworthy language models, we need to rethink knowledge in AI and aim for an agenda of diversification and scrutiny from outgroup members.Item If I were a digital application, I would be ChatGPT: Student perspectives on digital technology in Togo(2025-11-04) Fröbel, Friederike; Lange, Carina; Gschwendtner, Philipp; Foli-Bebe, Ousia; Afanou, Séti; Tsamedi, Victoire; Joost, Gesche; Lazem, Shaimaa; Anya, Obinna; Saleh, Mennatallah; Nkwo, Makuochi S.; Gamundani, Attlee M.; Ogunyemi, Abiodun A.; Isafiade, Omowunmi E.Rural West African communities, particularly in Togo, are disproportionately affected by the impacts of climate change despite their minimal contribution to it. The ability to adapt to this change is dependent on access to information and resilient agricultural practices. The development of appropriate information and communication technology (ICT) solutions for subsistence farmers requires thorough contextual understanding. This qualitative study investigates Togolese university students’ perspectives on digital technologies. The study also explores the potential of these students to serve as a bridge to rural subsistence farmers in the process of co-creating digital technologies. As a part of the study, a two-day workshop was conducted, with a focus on co-creating knowledge using do-it-yourself (DIY) ICT for climate change adaptation in rural land management. The following question was posed to participants: ’If you were a technical or digital application, what would it be, and why?’ The analysis of the student responses offers insights into their perspectives on digital technologies and their potential applications for addressing climate challenges. This research contributes a methodological approach to understanding insider (emic) viewpoints from an outsider (etic) perspective to inform the development of sustainable ICT applications in rural sub-Saharan West Africa.Item Deepfakes on Demand: The rise of accessible non-consensual deepfake image generators(2025-06-23) Hawkins, Will; Mittelstadt, Brent; Russell, ChrisAdvances in multimodal machine learning have made text-to-image (T2I) models increasingly accessible and popular. However, T2I models introduce risks such as the generation of non-consensual depictions of identifiable individuals, otherwise known as deepfakes. This paper presents an empirical study exploring the accessibility of deepfake model variants online. Through a metadata analysis of thousands of publicly downloadable model variants on two popular repositories, Hugging Face and Civitai, we demonstrate a huge rise in easily accessible deepfake models. Almost 35,000 examples of publicly downloadable deepfake model variants are identified, primarily hosted on Civitai. These deepfake models have been downloaded almost 15 million times since November 2022, with the models targeting a range of individuals from global celebrities to Instagram users with under 10,000 followers. Both Stable Diffusion and Flux models are used for the creation of deepfake models, with 96% of these targeting women and many signalling intent to generate non-consensual intimate imagery (NCII). Deepfake model variants are often created via the parameter-efficient fine-tuning technique known as low rank adaptation (LoRA), requiring as few as 20 images, 24GB VRAM, and 15 minutes of time, making this process widely accessible via consumer-grade computers. Despite these models violating the Terms of Service of hosting platforms, and regulation seeking to prevent dissemination, these results emphasise the pressing need for greater action to be taken against the creation of deepfakes and NCII.Item Social Bias in Popular Question-Answering Benchmarks(2025-12) Kraft, Angelie; Simon, Judith; Schimmler, Sonja; Inui, Kentaro; Sakti, Sakriani; Wang, Haofen; Wong, Derek F.; Bhattacharyya, Pushpak; Banerjee, Biplab; Ekbal, Asif; Chakraborty, Tanmoy; Singh, Dhirendra PratapQuestion-answering (QA) and reading comprehension (RC) benchmarks are commonly used for assessing the capabilities of large language models (LLMs) to retrieve and reproduce knowledge. However, we demonstrate that popular QA and RC benchmarks do not cover questions about different demographics or regions in a representative way. We perform a content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) whether the benchmarks exhibit social bias, or whether this is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most benchmark papers analyzed provide insufficient information about those involved in benchmark creation, particularly the annotators. Notably, just one (WinoGrande) explicitly reports measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. Our work adds to the mounting criticism of AI evaluation practices and shines a light on biased benchmarks being a potential source of LLM bias by incentivizing biased inference heuristics.Item The Lifecycle of “Facts”: A Survey of Social Bias in Knowledge Graphs(2022-11) Kraft, Angelie; Usbeck, Ricardo; He, Yulan; Ji, Heng; Li, Sujian; Liu, Yang; Chang, Chua-HuiKnowledge graphs are increasingly used in a plethora of downstream tasks or in the augmentation of statistical models to improve factuality. However, social biases are engraved in these representations and propagate downstream. We conducted a critical analysis of literature concerning biases at different steps of a knowledge graph lifecycle. We investigated factors introducing bias, as well as the biases that are rendered by knowledge graphs and their embedded versions afterward. Limitations of existing measurement and mitigation strategies are discussed and paths forward are proposed.Item When is liquid democracy possible?(2025-06-13) Chatterjee, Krishnendu; Gilbert, Seth; Schmid, Stefan; Svoboda, Jakub; Yeo, MichelleLiquid democracy is a transitive vote delegation mechanism over voting graphs. It enables each voter to delegate their vote(s) to another better-informed voter, with the goal of collectively making a better decision. The question of whether liquid democracy outperforms direct voting has been previously studied in the context of local delegation mechanisms (where voters can only delegate to someone in their neighbourhood) and binary decision problems. It has previously been shown that it is impossible for local delegation mechanisms to outperform direct voting in general graphs. This raises the question: for which classes of graphs do local delegation mechanisms yield good results? In this work, we analyse (1) properties of specific graphs and (2) properties of local delegation mechanisms on these graphs, determining where local delegation actually outperforms direct voting. We show that a critical graph property enabling liquid democracy is that the voting outcome of local delegation mechanisms preserves a sufficient amount of variance, thereby avoiding situations where delegation falls behind direct voting1. These insights allow us to prove our main results, namely that there exist local delegation mechanisms that perform no worse and in fact quantitatively better than direct voting in natural graph topologies like complete, random d-regular, and bounded degree graphs, lending a more nuanced perspective to previous impossibility results.Item It’s Not Just the Prompt: Model Choice Dominates LLM Creative Output(2026-01-20) Haase, Jennifer; Gonnermann-Müller, Jana; Hanel, Paul H.P.; Leins, Nicolas; Kosch, Thomas; Mendling, Jan; Pokutta, Sebastian; Oliver,Nuria; Shamma,David A.; Candello, Heloisa; Cesar, Pablo; Lopes, Pedro; Artizzu,Valentino; Draxler, Fiona; López, Gustavo; Reinschluessel, Anke V.; Tong, Xin; Toups Dugas, Phoebe O.Prompt engineering is often treated as a reliable control mechanism for LLM behavior, yet LLM outputs vary even under similar prompts due to stochasticity. We quantify how much output variance is driven by prompt choice versus model choice and by inherent within-LLM stochasticity by evaluating 12 LLMs on 10 creativity prompts in an open-ended divergent-thinking task (AUT), measuring answer quality (originality) and quantity (number of answers), generating 100 samples per prompt. Then, we partition the variance into model, prompt, within-LLM stochasticity, and model × prompt interaction components. Our findings show that model choice is at least as important as prompt choice in this setting. For originality, the model explains 41% of the variance. Prompts explain 36%, and within-model stochasticity explains 11%. For fluency, prompts explain 4% of the variance. Model choice explains 51%, and within-model stochasticity 34%. Beyond variance decomposition, models exhibit persistent “creative fingerprints” in thematic preferences and formatting habits.Item Oversight Structures for Agentic AI in Public-Sector Organizations(2025-06) Schmitz, Chris; Rystrøm, Jonathan; Batzner, Jan; Kamalloo, Ehsan; Gontier, Nicolas; Han Lu, Xing; Dziri, Nouha; Murty, Shikhar; Lacoste, AlexandreThis paper finds that agentic AI systems intensify existing challenges to traditional public sector oversight mechanisms — which rely on siloed compliance units and episodic approvals rather than continuous, integrated supervision. We identify five governance dimensions essential for responsible agent deployment: cross-departmental implementation, comprehensive evaluation, enhanced security protocols, operational visibility, and systematic auditing. We evaluate the capacity of existing oversight structures to meet these challenges, via a mixed-methods approach consisting of a literature review and interviews with civil servants in AI-related roles. We find that agent oversight poses intensified versions of three existing governance challenges: continuous oversight, deeper integration of governance and operational capabilities, and interdepartmental coordination. We propose approaches that both adapt institutional mechanisms and design agent architectures compatible with public sector constraints.Item A Survey on Metadata for Machine Learning Models and Datasets: Standards, Practices, and Harmonization Challenges(2025-11-02) Gesese, Gernet-Asefa; Chen, Zongxiong; Zoubia, Oussama; Limani, Fidan; Silva, Kanishka; Survyani, Muhammad Asif; Zapilko, Benjamin; Castro, Leyla Jael; Kutafina, Ekaterina; Solanki, Dhwani; Fliegl, Heike; Schimmler, Sonja; Boukhers, Zeyd; Sack, Harald; Jacyszyn, Anna; Mannocci, Andrea; Osborne, Francesco; Rehm, Georg; Salatino, Angelo; Schimmler, Sonja; Stork, LiseThe growing availability of machine learning (ML) models, datasets, and related artifacts across platforms, such as Hugging Face, GitHub, and Zenodo, has amplified the need for structured and standardized metadata. However, metadata practices remain highly heterogeneous, differing in schema design, vocabulary usage, and semantic expressiveness, posing significant challenges for tasks such as representation, extraction, alignment, and integration. This fragmentation impedes the development of infrastructures that depend on machine-actionable metadata to support discovery, provenance tracking, or cross-platform interoperability. While metadata is also foundational to enabling FAIR (Findable, Accessible, Interoperable, and Reusable) principles in ML, there is a lack of consolidated understanding of how existing standards support interoperability and alignment across platforms. In this survey, we review and compare a range of general-purpose and ML-specific metadata standards, evaluating their suitability for cross-platform alignment, discoverability, extensibility, and interoperability. We assess these standards based on defined criteria and analyze their potential to support unified, FAIR-compliant metadata infrastructures for ML, laying the groundwork for scalable and interoperable tooling in future ML ecosystems.