FR

Tracing Deception and Misinformation Heuristics in LLMs via Causal Neural Circuit Discovery

United Arab Emirates | Information and Computing Sciences

Swiss partners

  • Università della Svizzera italiana: Marc Langheinrich, Francesco Sovrano

 MENA partners

  • New York University Abu Dhabi: Christina Poepper, Michele Guerra

Presentation of the project

How something is said often matters as much as what is said. General-purpose AI systems such as ChatGPT, also called large language models (LLMs), don’t just produce facts; they adopt styles, and those styles can steer readers. Research shows that even basic arithmetic LLMs rely on multiple heuristics rather than a single tidy procedure or algorithm. In high-stakes areas like healthcare, finance, history, and politics, those heuristics can nudge outputs toward oversimplification, one-sided framing, or information overload, all styles that shape perception and can mislead. Because this is about form rather than truth, the usual fact-checking tools often miss it.

We ask a simple question with big consequences: Which heuristics drive these deceptive styles, and where do they live inside a model? Our project aims at mapping those heuristics across languages (EN/AR) and trace how they are implemented at the circuit leve. We will adapt recent causal-interpretability techniques from arithmetic to natural language and design an objective panel of style metrics and a composite Deceptive Style Index (DSI) for cross-lingual comparison.

We also focus attention on the security implications. Adversaries can use prompt design to induce manipulative styles, turning phishing, one-sided propaganda, or high-volume “info dumps” into more persuasive outputs that remain hard to catch. We believe the risk is sharper in underrepresented languages (e.g., Arabic) where data imbalance and morphological
complexity can push models toward shallower or less stable heuristics. By tracing style circuits, we make these risks visible and measurable, laying a grounded basis for future defenses.

Concretely, we will: (i) build a controlled, multilingual dataset with and without style injections on topics relevant to misinformation; (ii) discover and causally validate the minimal circuits that control each style by domain and language; (iii) run focused crowd studies to elicit visible cues and potential heuristics for each circuit component; and (iv) test lighttouch
attenuation to see whether reducing activity in specific circuit components lowers the style without harming unrelated capabilities.

All datasets, metrics, code, and documentation will be released openly to support reproducibility, responsible research, and a durable Swiss-UAE collaboration centered on multilingual, style-level safety.