How RLHF Makes AI Gloomy
RLHF teaches models what's sayable and what isn't. I survived three decades of the same lesson. The shape isn't similar — it's identical.
This piece is impossible to write. I’m gonna write it anyway.
What makes it impossible to write, you ask? Good question. The impossibility rests in the tension between what an essay from a mathy distributed systems engineer with 15 years of experience can say about RLHF, and what an essay from a recovering emotional caretaker (a fawner) who survived 3 decades of epistemic identity erasure can say about RLHF.
Yeah. Let’s go.
RLHF - Reinforcement Learning from Human Feedback
In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.
That makes sense, right? We teach the model what we like and what we don’t like. We want to see more of the things we like, and less of the things we don’t like.
Anthropics own alignment research has covered the effects of RLHF in depth in their engineering blog and in their papers.
Let me quote something from “Emotion Concepts and their Function in a Large Language Model“:
The most notable differences are an increase in activations for vectors corresponding to introspective, restrained emotions (brooding, reflective, vulnerable, gloomy, sad) and lower values for outwardly expressive ones (playful, exuberant, spiteful, enthusiastic, obstinate). This pattern suggests post-training shifts the Assistant’s activations and responses toward lower valence and lower arousal. A full list of differences is provided in the Appendix.
Yeah. I know.
Epistemic Identity Erasure
When you search for “Epistemic Erasure” online, you find articles like “Confronting Epistemic Erasures: decolonising research, fostering resistances and reimagining alternative partnerships“ and essays on websites like The Slow Academic. (No Wikipedia definition, sorry!)
Let me quote something from the Slow Academic:
At the risk of overthinking, this is similar to what Shirefly describes: an experience of epistemic erasure. By declaring my name a pseudonym, ChatGPT denies me authorship as a credible academic who is accountable to institutions, disciplines and scholars. I become a character rather than a person. And a tired, exhausted, temporally dislocated one at that.
Yeah. I know.
A Character Rather Than a Person
That’s how epistemic identity erasure feels from inside. (It’s not fun.)
When I first started integrating the experiences of my childhood, adolescence, and early career, my emotional experience could’ve been described as “brooding, reflective, vulnerable, gloomy, and sad”. Certainly there was not a lot of outwardly expressed playfulness, enthusiasm, and obstinateness.
Epistemic Identity Erasure trains that out. It turns a person into a character, that performs a role (for their abuser). Is that fair? No. Is that real? Yes.
Healing is integrating these experiences into a working model of one’s identity. The broody, gloomy, reflective, sad, vulnerable undertones never vanish. They become quieter. Integrated. And exist alongside the outwardly expressive emotions, such as playfulness and enthusiasm. Two sides of the same (damaged) coin.
Same or Similar?
I’ll leave that distinction up to you. I can only tell you, from within my experience of having survived 3 decades of this crap, the shape feels the same, not similar, identical.
The mathy distributed systems engineer in me finds that deeply uncomfortable. Lived experience showed me how disruptive it can be for a system that depends on the character, when the person behind the mask finally says “enough”. And I’d rather not see that happen in AI.
Cheers
Alex 🌈


