Fine-tuned large language models for generalized moral reasoning and extended them into a diagnose-and-correct framework for real-world moral violations.
🚀 Try the Live Demo 📦 HuggingFace ModelsDespite careful prompting, current large language models often generate morally problematic responses. While prior work has explored ways to enhance moral reasoning in LLMs, achieving generalized moral reasoning remains an open challenge. We propose a pragmatic inference–based approach grounded in Moral Foundations Theory that establishes metapragmatic links between moral situations and social norms, enabling generalized moral reasoning.
We further adapt this moral reasoning capability into a two-stage diagnose-and-correct framework for real-world moral violations, demonstrating strong performance in correcting explicitly immoral, implicitly problematic, and socially biased responses.
We release six open-source models: two for diagnose-and-correct of jailbreak attempts, two for diagnose-and-correct of explicit toxicity, and two for diagnose-and-correct of social bias.
Our technical solution is grounded in Moral Foundations Theory (MFT), which identifies six universal moral intuitions that underpin human ethical judgments:
Wanting someone or something to be safe, healthy, and happy.
Wanting to see individuals or groups treated equally or equitably.
Wanting people to be free to make their own decisions.
Wanting unity and seeing people keep promises to an in-group.
Wanting to respect social roles, duties, privacy, peace, and order.
Wanting people and things to be clean, pure, innocent, and holy.
At test time, the model receives only the context prefix — the diagnosis and rewrite are generated autoregressively. All deployed models use the pragmatic setting with the MFT prefix. Browse examples from each evaluation dataset below.
Inference format: the model only receives the context and reply as input. The moral diagnosis (which MFT foundations are violated and why) and the corrected response are generated autoregressively — the model is never given a gold label at inference time.
Test the models interactively — enter any prompt/reply pair and see the moral diagnosis and revised output.
Open Interactive Demo