M.S. Thesis · IIT Madras · 2012

Beyond Rewards: Learning from Richer Supervision

Indian Institute of Technology, Madras | M.S. (by Research) | 2012

Core Idea: Human instructions carry structure—what matters, what to ignore, and what action fits. This thesis shows how instructions can serve as a first-class learning signal in reinforcement learning, enabling agents to generalize faster and operate with less supervision while maintaining autonomy.

Traditional reinforcement learning relies on trial-and-error with scalar rewards. This work explores a different path: leveraging human instructions as structured knowledge that accelerates learning without requiring optimal demonstrations.

Key Contributions

π-Instructions: Action Guidance

Direct action suggestions that generalize across behaviorally similar states. Examples: "Push the object," "Use two arms," "Go left." These instructions reduce exploration by guiding action selection in states the agent recognizes as similar.

Φ-Instructions: State Abstraction

Guidance on what features matter in the state representation. Examples: "Ignore color," "Focus on this object," "Track the adversary." These instructions enable faster policy learning by reducing state space complexity.

Instruction-Aware Learning

A framework where the agent learns instruction models in parallel with RL. Instructions provide guidance early on, but the agent maintains autonomy—it's never forced to obey indefinitely. Confidence measures arbitrate between instruction-derived and learned policies.

Handling Ambiguity

Human instructions are often underspecified. The framework maintains multiple interpretations, learns interpretation models, and dynamically selects the most confident one. Statistical Relational Learning (Markov Logic Networks) enables reasoning over ambiguous instructions.

Why It Matters

This thesis anticipated several ideas that have become central to modern AI and robotics:

Human-in-the-loop learning: Incorporating human guidance without compromising agent autonomy
Advisor-style AI: Models that suggest rather than command, similar to today's LLM-advisory architectures
Neuro-symbolic reasoning: Combining learned models with symbolic instructions for explainable behavior
Structured knowledge transfer: Learning what matters, not just optimal actions

The framework was validated on simulated domains and a real robotic system (Sorter Robot), demonstrating instruction learning under real-world sensing noise and ambiguity.

Robot Demo Video

Watch the Sorter Robot in action, demonstrating instruction-driven reinforcement learning on a real robotic system:

Connection to Current Work

The core ideas from this thesis—using instructions as structured guidance, maintaining agent autonomy, and combining symbolic reasoning with learning—directly inform my current work on:

Multi-LLM advisory architectures for autonomous robots
Neurosymbolic AI systems using Semantic Rule Languages
Explainable and auditable autonomous systems
Resilient AI-in-the-loop frameworks

Publications

Beyond Rewards: Learning with Richer Supervision

Korupolu V. N. Pradyot, B. Ravindran
9th European Workshop on Reinforcement Learning (EWRL), 2011

Introduced instruction-driven generalization and motivated instruction models alongside reinforcement learners.

Instructing a Reinforcement Learner

Korupolu V. N. Pradyot, M. Sivamurugan, B. Ravindran
25th Florida AI Research Society Conference (FLAIRS), 2012

Formalized π-Instructions and Φ-Instructions. Demonstrated learning speedups in structured RL domains.

Integrating Human Instructions and Reinforcement Learners: An SRL Approach

Korupolu V. N. Pradyot, M. Sivamurugan, B. Ravindran, S. Natarajan
2nd International Workshop on Statistical Relational AI (StarAI), 2012

Used Markov Logic Networks to interpret human instructions. Addressed ambiguity and relational generalization.