Autonomous Systems: Robot Learning

Robot Learning: A First-Principles Guide

Target audience: ML engineers preparing for robotics interviews. This guide builds from foundational robotics concepts to modern learning-based methods, defining every term with concrete examples before building on it.


1. Robot Kinematics

Kinematics is the study of motion without considering forces. Before a robot can learn anything, we need to describe where its body is and how it moves.

1.1 Configuration Space vs Task Space

A configuration is the complete specification of every joint in a robot. For a robot arm with 6 revolute joints, the configuration is a vector of 6 angles: q = (θ₁, θ₂, θ₃, θ₄, θ₅, θ₆). The set of all valid configurations is the configuration space (C-space). A 6-joint arm lives in a 6-dimensional C-space.

Task space (also called operational space or Cartesian space) is the space we actually care about — the position and orientation of the robot’s hand (the end-effector). For a robot placing a cup on a table, task space is the 3D position (x, y, z) and 3D orientation (roll, pitch, yaw) of the gripper — 6 dimensions total, called a pose.

Concrete example: imagine a desk lamp with two hinged segments. The configuration is two angles (θ₁, θ₂). The task space is where the lamp head points in 2D. Many different (θ₁, θ₂) pairs can aim the lamp at the same spot — this many-to-one mapping is the root of why inverse kinematics is hard.

1.2 Forward Kinematics

Forward kinematics (FK) answers: given joint angles, where is the end-effector?

For a chain of rigid links, each joint applies a transformation to the next link. We compose these transformations using 4×4 homogeneous transformation matrices:

T = T₁ · T₂ · ... · Tₙ

Each Tᵢ encodes a rotation and translation:

Tᵢ = | Rᵢ  dᵢ |
     | 0    1  |

where Rᵢ is a 3×3 rotation matrix and dᵢ is a 3×1 translation vector.

2-link planar arm example. Two links of lengths L₁ and L₂ with joint angles θ₁ and θ₂ (both revolute, rotating in the plane):

x = L₁·cos(θ₁) + L₂·cos(θ₁ + θ₂)
y = L₁·sin(θ₁) + L₂·sin(θ₁ + θ₂)

This is a direct geometric calculation — FK always has a unique, closed-form solution.

Forward and inverse kinematics of a 2-link planar robot arm

1.3 DH (Denavit-Hartenberg) Parameters

Rather than deriving transforms from scratch for every robot, the Denavit-Hartenberg convention provides a systematic 4-parameter description of each link-joint pair:

Parameter Symbol Meaning
Link length a Distance between joint axes along the common normal
Link twist α Angle between joint axes about the common normal
Link offset d Distance along the joint axis (variable for prismatic joints)
Joint angle θ Rotation about the joint axis (variable for revolute joints)

Each joint’s transform is:

Tᵢ = Rot_z(θᵢ) · Trans_z(dᵢ) · Trans_x(aᵢ) · Rot_x(αᵢ)

2-link planar arm DH table:

Joint θ d a α
1 θ₁ 0 L₁ 0
2 θ₂ 0 L₂ 0

Both joints are revolute (θ is the variable), planar (no twist α = 0, no offset d = 0). DH parameters are the standard way to specify robot geometry in textbooks and manufacturer datasheets. Given a DH table, you can mechanically compute the full FK — no geometric intuition needed.

1.4 Inverse Kinematics

Inverse kinematics (IK) answers the opposite question: given a desired end-effector pose, what joint angles achieve it?

Why it’s harder than FK:

Jacobian-based iterative methods solve IK numerically. The Jacobian J relates small joint-space motions to small task-space motions:

Δx = J(q) · Δq

where Δx is the task-space error and Δq is the joint correction. To find Δq:

Δq = J⁻¹ · Δx        (if J is square and non-singular)
Δq = J† · Δx          (pseudoinverse, if J is non-square)
Δq = Jᵀ · Δx          (transpose method, simpler but slower convergence)

The algorithm: start from an initial guess q₀, compute the task-space error Δx = x_desired - FK(q₀), compute Δq, update q ← q + Δq, repeat until convergence. This is essentially Newton’s method applied to the FK equations.

Damped least squares (Levenberg-Marquardt) adds regularization to handle singularities:

Δq = Jᵀ(J·Jᵀ + λ²I)⁻¹ · Δx

The damping factor λ trades off accuracy for stability near singular configurations.

1.5 Joint Spaces

Robots use two fundamental joint types:

Joint limits constrain the C-space to a subset. A 6-DOF arm with joint limits has a C-space that’s a hyper-rectangle in ℝ⁶ rather than all of ℝ⁶. The workspace is the image of this constrained C-space through FK — an irregularly shaped region in task space. Workspace analysis determines what a robot can reach, and is critical for cell layout and task feasibility.

Degrees of freedom (DOF): The number of independent joint variables. A task in 3D (position + orientation) requires 6 DOF. Robots with more than 6 joints are redundant — they have extra freedom that can be used for obstacle avoidance, singularity avoidance, or optimizing secondary objectives (e.g., minimizing joint torques).


2. Dynamics and Control

Kinematics tells us where the robot goes. Dynamics tells us what forces make it go there. Control is the art of computing those forces in real time.

The equation of motion for a robot arm is:

M(q)·q̈ + C(q, q̇)·q̇ + g(q) = τ

where M is the inertia matrix, C captures Coriolis and centrifugal effects, g is gravity, and τ is the vector of joint torques we command. This is the manipulator equation — it’s nonlinear, coupled, and configuration-dependent.

2.1 PID Control

PID control is the most widely deployed control law in industrial robotics. It computes the control signal u(t) as a sum of three terms:

u(t) = Kp·e(t) + Ki·∫e(t')dt' + Kd·(de/dt)

where e(t) = q_desired - q_actual is the tracking error.

Term Role Concrete example (controlling a joint angle)
P (proportional) Pushes toward the target proportionally to how far away you are Joint is 10° off → apply torque proportional to 10°
I (integral) Eliminates steady-state error by accumulating past error Gravity pulls the arm down 0.5° persistently → I term builds up and compensates
D (derivative) Damps oscillation by opposing rapid error changes Joint is approaching target quickly → D term brakes to prevent overshoot

Tuning tradeoffs:

In practice, each joint runs an independent PID loop at 1 kHz+. PID works well when the dynamics are mild (slow motions, light payloads) but struggles with fast, dynamic motions where joint coupling matters — that’s where model-based control helps.

PID control loop block diagram

2.2 Model Predictive Control (MPC)

Model Predictive Control solves an optimization problem at every timestep:

minimize  Σ_{t=0}^{H} [ (x_t - x_ref)ᵀ Q (x_t - x_ref) + u_tᵀ R u_t ]
subject to  x_{t+1} = f(x_t, u_t)        (dynamics model)
            u_min ≤ u_t ≤ u_max           (actuator limits)
            x_t ∈ X_safe                   (state constraints)

The key idea: optimize over a finite horizon H, but only execute the first action, then re-solve at the next timestep with updated state. This receding horizon approach provides feedback and handles model error.

Why MPC is powerful for robotics:

The downside: solving an optimization problem at every timestep is computationally expensive. Real-time MPC on robots typically runs at 50–500 Hz, depending on the horizon length and solver.

2.3 Impedance Control

Classical control tracks a trajectory: “be at position x_d at time t.” But what if the robot touches something? A stiff position controller will either push with enormous force (dangerous) or lose contact.

Impedance control specifies a desired relationship between force and motion, modeled as a virtual spring-damper:

F = K·(x_d - x) + D·(ẋ_d - ẋ)

where K is the stiffness matrix and D is the damping matrix. The robot behaves like a spring attached to the desired position:

Concrete example: a robot polishing a surface. In the normal direction, you want low stiffness (comply with surface shape) and moderate damping (smooth motion). In the tangential directions, you want the robot to track the polishing trajectory. Impedance control achieves this by setting different K and D values per axis.

Admittance control is the dual: given a force input (from a force/torque sensor), compute a motion output. Used when the robot has stiff position-controlled joints but needs compliant behavior.

Impedance control is essential for contact-rich tasks: assembly, insertion, collaborative robots (cobots) working alongside humans, and any manipulation that involves sustained contact.

2.4 Computed Torque Control

Computed torque (also called inverse dynamics control) uses the full dynamics model to cancel nonlinearities:

τ = M(q)·(q̈_d + Kp·e + Kd·ė) + C(q, q̇)·q̇ + g(q)

Substituting into the manipulator equation, the nonlinear terms cancel, leaving:

ë + Kd·ė + Kp·e = 0

This is a linear second-order system — easily tuned with standard linear control theory.

The catch: computed torque requires an accurate dynamics model (M, C, g). Model errors lead to imperfect cancellation and degraded performance. In practice, it’s combined with robust or adaptive control to handle uncertainties.


3. Imitation Learning

Now we enter the learning-based approaches. Instead of manually programming control laws, can the robot learn from watching an expert?

3.1 The Core Idea

Imitation learning (IL) learns a policy π(a s) from a dataset of expert demonstrations:
D = {(s₁, a₁), (s₂, a₂), ..., (sₙ, aₙ)}

where sᵢ is a state (e.g., joint angles + camera image) and aᵢ is the action the expert took (e.g., joint velocity command). The goal: find π that mimics the expert.

This avoids the reward-engineering problem of RL — the expert shows what to do rather than specifying a scalar reward.

3.2 Behavioral Cloning (BC)

Behavioral cloning is the simplest approach: treat imitation as supervised learning.

minimize  Σᵢ L(π(sᵢ), aᵢ)

Train a neural network to predict the expert’s action given the state. For continuous actions, L is often MSE; for discrete actions, cross-entropy.

The distribution shift problem. BC works on the expert’s state distribution. But at test time, the learned policy makes small errors → visits slightly different states → makes bigger errors → cascades into states the expert never demonstrated. Over a trajectory of length T, the error compounds quadratically: total error ∝ O(T²).

Concrete example: a self-driving car trained with BC drifts slightly left. The expert data never shows recovery from this situation. The car drifts further left. This is why naive BC fails on long-horizon tasks.

Despite this, BC with modern architectures and large datasets can work surprisingly well. The key is data diversity — if the training data covers enough of the state space (including recovery behaviors), the compounding effect is mitigated. Diffusion Policy (Chi et al., 2023) uses a diffusion model as the policy class for BC, generating multi-modal action distributions that handle the one-to-many nature of demonstrations.

3.3 DAgger (Dataset Aggregation)

DAgger (Ross et al., 2011) directly addresses distribution shift:

  1. Train initial policy π₁ on expert data D
  2. Roll out π₁ in the environment, collecting states {s’₁, s’₂, …}
  3. Query the expert for the correct actions at these new states: {(s’₁, a₁), (s’₂, a₂), …}
  4. Aggregate: D ← D ∪ {new data}
  5. Retrain π₂ on the aggregated dataset
  6. Repeat

By training on states the learned policy actually visits (not just the expert’s states), DAgger breaks the distribution shift loop. Theoretical guarantee: after N rounds, the policy’s performance approaches the expert’s.

The practical limitation: DAgger requires an interactive expert who can provide corrections during rollouts. This is feasible in simulation or with a human teleoperator, but expensive in general.

Behavioral Cloning vs DAgger imitation learning approaches

3.4 Beyond Behavioral Cloning

Inverse reinforcement learning (IRL) takes a different angle: instead of learning the policy directly, infer the expert’s reward function R(s, a) from demonstrations, then optimize a policy for that reward. The idea is that the reward function is a more compact, transferable representation of the task than raw state-action pairs. MaxEntropy IRL (Ziebart et al., 2008) is a foundational algorithm. Adversarial IRL methods like GAIL (Ho & Ermon, 2016) train a discriminator to distinguish expert from policy behavior — the policy is rewarded for fooling the discriminator.

One-shot imitation learning aims to learn a new task from a single demonstration by leveraging prior experience across many tasks. The model learns a task-conditioned policy: given one demo of a new task, generalize to new initial conditions (Duan et al., 2017).

3.5 Connection to Autonomous Driving

Imitation learning has found major real-world application in self-driving:

Both systems demonstrate that with sufficient data scale and augmentation, BC-style approaches can work on long-horizon, safety-critical tasks.


4. Sim-to-Real Transfer

Training robots in the real world is slow (wall-clock time), expensive (hardware damage), and dangerous. Simulation is fast, cheap, and safe. But policies trained in simulation often fail when deployed on real hardware.

4.1 The Reality Gap

The reality gap has three main components:

Gap type Example
Visual Simulated images look different (textures, lighting, reflections, noise)
Physics Simulated contact dynamics are approximate (friction, deformation, soft contacts)
Dynamics Simulated motor response differs from real actuators (latency, backlash, gear flexibility)

A policy that achieves 95% success in simulation might achieve 5% success on the real robot — not because the algorithm is wrong, but because the policy has overfit to simulator-specific artifacts.

Sim-to-real transfer pipeline and approaches to bridge the reality gap

4.2 Domain Randomization

Domain randomization (DR) takes a brute-force approach: if you don’t know the exact real-world parameters, randomize them over a wide range during training. The policy must succeed across all randomizations, so it learns to be robust to variation.

What gets randomized:

Canonical example: OpenAI’s Rubik’s Cube (OpenAI, 2019). A dexterous Shadow Hand solved a Rubik’s Cube in the real world after training entirely in simulation with massive domain randomization — thousands of physics parameters randomized simultaneously. The policy learned to be robust to any specific parameter setting because it had to succeed across all of them. This is sometimes called Automatic Domain Randomization (ADR): start with narrow randomization ranges and automatically widen them as the policy improves.

4.3 System Identification

System identification (sysid) takes the opposite approach: instead of randomizing, measure the real-world parameters and tune the simulator to match.

Methods:

Sysid works well when the simulation is fundamentally capable of matching reality (the sim has the right structure but wrong parameters). It breaks down when the sim is missing important phenomena entirely (e.g., cable dynamics, deformable contacts).

4.4 Domain Adaptation

Domain adaptation uses representation learning to bridge the sim-real gap:

4.5 Progressive Transfer

Rather than a binary sim → real transfer, progressive approaches bridge the gap gradually:

  1. Train base policy in simulation with domain randomization
  2. Deploy on real robot, collect a small dataset (tens to hundreds of trajectories)
  3. Fine-tune the policy on real data, using sim data as regularization

This combines the sample efficiency of sim training with the accuracy of real-world fine-tuning. Residual policy learning is a related idea: train a base policy in sim, then learn a small residual correction on the real robot that compensates for the sim-real gap.


5. Reward Shaping and Curriculum Learning

RL for robotics faces a fundamental challenge: designing reward functions that are both informative enough for learning and aligned with the true objective.

5.1 The Sparse Reward Problem

Many robotics tasks have natural sparse rewards: the robot gets +1 if it completes the task and 0 otherwise. Picking up an object: reward = 1 if object is grasped, 0 otherwise.

The problem: a random policy almost never achieves the goal, so it receives reward 0 on virtually every trajectory. There’s no learning signal to improve from. For a task like inserting a peg into a hole with <1mm tolerance, a random policy might succeed once in 10⁸ trials.

5.2 Reward Shaping

Reward shaping adds intermediate reward signals to guide learning:

r_shaped(s, a, s') = r_sparse(s, a, s') + F(s, s')

Examples of shaping terms F:

The risk: poorly designed shaping can create local optima. A robot rewarded for moving its hand close to an object might learn to hover near it without ever grasping — the shaped reward is maximized, but the actual task isn’t completed.

5.3 Potential-Based Reward Shaping

Potential-based reward shaping (Ng et al., 1999) provides a theoretical guarantee: if the shaping function has the form:

F(s, s') = γ·Φ(s') - Φ(s)
for some potential function Φ(s), then the optimal policy is invariant — shaping doesn’t change what the optimal policy is, only how fast the agent finds it. This is the safe way to add shaping: define a potential function (e.g., Φ(s) = -   x - x_goal   ) and derive F from it.

5.4 Curriculum Learning

Curriculum learning starts with easy task variants and progressively increases difficulty:

5.5 Hindsight Experience Replay (HER)

Hindsight Experience Replay (Andrychowicz et al., 2017) is an elegant solution to sparse rewards in goal-conditioned settings.

The insight: a failed trajectory still reaches some final state. Instead of discarding it, relabel the trajectory with the achieved state as the intended goal.

Original:  goal = "stack block on red block"  →  failed  →  reward = 0
Relabeled: goal = "place block at position (0.3, 0.2, 0.1)"  →  succeeded!  →  reward = 1

Every trajectory becomes a successful demonstration of something. The policy learns the structure of the task (how to reach arbitrary positions) and gradually generalizes to harder goals. HER converts a sparse-reward problem into a dense-reward one without manual shaping.


6. Manipulation

Manipulation — grasping, placing, assembling, tool use — is the central capability that makes robots useful in unstructured environments.

6.1 Grasping

Analytical grasping reasons about physics:

Data-driven grasping trains neural networks to predict grasp success:

Modern grasping pipelines combine perception (object detection/segmentation), grasp prediction, motion planning (collision-free path to the grasp pose), and control (compliant approach and force-controlled closure).

6.2 Dexterous Manipulation

Dexterous manipulation uses multi-fingered hands (3–5 fingers, 15–25 DOF) to manipulate objects with human-like skill.

Why it’s hard:

In-hand reorientation — rotating a held object to a target orientation without putting it down — is a benchmark task. OpenAI’s Rubik’s cube work (2019) and NVIDIA’s DextremeNet (2023) demonstrated that RL with sim-to-real transfer can achieve dexterous manipulation, though reliability remains below human level.

6.3 Contact-Rich Tasks

Contact-rich manipulation involves sustained, controlled contact: peg insertion, screwing, polishing, surface wiping.

Key challenges:

Industrial assembly often combines:

  1. Vision for coarse alignment (get the peg near the hole)
  2. Force-guided search (spiral search pattern with force feedback to find the hole)
  3. Impedance control for insertion (compliant in the insertion direction)

RL approaches to assembly (e.g., Luo et al., 2021) learn these strategies end-to-end from force and proprioceptive feedback, often outperforming hand-coded search patterns.


7. Locomotion

While manipulation concerns what a robot does with its hands, locomotion concerns how it moves through the world.

7.1 Why Legs?

Wheels are simple and efficient on flat ground. Legs exist for terrain traversal: stairs, rubble, gaps, uneven ground. Biological systems overwhelmingly use legs for terrestrial locomotion. The tradeoff: legs are mechanically complex, dynamically unstable, and harder to control — but they can go almost anywhere.

The fundamental locomotion challenge: maintain balance while moving. A wheeled robot is statically stable (it won’t fall over if you stop the motors). A legged robot must actively balance — it’s an inherently dynamic system.

7.2 Quadruped Locomotion

Quadruped robots (Boston Dynamics Spot, ANYmal, Unitree Go1/Go2) have become the most commercially successful legged platforms.

Gait patterns describe which legs move when:

Gait Pattern Speed Stability
Walk 3 legs on ground at all times, move one at a time Slow Statically stable
Trot Diagonal pairs move together (front-left + rear-right, then front-right + rear-left) Medium Dynamically stable
Gallop Front pair then rear pair (or rotary variant) Fast Dynamically stable, brief flight phases

Central Pattern Generators (CPGs) are bio-inspired controllers: networks of coupled oscillators that produce rhythmic patterns without sensory feedback. Each oscillator drives one leg, and coupling between oscillators determines the gait. By modulating oscillator frequency and phase relationships, a CPG can transition between gaits. CPGs provide a low-dimensional parameterization of locomotion that RL can learn to modulate.

7.3 Humanoid Balance

Bipedal locomotion is fundamentally harder: two legs provide a smaller support polygon, and walking involves controlled falling.

Inverted pendulum model: The simplest model of bipedal balance. The robot’s center of mass (CoM) is a point mass on a massless rigid leg pivoting at the foot. The Linear Inverted Pendulum Model (LIPM) constrains the CoM to a horizontal plane, yielding analytically tractable dynamics.

Zero Moment Point (ZMP) is the point on the ground where the net moment of all forces (gravity + inertia) is zero. The ZMP stability criterion: if the ZMP stays within the support polygon (convex hull of foot contacts), the robot won’t tip over. Traditional humanoid walking (Honda ASIMO, early Atlas) planned footsteps to keep the ZMP inside the support polygon.

Limitation: ZMP-based walking is conservative — it can’t handle running, jumping, or dynamic motions where the ZMP leaves the support polygon. Modern approaches use full-body dynamics and optimization-based control (MPC) to plan through dynamic maneuvers.

7.4 Modern RL-Based Locomotion

The dominant modern approach: train locomotion policies with RL in simulation, transfer to real hardware.

Workflow:

  1. Simulate the robot with a physics engine (MuJoCo, Isaac Sim)
  2. Train an RL policy (typically PPO) to walk, using reward = forward velocity - energy penalty - stability penalty
  3. Apply domain randomization to terrain, dynamics, sensor noise
  4. Deploy on the real robot

Landmark results:

Key insight from locomotion research: proprioceptive policies (using only joint angles, velocities, and IMU — no vision) can handle remarkably complex terrain by learning to “feel” the ground through foot contacts. Vision is added for anticipatory behavior (seeing a staircase before reaching it).


8. Safety

A robot that learns by trial and error can break itself, damage its environment, or injure people. Safety is not optional — it’s the primary constraint.

8.1 Safe Exploration

Safe exploration asks: how can an agent explore and learn without visiting catastrophic states?

Approaches:

8.2 Constrained RL (CMDPs)

A Constrained Markov Decision Process (CMDP) adds safety constraints to the standard MDP:

maximize  E[Σ γᵗ r(sₜ, aₜ)]          (return)
subject to E[Σ γᵗ cᵢ(sₜ, aₜ)] ≤ dᵢ    (constraint i)

where cᵢ is a cost function (e.g., collision indicator) and dᵢ is the maximum allowable cumulative cost.

Lagrangian relaxation is the standard solution: convert the constrained problem to an unconstrained one by introducing Lagrange multipliers:

L(π, λ) = E[Σ γᵗ r] - Σᵢ λᵢ · (E[Σ γᵗ cᵢ] - dᵢ)

Optimize π to maximize L (the primal), and optimize λ to minimize L (the dual). The multipliers λᵢ are learned alongside the policy — they automatically increase when constraints are violated and decrease when satisfied. Algorithms: CPO (Constrained Policy Optimization, Achiam et al., 2017), PPO-Lagrangian.

8.3 Risk-Aware Planning

Standard RL optimizes expected return. But for safety, we care about worst-case or tail outcomes.

CVaR (Conditional Value at Risk): Instead of optimizing the mean return, optimize the expected return in the worst α% of outcomes. CVaR₀.₀₅ optimizes the average return in the bottom 5% of trajectories — focusing on catastrophic scenarios.

Worst-case optimization: Robust MDPs model uncertainty as an adversary that chooses the worst-case transition dynamics within a set. The policy is optimized against this adversary, guaranteeing performance under any dynamics in the uncertainty set.

Chance constraints: Require that safety violations occur with probability less than ε:

P(unsafe event) ≤ ε

This is a probabilistic version of hard constraints, appropriate when deterministic guarantees are impossible but statistical guarantees are sufficient.

8.4 Connection to Autonomous Driving Safety

Autonomous driving is the highest-stakes application of robot learning, with mature safety frameworks:


Summary: Key Concepts for Interviews

Topic Core question Key concepts
Kinematics Where is the robot? FK (always solvable), IK (hard: non-unique, singularities), Jacobian
Control How to move it? PID (simple, per-joint), MPC (optimal, handles constraints), impedance (for contact)
Imitation learning Learn from demos? BC (simple, distribution shift), DAgger (interactive correction), IRL (infer reward)
Sim-to-real Sim → real? Domain randomization (brute-force robustness), sysid (match the sim), domain adaptation (invariant features)
Reward shaping Sparse reward? Potential-based shaping (preserves optimality), HER (relabel goals), curriculum (progressive difficulty)
Manipulation Grasp and use? Force/form closure, data-driven grasping, dexterous hands (high DOF, contact discontinuities)
Locomotion Walk and balance? Gaits, CPGs, ZMP (stability criterion), RL + sim-to-real (modern approach)
Safety Don’t break things? CMDPs (constraints in RL), CVaR (worst-case), safety filters, SOTIF (system-level)

References