Tippy Project

Interactive Intelligent Agents

Summary: Autonomous intelligent agent simulations are abundant, but users can rarely interact with the trained agent. The Tippy Project is my attempt to simulate robots that users can play with.

Site Overview

  1. Optimizer Demo - What is CMA-ES and how does it work?
  2. Complex Limbs - Watch a robot learn to run
  3. Interactive Agent - Control a robot in the browser

Goals

Interaction

An intelligent agent is simple: it perceives its environment, takes actions, and receives rewards based on those actions. To Tippy, the human user is just another aspect of its environment. User inputs are no different from the height of the ground or body angle from Tippy's perspective. To distinguish user inputs I designed the reward (or objective) function to depend on those inputs. The reward function evaluates the agent's actions relative to whatever command the user is sending.

Because training takes place over hundreds of episodes, I simulated random user inputs, and the quality of neural network weights were evaluated based on how accurately Tippy followed those simulated commands. Because some set of weights might happen to randomly excel at a particular sequence of commands (causing overfitting) each solution was evaluated as the mean accuracy to a set of random commands that exhibited a range of "twitchiness". Random variation between training episodes reduces overfitting but also slows the learning process. To increase consistency in the reward function, each episode includes a sequence of commands to remain motionless and also to move right as quickly as possible.

One trick that dramatically increased the useability of the controls was forcing symmetry on the control system. Positive and negative directional inputs with the same magnitude are identical to Tippy; the other x-axis inputs are simply reversed (the sign of x velocity or angular velocity is flipped, while the order of the sensor inputs is reversed, right to left or left to right). This forced symmetry accelerates training by reducing the space of behaviors tippy must learn and increases the predictability of Tippy's behavior. The only drawback is the tendency for a discontinuity in behavior when inputs transition from positive to negative, that is, if Tippy's control system responds very differently to the "flipping" inputs (x-velocity, etc) as the user input passes from positive to negative.

Speaking of controls, the user inputs were restricted so that Tippy would be capable of accurately following a random set of commands. It is not physically realistic for Tippy to change direction instantly, so the controls are restricted to enforce smooth transitions. Tippy is an inverted pendulum. Tippy's acceleration is proportional to the sine of its body angle. So I first modeled the user inputs as request for a given body angle. This forces the user to think in terms of accelerating Tippy rather than requesting a particular velocity. This was effective for early training because Tippy's target behavior directly corresponds to one of the inputs it receives (body angle). It was easy for a simple neural network to tend toward that input, almost thermostatically. Unfortunately, acceleration-based controls do not have a natural upper limit, so users could request Tippy to continue accelerating to unstable speeds. In addition, when the user requests zero-acceleration (no body angle), the squared-error can be small (a nearly-zero body angle) while the impact on physical behavior, a small "drifting" acceleration , is extremely distracting. As a human user, I find velocity-based behavior much more natural than acceleration-based. (Ice skaters may feel differently.) acc