# Anna Pawlicka

Programmer. Hiker. Cook. Always looking for interesting problems to solve.

# Q-Learning for Aibo

07 Feb 2013 | Aibo, artificial curiosity, reinforcement learning, robotics, sensor, Webots

I’ve been looking more into motor actions and their effect on Servos.  The initial position is always zero. So each time a new action is applied to the motor (we_servo_set_position()), it is interpreted as absolute position. In order to emulate a relative position, I’ll have to store the last value passed to the method, and then just add that value to the newly computed one. Not sure if this will improve my algorithm as it seems that relative position will eventually result in maxing out the joint position.

I’m back to Q-Learning algorithm. The design is as follows:
1. Goal: Approach red ball. The current distance from the ball minus the last distance reading determines the reward (greater decrease = greater reward). So if the current distance reading is 56 and the previous distance reading was 53, it receives a reward of +3.
2. Goal: Avoid Obstacles. If one of the bumpers is pressed, the reward is -2.

1. Goal: Avoid Staying Still. If the distance reading hasn’t changed in the last five steps, it receives a negative reward of -2. Presumably, if Aibo is receiving identical distance readings for five or more steps in a row, it is hung up or not moving.So how will the actual Q-Values be calculated? Basically we just need an equation that increases the Q-Value when a reward is positive, decreases the value when it is negative, and holds the value at equilibrium when the Q-Values are optimal. The equation is as follows:

Q(a,i)fl Q(a,i) + ß(R(i) + Q(a1,j) – Q(a,i))

where the following is true:

Q — a table of Q-values
a — previous action
i – previous state
j – the new state that resulted from the previous action
a1 — the action that will produce the maximum Q value
ß — the learning rate (between 0 and 1)
R — the reward function

This calculation must occur after an action has taken place, so the robot can determine how successful the action was (hence, why previous action and previous state are used).

In order to implement this algorithm, all movement by the robot must be segregated into steps. Each step consists of reading the percepts, choosing an action, and evaluating how well the action performed. All Q-values will initially be equal to zero for the first step, but during the next step (when the algorithm is invoked), it will set a Q-Value for Q(a,i) that will be a product of the reward it received for the last action. So as the robot moves along, the Q-values are calculated repeatedly, gradually becoming more refined (and more accurate).

I am going to try this out now – but with very simplified actions (joint angles will be replaced by movements NORTH, SOUTH, etc.).

## Standing position of Bazinga as a new fitness function.

It turns out that the infra red sensor does not work as I expected. First of all, it interprets lighter shades of various colours as red (e.g. yellow!). I’ve tried to modify the lookup table, but could not find the correct settings. Second, it does not work in fast mode (vital for evolution!). To sum up, it’s not a reliable sensor to be used in fitness function. I will definitely implement the red ball seeking behaviour, but instead of using the infra red sesor, I’ll do some simple camera image processing. However, before I jump into that, I’d like to make the dog move and not jerk in a random manner!