Artificial Intelligence: 6. Q Learning and SARS on Maze in OpenAI

William Scott
4 min readJan 27, 2019

Code: GitRepo

Hyperparameters Used:

- Learning Rate

- Discount Rate

- Exploration Rate

Learning Rate [0, 1]:

Learning rate is the rate of learning or the amount of information that we are taking from the current iteration into the Q-table.

Discount Rate [0, 1]:

Discount rate is the amount of discount we are giving to the q’(s’, a’) during the update rule

Exploration Rate [0, 1]:

Normally we should take the maximum valued action at a given state. But to avoid going to a local-maxima instead of global-maxima we take few random actions few times even though that action is not a max action. This helps to diversify and look every place possible. And the amount of exploration is decided by exploration rate.

Q-Learning:

Q[s, a] += learning_rate * (reward + (discount * q’) — Q[s, a)

Here we maintain a q-table which helps us decide the action to be taken from a particular given state. We maintain q-values in the Q-table. Each cell is denoted by q(s, a). where s is the state and a is action. The q table will be of size (s x a)

Note:

- All the changes of the hyperparameters are done on the same environment for evaluation of the changes in results.

- Global maxima is the shortest distance possible in the maze. Local-maxima is a near short distance but not the shortest.

Convergence Condition:

- Convergence is assumed when we have the same reward for 8 consecutive iterations.

Part 3:

Environment: 5x5

Observation:

- As the exploration rate is high, random actions are taken instead of the better suited actions

Observation:

- With both learning rate and exploration rate high, the same issue as above happens. The model learns but still takes a random choice 80% of the time

Observation:

- With less learning rate and exploration rate, the learning takes time but is not at constant as the model tends to be unconfident.

Observation:

- Without discount factor the model just doesn’t work. And the update on the q values doesn’t make sense

Observation:

- With less exploration rate and 80% learning rate and total discount rate, the learning wobbles a bit but becomes constant in the end.

Observation:

- With learning rate full and discount as 80%, the model tends to be perfectly aligned but little different in each iteration. Constant path is not maintained.

Observation:

- With 0 exploration rate, the model reaches to a perfect shorter path very quickly and it constantly goes through that path only.

- This is a drawback as the path might be a near shorter path instead of the shortest path available.

- Pre-mature convergence is highly possible.

Best Hyperparameters:

Exploration Rate:0.1 | Learning Rate: 0.3 | Discount: 0.8

Best method to run with hyperparameters:

- We can actually set the hyperparameters manually and check the performance but the best way to set the hyperparameters is to gradually change according to the number of iterations

- In the beginning the exploration rate should be little high and then should reduce to 0.01

- Learning rate also should decrease from 0.9 to 0.4

- Discount rate can stay constant.

Note:

Time taken is directly proportional to the number of iterations. So, did not mention the time taken.

Part 3c:

- If there is equal probability in the initial state, then also the learning is not worrisome and the convergence is almost same with a little +- few iterations.

Part 3e:

- On observation of the values of two separate intermediate states, the q-values increase for the states where the state is desired and it reduces for the states where it is not desired. Basically, the positive and negative reward comes into effect here.

Part 4:

- Extracted the q values from the previous trained model.

- Applying those q values to a new environment.

Code to apply q values to new model:

Output:

Observation:

- This will definitely won’t work as the model is trained on a different maze.

- The pointer in the maze gets struck at different positions. As the q values are not designed for this maze.

SARSA:

a’ = get_second_action()

q’ = Q(s’, a’)

Q(s, a) += learning_rate * (reward + (discount * q’) — Q(s, a))

In SARSA, we take another action from the first action and then take the q value of that new action with that new state.

The lower maze is solved by both Q-learning and SARSA algorithm to compare the performance

Q-Learning:

Converged in 60 Iterations

SARSA:

Converged in 33 Iterations

Observation:

- SARSA is a greedy policy, as the policy it relies on is from a’ from another action.

- Q-learning directly learns the optimal value. While SARSA learns the near optimal policy with a little exploration.

- Q-Learning takes little more iterations to converge than SARSA.

- On more analysis, the few times the Q-learning is better than SARSA. Maybe the variance is due to the change in maze.

--

--