Continuous Learning Series

Reinforcement Learning: A Quick Overview

Part 1 in Reinforcement Learning Series

Mohit Pilkhan

Published in

Building Fynd

5 min readMay 11, 2020

When you wish to instruct, be brief. — Cicero

I was in the third year of college when I bought the book Artificial Intelligence-A Modern Approach. I was deeply impressed by the cover page of the book that shows a mechanical hand playing chess. This book might be one of the many reasons why I love working in this field.

Cover Page of Artificial Intelligence — A Modern Approach

Artificial Intelligence or AI is where science, technology, and the future overlaps. And at Fynd, we apply AI to craft smarter solutions for the retail and ecommerce industry.

As a company we believe in continuous self-improvement. This is also one of our core values. We are always eager to learn and use new technologies to make our products better. The ML team has been researching Reinforcement Learning or RL and its system of rewards and penalties for sometime.We believe that exploring together can create better opportunities for everyone, hence this blog.

This series on RL, will help you discover RL and keep you updated with our research and progress.In this blog, we will look at the basic concepts in RL. So, lets get started.

What is Reinforcement Learning?

Reinforcement Learning or RL is learning “What to Do” and how to map situations to actions — to maximise a numerical Reward Signal instead of trying to find hidden structures (as in Machine Learning) however these can be used to maximize the reward.

If you are interested in deep learning then you must have observed that lately many research papers are combining reinforcement learning, genetic algorithms, and deep learning techniques to solve various problems. The problem of neural architecture search is especially being addressed with RL.

Those of you familiar with RL may also understand the Multi-Armed Bandit problem. The multi-armed bandit problem is a classic problem that demonstrates the exploration vs exploitation dilemma. We will talk about it in detail in our next blog post.

Another interesting exploration is the monthly AWS DeepRacer. For our team, it has turned out to be a fun way to experiment in reinforcement learning.

RL Resources and Examples

Alibaba Luban: Luban is an automatic banner design tool. It uses RL to learn the “preferred” kind of Design.
Designing Neural Network: This paper is an excellent read showing how automating the process of neural network design can be addressed with the use of RL.
AlphaGo: Go is a very complex game. While discussing the complexity of the game, Demis Hassabis said: “There are more possible Go positions than there are atoms in the universe”. This may seem like an exaggeration, but DeepMind’s AlphaGo did beat the world champion Lee Sedol in 4 out of total 5 games. AlphaGo used techniques of Deep RL.

Some Basic RL Concepts and Terminologies

Below are some important concepts in RL. These terms will continue to be used in our upcoming blogs where we will talk about implementing solutions for various problems.

1. Evaluative vs Instructive Feedback

Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. [1]

The main difference between reinforcement and supervised learning is whether the feedback received is evaluative or instructive.

In reinforcement learning, we take the action and evaluate the reward that is returned. In supervised learning, we take the action, for example, assigning class labels in an image classification task. We then compare the action with the correct action that should have been taken. This comparison is done using some appropriate loss function and then used to instruct the learning algorithm (using optimizers like Adam) to take the correct action.

2. Associative vs Non-Associative task

An associative task is the one in which there are more than one state in the environment. So now you need to associate some action to each state. If in case there is only a single state, it is identified as a non-associative task.

3. Stationary vs Non-Stationary task

In stationary problem settings, the true reward of an action does not change. Suppose you have 10 lever-arms to pull in a 10-armed bandit problem. If the true reward of each lever-arm does not change throughout the play, then we say “the problem setting is stationary”. We can make a fairly good approximation for such tasks in a fair number of repeated play.

In non-stationary problem settings, the true reward of an action changes. Each time you pull a lever-arm, the underlying mechanism ( i.e which is responsible for assigning true rewards) changes the true rewards for all or some of the arms based on some criteria. But if you continue to pull that arm, you may not receive the same reward. This setting is non-stationary.

4. Exploitation vs Exploration

Suppose you are driving a car and whenever you come across any intersection, it offers four choices [LEFT, RIGHT, STRAIGHT, TURN BACK]. Suppose the first time when you came across an intersection you randomly chose an action. Let’s say you chose LEFT and received a reward of 100 dollars.

Now, what will you do at the next intersection?

You can keep taking a LEFT turn and earn 100 dollars each time (assuming the problem is stationary ) i.e exploiting the current information you have. However if you take the right turn, you may be rewarded with 1000 dollars or just 10 dollars. The worst scenario is that taking a turn may get you a negative reward. This means that instead of receiving money you will be paying money. The dilemma here is that since you have incomplete information, you will be wondering if better rewards are waiting at another turn. With exploration, we take some risk to collect information about unknown options.

Conclusion

We now have a foundation of reinforcement learning concepts, and are ready to explore the Multi Armed Bandit Problem. In the next blog, we will set a k-armed bandit testbed as shown in the figure below. Then we will explore various methods to solve the k-armed bandit problem.

WHO IS THIS GUY?

Hi, I am Mohit Pilkhan. I am a Software Engineer and currently a member of the Machine Learning Team at Fynd. At Fynd, We are continuously leveraging the latest trends in AI in order to make our products better.

References

Reinforcement Learning — An Introduction
https://github.com/mahakal001/reinforcement-learning
Designing Neural Network Architectures Using Reinforcement Learning. https://arxiv.org/pdf/1611.02167.pdf
AI visual design is already here — and it won’t hesitate to take over your petty design job
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. https://arxiv.org/pdf/1712.01815.pdf

To give feedback, please leave a comment. You can learn more about our current research at https://research.fynd.com/.

If you notice any discrepancy in this article please leave a comment below so that I can address the issue as soon as possible.