a team of researchers from Microsoft proposes to imbue reinforcement learning, an AI training technique that employs rewards to stimulate systems towards positively positive goals, which claim that it could boost useful exploration to gather critical experiences to The learning.
As the researchers explain, reinforcement learning is commonly implemented through specific policy rewards designed for a predefined goal. Problematically, these extrinsic Rewards are limited in scope and can be difficult to define, unlike intrinsic Rewards that are independent of the task and quickly indicate success or failure.
In the search for an intrinsic policy, the researchers developed a framework that includes mechanisms motivated by human affection, one that motivates agents through impulses such as delight. Using a computer vision system that models the reward and another system that uses data to solve multiple tasks, measures human smiles as a positive effect.
The framework encourages agents to explore virtual or real-world environments without entering dangerous situations, and has the advantage of being independent of any specific application of artificial intelligence. A positive intrinsic reward mechanism predicts human smile responses as exploration evolves, while a sequential decision-making framework learns a generalizable policy. As for the positive intrinsic affect model, it changes the selection of actions so that it biases the actions that provide better intrinsic rewards, and a final component uses the data collected during the agent's exploration to create representations for visual recognition and understanding of Tasks.
To test the framework, the researchers collected data from five subjects responsible for exploring a three-dimensional digital maze with a vehicle, as well as synchronized images of each of their faces. (Each person drove for 11 minutes each, providing a total of 64,000 pictures). Participants were told to explore the environment, but were not given additional instructions on other objectives, and their smile responses were calculated and recorded using an open source algorithm.
The intrinsic motivation model based on affection was trained using the data of the subjects, with image frames of the vehicle dashboard that serves as input and the probability of smile as output. The results of other experiments show that the framework improved safe exploration and at the same time allowed efficient learning; Compared to the baselines, the researchers' intrinsic reward policy covered 46% more space in the maze and collided with obstacles 29% less time.
That accumulated experiences can help us learn general representations to solve tasks, including depth estimation, scene segmentation and translation of sketch to image .
