Learning Hierarchical Representation in infoGAN

Learning Hierarchical Representation in infoGAN



Seung Hee Yoon.

Department of Computer Science

The University of Southern California

Los Angeles, CA, 90089

yoon@usc.edu April 2019 <Code>






Fig 1. Desire 0 (left) sets the state that gives the gent engender the intention to cook 'steak' which take ingredients 1 and 2 while the desire 1 (right) is for cooking 'pasta' taking 0 and 3. When considering 'ingredients', the chef must distinguish what is his best deal in terms of the quality of the ingredients. 


1. Problem formulation/Modeling/Implementation/Experiment Design

What is the Intention code in IRL setting?

When people think of their actions, they usually do not think much of their tiny muscle movement and angler forces. Instead, one tends to rather care of higher intention such as walk, run, ‘move arms on the object’, or ‘grab the apple’. Etc. In the field of Inversed Reinforcement Learning, there had been the work such as [1] that shows it is capable to encode such intentions where the code represents the unlabeled intentions of an expert. 

In the paper, the authors claim that the system they built can categorize the unlabeled intentional codes corresponding to expert trajectories under the GAIL setting; in other words, it assumes the sub-policies among the demonstrated policy of expert and learns intention codes that can most likely represent the unlabeled sub-policies from the expert. The author showed that the objective function to get optimized intention codes is equivalent to that of infoGAN under the GAIL setting and utilized infoGAN to encodes disentangled representation of an agent's policy.  The followings are its objective function.




 - E.q 1.


Where, I(G)= -H(πፀand I(i; G)= -H(πፀ) - (-H(πፀ| i)) ≧ E(log (Q(i | s, a)))Jonathan Ho and Stefano Ermon showed such Jansen Shannon divergence between the expert policy and the agent policy is equivalent to the expected total rewards the agent can get during trajectories. [1]. In the trajectories in the above settings, the reward depends on several factors that are, state, action, and its intention, which means, if the agent can reach onto the more similar distribution of expert's joint probability of intention, action, and state, it will get more rewards. On the other wards, intuitively, the expert has its intentions on each action on each state, and the agent tries to figure out the most compelling policies to generate such distribution.




Hierarchical Intentions 







 The graph is the Bayesian graph in the state transitions of the single intentional GAIL (Top). The hierarchical intention forms the multiple intentional layers associated with the state at the same timeline (Bottom). Since each reward generation depends on states, actions, and paired intentions, the terms on those three factors are inserted in the expectation terms in the JS divergence terms in E.q 1. 


 On the other hand, regarding the graph on the bottom, the reward engendered with states, actions and the multiple intentions that affect others in the way of the graph. Thus, all the factors must be additionally added on to the expectation terms of E.q 1 as like E.q 2.





-E.q 2.



In addition to the previous work onto an intentional layer, I extended their ideas into the multiple layers of intentions where P(I) = P(i0, i1,  . . . iM). One may try higher intention over ‘walk’, ‘run’. For instance, such higher intention can be ‘reaching out to a certain place to grab an object’; and that intention must be influencing and triggering multiple Intentions to accomplish the higher order of intentions. 

Yet, I found that, without the association between states and each intention code, the entire formula from the shifted terms can be equivalently treated as the original work as the 'I' can be treated as a single, intentional layer which is just a little bit bigger.  Thus, the association between the state and its intentions were added which will make the agent 'has' its own intentions on every state it faces on each timeline. 


Collectively, the original intentional GAIL's objective function can be extended onto this form below. 







Then the whole objective function can be represented as below. Notice that this is the case when the number of intentional layers is 2 for simplicity. 






The additional term for hierarchical intentions consists of maximizing mutual information terms when conjecting its auxiliary distribution in decoders (minus prefix) and minimizing mutual information (alternatively, maximizing entropy) term when it calculates each information without specifying intentional conditions. 


HGAIL Architenture



Fig 3. The architecture of HGAIL network. This structure will represent the original distribution of each intention by using encoding network feed into the policy generator while decoding its auxiliary distributions at the end of discriminator on GAIL architecture. I used a single fully connected layer as an encoder and decoder for each intention. 

Let us call this network HGAIL. The discriminator consists of tanh function, 32 channel Convolution, batch norm, leaky Relu, 64 channel Convolution, again, batch norm, and leaky Relu. Each convolution layer steps by 1 since the experiments that will be performed will cherish the detailed features in a state. 


The generator consists of, a fully connected layer, batch norm, leaky Relu, 32 channel Convolution, batch norm, leaky Relu, 16 channel Convolution, batch norm, leaky Relu, and same goes for 8 channel Convolution layer. At the bottom of the generator, there is another fully connected layer with a sigmoid layer. 



Cooking scenario

This can be demonstrated in a cooking scenario. In this setting, the most higher intentional code should represent ‘what to cook’. On the below, for example, it represents ‘grab a tomato’ or ‘take some beef’; and as most tiny elements of intention, the system must learn expert action trajectories where the action represents basic actions which consist of 'up', 'down', 'left', 'right', 'get' and 'cook' under non-deterministic environment. The action 'cook' make the agent get entered termination state and refresh the kitchen setting other desire. See Fig 1. Since the quality-factor is given on each ingredient even among the same kind of ingredients, and that information is unknown before reaching them, the expert and agent must look over almost all nodes spread on the map before 'get' them, and the information is also hidden in the state representation before touching the node. Yet, one does not have to check the ingredients other than what needed for a cook it wants to make. It might seem to an agent that the goal is simply touching and getting whatever they face. However, as seen in the Fig1, the expert clearly has their hidden, and hierarchical intentions when proceeding an action. Thus, the purpose of the Cooking agent is distinguishing those intentions behind the expert's behaviors and collecting the maximum quality of 'right ingredients' corresponding to what to cook. For instance, the agent must not choose the 'ingredients' 0 or 3 when its desire is 0. As a sub-purpose, the agent should choose the product that has a higher number of quality-factor among two same kinds of ingredients. Hence, the entire rule of this game is not that simple for training in order to make the agent understand it. 




Fig 4. The movement on the right gif file shows the experts trajectories following certain reward criteria. I collected over 10K of rows of expert trajectories. The state and action space will be,

State = [agent-coordination, outer state, Internal state]
outer state = [product-0 on the map . . product-M on the map]
Internal state = [Inventory, Current desire]

a product-i on the map = [Product Category, Product Index, Quality factor, Product Coordinate on the map]

{Product Category} = {0 ~ 3}


Ideally, the current desire has to directly determine the intentional
 value 1, and the intention 0, as a sub-intention, will be determined by the value of intention 1 and the current state. 



2. Results

Training


                 
Fig 5. The loss from GAIL (right) and HGAIL (left) The above line shows discriminator error while the bottom shows generator one. 

One of the most time-consuming work was determining hyperparameters such as constant values of the models, the number of epoch or the size of a batch; and I found the bigger batch size tends to improve overall performances while smaller learning rates and the huge number of epoch helps as well. I set the batch size as 2K and 15 for epoch number. As simulated trajectories go into 'generator buffer', I shuffled generator buffers picking up every batch from that buffer in order to remove correlation between trajectories encouraging the agent to learn 'general rule' behind the expert's trajectories. 

In order to evaluate the model's performance, I tested two kind of mission. The one is Reward test comparing GAIL and HGAIL; and the other one is Intention Generation to see whether it break the intentions into values that are separated with different modes. The details will be described below.

Reward Test

 

Fig 6. The generated trajectories from GAIL (left) and HGAIL (right). 


I tested both GAIL and HGAIL under same conditions and many trials (10 trials, over 100 iterations on every trial), collecting rewards that they got. I manually designed two kinds of reward functions to represent the criteria of this game. The first is 'reward-distance' where the value is how close to 'proper ingredient' is. Another one is 'reward-got' which represents whether it actually got desired ingredients. There must be two 'most desired ingredients' and two 'less desired ingredients but acceptable' on every trial. I penalties wrong ingredients that do not fit in its cooking objective. The followings are the resulted graph from over 15 trials. (In total 300 x 15 = 4500 iterations)



Fig 7. The value of reward-distance and reward-got of GAIL (top) and HGAIL (bottom). The picky lines one the above part of the graphs represents reward-distance while the lines on the bottom show reward-got. 

                                     


See Fig 7. In terms of reward-distance, both GAIL and HGAIL do not show a huge difference. However, in terms of reward-got, we can see HGAIL wins over GAIL in overall values along with iterations. Notice that the maximum value of reward-got for each trial is 20. The HGAIL succeed to get the maximum value for many times while the naive GAIL could not. As we can in Fig 6, GAIL tends to get ingredients whatever they reached, which is penalized. Meanwhile, HGAIL tend to hesitate to get an ingredient before getting it. That tendency sometimes hindered HGAIL agent from getting a higher reward-got though, it eventually helps to improve overall rewards winning over GAIL agent. The reward-got value can serve as a measure to evaluate how much does an agent understand the hierarchical intentions behind experts movements. 

Hierarchical Intention Generation


Now let us see whether it can break the intentions or the actions toward a master intention. The experiments are two things. First, feed i1 value 0 or 1 into the model. Then the i0 value will be affected by the i1 value and will have certain tendencies based on i1. 







Fig 8. Testing agent feeding fixed value of i1 (bottom). The histogram of the value of i0 when i1 = 0 (middle) and i1= 1 (top)



See histograms in Fig 8. It shows that the model tends to exert certain sub-intentions based on i1 value. Yet, it seems there is still a lot more way to go. Since the model's subordinate intention depends on, not only a master intention but also states, it is hard to control HGAIL agent completely solely by feeding i1 value. In order to control the agent completely, both states and a master intention should be controlled intentionally. 


It is time to test i0 mode. I initially intended to i0 to represent 'objective ingredient' on a current situation. For instance, when the agent has to get 'green' then the i0 would be 0 or other specified intentional code of i0 ideally.


 

Fig 9. Testing i0 mode test (right) and its reward-got histogram (left)


See Fig 9. I collected reward-got value for every i0 value performing 20K iterations. On every 5K iterations, I shifted the ingredients on the map to see what is the most 'preferred' ingredients assigned onto each i0 value. The histograms are its testing movie and histograms for i0 = 2, 0, 3, 1 in order. 

The below is their separated histograms where the value of i0 is from 0 to 3. 


                                         











Fig 11. reward-got histogram when i0 = 0 (left), i0 = 1 (right)










Fig 11. reward-got histogram when i0 = 2 (left), i0 = 3 (right)




During a test, I found that an intention to perform a sequence of actions to get a specific object may consist of a combination of different intentional codes, not a single one. Also, according to the test, the agent seems to learn an intention that I did not initially expect it to learn. For example, a certain intentional value such as i0 = 2 tends to perform 'scouting' checking the quality factor of every ingredient on the map. That means, the agent learned to breakdown the task into its own way and manner that I did not think it was; but that may also work to meet its desire. Still, the separated i0 codes tend to exert its tendency toward certain ingredients successfully. 


Conclusion


It is hard to predict and interpret what an agent learn from an expert. The agent might learn what we expect, or learn something that overcomes our expectation. In other words, the agent might interpret expert's policies into a manner that we want it to while it can learn another way of interpretations. Yet, I found that if we offer GAIL model proper containers of the intentions, it can improve the probability that the agent learns an aspect of expert's hierarchical intentions and not to misconstrue the initial intentions behind a circumstance as it is shown on the experimental results above.




References

[1] Multi-Modal Imitation Learning from Unstructured | arXiv:1705.10479v2 [cs.RO] 23 Nov 2017

[2] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. CoRR, abs/1606.03476, 2016