What does 'acting greedily' mean?












2












$begingroup$


I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:





  • $v_pi(s)$ is known and $R_s$ is also known (only).


  • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).










share|improve this question











$endgroup$












  • $begingroup$
    Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
    $endgroup$
    – DuttaA
    yesterday
















2












$begingroup$


I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:





  • $v_pi(s)$ is known and $R_s$ is also known (only).


  • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).










share|improve this question











$endgroup$












  • $begingroup$
    Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
    $endgroup$
    – DuttaA
    yesterday














2












2








2





$begingroup$


I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:





  • $v_pi(s)$ is known and $R_s$ is also known (only).


  • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).










share|improve this question











$endgroup$




I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:





  • $v_pi(s)$ is known and $R_s$ is also known (only).


  • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).







reinforcement-learning greedy-ai






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday









nbro

1,484621




1,484621










asked yesterday









DuttaADuttaA

2,1851831




2,1851831












  • $begingroup$
    Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
    $endgroup$
    – DuttaA
    yesterday


















  • $begingroup$
    Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
    $endgroup$
    – DuttaA
    yesterday
















$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday




$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday










3 Answers
3






active

oldest

votes


















2












$begingroup$

In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".



It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.



When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.




I want to know the actions that will be taken in 2 cases:





  • $v_pi(s)$ is known and $R_s$ is also known (only).




To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:



$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$



Notes:




  • This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.


  • The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$






  • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).




To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:



$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$



Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$






share|improve this answer









$endgroup$













  • $begingroup$
    Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
    $endgroup$
    – DuttaA
    yesterday






  • 1




    $begingroup$
    @DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
    $endgroup$
    – Neil Slater
    yesterday





















2












$begingroup$

In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).



In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.



I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.



So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.



If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.



If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).






share|improve this answer











$endgroup$





















    0












    $begingroup$

    Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).






    share|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "658"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11016%2fwhat-does-acting-greedily-mean%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2












      $begingroup$

      In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".



      It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.



      When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.




      I want to know the actions that will be taken in 2 cases:





      • $v_pi(s)$ is known and $R_s$ is also known (only).




      To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:



      $$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$



      Notes:




      • This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.


      • The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$






      • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).




      To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:



      $$pi'(s) = text{argmax}_a q_{pi}(s, a)$$



      Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$






      share|improve this answer









      $endgroup$













      • $begingroup$
        Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
        $endgroup$
        – DuttaA
        yesterday






      • 1




        $begingroup$
        @DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
        $endgroup$
        – Neil Slater
        yesterday


















      2












      $begingroup$

      In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".



      It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.



      When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.




      I want to know the actions that will be taken in 2 cases:





      • $v_pi(s)$ is known and $R_s$ is also known (only).




      To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:



      $$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$



      Notes:




      • This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.


      • The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$






      • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).




      To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:



      $$pi'(s) = text{argmax}_a q_{pi}(s, a)$$



      Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$






      share|improve this answer









      $endgroup$













      • $begingroup$
        Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
        $endgroup$
        – DuttaA
        yesterday






      • 1




        $begingroup$
        @DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
        $endgroup$
        – Neil Slater
        yesterday
















      2












      2








      2





      $begingroup$

      In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".



      It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.



      When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.




      I want to know the actions that will be taken in 2 cases:





      • $v_pi(s)$ is known and $R_s$ is also known (only).




      To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:



      $$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$



      Notes:




      • This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.


      • The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$






      • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).




      To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:



      $$pi'(s) = text{argmax}_a q_{pi}(s, a)$$



      Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$






      share|improve this answer









      $endgroup$



      In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".



      It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.



      When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.




      I want to know the actions that will be taken in 2 cases:





      • $v_pi(s)$ is known and $R_s$ is also known (only).




      To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:



      $$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$



      Notes:




      • This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.


      • The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$






      • $q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).




      To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:



      $$pi'(s) = text{argmax}_a q_{pi}(s, a)$$



      Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered yesterday









      Neil SlaterNeil Slater

      5,7831416




      5,7831416












      • $begingroup$
        Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
        $endgroup$
        – DuttaA
        yesterday






      • 1




        $begingroup$
        @DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
        $endgroup$
        – Neil Slater
        yesterday




















      • $begingroup$
        Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
        $endgroup$
        – DuttaA
        yesterday






      • 1




        $begingroup$
        @DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
        $endgroup$
        – Neil Slater
        yesterday


















      $begingroup$
      Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
      $endgroup$
      – DuttaA
      yesterday




      $begingroup$
      Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
      $endgroup$
      – DuttaA
      yesterday




      1




      1




      $begingroup$
      @DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
      $endgroup$
      – Neil Slater
      yesterday






      $begingroup$
      @DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
      $endgroup$
      – Neil Slater
      yesterday















      2












      $begingroup$

      In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).



      In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.



      I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.



      So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.



      If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.



      If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).






      share|improve this answer











      $endgroup$


















        2












        $begingroup$

        In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).



        In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.



        I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.



        So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.



        If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.



        If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).






        share|improve this answer











        $endgroup$
















          2












          2








          2





          $begingroup$

          In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).



          In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.



          I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.



          So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.



          If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.



          If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).






          share|improve this answer











          $endgroup$



          In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).



          In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.



          I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.



          So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.



          If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.



          If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited yesterday

























          answered yesterday









          nbronbro

          1,484621




          1,484621























              0












              $begingroup$

              Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).






              share|improve this answer









              $endgroup$


















                0












                $begingroup$

                Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).






                share|improve this answer









                $endgroup$
















                  0












                  0








                  0





                  $begingroup$

                  Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).






                  share|improve this answer









                  $endgroup$



                  Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered yesterday









                  Douglas DaseecoDouglas Daseeco

                  4,88811040




                  4,88811040






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11016%2fwhat-does-acting-greedily-mean%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to label and detect the document text images

                      Tabula Rosettana

                      Aureus (color)