What does 'acting greedily' mean?
$begingroup$
I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
reinforcement-learning greedy-ai
$endgroup$
add a comment |
$begingroup$
I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
reinforcement-learning greedy-ai
$endgroup$
$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday
add a comment |
$begingroup$
I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
reinforcement-learning greedy-ai
$endgroup$
I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
reinforcement-learning greedy-ai
reinforcement-learning greedy-ai
edited yesterday
nbro
1,484621
1,484621
asked yesterday
DuttaADuttaA
2,1851831
2,1851831
$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday
add a comment |
$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday
$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday
$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".
It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.
When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.
I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:
$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$
Notes:
This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.
The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:
$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$
Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$
$endgroup$
$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday
1
$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday
add a comment |
$begingroup$
In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).
In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.
I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.
So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.
If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.
If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).
$endgroup$
add a comment |
$begingroup$
Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "658"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11016%2fwhat-does-acting-greedily-mean%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".
It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.
When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.
I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:
$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$
Notes:
This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.
The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:
$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$
Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$
$endgroup$
$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday
1
$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday
add a comment |
$begingroup$
In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".
It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.
When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.
I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:
$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$
Notes:
This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.
The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:
$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$
Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$
$endgroup$
$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday
1
$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday
add a comment |
$begingroup$
In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".
It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.
When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.
I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:
$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$
Notes:
This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.
The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:
$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$
Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$
$endgroup$
In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".
It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.
When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.
I want to know the actions that will be taken in 2 cases:
$v_pi(s)$ is known and $R_s$ is also known (only).
To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:
$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$
Notes:
This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.
The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$
$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).
To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:
$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$
Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$
answered yesterday
Neil SlaterNeil Slater
5,7831416
5,7831416
$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday
1
$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday
add a comment |
$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday
1
$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday
$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday
$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday
1
1
$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday
$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday
add a comment |
$begingroup$
In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).
In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.
I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.
So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.
If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.
If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).
$endgroup$
add a comment |
$begingroup$
In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).
In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.
I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.
So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.
If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.
If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).
$endgroup$
add a comment |
$begingroup$
In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).
In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.
I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.
So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.
If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.
If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).
$endgroup$
In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).
In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.
I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.
So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.
If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.
If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).
edited yesterday
answered yesterday
nbronbro
1,484621
1,484621
add a comment |
add a comment |
$begingroup$
Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).
$endgroup$
add a comment |
$begingroup$
Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).
$endgroup$
add a comment |
$begingroup$
Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).
$endgroup$
Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).
answered yesterday
Douglas DaseecoDouglas Daseeco
4,88811040
4,88811040
add a comment |
add a comment |
Thanks for contributing an answer to Artificial Intelligence Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11016%2fwhat-does-acting-greedily-mean%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday