What does 'acting greedily' mean?

I wanted to clarify the term 'acting greedily'. What does it mean? Does it correspond to the immediate reward, future reward or both combined? I want to know the actions that will be taken in 2 cases:

$v_pi(s)$ is known and $R_s$ is also known (only).

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

edited yesterday

nbro

1,484621

asked yesterday

DuttaA

2,1851831

$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday

add a comment |

$v_pi(s)$ is known and $R_s$ is also known (only).

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

edited yesterday

nbro

1,484621

asked yesterday

DuttaA

2,1851831

$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday

add a comment |

$v_pi(s)$ is known and $R_s$ is also known (only).

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

edited yesterday

nbro

1,484621

asked yesterday

DuttaA

2,1851831

$v_pi(s)$ is known and $R_s$ is also known (only).

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

reinforcement-learning greedy-ai

edited yesterday

nbro

1,484621

asked yesterday

DuttaA

2,1851831

edited yesterday

nbro

1,484621

asked yesterday

DuttaA

2,1851831

edited yesterday

nbro

1,484621

edited yesterday

nbro

1,484621

edited yesterday

nbro

1,484621

asked yesterday

DuttaA

2,1851831

asked yesterday

DuttaA

2,1851831

asked yesterday

DuttaA

2,1851831

$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday

add a comment |

$begingroup$
Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.
$endgroup$
– DuttaA
yesterday

Note: Since I am a beginner the question might look incoherent due to incomplete understanding. You can point it out and if i think thats the case i'll correct it.

– DuttaA
yesterday

add a comment |

3 Answers
3

active

oldest

votes

In RL, the phrase "acting greedily" is usually short for "acting greedily with respect to the value function". Greedy local optimisation turns up in other contexts, and it is common to specify what metric is being maximised or minimised. The value function is most often the discounted sum of expected future reward, and also the metric used when defining a policy as "acting greedily".

It is possible to define a policy as acting greedily with respect to immediate expected reward, but not the norm. As a special case, when the discount factor $gamma = 0$ then it is the same as acting greedily with respect to the value function.

When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.

I want to know the actions that will be taken in 2 cases:

$v_pi(s)$ is known and $R_s$ is also known (only).

To act greedily in RL, you would use the value function $v_pi(s')$ - the value function of the next states. To do so, you need to know the environment dynamics - to go with the notation $R_s$ you should also know the transition matrix $P_{ss'}^a$ - the probability of transitioning from $s$ to $s'$ whilst taking action $a$:

$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$

Notes:

This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.

The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:

$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$

Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$

answered yesterday

Neil Slater

5,7831416

$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday

1

$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday

add a comment |

In general, a greedy "action" is an action that would lead to an immediate "benefit". For example, the Dijkstra's algorithm can be considered a greedy algorithm because at every step it selects the node with the smallest "estimate" to the initial (or starting) node. In reinforcement learning, a greedy action often refers to an action that would lead to the immediate highest reward (disregarding possible future rewards). However, a greedy action can also mean the action that would lead to the highest possible return (that is, the greedy action can also be considered an action that takes into account not just immediate rewards but also future ones).

In your case, I think that the "greedy action" can mean different things, depending on weather you use the reward function or the value functions, that is, you can act greedily with respect to the reward function or the value functions.

I would like to note that you are using a different notation for the reward function for each of the two value functions, but this does not need to be the case. So, your reward function might be expressed as $R_s^a$ even if you use $v_pi(s)$. I will use the notation $R_s^a$ for simplicity of the explanations.

So, if you have access to the reward function for a given state and action, $R^a_s = r(s, a)$, then the greedy action (with respect to the reward function $r$) would just be the action from state $s$ with the highest reward. So, formally, we can define it as $a_text{greedy} = arg max_a r(s, a)$ (both in the case of the state or state-action value functions: it does not matter if you have one or the other value function). In other words, if you have access to the reward function (in that form), you can act greedily from any state without needing to access the value functions: you have a "model" of the rewards that you will obtain.

If you have $q_pi(s, a)$ (that is, the state-action value function for a fixed policy $pi$), then, at time step $t$, the greedy action (with respect to $q_pi(s, a)$) from state $s$ is $a_text{greedy} = arg max_{a}q_pi(s, a)$. If you then take action $a_text{greedy}$ in the environment, you would obtain the highest discounted future reward (that is, the return), according the $q_pi(s, a)$, which might actually not be the highest possible return from $s$, because $q_pi(s, a)$ might not be the optimal state-action value function. If $q_pi(s, a) = q_{pi^*}(s, a)$ (that is, if you have the optimal state-action value function), then, if you execute $a_text{greedy}$ in the environment, you will theoretically obtain the highest possible return from $s$.

If you had the optimal value function (the value function associated with the optimal policy to act in your environment), then the following equation holds $v_*(s) = max_{a} q_{pi^*}(s, a)$. So, in that case, $a_text{greedy} = arg max_{a}q_{pi^*}(s, a)$ would also be the greedy action if you had $v_*(s)$. If you only have $v_pi(s)$ (without e.g. the Q function), I don't think you can act greedily (that is, there is no way of knowing which action is the greedy action from $s$ by just having the value of state $s$: this is actually why we often estimate the Q functions for "control", i.e. acting in the environment).

edited yesterday

answered yesterday

nbro

1,484621

add a comment |

Acting greedily means that the search is not forward thinking and limits its decisions solely on immediate return. It is not quite the same as what is meant in human social contexts in that greed in that context can involve forward thinking strategies that sacrifice short term losses for long term gain. In the typical machine search lingo, greed is myopic (short-sighted).

answered yesterday

Douglas Daseeco

4,88811040

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "658"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f11016%2fwhat-does-acting-greedily-mean%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.

I want to know the actions that will be taken in 2 cases:

$v_pi(s)$ is known and $R_s$ is also known (only).

$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$

Notes:

This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.

The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:

$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$

Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$

answered yesterday

Neil Slater

5,7831416

$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday

1

$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday

add a comment |

When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.

I want to know the actions that will be taken in 2 cases:

$v_pi(s)$ is known and $R_s$ is also known (only).

$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$

Notes:

This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.

The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:

$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$

Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$

answered yesterday

Neil Slater

5,7831416

$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday

1

$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday

add a comment |

When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.

I want to know the actions that will be taken in 2 cases:

$v_pi(s)$ is known and $R_s$ is also known (only).

$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$

Notes:

This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.

The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:

$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$

Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$

answered yesterday

Neil Slater

5,7831416

When rewards are sparse (a common situation in RL), acting greedily with respect to expected immediate reward is not very useful. There is not enough information to make a decision.

I want to know the actions that will be taken in 2 cases:

$v_pi(s)$ is known and $R_s$ is also known (only).

$$pi'(s) = text{argmax}_a sum_{s'} P_{ss'}^a(R_{s} + gamma v_{pi}(s'))$$

Notes:

This assumes $R_{s}$ is your immediate reward for leaving state $s$. Substitute $R_{s'}$ if the reward matrix is for entering state $s'$ instead.

The policy gained from acting greedily with respect to $v_pi(s)$ is not $pi$, it is (a usually improved) policy $pi'$

$q_{pi}(s, a)$ is known and $R_s^a$ is also known (only).

To act greedily in RL, you would use the value function $q_{pi}(s, a)$, and this is much simpler:

$$pi'(s) = text{argmax}_a q_{pi}(s, a)$$

Again, the policy gained from acting greedily with respect to $q_pi(s,a)$ is not $pi$, it is (a usually improved) policy $pi'$

answered yesterday

Neil Slater

5,7831416

answered yesterday

Neil Slater

5,7831416

answered yesterday

Neil Slater

5,7831416

answered yesterday

Neil Slater

5,7831416

$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday

1

$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday

add a comment |

$begingroup$
Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)
$endgroup$
– DuttaA
yesterday

1

$begingroup$
@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.
$endgroup$
– Neil Slater
yesterday

Shouldn't $R_s$ be outside the brackets (not multiplied with probability) in case the reward is for leaving the state while inside the bracket for ($R_s'$)

– DuttaA
yesterday

@DuttA: Yes it can be outside the brackets (and the whole sum) for efficiency, but actually it does not matter to the value. I think I should leave it in for this case, so I don't have extra complication of two formulae.

– Neil Slater
yesterday

add a comment |

edited yesterday

answered yesterday

nbro

1,484621

add a comment |

edited yesterday

answered yesterday

nbro

1,484621

add a comment |

edited yesterday

answered yesterday

nbro

1,484621

edited yesterday

answered yesterday

nbro

1,484621

edited yesterday

answered yesterday

nbro

1,484621

answered yesterday

nbro

1,484621

answered yesterday

nbro

1,484621

add a comment |

answered yesterday

Douglas Daseeco

4,88811040

add a comment |

answered yesterday

Douglas Daseeco

4,88811040

add a comment |

answered yesterday

Douglas Daseeco

4,88811040

answered yesterday

Douglas Daseeco

4,88811040

answered yesterday

Douglas Daseeco

4,88811040

answered yesterday

Douglas Daseeco

4,88811040

answered yesterday

Douglas Daseeco

4,88811040

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Artificial Intelligence Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk