One-step Dynamics MDP












4












$begingroup$


I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but that not not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!










share|improve this question







New contributor




RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    yesterday










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    yesterday
















4












$begingroup$


I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but that not not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!










share|improve this question







New contributor




RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    yesterday










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    yesterday














4












4








4





$begingroup$


I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but that not not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!










share|improve this question







New contributor




RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but that not not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!







machine-learning reinforcement-learning






share|improve this question







New contributor




RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question







New contributor




RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question






New contributor




RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked yesterday









RLSelfStudyRLSelfStudy

233




233




New contributor




RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






RLSelfStudy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    yesterday










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    yesterday


















  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    yesterday










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    yesterday
















$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
yesterday




$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
yesterday












$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
yesterday




$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
yesterday










1 Answer
1






active

oldest

votes


















2












$begingroup$

In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






share|improve this answer








New contributor




Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    RLSelfStudy is a new contributor. Be nice, and check out our Code of Conduct.










    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47436%2fone-step-dynamics-mdp%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2












    $begingroup$

    In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



    Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



    As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






    share|improve this answer








    New contributor




    Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$


















      2












      $begingroup$

      In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



      Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



      As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






      share|improve this answer








      New contributor




      Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$
















        2












        2








        2





        $begingroup$

        In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



        Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



        As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






        share|improve this answer








        New contributor




        Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



        Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



        As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.







        share|improve this answer








        New contributor




        Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|improve this answer



        share|improve this answer






        New contributor




        Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered yesterday









        Philip RaeisghasemPhilip Raeisghasem

        1735




        1735




        New contributor




        Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        Philip Raeisghasem is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






















            RLSelfStudy is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            RLSelfStudy is a new contributor. Be nice, and check out our Code of Conduct.













            RLSelfStudy is a new contributor. Be nice, and check out our Code of Conduct.












            RLSelfStudy is a new contributor. Be nice, and check out our Code of Conduct.
















            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47436%2fone-step-dynamics-mdp%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Callistus I

            Tabula Rosettana

            How to label and detect the document text images