What is purpose of partial derivatives in loss calculation (linear regression)?












0












$begingroup$


I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.



Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?










share|improve this question







New contributor




aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    0












    $begingroup$


    I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.



    Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?










    share|improve this question







    New contributor




    aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      0












      0








      0





      $begingroup$


      I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.



      Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?










      share|improve this question







      New contributor




      aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.



      Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?







      machine-learning linear-regression machine-learning-model loss-function






      share|improve this question







      New contributor




      aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 days ago









      aB9aB9

      1011




      1011




      New contributor




      aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      aB9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
          $$
          L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
          $$

          For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.



          In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.






          share|improve this answer









          $endgroup$





















            0












            $begingroup$

            It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.



            In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.



            I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.






            share|improve this answer









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });






              aB9 is a new contributor. Be nice, and check out our Code of Conduct.










              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46813%2fwhat-is-purpose-of-partial-derivatives-in-loss-calculation-linear-regression%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1












              $begingroup$

              The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
              $$
              L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
              $$

              For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.



              In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.






              share|improve this answer









              $endgroup$


















                1












                $begingroup$

                The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
                $$
                L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
                $$

                For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.



                In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.






                share|improve this answer









                $endgroup$
















                  1












                  1








                  1





                  $begingroup$

                  The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
                  $$
                  L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
                  $$

                  For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.



                  In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.






                  share|improve this answer









                  $endgroup$



                  The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
                  $$
                  L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
                  $$

                  For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.



                  In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 2 days ago









                  gentedgented

                  26816




                  26816























                      0












                      $begingroup$

                      It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.



                      In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.



                      I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.






                      share|improve this answer









                      $endgroup$


















                        0












                        $begingroup$

                        It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.



                        In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.



                        I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.






                        share|improve this answer









                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.



                          In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.



                          I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.






                          share|improve this answer









                          $endgroup$



                          It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.



                          In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.



                          I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered 2 days ago









                          Juan Antonio Gomez MorianoJuan Antonio Gomez Moriano

                          656213




                          656213






















                              aB9 is a new contributor. Be nice, and check out our Code of Conduct.










                              draft saved

                              draft discarded


















                              aB9 is a new contributor. Be nice, and check out our Code of Conduct.













                              aB9 is a new contributor. Be nice, and check out our Code of Conduct.












                              aB9 is a new contributor. Be nice, and check out our Code of Conduct.
















                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46813%2fwhat-is-purpose-of-partial-derivatives-in-loss-calculation-linear-regression%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              How to label and detect the document text images

                              Vallis Paradisi

                              Tabula Rosettana