What is purpose of partial derivatives in loss calculation (linear regression)?
$begingroup$
I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.
Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?
machine-learning linear-regression machine-learning-model loss-function
New contributor
$endgroup$
add a comment |
$begingroup$
I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.
Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?
machine-learning linear-regression machine-learning-model loss-function
New contributor
$endgroup$
add a comment |
$begingroup$
I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.
Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?
machine-learning linear-regression machine-learning-model loss-function
New contributor
$endgroup$
I am studying ML and data science stuff from scratch. As a part of the course, I am studying how the models are derived. And for most of them, starting with the simplest - linear regression, we take partial derivatives. I understood its implementation part, however, I am a bit confused about why we need to take partial derivative there.
Is there any specific reason behind it? Can we use any other methodology to compute linear regression loss function?
machine-learning linear-regression machine-learning-model loss-function
machine-learning linear-regression machine-learning-model loss-function
New contributor
New contributor
New contributor
asked 2 days ago
aB9aB9
1011
1011
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
$$
L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
$$
For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.
In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.
$endgroup$
add a comment |
$begingroup$
It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.
In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.
I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
aB9 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46813%2fwhat-is-purpose-of-partial-derivatives-in-loss-calculation-linear-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
$$
L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
$$
For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.
In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.
$endgroup$
add a comment |
$begingroup$
The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
$$
L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
$$
For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.
In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.
$endgroup$
add a comment |
$begingroup$
The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
$$
L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
$$
For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.
In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.
$endgroup$
The underlying idea behind machine learning is to come up with more or less complicated algorithms such that, given a set of input data, one is able to produce some sort of output; this output in turn depends on some parameters (upon which the model is specified). The objective is to choose the aforementioned parameters so that the algorithm output is as close as possible to the actual result. Now let $y_i$ and $f(x_i, beta)$ be the actual and the predicted value, respectively, (in correspondence of the variable $x_i$): the previous sentence translates into trying to choose the parameters $beta$ (whatever they mean) such that the error you commit is the smallest, namely $L(y | x, beta)$ is minimised, where the function $L$ is any way we decide to estimate the error between predictions and actuals. In the literature $L$ is referred to as loss function and for most practical purposes, especially for polynomial models like linear regression, it reduces to the sum of squares
$$
L(y|x) = sum_{i=1}^N (y_i -f(x_i,beta))^2.
$$
For each choice of parameters $beta$ and function $f(x_i)$ the above takes different values; we are looking for, once we decide to fix the form of $f$, the set of $beta$ such that the above is the smallest. Assuming the loss function is differentiable, local minima must satisfy the condition such that the collection of partial derivatives with respect to the variable in question (in this case $beta$) must vanish; as such, at the end of the day one comes down to essentially taking derivatives and equating them to zero.
In case of linear regression one assumes $f(x_i, beta) = sum_{j=1}^M x_i^j beta_j$: plugging this expression into the loss function and taking derivatives gives back the familiar expressions for the coefficients that one learns in school. Likewise for more complicated models: the form of the function $f$ might be more complicated, there might be analytical problems related to the computational minimisation of the loss, there might be a bunch of more parameters connected to each other in a somewhat complex way (for instance for neural networks), nevertheless the underlying argument still holds.
answered 2 days ago
gentedgented
26816
26816
add a comment |
add a comment |
$begingroup$
It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.
In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.
I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.
$endgroup$
add a comment |
$begingroup$
It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.
In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.
I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.
$endgroup$
add a comment |
$begingroup$
It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.
In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.
I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.
$endgroup$
It all comes down on how backward propagation works. Ultimately you need to know how much each part of the equation contributed to the final error and then modify the values of such part of the equation.
In the case of linear regression is a bit more trivial, but when you start stacking one linear regression after other (and that is essentially a neural network) then you need to know how much each coefficient of each layer contributes to the final error. If you were to use a full derivative instead of a partial one, you could not determine the error contribution with such grain-level precision.
I understand this is not a good mathematical explanation, but it should give you an intuition of why you need the partial derivative.
answered 2 days ago
Juan Antonio Gomez MorianoJuan Antonio Gomez Moriano
656213
656213
add a comment |
add a comment |
aB9 is a new contributor. Be nice, and check out our Code of Conduct.
aB9 is a new contributor. Be nice, and check out our Code of Conduct.
aB9 is a new contributor. Be nice, and check out our Code of Conduct.
aB9 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46813%2fwhat-is-purpose-of-partial-derivatives-in-loss-calculation-linear-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown