Understanding general approach to updating optimization function parameters
$begingroup$
This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.
Introduction
In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.
Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.
In many examples, I see $Theta$ written something like:
$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$
and each row of $Theta$ is actually a vector.
Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.
In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.
So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.
Now for my point of confusion:
Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.
optimization gradient-descent
$endgroup$
add a comment |
$begingroup$
This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.
Introduction
In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.
Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.
In many examples, I see $Theta$ written something like:
$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$
and each row of $Theta$ is actually a vector.
Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.
In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.
So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.
Now for my point of confusion:
Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.
optimization gradient-descent
$endgroup$
add a comment |
$begingroup$
This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.
Introduction
In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.
Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.
In many examples, I see $Theta$ written something like:
$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$
and each row of $Theta$ is actually a vector.
Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.
In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.
So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.
Now for my point of confusion:
Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.
optimization gradient-descent
$endgroup$
This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.
Introduction
In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.
Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.
In many examples, I see $Theta$ written something like:
$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$
and each row of $Theta$ is actually a vector.
Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.
In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.
So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.
Now for my point of confusion:
Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.
optimization gradient-descent
optimization gradient-descent
asked 2 days ago
wheresmycookiewheresmycookie
1064
1064
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.
A general problematic case
Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:
$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$
Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).
Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:
$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$
According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.
A specific counterexample
For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
$$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).
The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$
This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
$$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$
But, what if $x_0=x_1$?
In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.
$endgroup$
1
$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago
1
$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday
$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47239%2funderstanding-general-approach-to-updating-optimization-function-parameters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.
A general problematic case
Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:
$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$
Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).
Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:
$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$
According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.
A specific counterexample
For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
$$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).
The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$
This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
$$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$
But, what if $x_0=x_1$?
In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.
$endgroup$
1
$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago
1
$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday
$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday
add a comment |
$begingroup$
I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.
A general problematic case
Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:
$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$
Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).
Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:
$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$
According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.
A specific counterexample
For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
$$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).
The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$
This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
$$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$
But, what if $x_0=x_1$?
In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.
$endgroup$
1
$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago
1
$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday
$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday
add a comment |
$begingroup$
I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.
A general problematic case
Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:
$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$
Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).
Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:
$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$
According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.
A specific counterexample
For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
$$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).
The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$
This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
$$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$
But, what if $x_0=x_1$?
In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.
$endgroup$
I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.
A general problematic case
Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:
$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$
Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).
Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:
$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$
According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.
A specific counterexample
For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
$$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).
The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$
This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
$$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$
But, what if $x_0=x_1$?
In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.
edited yesterday
answered 2 days ago
EsmailianEsmailian
1,096112
1,096112
1
$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago
1
$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday
$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday
add a comment |
1
$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago
1
$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday
$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday
1
1
$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago
$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago
1
1
$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday
$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday
$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday
$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47239%2funderstanding-general-approach-to-updating-optimization-function-parameters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown