Understanding general approach to updating optimization function parameters

This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.

Introduction

In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.

Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.

In many examples, I see $Theta$ written something like:

$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$

and each row of $Theta$ is actually a vector.

Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.

In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.

So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.

Now for my point of confusion:

Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.

asked 2 days ago

wheresmycookie

1064

add a comment |

This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.

Introduction

Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.

In many examples, I see $Theta$ written something like:

$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$

and each row of $Theta$ is actually a vector.

Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.

In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.

So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.

Now for my point of confusion:

asked 2 days ago

wheresmycookie

1064

add a comment |

This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.

Introduction

Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.

In many examples, I see $Theta$ written something like:

$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$

and each row of $Theta$ is actually a vector.

Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.

In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.

So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.

Now for my point of confusion:

asked 2 days ago

wheresmycookie

1064

This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.

Introduction

Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.

In many examples, I see $Theta$ written something like:

$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$

and each row of $Theta$ is actually a vector.

Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.

In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.

So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.

Now for my point of confusion:

optimization gradient-descent

asked 2 days ago

wheresmycookie

1064

asked 2 days ago

wheresmycookie

1064

asked 2 days ago

wheresmycookie

1064

asked 2 days ago

wheresmycookie

1064

asked 2 days ago

wheresmycookie

1064

add a comment |

1 Answer
1

active

oldest

votes

I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.

A general problematic case

Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:

$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$

Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).

Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:

$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$

According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.

A specific counterexample

For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
$$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).

The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$

This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
$$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$

But, what if $x_0=x_1$?

In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.

edited yesterday

answered 2 days ago

Esmailian

1,096112

1

$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago

1

$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday

$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47239%2funderstanding-general-approach-to-updating-optimization-function-parameters%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.

A general problematic case

Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:

$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$

Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).

Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:

$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$

A specific counterexample

The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$

edited yesterday

answered 2 days ago

Esmailian

1,096112

1

$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago

1

$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday

$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday

add a comment |

I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.

A general problematic case

Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:

$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$

Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).

Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:

$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$

A specific counterexample

The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$

edited yesterday

answered 2 days ago

Esmailian

1,096112

1

$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago

1

$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday

$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday

add a comment |

I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.

A general problematic case

Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:

$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$

Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).

Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:

$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$

A specific counterexample

The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$

edited yesterday

answered 2 days ago

Esmailian

1,096112

I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.

A general problematic case

Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:

$$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$

Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).

Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:

$$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$

A specific counterexample

The gradient w.r.t. parameters is:
$$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$

edited yesterday

answered 2 days ago

Esmailian

1,096112

edited yesterday

answered 2 days ago

Esmailian

1,096112

answered 2 days ago

Esmailian

1,096112

answered 2 days ago

Esmailian

1,096112

1

$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago

1

$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday

$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday

add a comment |

1

$begingroup$
Thanks for the response! I'll take a closer look at this when I've got some time tonight!
$endgroup$
– wheresmycookie
2 days ago

1

$begingroup$
In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
$endgroup$
– wheresmycookie
yesterday

$begingroup$
@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
$endgroup$
– Esmailian
yesterday

Thanks for the response! I'll take a closer look at this when I've got some time tonight!

– wheresmycookie
2 days ago

In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)

– wheresmycookie
yesterday

@wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.

– Esmailian
yesterday

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk

Understanding general approach to updating optimization function parameters

Introduction

Now for my point of confusion:

Introduction

Now for my point of confusion:

Introduction

Now for my point of confusion:

Introduction

Now for my point of confusion:

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

LSTM sequence prediction: 3d input to 2d output

Chemia organometallica

Cannabis

Understanding general approach to updating optimization function parameters

Introduction

Now for my point of confusion:

Introduction

Now for my point of confusion:

Introduction

Now for my point of confusion:

Introduction

Now for my point of confusion:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

LSTM sequence prediction: 3d input to 2d output

Chemia organometallica

Cannabis

1 Answer
1

1 Answer
1

1 Answer
1