Understanding general approach to updating optimization function parameters












1












$begingroup$


This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.



Introduction



In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.



Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.



In many examples, I see $Theta$ written something like:



$Theta =
begin{bmatrix}
hat{x_0} \
hat{x_1} \
hat{x_2} \
vdots \
end{bmatrix}
$



and each row of $Theta$ is actually a vector.



Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.



In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.



So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.



Now for my point of confusion:



Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.










share|improve this question









$endgroup$

















    1












    $begingroup$


    This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.



    Introduction



    In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.



    Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.



    In many examples, I see $Theta$ written something like:



    $Theta =
    begin{bmatrix}
    hat{x_0} \
    hat{x_1} \
    hat{x_2} \
    vdots \
    end{bmatrix}
    $



    and each row of $Theta$ is actually a vector.



    Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.



    In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.



    So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.



    Now for my point of confusion:



    Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.










    share|improve this question









    $endgroup$















      1












      1








      1


      1



      $begingroup$


      This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.



      Introduction



      In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.



      Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.



      In many examples, I see $Theta$ written something like:



      $Theta =
      begin{bmatrix}
      hat{x_0} \
      hat{x_1} \
      hat{x_2} \
      vdots \
      end{bmatrix}
      $



      and each row of $Theta$ is actually a vector.



      Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.



      In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.



      So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.



      Now for my point of confusion:



      Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.










      share|improve this question









      $endgroup$




      This question not related to a specific method or technique, rather there is a broader concept that I'm struggling to see clearly.



      Introduction



      In machine learning, we have loss functions that we're trying to minimize. Gradient descent is a general solution to minimizing the output of these loss functions. I understand the basic idea there, but it's the details that I'm getting stuck on.



      Suppose that I have some loss function $J_Theta(x)$, where I have some $x$ input, where $Theta$ is just some matrix of arbitrary parameters.



      In many examples, I see $Theta$ written something like:



      $Theta =
      begin{bmatrix}
      hat{x_0} \
      hat{x_1} \
      hat{x_2} \
      vdots \
      end{bmatrix}
      $



      and each row of $Theta$ is actually a vector.



      Note that some concrete examples that I've run across include updating word embeddings in word2vec, or updating a softmax layer. I'm happy to elaborate if my explanation is too abstract.



      In the examples of gradient descent that I've seen, typically the derivative of $J$ is taken w.r.t each row of $Theta$, not individual elements.



      So something like $frac{dJ}{dhat{x_0}}$, where each vector in $Theta$ is updated according to the output of this gradient function.



      Now for my point of confusion:



      Suppose the parameters are initialized to zero, which is sometimes a thing. Wouldn't the updates be the same for each element of $hat{x_i}$ in $Theta$ during the update step? Wouldn't that lead to each vector having the same numbers for each element? I know that's the wrong conclusion, but I'm not able to see how each dimension would, in the end, become different values. I assume (hope) I'm missing something simple.







      optimization gradient-descent






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 2 days ago









      wheresmycookiewheresmycookie

      1064




      1064






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.



          A general problematic case



          Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:



          $$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$



          Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).



          Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:



          $$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$



          According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.



          A specific counterexample



          For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
          $$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
          for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).



          The gradient w.r.t. parameters is:
          $$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$



          This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
          $$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$



          But, what if $x_0=x_1$?

          In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.






          share|improve this answer











          $endgroup$









          • 1




            $begingroup$
            Thanks for the response! I'll take a closer look at this when I've got some time tonight!
            $endgroup$
            – wheresmycookie
            2 days ago






          • 1




            $begingroup$
            In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
            $endgroup$
            – wheresmycookie
            yesterday












          • $begingroup$
            @wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
            $endgroup$
            – Esmailian
            yesterday











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47239%2funderstanding-general-approach-to-updating-optimization-function-parameters%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.



          A general problematic case



          Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:



          $$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$



          Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).



          Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:



          $$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$



          According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.



          A specific counterexample



          For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
          $$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
          for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).



          The gradient w.r.t. parameters is:
          $$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$



          This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
          $$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$



          But, what if $x_0=x_1$?

          In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.






          share|improve this answer











          $endgroup$









          • 1




            $begingroup$
            Thanks for the response! I'll take a closer look at this when I've got some time tonight!
            $endgroup$
            – wheresmycookie
            2 days ago






          • 1




            $begingroup$
            In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
            $endgroup$
            – wheresmycookie
            yesterday












          • $begingroup$
            @wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
            $endgroup$
            – Esmailian
            yesterday
















          1












          $begingroup$

          I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.



          A general problematic case



          Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:



          $$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$



          Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).



          Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:



          $$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$



          According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.



          A specific counterexample



          For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
          $$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
          for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).



          The gradient w.r.t. parameters is:
          $$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$



          This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
          $$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$



          But, what if $x_0=x_1$?

          In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.






          share|improve this answer











          $endgroup$









          • 1




            $begingroup$
            Thanks for the response! I'll take a closer look at this when I've got some time tonight!
            $endgroup$
            – wheresmycookie
            2 days ago






          • 1




            $begingroup$
            In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
            $endgroup$
            – wheresmycookie
            yesterday












          • $begingroup$
            @wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
            $endgroup$
            – Esmailian
            yesterday














          1












          1








          1





          $begingroup$

          I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.



          A general problematic case



          Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:



          $$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$



          Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).



          Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:



          $$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$



          According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.



          A specific counterexample



          For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
          $$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
          for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).



          The gradient w.r.t. parameters is:
          $$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$



          This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
          $$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$



          But, what if $x_0=x_1$?

          In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.






          share|improve this answer











          $endgroup$



          I reached the same conclusion for a general case below. However, in practice, at least in the case of neural networks, weights are initialized randomly; initializing to the same value is avoided.



          A general problematic case



          Suppose data is $x=(x_0, x_1)$, where $x_0$ is the vector of features and $x_1$ is the outcome. Let the loss function be:



          $$J_{theta}(x) = (x_1 - f_{theta}(x_0))^2$$



          Only one data point $x$ is considered for simplicity (for a set of points, conclusion would be the same since the argument applies to each term in the summation).



          Each parameter $hat{x}_i$ could be a vector or a single parameter. Gradient of loss would be:



          $$nabla_{hat{x}_i}J_{theta}(x) = 2nabla_{hat{x}_i}f_{theta}(x_0)(x_1 - f_{theta}(x_0))$$



          According to this gradient, if two parameters $i$ and $j$ have symmetric roles in $f$, i.e. $$nabla_{hat{x}_i}f_{theta} = nabla_{hat{x}_j}f_{theta},$$ their corresponding loss gradient will also be the same since the components $nabla f_{theta}(x_0)$, $x_1$, and $f_{theta}(x_0)$ are all the same. A concrete example would be a neural network with equal weights and mean squared loss function. All weights between two specific layers would have the same role in the network, thus they will remain equal after each update. However, in practice, weights of neural networks are initialized randomly which breaks this role symmetry between the weights.



          A specific counterexample



          For example, consider 2D data $x=(x_0, x_1)$ and two 1D parameters $theta=[hat{x}_0, hat{x}_1]$, and let the loss function be:
          $$J_{theta}(x)= hat{x}_0 x_0 + hat{x}_1 x_1,$$
          for a batch of one point $x$ (this is for simplicity to avoid a summation over points in the batch).



          The gradient w.r.t. parameters is:
          $$frac{partial J_{theta}(x)}{partial hat{x}_0} = x_0,mbox{and }frac{partial J_{theta}(x)}{partial hat{x}_1} = x_1,$$



          This illustrates the dependency of gradient on data $x=(x_0, x_1)$. Now, if both parameters are zero, i.e. $hat{x}_0=hat{x}_1=0$, the gradient is still different and non zero. More specifically, suppose learning rate is $lambda$, the next values for parameters would be:
          $$hat{x}'_0 = hat{x}_0 - lambdafrac{partial J_{theta}(x)}{partial hat{x}_0} = 0 - lambda x_0 neq 0 - lambda x_1 = hat{x}_1 -lambdafrac{partial J_{theta}(x)}{partial hat{x}_1} = hat{x}'_1$$



          But, what if $x_0=x_1$?

          In this case, parameters always remain the same if we always use specific data point $x$ to update the parameters. However, this case is pathological (unlikely). Because, to let this equality keep going, any other data point $y$ that we peak must satisfy $y_0=y_1$ too. So in this example, the problem is unlikely to happen.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited yesterday

























          answered 2 days ago









          EsmailianEsmailian

          1,096112




          1,096112








          • 1




            $begingroup$
            Thanks for the response! I'll take a closer look at this when I've got some time tonight!
            $endgroup$
            – wheresmycookie
            2 days ago






          • 1




            $begingroup$
            In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
            $endgroup$
            – wheresmycookie
            yesterday












          • $begingroup$
            @wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
            $endgroup$
            – Esmailian
            yesterday














          • 1




            $begingroup$
            Thanks for the response! I'll take a closer look at this when I've got some time tonight!
            $endgroup$
            – wheresmycookie
            2 days ago






          • 1




            $begingroup$
            In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
            $endgroup$
            – wheresmycookie
            yesterday












          • $begingroup$
            @wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
            $endgroup$
            – Esmailian
            yesterday








          1




          1




          $begingroup$
          Thanks for the response! I'll take a closer look at this when I've got some time tonight!
          $endgroup$
          – wheresmycookie
          2 days ago




          $begingroup$
          Thanks for the response! I'll take a closer look at this when I've got some time tonight!
          $endgroup$
          – wheresmycookie
          2 days ago




          1




          1




          $begingroup$
          In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
          $endgroup$
          – wheresmycookie
          yesterday






          $begingroup$
          In your example, $hat{x_0}$ and $hat{x_1}$ are both 1D. But I'm wondering what happens when they aren't. Suppose they are now 2D vectors. If I understand correctly, even when $x_0$ and $x_1$ are different, $hat{x}'_0$ and $hat{x}'_1$ would be different from one another, but the numbers within the vectors have the same operations being performed on them (in your example, $0 - lambda x_0$ and $0 - lambda x_1$, respectively)
          $endgroup$
          – wheresmycookie
          yesterday














          $begingroup$
          @wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
          $endgroup$
          – Esmailian
          yesterday




          $begingroup$
          @wheresmycookie your points led me to come up with a general problematic case, which means the problem is not that unlikely that I thought.
          $endgroup$
          – Esmailian
          yesterday


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47239%2funderstanding-general-approach-to-updating-optimization-function-parameters%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Vallis Paradisi

          Tabula Rosettana