LSTM sequence prediction: 3d input to 2d output












1












$begingroup$


I have this LSTM model



model = Sequential()
model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=False))
model.add(Dense(features, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


and shapes X_train (21, 11, 5), y_train (21, 5).



Each timestep is represented by 5 features and
return_sequences is set to False because I want to predict one 5D array (the next timestep) for each input sequence of 11 timesteps.



I get the error




ValueError: y_true and y_pred have different number of output (5!=1)




If I reshape the data as X_train (21, 11, 5), y_train (21, 1, 5) instead I get the error




ValueError: Invalid shape for y: (14, 1, 5)




Note: the value 14 is due to the fact that I'm using cross validation.



What should I do?



Edit



I changed the model to



model = Sequential()
model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(features, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


and used the same shapes as before.
Here model.summary() gives



_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
masking_1 (Masking) (None, 11, 5) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 42400
_________________________________________________________________
dense_1 (Dense) (None, 5) 505
=================================================================



The idea is to produce a multilabel classification. After training the model, I evaluate it on the test data and this is what I get:




X[0] = [[0 0 0 0 0],[1 0 0 1 0], ...,[0 0 1 0 0],[0 0 1 0 0]]



y_true[0] = [0 0 1 0 0]



y_pred[0] = 2




which is not what I want. How can I get an output of the same shape as y_true, so as to transform it into a multilabel classification?










share|improve this question









New contributor




ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    1












    $begingroup$


    I have this LSTM model



    model = Sequential()
    model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
    model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=False))
    model.add(Dense(features, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


    and shapes X_train (21, 11, 5), y_train (21, 5).



    Each timestep is represented by 5 features and
    return_sequences is set to False because I want to predict one 5D array (the next timestep) for each input sequence of 11 timesteps.



    I get the error




    ValueError: y_true and y_pred have different number of output (5!=1)




    If I reshape the data as X_train (21, 11, 5), y_train (21, 1, 5) instead I get the error




    ValueError: Invalid shape for y: (14, 1, 5)




    Note: the value 14 is due to the fact that I'm using cross validation.



    What should I do?



    Edit



    I changed the model to



    model = Sequential()
    model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
    model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(features, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


    and used the same shapes as before.
    Here model.summary() gives



    _________________________________________________________________
    Layer (type) Output Shape Param #
    =================================================================
    masking_1 (Masking) (None, 11, 5) 0
    _________________________________________________________________
    lstm_1 (LSTM) (None, 100) 42400
    _________________________________________________________________
    dense_1 (Dense) (None, 5) 505
    =================================================================



    The idea is to produce a multilabel classification. After training the model, I evaluate it on the test data and this is what I get:




    X[0] = [[0 0 0 0 0],[1 0 0 1 0], ...,[0 0 1 0 0],[0 0 1 0 0]]



    y_true[0] = [0 0 1 0 0]



    y_pred[0] = 2




    which is not what I want. How can I get an output of the same shape as y_true, so as to transform it into a multilabel classification?










    share|improve this question









    New contributor




    ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I have this LSTM model



      model = Sequential()
      model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
      model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=False))
      model.add(Dense(features, activation='softmax'))
      model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


      and shapes X_train (21, 11, 5), y_train (21, 5).



      Each timestep is represented by 5 features and
      return_sequences is set to False because I want to predict one 5D array (the next timestep) for each input sequence of 11 timesteps.



      I get the error




      ValueError: y_true and y_pred have different number of output (5!=1)




      If I reshape the data as X_train (21, 11, 5), y_train (21, 1, 5) instead I get the error




      ValueError: Invalid shape for y: (14, 1, 5)




      Note: the value 14 is due to the fact that I'm using cross validation.



      What should I do?



      Edit



      I changed the model to



      model = Sequential()
      model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
      model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
      model.add(Dense(features, activation='sigmoid'))
      model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


      and used the same shapes as before.
      Here model.summary() gives



      _________________________________________________________________
      Layer (type) Output Shape Param #
      =================================================================
      masking_1 (Masking) (None, 11, 5) 0
      _________________________________________________________________
      lstm_1 (LSTM) (None, 100) 42400
      _________________________________________________________________
      dense_1 (Dense) (None, 5) 505
      =================================================================



      The idea is to produce a multilabel classification. After training the model, I evaluate it on the test data and this is what I get:




      X[0] = [[0 0 0 0 0],[1 0 0 1 0], ...,[0 0 1 0 0],[0 0 1 0 0]]



      y_true[0] = [0 0 1 0 0]



      y_pred[0] = 2




      which is not what I want. How can I get an output of the same shape as y_true, so as to transform it into a multilabel classification?










      share|improve this question









      New contributor




      ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I have this LSTM model



      model = Sequential()
      model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
      model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=False))
      model.add(Dense(features, activation='softmax'))
      model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


      and shapes X_train (21, 11, 5), y_train (21, 5).



      Each timestep is represented by 5 features and
      return_sequences is set to False because I want to predict one 5D array (the next timestep) for each input sequence of 11 timesteps.



      I get the error




      ValueError: y_true and y_pred have different number of output (5!=1)




      If I reshape the data as X_train (21, 11, 5), y_train (21, 1, 5) instead I get the error




      ValueError: Invalid shape for y: (14, 1, 5)




      Note: the value 14 is due to the fact that I'm using cross validation.



      What should I do?



      Edit



      I changed the model to



      model = Sequential()
      model.add(Masking(mask_value=0, input_shape=(timesteps, features)))
      model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
      model.add(Dense(features, activation='sigmoid'))
      model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


      and used the same shapes as before.
      Here model.summary() gives



      _________________________________________________________________
      Layer (type) Output Shape Param #
      =================================================================
      masking_1 (Masking) (None, 11, 5) 0
      _________________________________________________________________
      lstm_1 (LSTM) (None, 100) 42400
      _________________________________________________________________
      dense_1 (Dense) (None, 5) 505
      =================================================================



      The idea is to produce a multilabel classification. After training the model, I evaluate it on the test data and this is what I get:




      X[0] = [[0 0 0 0 0],[1 0 0 1 0], ...,[0 0 1 0 0],[0 0 1 0 0]]



      y_true[0] = [0 0 1 0 0]



      y_pred[0] = 2




      which is not what I want. How can I get an output of the same shape as y_true, so as to transform it into a multilabel classification?







      lstm multilabel-classification recurrent-neural-net






      share|improve this question









      New contributor




      ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 9 hours ago







      ginevracoal













      New contributor




      ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Feb 22 at 9:14









      ginevracoalginevracoal

      1235




      1235




      New contributor




      ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      ginevracoal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          What are your features like? Given that you have a Dense layer outputting a softmax of size 5, this implies that all you want to predict is 1 feature, a categorical feature with 5 options.



          If this is not true, more information about the features is needed to help here.



          Your Y-variable for each feature should be of size (num_samples, time_step_len, num_categories_of_feature). You need to one-hot-encode each categorical feature separately, which gives the last dimension size, num_categories_of_feature. As you have it currently, the Y_train size is (num_samples, features). So, as you have the problem framed, the network has no way to learn the sequence patterns, as you only give it the end result. You should create your Y_train data to be the true value for the next time-step, for every time-step. Hence, (num_samples, time_step_len, num_categories_of_feature). Side note: I've only worked with LSTMs/RNN's on one problem, and this is how I did it. I cared about learning the sequences in it's entirety, because my inputs at prediction time are variable. If you always have 11 time-steps and always just want the next time-step prediction, this might not apply. I really don't know to be honest.



          This is where I'm not totally sure if this is the only way to do this, but the way I think of this problem for wanting to predict 5 categorical variables, you need a way to output softmaxs for each variable. A softmax activation of size "features", like you have it here, is estimating a probability distribution of size 5, which implies your Y variable is only 1 categorical feature that has 5 potential values. So, you will need to set up your network to have 5 outputs with independent softmax outputs the size equal to the number of categories for each variable. A single softmax should only be used to estimate a distribution over a single class variable. 10 options for feat1? Softmax of size 10. etc.



          losses = {"feat1_output": "categorical_crossentropy", "feat2_output": "categorical_crossentropy", "feat3_output": "categorical_crossentropy", "feat4_output": "categorical_crossentropy", "feat5_output": "categorical_crossentropy"}
          lossWeights = {"feat1_output": 1.0, "feat2_output": 1.0, ... , ...}# if everything is equal, dont worry about specifying loss weights.
          metrics = {"feat1_output": "categorical_accuracy", "feat2_output": "categorical_accuracy", "feat3_output": "categorical_accuracy", "feat4_output": "categorical_accuracy", "feat5_output": "categorical_accuracy"}
          opt = Adam(lr=init_lr,decay=init_lr / num_epochs)
          model.compile(loss = losses, loss_weights = lossWeights, optimizer=opt, metrics=metrics)


          Now, you will be optimizing 5 loss functions at the same time, one for each categorical prediction. You must now have 5 Y-variable datasets, each of size (num_samples, time_step_len, num_categories_of_feature). You will then give 5 y datasets for the outputs in the fit function, as a list. However, to properly name the output layers, you will need to specify the names for the output layers in the model definition.






          share|improve this answer









          $endgroup$













          • $begingroup$
            My data is shaped in the way I explained in this other question: datascience.stackexchange.com/questions/45867/…, but with size 5 instead of 3. The idea was to predict the one-hot encoded categorical vectors of size 5 for the next timestep at the same time, given the list of all the encodings for the previous timesteps (11 timesteps in the example).
            $endgroup$
            – ginevracoal
            yesterday












          • $begingroup$
            So, each of the 5 features can only take a value of 0 or 1, right? They only have a single dimension, so this seems like what you have. If these variables were "one-hot-encoded", you would have a timestep that looks like this: [(1,0),(0,1),(0,1)] instead of [0,1,0]. One-Hot-Encoding means that each feature is represented as however many classes the feature can take. If a feature can take 6 values, it takes 5 0's and a single 1 (in the "column" representing that class) to represent which class it is.
            $endgroup$
            – kylec123
            yesterday










          • $begingroup$
            If you reshape your y-data to be like I mention above, you can then have 5 separate softmax activations, once for each categorical variable.
            $endgroup$
            – kylec123
            yesterday












          • $begingroup$
            Actually it is slightly different. I concatenated different encodings into the same 5D array because otherwise I would have a 4 dimensional input, but I don't know if it's a reasonable thing to do.
            $endgroup$
            – ginevracoal
            yesterday








          • 1




            $begingroup$
            If you are trying to use a Softmax activation, you will only ever get a probability distribution the size of the "features". I believe you need to reevaluate how you represent your Y-data.
            $endgroup$
            – kylec123
            yesterday











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          ginevracoal is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46006%2flstm-sequence-prediction-3d-input-to-2d-output%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          What are your features like? Given that you have a Dense layer outputting a softmax of size 5, this implies that all you want to predict is 1 feature, a categorical feature with 5 options.



          If this is not true, more information about the features is needed to help here.



          Your Y-variable for each feature should be of size (num_samples, time_step_len, num_categories_of_feature). You need to one-hot-encode each categorical feature separately, which gives the last dimension size, num_categories_of_feature. As you have it currently, the Y_train size is (num_samples, features). So, as you have the problem framed, the network has no way to learn the sequence patterns, as you only give it the end result. You should create your Y_train data to be the true value for the next time-step, for every time-step. Hence, (num_samples, time_step_len, num_categories_of_feature). Side note: I've only worked with LSTMs/RNN's on one problem, and this is how I did it. I cared about learning the sequences in it's entirety, because my inputs at prediction time are variable. If you always have 11 time-steps and always just want the next time-step prediction, this might not apply. I really don't know to be honest.



          This is where I'm not totally sure if this is the only way to do this, but the way I think of this problem for wanting to predict 5 categorical variables, you need a way to output softmaxs for each variable. A softmax activation of size "features", like you have it here, is estimating a probability distribution of size 5, which implies your Y variable is only 1 categorical feature that has 5 potential values. So, you will need to set up your network to have 5 outputs with independent softmax outputs the size equal to the number of categories for each variable. A single softmax should only be used to estimate a distribution over a single class variable. 10 options for feat1? Softmax of size 10. etc.



          losses = {"feat1_output": "categorical_crossentropy", "feat2_output": "categorical_crossentropy", "feat3_output": "categorical_crossentropy", "feat4_output": "categorical_crossentropy", "feat5_output": "categorical_crossentropy"}
          lossWeights = {"feat1_output": 1.0, "feat2_output": 1.0, ... , ...}# if everything is equal, dont worry about specifying loss weights.
          metrics = {"feat1_output": "categorical_accuracy", "feat2_output": "categorical_accuracy", "feat3_output": "categorical_accuracy", "feat4_output": "categorical_accuracy", "feat5_output": "categorical_accuracy"}
          opt = Adam(lr=init_lr,decay=init_lr / num_epochs)
          model.compile(loss = losses, loss_weights = lossWeights, optimizer=opt, metrics=metrics)


          Now, you will be optimizing 5 loss functions at the same time, one for each categorical prediction. You must now have 5 Y-variable datasets, each of size (num_samples, time_step_len, num_categories_of_feature). You will then give 5 y datasets for the outputs in the fit function, as a list. However, to properly name the output layers, you will need to specify the names for the output layers in the model definition.






          share|improve this answer









          $endgroup$













          • $begingroup$
            My data is shaped in the way I explained in this other question: datascience.stackexchange.com/questions/45867/…, but with size 5 instead of 3. The idea was to predict the one-hot encoded categorical vectors of size 5 for the next timestep at the same time, given the list of all the encodings for the previous timesteps (11 timesteps in the example).
            $endgroup$
            – ginevracoal
            yesterday












          • $begingroup$
            So, each of the 5 features can only take a value of 0 or 1, right? They only have a single dimension, so this seems like what you have. If these variables were "one-hot-encoded", you would have a timestep that looks like this: [(1,0),(0,1),(0,1)] instead of [0,1,0]. One-Hot-Encoding means that each feature is represented as however many classes the feature can take. If a feature can take 6 values, it takes 5 0's and a single 1 (in the "column" representing that class) to represent which class it is.
            $endgroup$
            – kylec123
            yesterday










          • $begingroup$
            If you reshape your y-data to be like I mention above, you can then have 5 separate softmax activations, once for each categorical variable.
            $endgroup$
            – kylec123
            yesterday












          • $begingroup$
            Actually it is slightly different. I concatenated different encodings into the same 5D array because otherwise I would have a 4 dimensional input, but I don't know if it's a reasonable thing to do.
            $endgroup$
            – ginevracoal
            yesterday








          • 1




            $begingroup$
            If you are trying to use a Softmax activation, you will only ever get a probability distribution the size of the "features". I believe you need to reevaluate how you represent your Y-data.
            $endgroup$
            – kylec123
            yesterday
















          1












          $begingroup$

          What are your features like? Given that you have a Dense layer outputting a softmax of size 5, this implies that all you want to predict is 1 feature, a categorical feature with 5 options.



          If this is not true, more information about the features is needed to help here.



          Your Y-variable for each feature should be of size (num_samples, time_step_len, num_categories_of_feature). You need to one-hot-encode each categorical feature separately, which gives the last dimension size, num_categories_of_feature. As you have it currently, the Y_train size is (num_samples, features). So, as you have the problem framed, the network has no way to learn the sequence patterns, as you only give it the end result. You should create your Y_train data to be the true value for the next time-step, for every time-step. Hence, (num_samples, time_step_len, num_categories_of_feature). Side note: I've only worked with LSTMs/RNN's on one problem, and this is how I did it. I cared about learning the sequences in it's entirety, because my inputs at prediction time are variable. If you always have 11 time-steps and always just want the next time-step prediction, this might not apply. I really don't know to be honest.



          This is where I'm not totally sure if this is the only way to do this, but the way I think of this problem for wanting to predict 5 categorical variables, you need a way to output softmaxs for each variable. A softmax activation of size "features", like you have it here, is estimating a probability distribution of size 5, which implies your Y variable is only 1 categorical feature that has 5 potential values. So, you will need to set up your network to have 5 outputs with independent softmax outputs the size equal to the number of categories for each variable. A single softmax should only be used to estimate a distribution over a single class variable. 10 options for feat1? Softmax of size 10. etc.



          losses = {"feat1_output": "categorical_crossentropy", "feat2_output": "categorical_crossentropy", "feat3_output": "categorical_crossentropy", "feat4_output": "categorical_crossentropy", "feat5_output": "categorical_crossentropy"}
          lossWeights = {"feat1_output": 1.0, "feat2_output": 1.0, ... , ...}# if everything is equal, dont worry about specifying loss weights.
          metrics = {"feat1_output": "categorical_accuracy", "feat2_output": "categorical_accuracy", "feat3_output": "categorical_accuracy", "feat4_output": "categorical_accuracy", "feat5_output": "categorical_accuracy"}
          opt = Adam(lr=init_lr,decay=init_lr / num_epochs)
          model.compile(loss = losses, loss_weights = lossWeights, optimizer=opt, metrics=metrics)


          Now, you will be optimizing 5 loss functions at the same time, one for each categorical prediction. You must now have 5 Y-variable datasets, each of size (num_samples, time_step_len, num_categories_of_feature). You will then give 5 y datasets for the outputs in the fit function, as a list. However, to properly name the output layers, you will need to specify the names for the output layers in the model definition.






          share|improve this answer









          $endgroup$













          • $begingroup$
            My data is shaped in the way I explained in this other question: datascience.stackexchange.com/questions/45867/…, but with size 5 instead of 3. The idea was to predict the one-hot encoded categorical vectors of size 5 for the next timestep at the same time, given the list of all the encodings for the previous timesteps (11 timesteps in the example).
            $endgroup$
            – ginevracoal
            yesterday












          • $begingroup$
            So, each of the 5 features can only take a value of 0 or 1, right? They only have a single dimension, so this seems like what you have. If these variables were "one-hot-encoded", you would have a timestep that looks like this: [(1,0),(0,1),(0,1)] instead of [0,1,0]. One-Hot-Encoding means that each feature is represented as however many classes the feature can take. If a feature can take 6 values, it takes 5 0's and a single 1 (in the "column" representing that class) to represent which class it is.
            $endgroup$
            – kylec123
            yesterday










          • $begingroup$
            If you reshape your y-data to be like I mention above, you can then have 5 separate softmax activations, once for each categorical variable.
            $endgroup$
            – kylec123
            yesterday












          • $begingroup$
            Actually it is slightly different. I concatenated different encodings into the same 5D array because otherwise I would have a 4 dimensional input, but I don't know if it's a reasonable thing to do.
            $endgroup$
            – ginevracoal
            yesterday








          • 1




            $begingroup$
            If you are trying to use a Softmax activation, you will only ever get a probability distribution the size of the "features". I believe you need to reevaluate how you represent your Y-data.
            $endgroup$
            – kylec123
            yesterday














          1












          1








          1





          $begingroup$

          What are your features like? Given that you have a Dense layer outputting a softmax of size 5, this implies that all you want to predict is 1 feature, a categorical feature with 5 options.



          If this is not true, more information about the features is needed to help here.



          Your Y-variable for each feature should be of size (num_samples, time_step_len, num_categories_of_feature). You need to one-hot-encode each categorical feature separately, which gives the last dimension size, num_categories_of_feature. As you have it currently, the Y_train size is (num_samples, features). So, as you have the problem framed, the network has no way to learn the sequence patterns, as you only give it the end result. You should create your Y_train data to be the true value for the next time-step, for every time-step. Hence, (num_samples, time_step_len, num_categories_of_feature). Side note: I've only worked with LSTMs/RNN's on one problem, and this is how I did it. I cared about learning the sequences in it's entirety, because my inputs at prediction time are variable. If you always have 11 time-steps and always just want the next time-step prediction, this might not apply. I really don't know to be honest.



          This is where I'm not totally sure if this is the only way to do this, but the way I think of this problem for wanting to predict 5 categorical variables, you need a way to output softmaxs for each variable. A softmax activation of size "features", like you have it here, is estimating a probability distribution of size 5, which implies your Y variable is only 1 categorical feature that has 5 potential values. So, you will need to set up your network to have 5 outputs with independent softmax outputs the size equal to the number of categories for each variable. A single softmax should only be used to estimate a distribution over a single class variable. 10 options for feat1? Softmax of size 10. etc.



          losses = {"feat1_output": "categorical_crossentropy", "feat2_output": "categorical_crossentropy", "feat3_output": "categorical_crossentropy", "feat4_output": "categorical_crossentropy", "feat5_output": "categorical_crossentropy"}
          lossWeights = {"feat1_output": 1.0, "feat2_output": 1.0, ... , ...}# if everything is equal, dont worry about specifying loss weights.
          metrics = {"feat1_output": "categorical_accuracy", "feat2_output": "categorical_accuracy", "feat3_output": "categorical_accuracy", "feat4_output": "categorical_accuracy", "feat5_output": "categorical_accuracy"}
          opt = Adam(lr=init_lr,decay=init_lr / num_epochs)
          model.compile(loss = losses, loss_weights = lossWeights, optimizer=opt, metrics=metrics)


          Now, you will be optimizing 5 loss functions at the same time, one for each categorical prediction. You must now have 5 Y-variable datasets, each of size (num_samples, time_step_len, num_categories_of_feature). You will then give 5 y datasets for the outputs in the fit function, as a list. However, to properly name the output layers, you will need to specify the names for the output layers in the model definition.






          share|improve this answer









          $endgroup$



          What are your features like? Given that you have a Dense layer outputting a softmax of size 5, this implies that all you want to predict is 1 feature, a categorical feature with 5 options.



          If this is not true, more information about the features is needed to help here.



          Your Y-variable for each feature should be of size (num_samples, time_step_len, num_categories_of_feature). You need to one-hot-encode each categorical feature separately, which gives the last dimension size, num_categories_of_feature. As you have it currently, the Y_train size is (num_samples, features). So, as you have the problem framed, the network has no way to learn the sequence patterns, as you only give it the end result. You should create your Y_train data to be the true value for the next time-step, for every time-step. Hence, (num_samples, time_step_len, num_categories_of_feature). Side note: I've only worked with LSTMs/RNN's on one problem, and this is how I did it. I cared about learning the sequences in it's entirety, because my inputs at prediction time are variable. If you always have 11 time-steps and always just want the next time-step prediction, this might not apply. I really don't know to be honest.



          This is where I'm not totally sure if this is the only way to do this, but the way I think of this problem for wanting to predict 5 categorical variables, you need a way to output softmaxs for each variable. A softmax activation of size "features", like you have it here, is estimating a probability distribution of size 5, which implies your Y variable is only 1 categorical feature that has 5 potential values. So, you will need to set up your network to have 5 outputs with independent softmax outputs the size equal to the number of categories for each variable. A single softmax should only be used to estimate a distribution over a single class variable. 10 options for feat1? Softmax of size 10. etc.



          losses = {"feat1_output": "categorical_crossentropy", "feat2_output": "categorical_crossentropy", "feat3_output": "categorical_crossentropy", "feat4_output": "categorical_crossentropy", "feat5_output": "categorical_crossentropy"}
          lossWeights = {"feat1_output": 1.0, "feat2_output": 1.0, ... , ...}# if everything is equal, dont worry about specifying loss weights.
          metrics = {"feat1_output": "categorical_accuracy", "feat2_output": "categorical_accuracy", "feat3_output": "categorical_accuracy", "feat4_output": "categorical_accuracy", "feat5_output": "categorical_accuracy"}
          opt = Adam(lr=init_lr,decay=init_lr / num_epochs)
          model.compile(loss = losses, loss_weights = lossWeights, optimizer=opt, metrics=metrics)


          Now, you will be optimizing 5 loss functions at the same time, one for each categorical prediction. You must now have 5 Y-variable datasets, each of size (num_samples, time_step_len, num_categories_of_feature). You will then give 5 y datasets for the outputs in the fit function, as a list. However, to properly name the output layers, you will need to specify the names for the output layers in the model definition.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered yesterday









          kylec123kylec123

          718




          718












          • $begingroup$
            My data is shaped in the way I explained in this other question: datascience.stackexchange.com/questions/45867/…, but with size 5 instead of 3. The idea was to predict the one-hot encoded categorical vectors of size 5 for the next timestep at the same time, given the list of all the encodings for the previous timesteps (11 timesteps in the example).
            $endgroup$
            – ginevracoal
            yesterday












          • $begingroup$
            So, each of the 5 features can only take a value of 0 or 1, right? They only have a single dimension, so this seems like what you have. If these variables were "one-hot-encoded", you would have a timestep that looks like this: [(1,0),(0,1),(0,1)] instead of [0,1,0]. One-Hot-Encoding means that each feature is represented as however many classes the feature can take. If a feature can take 6 values, it takes 5 0's and a single 1 (in the "column" representing that class) to represent which class it is.
            $endgroup$
            – kylec123
            yesterday










          • $begingroup$
            If you reshape your y-data to be like I mention above, you can then have 5 separate softmax activations, once for each categorical variable.
            $endgroup$
            – kylec123
            yesterday












          • $begingroup$
            Actually it is slightly different. I concatenated different encodings into the same 5D array because otherwise I would have a 4 dimensional input, but I don't know if it's a reasonable thing to do.
            $endgroup$
            – ginevracoal
            yesterday








          • 1




            $begingroup$
            If you are trying to use a Softmax activation, you will only ever get a probability distribution the size of the "features". I believe you need to reevaluate how you represent your Y-data.
            $endgroup$
            – kylec123
            yesterday


















          • $begingroup$
            My data is shaped in the way I explained in this other question: datascience.stackexchange.com/questions/45867/…, but with size 5 instead of 3. The idea was to predict the one-hot encoded categorical vectors of size 5 for the next timestep at the same time, given the list of all the encodings for the previous timesteps (11 timesteps in the example).
            $endgroup$
            – ginevracoal
            yesterday












          • $begingroup$
            So, each of the 5 features can only take a value of 0 or 1, right? They only have a single dimension, so this seems like what you have. If these variables were "one-hot-encoded", you would have a timestep that looks like this: [(1,0),(0,1),(0,1)] instead of [0,1,0]. One-Hot-Encoding means that each feature is represented as however many classes the feature can take. If a feature can take 6 values, it takes 5 0's and a single 1 (in the "column" representing that class) to represent which class it is.
            $endgroup$
            – kylec123
            yesterday










          • $begingroup$
            If you reshape your y-data to be like I mention above, you can then have 5 separate softmax activations, once for each categorical variable.
            $endgroup$
            – kylec123
            yesterday












          • $begingroup$
            Actually it is slightly different. I concatenated different encodings into the same 5D array because otherwise I would have a 4 dimensional input, but I don't know if it's a reasonable thing to do.
            $endgroup$
            – ginevracoal
            yesterday








          • 1




            $begingroup$
            If you are trying to use a Softmax activation, you will only ever get a probability distribution the size of the "features". I believe you need to reevaluate how you represent your Y-data.
            $endgroup$
            – kylec123
            yesterday
















          $begingroup$
          My data is shaped in the way I explained in this other question: datascience.stackexchange.com/questions/45867/…, but with size 5 instead of 3. The idea was to predict the one-hot encoded categorical vectors of size 5 for the next timestep at the same time, given the list of all the encodings for the previous timesteps (11 timesteps in the example).
          $endgroup$
          – ginevracoal
          yesterday






          $begingroup$
          My data is shaped in the way I explained in this other question: datascience.stackexchange.com/questions/45867/…, but with size 5 instead of 3. The idea was to predict the one-hot encoded categorical vectors of size 5 for the next timestep at the same time, given the list of all the encodings for the previous timesteps (11 timesteps in the example).
          $endgroup$
          – ginevracoal
          yesterday














          $begingroup$
          So, each of the 5 features can only take a value of 0 or 1, right? They only have a single dimension, so this seems like what you have. If these variables were "one-hot-encoded", you would have a timestep that looks like this: [(1,0),(0,1),(0,1)] instead of [0,1,0]. One-Hot-Encoding means that each feature is represented as however many classes the feature can take. If a feature can take 6 values, it takes 5 0's and a single 1 (in the "column" representing that class) to represent which class it is.
          $endgroup$
          – kylec123
          yesterday




          $begingroup$
          So, each of the 5 features can only take a value of 0 or 1, right? They only have a single dimension, so this seems like what you have. If these variables were "one-hot-encoded", you would have a timestep that looks like this: [(1,0),(0,1),(0,1)] instead of [0,1,0]. One-Hot-Encoding means that each feature is represented as however many classes the feature can take. If a feature can take 6 values, it takes 5 0's and a single 1 (in the "column" representing that class) to represent which class it is.
          $endgroup$
          – kylec123
          yesterday












          $begingroup$
          If you reshape your y-data to be like I mention above, you can then have 5 separate softmax activations, once for each categorical variable.
          $endgroup$
          – kylec123
          yesterday






          $begingroup$
          If you reshape your y-data to be like I mention above, you can then have 5 separate softmax activations, once for each categorical variable.
          $endgroup$
          – kylec123
          yesterday














          $begingroup$
          Actually it is slightly different. I concatenated different encodings into the same 5D array because otherwise I would have a 4 dimensional input, but I don't know if it's a reasonable thing to do.
          $endgroup$
          – ginevracoal
          yesterday






          $begingroup$
          Actually it is slightly different. I concatenated different encodings into the same 5D array because otherwise I would have a 4 dimensional input, but I don't know if it's a reasonable thing to do.
          $endgroup$
          – ginevracoal
          yesterday






          1




          1




          $begingroup$
          If you are trying to use a Softmax activation, you will only ever get a probability distribution the size of the "features". I believe you need to reevaluate how you represent your Y-data.
          $endgroup$
          – kylec123
          yesterday




          $begingroup$
          If you are trying to use a Softmax activation, you will only ever get a probability distribution the size of the "features". I believe you need to reevaluate how you represent your Y-data.
          $endgroup$
          – kylec123
          yesterday










          ginevracoal is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          ginevracoal is a new contributor. Be nice, and check out our Code of Conduct.













          ginevracoal is a new contributor. Be nice, and check out our Code of Conduct.












          ginevracoal is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46006%2flstm-sequence-prediction-3d-input-to-2d-output%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Tabula Rosettana

          Aureus (color)