Input data of variable length - two scenarios












1












$begingroup$


I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.



Scenario 1:
I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?



Scenario 2:
I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.



Thanks for reading this far...










share|improve this question







New contributor




Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    1












    $begingroup$


    I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.



    Scenario 1:
    I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
    Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?



    Scenario 2:
    I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.



    Thanks for reading this far...










    share|improve this question







    New contributor




    Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.



      Scenario 1:
      I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
      Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?



      Scenario 2:
      I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.



      Thanks for reading this far...










      share|improve this question







      New contributor




      Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.



      Scenario 1:
      I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
      Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?



      Scenario 2:
      I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.



      Thanks for reading this far...







      text mlp features






      share|improve this question







      New contributor




      Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 days ago









      Joe RJoe R

      61




      61




      New contributor




      Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Joe R is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.



          Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!






          share|improve this answer










          New contributor




          Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$





















            0












            $begingroup$

            For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
            For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.






            share|improve this answer









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });






              Joe R is a new contributor. Be nice, and check out our Code of Conduct.










              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46818%2finput-data-of-variable-length-two-scenarios%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0












              $begingroup$

              Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.



              Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!






              share|improve this answer










              New contributor




              Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$


















                0












                $begingroup$

                Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.



                Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!






                share|improve this answer










                New contributor




                Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$
















                  0












                  0








                  0





                  $begingroup$

                  Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.



                  Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!






                  share|improve this answer










                  New contributor




                  Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$



                  Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.



                  Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!







                  share|improve this answer










                  New contributor




                  Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer








                  edited yesterday





















                  New contributor




                  Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered 2 days ago









                  Andrei UngurAndrei Ungur

                  312




                  312




                  New contributor




                  Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  Andrei Ungur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.























                      0












                      $begingroup$

                      For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
                      For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.






                      share|improve this answer









                      $endgroup$


















                        0












                        $begingroup$

                        For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
                        For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.






                        share|improve this answer









                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
                          For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.






                          share|improve this answer









                          $endgroup$



                          For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
                          For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered yesterday









                          Atif HassanAtif Hassan

                          1212




                          1212






















                              Joe R is a new contributor. Be nice, and check out our Code of Conduct.










                              draft saved

                              draft discarded


















                              Joe R is a new contributor. Be nice, and check out our Code of Conduct.













                              Joe R is a new contributor. Be nice, and check out our Code of Conduct.












                              Joe R is a new contributor. Be nice, and check out our Code of Conduct.
















                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46818%2finput-data-of-variable-length-two-scenarios%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              How to label and detect the document text images

                              Vallis Paradisi

                              Tabula Rosettana