What kind of a fit would be suitable for this?












2












$begingroup$


Below is a scatter plot of the data set I am dealing with. The X axis is the total number of words per essay for a particular individual, and they Y axis is the number of unique words. In principle, the number of unique words should approach the individuals vocabulary.



I am attempting to find that individual's vocabulary from the data below, but I don't know what kind of a fit would work. A logarithm would have no limit, a quadratic fit doesn't make sense (the gradient should remain non-negative over the entire domain).



In short, I am looking for a decent model to fit the data below, and don't know where to start.



Thank you.



Scatter plot of data set










share|improve this question







New contributor




Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    2












    $begingroup$


    Below is a scatter plot of the data set I am dealing with. The X axis is the total number of words per essay for a particular individual, and they Y axis is the number of unique words. In principle, the number of unique words should approach the individuals vocabulary.



    I am attempting to find that individual's vocabulary from the data below, but I don't know what kind of a fit would work. A logarithm would have no limit, a quadratic fit doesn't make sense (the gradient should remain non-negative over the entire domain).



    In short, I am looking for a decent model to fit the data below, and don't know where to start.



    Thank you.



    Scatter plot of data set










    share|improve this question







    New contributor




    Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      2












      2








      2





      $begingroup$


      Below is a scatter plot of the data set I am dealing with. The X axis is the total number of words per essay for a particular individual, and they Y axis is the number of unique words. In principle, the number of unique words should approach the individuals vocabulary.



      I am attempting to find that individual's vocabulary from the data below, but I don't know what kind of a fit would work. A logarithm would have no limit, a quadratic fit doesn't make sense (the gradient should remain non-negative over the entire domain).



      In short, I am looking for a decent model to fit the data below, and don't know where to start.



      Thank you.



      Scatter plot of data set










      share|improve this question







      New contributor




      Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      Below is a scatter plot of the data set I am dealing with. The X axis is the total number of words per essay for a particular individual, and they Y axis is the number of unique words. In principle, the number of unique words should approach the individuals vocabulary.



      I am attempting to find that individual's vocabulary from the data below, but I don't know what kind of a fit would work. A logarithm would have no limit, a quadratic fit doesn't make sense (the gradient should remain non-negative over the entire domain).



      In short, I am looking for a decent model to fit the data below, and don't know where to start.



      Thank you.



      Scatter plot of data set







      python scikit-learn regression linear-regression model-selection






      share|improve this question







      New contributor




      Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 days ago









      MirMir

      132




      132




      New contributor




      Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Mir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          In my opinion, this estimation cannot be achieved merely based on this plot, because:




          1. From 4000 words onward, the unique words are increasing linearly around 250 per 2K words: (4K, 1.25K), (6K, 1.5K), (8K, 1.75K), (10K, 2K), (12K, 2.25K). So there is not enough evidence to hypothesize an upper-bound for this linear growth,


          2. On average, an adult knows 20K-35K unique words, but this plot goes only up to 2K which is far behind the final expected value. The extrapolation from 2K to 20K is unreliable.



          Vocabulary of Shakespeare



          The estimation of a person's vocabulary is quite complicated. Below is a paper that estimates the vocabulary of Shakespeare. He had used 31K unique words in all of his writtings. The paper estimates that he knew at least 35K more words which he did not use (at least 66K vocabulary). As you see, the estimated vocabulary is only twice the observation, which sheds light on unreliability of going from 2K to 20K and beyond.



          1976 Estimating the number of unseen species - How many words did Shakespeare know






          share|improve this answer









          $endgroup$









          • 1




            $begingroup$
            That paper was immensely helpful! And I agree with you, this estimation is not as trivial as I initially thought. The good thing is I have some more data on the way. Thank you very much!
            $endgroup$
            – Mir
            16 hours ago











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          Mir is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46938%2fwhat-kind-of-a-fit-would-be-suitable-for-this%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          In my opinion, this estimation cannot be achieved merely based on this plot, because:




          1. From 4000 words onward, the unique words are increasing linearly around 250 per 2K words: (4K, 1.25K), (6K, 1.5K), (8K, 1.75K), (10K, 2K), (12K, 2.25K). So there is not enough evidence to hypothesize an upper-bound for this linear growth,


          2. On average, an adult knows 20K-35K unique words, but this plot goes only up to 2K which is far behind the final expected value. The extrapolation from 2K to 20K is unreliable.



          Vocabulary of Shakespeare



          The estimation of a person's vocabulary is quite complicated. Below is a paper that estimates the vocabulary of Shakespeare. He had used 31K unique words in all of his writtings. The paper estimates that he knew at least 35K more words which he did not use (at least 66K vocabulary). As you see, the estimated vocabulary is only twice the observation, which sheds light on unreliability of going from 2K to 20K and beyond.



          1976 Estimating the number of unseen species - How many words did Shakespeare know






          share|improve this answer









          $endgroup$









          • 1




            $begingroup$
            That paper was immensely helpful! And I agree with you, this estimation is not as trivial as I initially thought. The good thing is I have some more data on the way. Thank you very much!
            $endgroup$
            – Mir
            16 hours ago
















          1












          $begingroup$

          In my opinion, this estimation cannot be achieved merely based on this plot, because:




          1. From 4000 words onward, the unique words are increasing linearly around 250 per 2K words: (4K, 1.25K), (6K, 1.5K), (8K, 1.75K), (10K, 2K), (12K, 2.25K). So there is not enough evidence to hypothesize an upper-bound for this linear growth,


          2. On average, an adult knows 20K-35K unique words, but this plot goes only up to 2K which is far behind the final expected value. The extrapolation from 2K to 20K is unreliable.



          Vocabulary of Shakespeare



          The estimation of a person's vocabulary is quite complicated. Below is a paper that estimates the vocabulary of Shakespeare. He had used 31K unique words in all of his writtings. The paper estimates that he knew at least 35K more words which he did not use (at least 66K vocabulary). As you see, the estimated vocabulary is only twice the observation, which sheds light on unreliability of going from 2K to 20K and beyond.



          1976 Estimating the number of unseen species - How many words did Shakespeare know






          share|improve this answer









          $endgroup$









          • 1




            $begingroup$
            That paper was immensely helpful! And I agree with you, this estimation is not as trivial as I initially thought. The good thing is I have some more data on the way. Thank you very much!
            $endgroup$
            – Mir
            16 hours ago














          1












          1








          1





          $begingroup$

          In my opinion, this estimation cannot be achieved merely based on this plot, because:




          1. From 4000 words onward, the unique words are increasing linearly around 250 per 2K words: (4K, 1.25K), (6K, 1.5K), (8K, 1.75K), (10K, 2K), (12K, 2.25K). So there is not enough evidence to hypothesize an upper-bound for this linear growth,


          2. On average, an adult knows 20K-35K unique words, but this plot goes only up to 2K which is far behind the final expected value. The extrapolation from 2K to 20K is unreliable.



          Vocabulary of Shakespeare



          The estimation of a person's vocabulary is quite complicated. Below is a paper that estimates the vocabulary of Shakespeare. He had used 31K unique words in all of his writtings. The paper estimates that he knew at least 35K more words which he did not use (at least 66K vocabulary). As you see, the estimated vocabulary is only twice the observation, which sheds light on unreliability of going from 2K to 20K and beyond.



          1976 Estimating the number of unseen species - How many words did Shakespeare know






          share|improve this answer









          $endgroup$



          In my opinion, this estimation cannot be achieved merely based on this plot, because:




          1. From 4000 words onward, the unique words are increasing linearly around 250 per 2K words: (4K, 1.25K), (6K, 1.5K), (8K, 1.75K), (10K, 2K), (12K, 2.25K). So there is not enough evidence to hypothesize an upper-bound for this linear growth,


          2. On average, an adult knows 20K-35K unique words, but this plot goes only up to 2K which is far behind the final expected value. The extrapolation from 2K to 20K is unreliable.



          Vocabulary of Shakespeare



          The estimation of a person's vocabulary is quite complicated. Below is a paper that estimates the vocabulary of Shakespeare. He had used 31K unique words in all of his writtings. The paper estimates that he knew at least 35K more words which he did not use (at least 66K vocabulary). As you see, the estimated vocabulary is only twice the observation, which sheds light on unreliability of going from 2K to 20K and beyond.



          1976 Estimating the number of unseen species - How many words did Shakespeare know







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 2 days ago









          EsmailianEsmailian

          5966




          5966








          • 1




            $begingroup$
            That paper was immensely helpful! And I agree with you, this estimation is not as trivial as I initially thought. The good thing is I have some more data on the way. Thank you very much!
            $endgroup$
            – Mir
            16 hours ago














          • 1




            $begingroup$
            That paper was immensely helpful! And I agree with you, this estimation is not as trivial as I initially thought. The good thing is I have some more data on the way. Thank you very much!
            $endgroup$
            – Mir
            16 hours ago








          1




          1




          $begingroup$
          That paper was immensely helpful! And I agree with you, this estimation is not as trivial as I initially thought. The good thing is I have some more data on the way. Thank you very much!
          $endgroup$
          – Mir
          16 hours ago




          $begingroup$
          That paper was immensely helpful! And I agree with you, this estimation is not as trivial as I initially thought. The good thing is I have some more data on the way. Thank you very much!
          $endgroup$
          – Mir
          16 hours ago










          Mir is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          Mir is a new contributor. Be nice, and check out our Code of Conduct.













          Mir is a new contributor. Be nice, and check out our Code of Conduct.












          Mir is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46938%2fwhat-kind-of-a-fit-would-be-suitable-for-this%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Vallis Paradisi

          Tabula Rosettana