From most frequent words how to extract technical skill words












1












$begingroup$


I've scrape 30 job description web and stored them into a list called job_desc where each item is a job description.



# each item is a list of tokenized job_description
tok = [nltk.word_tokenize(job.lower()) for job in job_desc]

# ignore stop words, bullets, etc. And put it into one list
from nltk.corpus import stopwords
stop = stopwords.words('english')

def clean_token(what_to_clean):
cleaned_tok =
for lists in what_to_clean:
for item in lists:
if len(item)>2 and (item not in stop):
cleaned_tok.append(item)
return cleaned_tok


After cleaning job description I've found most frequent words using:



freq = nltk.FreqDist(clean_token(tok))
most_freq_words = freq.most_common(100)


Which outputs:



 [('data', 211),
('experience', 78),
('learning', 70),
('business', 65),
('team', 53),
('science', 51),
('machine', 48),.....


From here I only want to extract words like machine, python, C+, technical skills. How can I go about it?



Also you can see there is word "machine" showing up 48 times and I am not sure whether it is talking about machine learning how can I go about this, I know if I want to make predictions I could've used CountVectorizer and n-grams.










share|improve this question









$endgroup$

















    1












    $begingroup$


    I've scrape 30 job description web and stored them into a list called job_desc where each item is a job description.



    # each item is a list of tokenized job_description
    tok = [nltk.word_tokenize(job.lower()) for job in job_desc]

    # ignore stop words, bullets, etc. And put it into one list
    from nltk.corpus import stopwords
    stop = stopwords.words('english')

    def clean_token(what_to_clean):
    cleaned_tok =
    for lists in what_to_clean:
    for item in lists:
    if len(item)>2 and (item not in stop):
    cleaned_tok.append(item)
    return cleaned_tok


    After cleaning job description I've found most frequent words using:



    freq = nltk.FreqDist(clean_token(tok))
    most_freq_words = freq.most_common(100)


    Which outputs:



     [('data', 211),
    ('experience', 78),
    ('learning', 70),
    ('business', 65),
    ('team', 53),
    ('science', 51),
    ('machine', 48),.....


    From here I only want to extract words like machine, python, C+, technical skills. How can I go about it?



    Also you can see there is word "machine" showing up 48 times and I am not sure whether it is talking about machine learning how can I go about this, I know if I want to make predictions I could've used CountVectorizer and n-grams.










    share|improve this question









    $endgroup$















      1












      1








      1





      $begingroup$


      I've scrape 30 job description web and stored them into a list called job_desc where each item is a job description.



      # each item is a list of tokenized job_description
      tok = [nltk.word_tokenize(job.lower()) for job in job_desc]

      # ignore stop words, bullets, etc. And put it into one list
      from nltk.corpus import stopwords
      stop = stopwords.words('english')

      def clean_token(what_to_clean):
      cleaned_tok =
      for lists in what_to_clean:
      for item in lists:
      if len(item)>2 and (item not in stop):
      cleaned_tok.append(item)
      return cleaned_tok


      After cleaning job description I've found most frequent words using:



      freq = nltk.FreqDist(clean_token(tok))
      most_freq_words = freq.most_common(100)


      Which outputs:



       [('data', 211),
      ('experience', 78),
      ('learning', 70),
      ('business', 65),
      ('team', 53),
      ('science', 51),
      ('machine', 48),.....


      From here I only want to extract words like machine, python, C+, technical skills. How can I go about it?



      Also you can see there is word "machine" showing up 48 times and I am not sure whether it is talking about machine learning how can I go about this, I know if I want to make predictions I could've used CountVectorizer and n-grams.










      share|improve this question









      $endgroup$




      I've scrape 30 job description web and stored them into a list called job_desc where each item is a job description.



      # each item is a list of tokenized job_description
      tok = [nltk.word_tokenize(job.lower()) for job in job_desc]

      # ignore stop words, bullets, etc. And put it into one list
      from nltk.corpus import stopwords
      stop = stopwords.words('english')

      def clean_token(what_to_clean):
      cleaned_tok =
      for lists in what_to_clean:
      for item in lists:
      if len(item)>2 and (item not in stop):
      cleaned_tok.append(item)
      return cleaned_tok


      After cleaning job description I've found most frequent words using:



      freq = nltk.FreqDist(clean_token(tok))
      most_freq_words = freq.most_common(100)


      Which outputs:



       [('data', 211),
      ('experience', 78),
      ('learning', 70),
      ('business', 65),
      ('team', 53),
      ('science', 51),
      ('machine', 48),.....


      From here I only want to extract words like machine, python, C+, technical skills. How can I go about it?



      Also you can see there is word "machine" showing up 48 times and I am not sure whether it is talking about machine learning how can I go about this, I know if I want to make predictions I could've used CountVectorizer and n-grams.







      python nltk






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 12 mins ago









      h_muskh_musk

      61




      61






















          0






          active

          oldest

          votes












          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f50824%2ffrom-most-frequent-words-how-to-extract-technical-skill-words%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f50824%2ffrom-most-frequent-words-how-to-extract-technical-skill-words%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Vallis Paradisi

          Tabula Rosettana