tensorflow categorical data with vocabulary list - Expected binary or Unicode string, got [0,1,2,…]












1












$begingroup$


I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)



The data is formatted in a dataframe like below



|Identity | Cuisine | Ingredients                |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |


I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.



|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |


I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list and subsequent tf.feature_column.indicator_column for the ingredients array.



I now however have an issue with my model not being able to read the ingredients column, and get the error



TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



my input function is as follows



def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)

if shuffle:
ds = ds.shuffle(10000)

feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch


which is fed into a simple function as below



training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)

classifier.train(
input_fn=training_func,
steps=steps_per_period
)


My urgent question is how do I fix this TypeError



In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)










share|improve this question









$endgroup$




bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
    $endgroup$
    – Byren Higgin
    Aug 9 '18 at 3:07


















1












$begingroup$


I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)



The data is formatted in a dataframe like below



|Identity | Cuisine | Ingredients                |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |


I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.



|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |


I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list and subsequent tf.feature_column.indicator_column for the ingredients array.



I now however have an issue with my model not being able to read the ingredients column, and get the error



TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



my input function is as follows



def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)

if shuffle:
ds = ds.shuffle(10000)

feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch


which is fed into a simple function as below



training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)

classifier.train(
input_fn=training_func,
steps=steps_per_period
)


My urgent question is how do I fix this TypeError



In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)










share|improve this question









$endgroup$




bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
    $endgroup$
    – Byren Higgin
    Aug 9 '18 at 3:07
















1












1








1





$begingroup$


I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)



The data is formatted in a dataframe like below



|Identity | Cuisine | Ingredients                |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |


I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.



|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |


I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list and subsequent tf.feature_column.indicator_column for the ingredients array.



I now however have an issue with my model not being able to read the ingredients column, and get the error



TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



my input function is as follows



def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)

if shuffle:
ds = ds.shuffle(10000)

feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch


which is fed into a simple function as below



training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)

classifier.train(
input_fn=training_func,
steps=steps_per_period
)


My urgent question is how do I fix this TypeError



In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)










share|improve this question









$endgroup$




I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)



The data is formatted in a dataframe like below



|Identity | Cuisine | Ingredients                |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |


I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.



|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |


I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list and subsequent tf.feature_column.indicator_column for the ingredients array.



I now however have an issue with my model not being able to read the ingredients column, and get the error



TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



my input function is as follows



def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)

if shuffle:
ds = ds.shuffle(10000)

feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch


which is fed into a simple function as below



training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)

classifier.train(
input_fn=training_func,
steps=steps_per_period
)


My urgent question is how do I fix this TypeError



In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)







python tensorflow dataset linear-regression categorical-data






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Aug 9 '18 at 3:04









Byren HigginByren Higgin

1061




1061





bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
    $endgroup$
    – Byren Higgin
    Aug 9 '18 at 3:07
















  • 1




    $begingroup$
    Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
    $endgroup$
    – Byren Higgin
    Aug 9 '18 at 3:07










1




1




$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07






$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07












1 Answer
1






active

oldest

votes


















0












$begingroup$

I'm not completely familiar with TF API, but here's what I think is happening.



The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.



You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.



You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.






share|improve this answer









$endgroup$














    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36667%2ftensorflow-categorical-data-with-vocabulary-list-expected-binary-or-unicode-st%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    I'm not completely familiar with TF API, but here's what I think is happening.



    The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.



    You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.



    You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.






    share|improve this answer









    $endgroup$


















      0












      $begingroup$

      I'm not completely familiar with TF API, but here's what I think is happening.



      The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.



      You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.



      You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.






      share|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        I'm not completely familiar with TF API, but here's what I think is happening.



        The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.



        You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.



        You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.






        share|improve this answer









        $endgroup$



        I'm not completely familiar with TF API, but here's what I think is happening.



        The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.



        You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.



        You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Aug 9 '18 at 15:11









        hssayhssay

        1,0931311




        1,0931311






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36667%2ftensorflow-categorical-data-with-vocabulary-list-expected-binary-or-unicode-st%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to label and detect the document text images

            Tabula Rosettana

            Aureus (color)