tensorflow categorical data with vocabulary list - Expected binary or Unicode string, got [0,1,2,…]

I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)

The data is formatted in a dataframe like below

|Identity | Cuisine | Ingredients                |

|---------|---------|----------------------------|

|1        | italian | [beans, milk,..., tomatoes]|

|2        | indian  | [chicken, curry leaf,...]  |

I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.

|Identity | Cuisine | Ingredients |

|---------|---------|-------------|

|1        | italian |[0, 1,..., 4]|

|2        | indian  |[5, 6,...]   |

I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list and subsequent tf.feature_column.indicator_column for the ingredients array.

I now however have an issue with my model not being able to read the ingredients column, and get the error

TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

my input function is as follows

def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):

    ds = Dataset.from_tensor_slices((features,labels))

    ds = ds.batch(batch_size).repeat(num_epochs)



    if shuffle:

        ds = ds.shuffle(10000)



    feature_batch, label_batch = ds.make_one_shot_iterator().get_next()

    return feature_batch, label_batch

which is fed into a simple function as below

training_func = lambda: input_fn(training_example,training_target,batch_size)

validati_func = lambda: input_fn(validation_example,validation_target,batch_size)



optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)



classifier.train(

    input_fn=training_func,

    steps=steps_per_period

)

My urgent question is how do I fix this TypeError

In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)

asked Aug 9 '18 at 3:04

Byren Higgin

1061

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07

add a comment |

The data is formatted in a dataframe like below

|Identity | Cuisine | Ingredients                |

|---------|---------|----------------------------|

|1        | italian | [beans, milk,..., tomatoes]|

|2        | indian  | [chicken, curry leaf,...]  |

|Identity | Cuisine | Ingredients |

|---------|---------|-------------|

|1        | italian |[0, 1,..., 4]|

|2        | indian  |[5, 6,...]   |

I now however have an issue with my model not being able to read the ingredients column, and get the error

TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

my input function is as follows

def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):

    ds = Dataset.from_tensor_slices((features,labels))

    ds = ds.batch(batch_size).repeat(num_epochs)



    if shuffle:

        ds = ds.shuffle(10000)



    feature_batch, label_batch = ds.make_one_shot_iterator().get_next()

    return feature_batch, label_batch

which is fed into a simple function as below

training_func = lambda: input_fn(training_example,training_target,batch_size)

validati_func = lambda: input_fn(validation_example,validation_target,batch_size)



optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)



classifier.train(

    input_fn=training_func,

    steps=steps_per_period

)

My urgent question is how do I fix this TypeError

In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)

asked Aug 9 '18 at 3:04

Byren Higgin

1061

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07

add a comment |

The data is formatted in a dataframe like below

|Identity | Cuisine | Ingredients                |

|---------|---------|----------------------------|

|1        | italian | [beans, milk,..., tomatoes]|

|2        | indian  | [chicken, curry leaf,...]  |

|Identity | Cuisine | Ingredients |

|---------|---------|-------------|

|1        | italian |[0, 1,..., 4]|

|2        | indian  |[5, 6,...]   |

I now however have an issue with my model not being able to read the ingredients column, and get the error

TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

my input function is as follows

def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):

    ds = Dataset.from_tensor_slices((features,labels))

    ds = ds.batch(batch_size).repeat(num_epochs)



    if shuffle:

        ds = ds.shuffle(10000)



    feature_batch, label_batch = ds.make_one_shot_iterator().get_next()

    return feature_batch, label_batch

which is fed into a simple function as below

training_func = lambda: input_fn(training_example,training_target,batch_size)

validati_func = lambda: input_fn(validation_example,validation_target,batch_size)



optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)



classifier.train(

    input_fn=training_func,

    steps=steps_per_period

)

My urgent question is how do I fix this TypeError

In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)

asked Aug 9 '18 at 3:04

Byren Higgin

1061

The data is formatted in a dataframe like below

|Identity | Cuisine | Ingredients                |

|---------|---------|----------------------------|

|1        | italian | [beans, milk,..., tomatoes]|

|2        | indian  | [chicken, curry leaf,...]  |

|Identity | Cuisine | Ingredients |

|---------|---------|-------------|

|1        | italian |[0, 1,..., 4]|

|2        | indian  |[5, 6,...]   |

I now however have an issue with my model not being able to read the ingredients column, and get the error

TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

my input function is as follows

def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):

    ds = Dataset.from_tensor_slices((features,labels))

    ds = ds.batch(batch_size).repeat(num_epochs)



    if shuffle:

        ds = ds.shuffle(10000)



    feature_batch, label_batch = ds.make_one_shot_iterator().get_next()

    return feature_batch, label_batch

which is fed into a simple function as below

training_func = lambda: input_fn(training_example,training_target,batch_size)

validati_func = lambda: input_fn(validation_example,validation_target,batch_size)



optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)



classifier.train(

    input_fn=training_func,

    steps=steps_per_period

)

My urgent question is how do I fix this TypeError

In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)

python tensorflow dataset linear-regression categorical-data

asked Aug 9 '18 at 3:04

Byren Higgin

1061

asked Aug 9 '18 at 3:04

Byren Higgin

1061

asked Aug 9 '18 at 3:04

Byren Higgin

1061

asked Aug 9 '18 at 3:04

Byren Higgin

1061

asked Aug 9 '18 at 3:04

Byren Higgin

1061

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07

add a comment |

1

$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07

Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition

– Byren Higgin
Aug 9 '18 at 3:07

add a comment |

1 Answer
1

active

oldest

votes

I'm not completely familiar with TF API, but here's what I think is happening.

The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.

You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.

You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.

answered Aug 9 '18 at 15:11

hssay

1,0931311

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36667%2ftensorflow-categorical-data-with-vocabulary-list-expected-binary-or-unicode-st%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I'm not completely familiar with TF API, but here's what I think is happening.

The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.

answered Aug 9 '18 at 15:11

hssay

1,0931311

add a comment |

I'm not completely familiar with TF API, but here's what I think is happening.

The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.

answered Aug 9 '18 at 15:11

hssay

1,0931311

add a comment |

I'm not completely familiar with TF API, but here's what I think is happening.

The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.

answered Aug 9 '18 at 15:11

hssay

1,0931311

I'm not completely familiar with TF API, but here's what I think is happening.

The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.

answered Aug 9 '18 at 15:11

hssay

1,0931311

answered Aug 9 '18 at 15:11

hssay

1,0931311

answered Aug 9 '18 at 15:11

hssay

1,0931311

answered Aug 9 '18 at 15:11

hssay

1,0931311

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk