To remove Chinese characters as features -

I have created document-term matrix using TfIdfVectorizer, but just noticed the feature contains Chinese characters. Is it possible to remove them using Python's regex?

I believe these characters are one of reason for lower prediction accuracy of my model.

Currently I use the below for pre-processing my data-

   # Pre-processing the data

    def text_preprocess( data ):

        # Changing to lower case

        data = data.lower()

        # Removing special characters

        data = re.sub("(\d|\W)+"," ",data)

        return data

Also, please note I used stopwords='english' in my TfidfVectorizer.

Please let me know if any information required. (New here, still learning)

asked 2 days ago

ranit.b

427

add a comment |

I have created document-term matrix using TfIdfVectorizer, but just noticed the feature contains Chinese characters. Is it possible to remove them using Python's regex?

I believe these characters are one of reason for lower prediction accuracy of my model.

Currently I use the below for pre-processing my data-

   # Pre-processing the data

    def text_preprocess( data ):

        # Changing to lower case

        data = data.lower()

        # Removing special characters

        data = re.sub("(\d|\W)+"," ",data)

        return data

Also, please note I used stopwords='english' in my TfidfVectorizer.

Please let me know if any information required. (New here, still learning)

asked 2 days ago

ranit.b

427

add a comment |

I have created document-term matrix using TfIdfVectorizer, but just noticed the feature contains Chinese characters. Is it possible to remove them using Python's regex?

I believe these characters are one of reason for lower prediction accuracy of my model.

Currently I use the below for pre-processing my data-

   # Pre-processing the data

    def text_preprocess( data ):

        # Changing to lower case

        data = data.lower()

        # Removing special characters

        data = re.sub("(\d|\W)+"," ",data)

        return data

Also, please note I used stopwords='english' in my TfidfVectorizer.

Please let me know if any information required. (New here, still learning)

asked 2 days ago

ranit.b

427

I have created document-term matrix using TfIdfVectorizer, but just noticed the feature contains Chinese characters. Is it possible to remove them using Python's regex?

I believe these characters are one of reason for lower prediction accuracy of my model.

Currently I use the below for pre-processing my data-

   # Pre-processing the data

    def text_preprocess( data ):

        # Changing to lower case

        data = data.lower()

        # Removing special characters

        data = re.sub("(\d|\W)+"," ",data)

        return data

Also, please note I used stopwords='english' in my TfidfVectorizer.

Please let me know if any information required. (New here, still learning)

machine-learning python feature-extraction

asked 2 days ago

ranit.b

427

asked 2 days ago

ranit.b

427

asked 2 days ago

ranit.b

427

asked 2 days ago

ranit.b

427

asked 2 days ago

ranit.b

427

add a comment |

1 Answer
1

active

oldest

votes

If you want to remove non-English characters then this regex will work, by selecting characters not in a given ASCII range (0 to 122, you can adjust this since it will allow some special characters):

([^x00-x7A])+

So to remove those characters:

data = re.sub("([^x00-x7F])+"," ",data)

answered yesterday

Dan Carter

6451215

1

$begingroup$
Perfect. Even I was thinking on same line, like excluding all non-keyboard characters. But then realised, someone might have Chinese characters on their keyboards. :) You rightly pointed at the ASCII codes. Thanks.
$endgroup$
– ranit.b
yesterday

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46705%2fto-remove-chinese-characters-as-features%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

If you want to remove non-English characters then this regex will work, by selecting characters not in a given ASCII range (0 to 122, you can adjust this since it will allow some special characters):

([^x00-x7A])+

So to remove those characters:

data = re.sub("([^x00-x7F])+"," ",data)

answered yesterday

Dan Carter

6451215

1

$begingroup$
Perfect. Even I was thinking on same line, like excluding all non-keyboard characters. But then realised, someone might have Chinese characters on their keyboards. :) You rightly pointed at the ASCII codes. Thanks.
$endgroup$
– ranit.b
yesterday

add a comment |

If you want to remove non-English characters then this regex will work, by selecting characters not in a given ASCII range (0 to 122, you can adjust this since it will allow some special characters):

([^x00-x7A])+

So to remove those characters:

data = re.sub("([^x00-x7F])+"," ",data)

answered yesterday

Dan Carter

6451215

1

$begingroup$
Perfect. Even I was thinking on same line, like excluding all non-keyboard characters. But then realised, someone might have Chinese characters on their keyboards. :) You rightly pointed at the ASCII codes. Thanks.
$endgroup$
– ranit.b
yesterday

add a comment |

If you want to remove non-English characters then this regex will work, by selecting characters not in a given ASCII range (0 to 122, you can adjust this since it will allow some special characters):

([^x00-x7A])+

So to remove those characters:

data = re.sub("([^x00-x7F])+"," ",data)

answered yesterday

Dan Carter

6451215

If you want to remove non-English characters then this regex will work, by selecting characters not in a given ASCII range (0 to 122, you can adjust this since it will allow some special characters):

([^x00-x7A])+

So to remove those characters:

data = re.sub("([^x00-x7F])+"," ",data)

answered yesterday

Dan Carter

6451215

answered yesterday

Dan Carter

6451215

answered yesterday

Dan Carter

6451215

answered yesterday

Dan Carter

6451215

1

$begingroup$
Perfect. Even I was thinking on same line, like excluding all non-keyboard characters. But then realised, someone might have Chinese characters on their keyboards. :) You rightly pointed at the ASCII codes. Thanks.
$endgroup$
– ranit.b
yesterday

add a comment |

1

$begingroup$
Perfect. Even I was thinking on same line, like excluding all non-keyboard characters. But then realised, someone might have Chinese characters on their keyboards. :) You rightly pointed at the ASCII codes. Thanks.
$endgroup$
– ranit.b
yesterday

Perfect. Even I was thinking on same line, like excluding all non-keyboard characters. But then realised, someone might have Chinese characters on their keyboards. :) You rightly pointed at the ASCII codes. Thanks.

– ranit.b
yesterday

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk