memory problem due to large file
$begingroup$
I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.
I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.
My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.
What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....
I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.
Any suggestions will help.
Thanks
Avi
python pandas
$endgroup$
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.
I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.
My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.
What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....
I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.
Any suggestions will help.
Thanks
Avi
python pandas
$endgroup$
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48
add a comment |
$begingroup$
I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.
I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.
My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.
What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....
I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.
Any suggestions will help.
Thanks
Avi
python pandas
$endgroup$
I'm new in Python and definitely, I'm sure I do some mistake.
Here is my problem and thanks all to your help in advance.
I have 2 files (one is Hive) and the other is csv and merging them. I have 64GB memory and I believe the csv file I create is around 25+ GB.
My problem is when I connect remotely, I see that memory usage hits to 100% and then I cannot even connect my workstation remotely and it needs a hard boot.
What I'm thinking is,
when I merge these 2 tables, I like to save in csv (let's say 100,000 rows) and clean that from the memory and continues with another 100,000 rows, append to it and so on....
I'm not sure how to do this, I found some within Google search, most likely is about to read large files, but not sure after I read (merge or during merge in my case), to write every 100K chunks to a csv and clean it from the memory.
Any suggestions will help.
Thanks
Avi
python pandas
python pandas
asked Nov 11 '18 at 20:01
AviAvi
112
112
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48
add a comment |
1
$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48
1
1
$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48
$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv
method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41059%2fmemory-problem-due-to-large-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv
method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.
$endgroup$
add a comment |
$begingroup$
I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv
method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.
$endgroup$
add a comment |
$begingroup$
I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv
method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.
$endgroup$
I guess you are trying to use pandas, if so. Don't because you can't do that for what you want. I guess your operation needs all data to be loaded to the memory which is not possible. Try to use dask or a kind of sql. I'm not sure whether you want to do operations such as group by which needs all data to be loaded simultaneously or not. If you can do extra coding and your operations don't imply to load all data, such as finding the minimum of a column, you can use generators and specify the chunk_size of the read_csv
method in pandas. But you have to do extra coding. You can also take a look at here. You can also take a look at here to figure out why sql and dask are more better for large operations.
answered Nov 11 '18 at 22:10
VaalizaadehVaalizaadeh
7,57562263
7,57562263
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41059%2fmemory-problem-due-to-large-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
i think this is an x-y problem- meta.stackexchange.com/questions/66377/what-is-the-xy-problem you aren't really trying to read a lot of data, you need to read a lot of data in order to do some task. What is that task? Maybe we can find a way to do it without reading all the data at once
$endgroup$
– Mohammad Athar
Dec 12 '18 at 18:48