Efficiently training big models on big dataframes with big samples, with crossvalidation and shuffling, and...

Context

There is a CSV of ~10gb timeseries (2gb compressed), with a shape of (629145481, 2). I can load it enterely on my RAM as float32, but then I have left very small memory to do other operations (such as training deep learning models with many params), and/or I am limited to use relatively small batches.

As additional information, the sample input must consist of 150.000 data points of the first feature, and the output of the last point of the second feature in the last index of the window being used. Also, the batch size would ideally be of at least 1024. Furthermore, we aim to use crossvalidation (number of folds>=3).

Without crossvalidation, it would be enough to use pandas read_csv with chunksize=150000, however, using cross_validation implies accessing different parts of the csv in each epoch (reloading the CSV skipping rows in each fold would be too much time and cpu consuming).

My research

I figured a way to make cross validation with such a big csv. I firstly load it enterely as a float32 dataframe, then make a very optimized batch generator using only some indexes, and finally only reshape the necessary data. It left memory for models with relatively many params. However the model got very overfit because of the ausence of shuffling.

Shuffling the windows implies creating a permutation which overflows my ram, so it becomes inviable. One solution is using Python sqlite3 and work using a database instead of having everything on memory, but I want to find alternative solutions to this.

Then I figured out two ways of “semi-randomly” getting windows along with batch generators:

Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.

Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.

I finally found a library apparently useful to handle this type of problems: Dask. It parallelizes the memory into multiple partitioned dataframes and doesn’t load the data when not necessary. So, it can be useful for me. However, in order to make slices and treat the timeseries as a single one instead of a partitioned one, we have to specify an index to use .iloc[…], otherwise multiple data is retrieved (the indexes are repeated for each partition if no index specified). This means that an index should be pregenerated on the csv. I tried to do this, but I also got a MemoryError when trying to read the dataframe, and adding an index set to range(629145481). Maybe a solution would be reading the dataframe in chunks and dumping it sequentially.

Desired answer

I look for improvements on the solutions that I mentioned or different ones which efficiently resolves the problem of handling this dataframe with 13gb free ram, along with crossvalidation, shuffling, and allowing for models with many parameters. Is there any solution better than using a database?

edited 2 days ago

asked 2 days ago

freesoul

$begingroup$
As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
$endgroup$
– Shamit Verma
2 days ago

add a comment |

Context

My research

Then I figured out two ways of “semi-randomly” getting windows along with batch generators:

Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.

Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.

Desired answer

edited 2 days ago

asked 2 days ago

freesoul

$begingroup$
As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
$endgroup$
– Shamit Verma
2 days ago

add a comment |

Context

My research

Then I figured out two ways of “semi-randomly” getting windows along with batch generators:

Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.

Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.

Desired answer

edited 2 days ago

asked 2 days ago

freesoul

Context

My research

Then I figured out two ways of “semi-randomly” getting windows along with batch generators:

Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.

Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.

Desired answer

time-series pandas preprocessing csv

edited 2 days ago

asked 2 days ago

freesoul

edited 2 days ago

asked 2 days ago

freesoul

edited 2 days ago

asked 2 days ago

freesoul

asked 2 days ago

freesoul

asked 2 days ago

freesoul

$begingroup$
As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
$endgroup$
– Shamit Verma
2 days ago

add a comment |

$begingroup$
As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
$endgroup$
– Shamit Verma
2 days ago

As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.

– Shamit Verma
2 days ago

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47171%2fefficiently-training-big-models-on-big-dataframes-with-big-samples-with-crossva%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk