Efficiently training big models on big dataframes with big samples, with crossvalidation and shuffling, and...












0












$begingroup$


Context



There is a CSV of ~10gb timeseries (2gb compressed), with a shape of (629145481, 2). I can load it enterely on my RAM as float32, but then I have left very small memory to do other operations (such as training deep learning models with many params), and/or I am limited to use relatively small batches.



As additional information, the sample input must consist of 150.000 data points of the first feature, and the output of the last point of the second feature in the last index of the window being used. Also, the batch size would ideally be of at least 1024. Furthermore, we aim to use crossvalidation (number of folds>=3).



Without crossvalidation, it would be enough to use pandas read_csv with chunksize=150000, however, using cross_validation implies accessing different parts of the csv in each epoch (reloading the CSV skipping rows in each fold would be too much time and cpu consuming).



My research



I figured a way to make cross validation with such a big csv. I firstly load it enterely as a float32 dataframe, then make a very optimized batch generator using only some indexes, and finally only reshape the necessary data. It left memory for models with relatively many params. However the model got very overfit because of the ausence of shuffling.



Shuffling the windows implies creating a permutation which overflows my ram, so it becomes inviable. One solution is using Python sqlite3 and work using a database instead of having everything on memory, but I want to find alternative solutions to this.



Then I figured out two ways of “semi-randomly” getting windows along with batch generators:




  1. Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.


  2. Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.



I finally found a library apparently useful to handle this type of problems: Dask. It parallelizes the memory into multiple partitioned dataframes and doesn’t load the data when not necessary. So, it can be useful for me. However, in order to make slices and treat the timeseries as a single one instead of a partitioned one, we have to specify an index to use .iloc[…], otherwise multiple data is retrieved (the indexes are repeated for each partition if no index specified). This means that an index should be pregenerated on the csv. I tried to do this, but I also got a MemoryError when trying to read the dataframe, and adding an index set to range(629145481). Maybe a solution would be reading the dataframe in chunks and dumping it sequentially.



Desired answer



I look for improvements on the solutions that I mentioned or different ones which efficiently resolves the problem of handling this dataframe with 13gb free ram, along with crossvalidation, shuffling, and allowing for models with many parameters. Is there any solution better than using a database?










share|improve this question











$endgroup$












  • $begingroup$
    As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
    $endgroup$
    – Shamit Verma
    2 days ago
















0












$begingroup$


Context



There is a CSV of ~10gb timeseries (2gb compressed), with a shape of (629145481, 2). I can load it enterely on my RAM as float32, but then I have left very small memory to do other operations (such as training deep learning models with many params), and/or I am limited to use relatively small batches.



As additional information, the sample input must consist of 150.000 data points of the first feature, and the output of the last point of the second feature in the last index of the window being used. Also, the batch size would ideally be of at least 1024. Furthermore, we aim to use crossvalidation (number of folds>=3).



Without crossvalidation, it would be enough to use pandas read_csv with chunksize=150000, however, using cross_validation implies accessing different parts of the csv in each epoch (reloading the CSV skipping rows in each fold would be too much time and cpu consuming).



My research



I figured a way to make cross validation with such a big csv. I firstly load it enterely as a float32 dataframe, then make a very optimized batch generator using only some indexes, and finally only reshape the necessary data. It left memory for models with relatively many params. However the model got very overfit because of the ausence of shuffling.



Shuffling the windows implies creating a permutation which overflows my ram, so it becomes inviable. One solution is using Python sqlite3 and work using a database instead of having everything on memory, but I want to find alternative solutions to this.



Then I figured out two ways of “semi-randomly” getting windows along with batch generators:




  1. Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.


  2. Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.



I finally found a library apparently useful to handle this type of problems: Dask. It parallelizes the memory into multiple partitioned dataframes and doesn’t load the data when not necessary. So, it can be useful for me. However, in order to make slices and treat the timeseries as a single one instead of a partitioned one, we have to specify an index to use .iloc[…], otherwise multiple data is retrieved (the indexes are repeated for each partition if no index specified). This means that an index should be pregenerated on the csv. I tried to do this, but I also got a MemoryError when trying to read the dataframe, and adding an index set to range(629145481). Maybe a solution would be reading the dataframe in chunks and dumping it sequentially.



Desired answer



I look for improvements on the solutions that I mentioned or different ones which efficiently resolves the problem of handling this dataframe with 13gb free ram, along with crossvalidation, shuffling, and allowing for models with many parameters. Is there any solution better than using a database?










share|improve this question











$endgroup$












  • $begingroup$
    As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
    $endgroup$
    – Shamit Verma
    2 days ago














0












0








0





$begingroup$


Context



There is a CSV of ~10gb timeseries (2gb compressed), with a shape of (629145481, 2). I can load it enterely on my RAM as float32, but then I have left very small memory to do other operations (such as training deep learning models with many params), and/or I am limited to use relatively small batches.



As additional information, the sample input must consist of 150.000 data points of the first feature, and the output of the last point of the second feature in the last index of the window being used. Also, the batch size would ideally be of at least 1024. Furthermore, we aim to use crossvalidation (number of folds>=3).



Without crossvalidation, it would be enough to use pandas read_csv with chunksize=150000, however, using cross_validation implies accessing different parts of the csv in each epoch (reloading the CSV skipping rows in each fold would be too much time and cpu consuming).



My research



I figured a way to make cross validation with such a big csv. I firstly load it enterely as a float32 dataframe, then make a very optimized batch generator using only some indexes, and finally only reshape the necessary data. It left memory for models with relatively many params. However the model got very overfit because of the ausence of shuffling.



Shuffling the windows implies creating a permutation which overflows my ram, so it becomes inviable. One solution is using Python sqlite3 and work using a database instead of having everything on memory, but I want to find alternative solutions to this.



Then I figured out two ways of “semi-randomly” getting windows along with batch generators:




  1. Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.


  2. Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.



I finally found a library apparently useful to handle this type of problems: Dask. It parallelizes the memory into multiple partitioned dataframes and doesn’t load the data when not necessary. So, it can be useful for me. However, in order to make slices and treat the timeseries as a single one instead of a partitioned one, we have to specify an index to use .iloc[…], otherwise multiple data is retrieved (the indexes are repeated for each partition if no index specified). This means that an index should be pregenerated on the csv. I tried to do this, but I also got a MemoryError when trying to read the dataframe, and adding an index set to range(629145481). Maybe a solution would be reading the dataframe in chunks and dumping it sequentially.



Desired answer



I look for improvements on the solutions that I mentioned or different ones which efficiently resolves the problem of handling this dataframe with 13gb free ram, along with crossvalidation, shuffling, and allowing for models with many parameters. Is there any solution better than using a database?










share|improve this question











$endgroup$




Context



There is a CSV of ~10gb timeseries (2gb compressed), with a shape of (629145481, 2). I can load it enterely on my RAM as float32, but then I have left very small memory to do other operations (such as training deep learning models with many params), and/or I am limited to use relatively small batches.



As additional information, the sample input must consist of 150.000 data points of the first feature, and the output of the last point of the second feature in the last index of the window being used. Also, the batch size would ideally be of at least 1024. Furthermore, we aim to use crossvalidation (number of folds>=3).



Without crossvalidation, it would be enough to use pandas read_csv with chunksize=150000, however, using cross_validation implies accessing different parts of the csv in each epoch (reloading the CSV skipping rows in each fold would be too much time and cpu consuming).



My research



I figured a way to make cross validation with such a big csv. I firstly load it enterely as a float32 dataframe, then make a very optimized batch generator using only some indexes, and finally only reshape the necessary data. It left memory for models with relatively many params. However the model got very overfit because of the ausence of shuffling.



Shuffling the windows implies creating a permutation which overflows my ram, so it becomes inviable. One solution is using Python sqlite3 and work using a database instead of having everything on memory, but I want to find alternative solutions to this.



Then I figured out two ways of “semi-randomly” getting windows along with batch generators:




  1. Iterating the timeseries with some stride (using stride=150000 or stride=75000 is ok, but lower stride generates too many windows to fit in the available memory), and adding a random number to the step. This creates a good variation on the dataset, but I am still limited to that stride and this is not a real shuffle. Additionally, I cant use models with more than 3 million params.


  2. Generating random indexes directly in the available folds for training / validation. The disadvantage for this is that the same data can be reused in the same epoch.



I finally found a library apparently useful to handle this type of problems: Dask. It parallelizes the memory into multiple partitioned dataframes and doesn’t load the data when not necessary. So, it can be useful for me. However, in order to make slices and treat the timeseries as a single one instead of a partitioned one, we have to specify an index to use .iloc[…], otherwise multiple data is retrieved (the indexes are repeated for each partition if no index specified). This means that an index should be pregenerated on the csv. I tried to do this, but I also got a MemoryError when trying to read the dataframe, and adding an index set to range(629145481). Maybe a solution would be reading the dataframe in chunks and dumping it sequentially.



Desired answer



I look for improvements on the solutions that I mentioned or different ones which efficiently resolves the problem of handling this dataframe with 13gb free ram, along with crossvalidation, shuffling, and allowing for models with many parameters. Is there any solution better than using a database?







time-series pandas preprocessing csv






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago







freesoul

















asked 2 days ago









freesoulfreesoul

84




84












  • $begingroup$
    As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
    $endgroup$
    – Shamit Verma
    2 days ago


















  • $begingroup$
    As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
    $endgroup$
    – Shamit Verma
    2 days ago
















$begingroup$
As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
$endgroup$
– Shamit Verma
2 days ago




$begingroup$
As you mentioned, most efficient way of doing data prep is to do it in a database. For toy problem, one can expect to load whole data in memory; most real world problems require dealing with Terabytes of data.
$endgroup$
– Shamit Verma
2 days ago










0






active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47171%2fefficiently-training-big-models-on-big-dataframes-with-big-samples-with-crossva%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47171%2fefficiently-training-big-models-on-big-dataframes-with-big-samples-with-crossva%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to label and detect the document text images

Vallis Paradisi

Tabula Rosettana