Is the R language suitable for Big Data

R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".

I've seen a guideline of 5TB for a dataset to be considered as Big Data.

My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

asked May 14 '14 at 11:15

akellyirl

35849

4

$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45

add a comment |

I've seen a guideline of 5TB for a dataset to be considered as Big Data.

My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

asked May 14 '14 at 11:15

akellyirl

35849

4

$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45

add a comment |

I've seen a guideline of 5TB for a dataset to be considered as Big Data.

My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

asked May 14 '14 at 11:15

akellyirl

35849

I've seen a guideline of 5TB for a dataset to be considered as Big Data.

My question is: Is R suitable for the amount of Data typically seen in Big Data problems?
Are there strategies to be employed when using R with this size of dataset?

bigdata r

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

asked May 14 '14 at 11:15

akellyirl

35849

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

asked May 14 '14 at 11:15

akellyirl

35849

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

edited May 14 '14 at 13:06

Konstantin V. Salikhov

569515

asked May 14 '14 at 11:15

akellyirl

35849

asked May 14 '14 at 11:15

akellyirl

35849

asked May 14 '14 at 11:15

akellyirl

35849

4

$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45

add a comment |

4

$begingroup$
In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.
$endgroup$
– cwharland
May 14 '14 at 17:45

In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.

– cwharland
May 14 '14 at 17:45

add a comment |

9 Answers
9

active

oldest

votes

Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.

A mature project for R and Hadoop is RHadoop

RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).

edited Jan 31 '15 at 11:34

chenrui333

20325

answered May 14 '14 at 11:24

MCP_infiltrator

96697

$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07

$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34

2

$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03

add a comment |

The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.

RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.

answered May 14 '14 at 12:39

asheeshr

5761413

1

$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03

add a comment |

Some good answers here. I would like to join the discussion by adding the following three notes:

The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).

Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

edited yesterday

Kevin Bowen

10915

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

add a comment |

R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.

This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).

answered May 18 '14 at 19:22

statsRus

267110

add a comment |

Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.

In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.

Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.

Edit:

Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)

edited May 19 '14 at 8:13

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

3

$begingroup$
R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52

$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08

1

$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46

5

$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31

$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35

add a comment |

R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.

However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

add a comment |

As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.

answered Jan 29 '15 at 21:03

mindcrime

1616

add a comment |

I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).

answered Jul 18 '14 at 15:24

Stenemo

1212

add a comment |

I think that there is actually a pletora of tools for working with big data in R.
sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries.
Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).

answered Nov 8 '18 at 15:35

paoloeusebi

3266

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41%2fis-the-r-language-suitable-for-big-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

9 Answers
9

active

oldest

votes

9 Answers
9

active

oldest

votes

A mature project for R and Hadoop is RHadoop

RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).

edited Jan 31 '15 at 11:34

chenrui333

20325

answered May 14 '14 at 11:24

MCP_infiltrator

96697

$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07

$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34

2

$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03

add a comment |

A mature project for R and Hadoop is RHadoop

RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).

edited Jan 31 '15 at 11:34

chenrui333

20325

answered May 14 '14 at 11:24

MCP_infiltrator

96697

$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07

$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34

2

$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03

add a comment |

A mature project for R and Hadoop is RHadoop

RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).

edited Jan 31 '15 at 11:34

chenrui333

20325

answered May 14 '14 at 11:24

MCP_infiltrator

96697

A mature project for R and Hadoop is RHadoop

RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).

edited Jan 31 '15 at 11:34

chenrui333

20325

answered May 14 '14 at 11:24

MCP_infiltrator

96697

edited Jan 31 '15 at 11:34

chenrui333

20325

edited Jan 31 '15 at 11:34

chenrui333

20325

edited Jan 31 '15 at 11:34

chenrui333

20325

answered May 14 '14 at 11:24

MCP_infiltrator

96697

answered May 14 '14 at 11:24

MCP_infiltrator

96697

answered May 14 '14 at 11:24

MCP_infiltrator

96697

$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07

$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34

2

$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03

add a comment |

$begingroup$
But does using R with Hadoop overcome this limitation (having to do computations in memory)?
$endgroup$
– Felipe Almeida
Jun 9 '14 at 23:07

$begingroup$
RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.
$endgroup$
– Steve Kallestad♦
Jun 11 '14 at 6:34

2

$begingroup$
Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.
$endgroup$
– wacax
Dec 5 '15 at 6:03

But does using R with Hadoop overcome this limitation (having to do computations in memory)?

– Felipe Almeida
Jun 9 '14 at 23:07

RHadoop does overcome this limitation. The tutorial here: github.com/RevolutionAnalytics/rmr2/blob/master/docs/… spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.

– Steve Kallestad♦
Jun 11 '14 at 6:34

Two new alternatives that are worth mentioning are: SparkR databricks.com/blog/2015/06/09/… and h2o.ai h2o.ai/product both well suited for big data.

– wacax
Dec 5 '15 at 6:03

add a comment |

RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.

answered May 14 '14 at 12:39

asheeshr

5761413

1

$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03

add a comment |

RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.

answered May 14 '14 at 12:39

asheeshr

5761413

1

$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03

add a comment |

RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.

answered May 14 '14 at 12:39

asheeshr

5761413

RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.

The ff package allows using larger than RAM data sets by utilising memory-mapped pages.

BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.

bigmemory : An R package which allows powerful and memory-efficient parallel
analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.

answered May 14 '14 at 12:39

asheeshr

5761413

answered May 14 '14 at 12:39

asheeshr

5761413

answered May 14 '14 at 12:39

asheeshr

5761413

answered May 14 '14 at 12:39

asheeshr

5761413

1

$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03

add a comment |

1

$begingroup$
Another package is distributedR which allows you to work with distributed files in RAM.
$endgroup$
– adesantos
Jun 25 '14 at 7:03

Another package is distributedR which allows you to work with distributed files in RAM.

– adesantos
Jun 25 '14 at 7:03

add a comment |

Some good answers here. I would like to join the discussion by adding the following three notes:

The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).

Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

edited yesterday

Kevin Bowen

10915

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

add a comment |

Some good answers here. I would like to join the discussion by adding the following three notes:

The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).

Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

edited yesterday

Kevin Bowen

10915

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

add a comment |

Some good answers here. I would like to join the discussion by adding the following three notes:

The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).

Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

edited yesterday

Kevin Bowen

10915

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

Some good answers here. I would like to join the discussion by adding the following three notes:

The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).

Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

edited yesterday

Kevin Bowen

10915

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

edited yesterday

Kevin Bowen

10915

edited yesterday

Kevin Bowen

10915

edited yesterday

Kevin Bowen

10915

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

answered Jul 19 '14 at 2:19

Aleksandr Blekh♦

5,94811747

add a comment |

This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).

answered May 18 '14 at 19:22

statsRus

267110

add a comment |

This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).

answered May 18 '14 at 19:22

statsRus

267110

add a comment |

This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).

answered May 18 '14 at 19:22

statsRus

267110

This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).

answered May 18 '14 at 19:22

statsRus

267110

answered May 18 '14 at 19:22

statsRus

267110

answered May 18 '14 at 19:22

statsRus

267110

answered May 18 '14 at 19:22

statsRus

267110

add a comment |

Edit:

Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)

edited May 19 '14 at 8:13

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

3

$begingroup$
R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52

$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08

1

$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46

5

$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31

$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35

add a comment |

Edit:

Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)

edited May 19 '14 at 8:13

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

3

$begingroup$
R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52

$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08

1

$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46

5

$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31

$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35

add a comment |

Edit:

Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)

edited May 19 '14 at 8:13

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

Edit:

Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)

edited May 19 '14 at 8:13

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

edited May 19 '14 at 8:13

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

answered May 18 '14 at 12:30

Amir Ali Akbari

80531023

3

$begingroup$
R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52

$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08

1

$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46

5

$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31

$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35

add a comment |

3

$begingroup$
R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization
$endgroup$
– organic agave
May 18 '14 at 21:52

$begingroup$
@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.
$endgroup$
– Amir Ali Akbari
May 19 '14 at 8:08

1

$begingroup$
Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.
$endgroup$
– blunders
May 20 '14 at 18:46

5

$begingroup$
This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.
$endgroup$
– stanekam
Jun 10 '14 at 20:31

$begingroup$
Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....
$endgroup$
– Shawn Mehan
Dec 5 '15 at 18:35

R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualization

– organic agave
May 18 '14 at 21:52

@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.

– Amir Ali Akbari
May 19 '14 at 8:08

Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.

– blunders
May 20 '14 at 18:46

This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.

– stanekam
Jun 10 '14 at 20:31

Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....

– Shawn Mehan
Dec 5 '15 at 18:35

add a comment |

R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

add a comment |

R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

add a comment |

R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

answered Jan 29 '15 at 12:58

Climbs_lika_Spyder

36038

add a comment |

answered Jan 29 '15 at 21:03

mindcrime

1616

add a comment |

answered Jan 29 '15 at 21:03

mindcrime

1616

add a comment |

answered Jan 29 '15 at 21:03

mindcrime

1616

answered Jan 29 '15 at 21:03

mindcrime

1616

answered Jan 29 '15 at 21:03

mindcrime

1616

answered Jan 29 '15 at 21:03

mindcrime

1616

answered Jan 29 '15 at 21:03

mindcrime

1616

add a comment |

answered Jul 18 '14 at 15:24

Stenemo

1212

add a comment |

answered Jul 18 '14 at 15:24

Stenemo

1212

add a comment |

answered Jul 18 '14 at 15:24

Stenemo

1212

answered Jul 18 '14 at 15:24

Stenemo

1212

answered Jul 18 '14 at 15:24

Stenemo

1212

answered Jul 18 '14 at 15:24

Stenemo

1212

answered Jul 18 '14 at 15:24

Stenemo

1212

add a comment |

answered Nov 8 '18 at 15:35

paoloeusebi

3266

add a comment |

answered Nov 8 '18 at 15:35

paoloeusebi

3266

add a comment |

answered Nov 8 '18 at 15:35

paoloeusebi

3266

answered Nov 8 '18 at 15:35

paoloeusebi

3266

answered Nov 8 '18 at 15:35

paoloeusebi

3266

answered Nov 8 '18 at 15:35

paoloeusebi

3266

answered Nov 8 '18 at 15:35

paoloeusebi

3266

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk