Import csv file contents into pyspark dataframes












8












$begingroup$


How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error




"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient




and my code was



from pyspark import SparkContext 
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)









share|improve this question











$endgroup$








  • 1




    $begingroup$
    If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
    $endgroup$
    – j.a.gartner
    Aug 1 '16 at 15:55










  • $begingroup$
    i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    $endgroup$
    – neha
    Aug 2 '16 at 4:59












  • $begingroup$
    and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
    $endgroup$
    – neha
    Aug 2 '16 at 5:01










  • $begingroup$
    Welcome to DataScience.SE! Please edit your original post instead of adding comments.
    $endgroup$
    – Emre
    Aug 2 '16 at 7:54










  • $begingroup$
    file path must be in HDFS then only u can run the data
    $endgroup$
    – Prakash Reddy
    2 days ago
















8












$begingroup$


How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error




"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient




and my code was



from pyspark import SparkContext 
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)









share|improve this question











$endgroup$








  • 1




    $begingroup$
    If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
    $endgroup$
    – j.a.gartner
    Aug 1 '16 at 15:55










  • $begingroup$
    i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    $endgroup$
    – neha
    Aug 2 '16 at 4:59












  • $begingroup$
    and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
    $endgroup$
    – neha
    Aug 2 '16 at 5:01










  • $begingroup$
    Welcome to DataScience.SE! Please edit your original post instead of adding comments.
    $endgroup$
    – Emre
    Aug 2 '16 at 7:54










  • $begingroup$
    file path must be in HDFS then only u can run the data
    $endgroup$
    – Prakash Reddy
    2 days ago














8












8








8


1



$begingroup$


How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error




"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient




and my code was



from pyspark import SparkContext 
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)









share|improve this question











$endgroup$




How can I import a .csv file into pyspark dataframes ? I even tried to read csv file in Pandas and then converting it to spark dataframes using createDataFrame but it is still showing some error. Can someone guide me through this. Also, please tell me how can I import xlsx file ?
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames, but it is showing the error




"Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient




and my code was



from pyspark import SparkContext 
from pyspark.sql import SQLContext
import pandas as pd
sqlc=SQLContext(sc)
df=pd.read_csv(r'D:BestBuytrain.csv')
sdf=sqlc.createDataFrame(df)






pyspark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 2 '16 at 18:04









Emre

8,60111935




8,60111935










asked Aug 1 '16 at 11:21









nehaneha

41113




41113








  • 1




    $begingroup$
    If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
    $endgroup$
    – j.a.gartner
    Aug 1 '16 at 15:55










  • $begingroup$
    i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    $endgroup$
    – neha
    Aug 2 '16 at 4:59












  • $begingroup$
    and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
    $endgroup$
    – neha
    Aug 2 '16 at 5:01










  • $begingroup$
    Welcome to DataScience.SE! Please edit your original post instead of adding comments.
    $endgroup$
    – Emre
    Aug 2 '16 at 7:54










  • $begingroup$
    file path must be in HDFS then only u can run the data
    $endgroup$
    – Prakash Reddy
    2 days ago














  • 1




    $begingroup$
    If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
    $endgroup$
    – j.a.gartner
    Aug 1 '16 at 15:55










  • $begingroup$
    i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    $endgroup$
    – neha
    Aug 2 '16 at 4:59












  • $begingroup$
    and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
    $endgroup$
    – neha
    Aug 2 '16 at 5:01










  • $begingroup$
    Welcome to DataScience.SE! Please edit your original post instead of adding comments.
    $endgroup$
    – Emre
    Aug 2 '16 at 7:54










  • $begingroup$
    file path must be in HDFS then only u can run the data
    $endgroup$
    – Prakash Reddy
    2 days ago








1




1




$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55




$begingroup$
If you have an error message, you should post it; it most likely has important info in helping to debug the situation.
$endgroup$
– j.a.gartner
Aug 1 '16 at 15:55












$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59






$begingroup$
i'm trying to import csv contents into pandas dataframes and then converting it into spark data frames....but it is showing error something like "Py4JJavaError" An error occurred while calling o28.applySchemaToPythonRDD. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
$endgroup$
– neha
Aug 2 '16 at 4:59














$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01




$begingroup$
and my code was--> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sqlc=SQLContext(sc) df=pd.read_csv(r'D:BestBuytrain.csv') sdf=sqlc.createDataFrame(df) ----> Error
$endgroup$
– neha
Aug 2 '16 at 5:01












$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54




$begingroup$
Welcome to DataScience.SE! Please edit your original post instead of adding comments.
$endgroup$
– Emre
Aug 2 '16 at 7:54












$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago




$begingroup$
file path must be in HDFS then only u can run the data
$endgroup$
– Prakash Reddy
2 days ago










3 Answers
3






active

oldest

votes


















10












$begingroup$

"How can I import a .csv file into pyspark dataframes ?"
-- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with



pyspark --packages com.databricks:spark-csv_2.10:1.4.0


then you can follow the following steps:



from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')


The other method would be to read in the text file as an rdd using



myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))


Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use



>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = sqlContext.createDataFrame(person)
>>> df2.collect()
[Row(name=u'Alice', age=1)]
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("name", StringType(), True),
... StructField("age", IntegerType(), True)])
>>> df3 = sqlContext.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name=u'Alice', age=1)]


Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row



"Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.






share|improve this answer









$endgroup$





















    0












    $begingroup$

    I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:



    >>> from pyspark import SQLContext
    >>> from pyspark.sql import Row
    >>> sql_c = SQLContext(sc)
    >>> d0 = sc.textFile('./temp.csv')
    >>> d0.collect()
    [u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
    >>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
    >>> d1.take(1)
    [Row(label=u'a', number=1, value=0.239)]
    >>> df = sql_c.createDataFrame(d1)
    >>> df_cut = df[df.number>1]
    >>> df_cut.select('label', 'value').collect()
    [Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]


    So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.






    share|improve this answer









    $endgroup$





















      0












      $begingroup$

      You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.






      share|improve this answer









      $endgroup$













        Your Answer





        StackExchange.ifUsing("editor", function () {
        return StackExchange.using("mathjaxEditing", function () {
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        });
        });
        }, "mathjax-editing");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "557"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f13123%2fimport-csv-file-contents-into-pyspark-dataframes%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        10












        $begingroup$

        "How can I import a .csv file into pyspark dataframes ?"
        -- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with



        pyspark --packages com.databricks:spark-csv_2.10:1.4.0


        then you can follow the following steps:



        from pyspark.sql import SQLContext
        sqlContext = SQLContext(sc)

        df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')


        The other method would be to read in the text file as an rdd using



        myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))


        Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use



        >>> from pyspark.sql import Row
        >>> Person = Row('name', 'age')
        >>> person = rdd.map(lambda r: Person(*r))
        >>> df2 = sqlContext.createDataFrame(person)
        >>> df2.collect()
        [Row(name=u'Alice', age=1)]
        >>> from pyspark.sql.types import *
        >>> schema = StructType([
        ... StructField("name", StringType(), True),
        ... StructField("age", IntegerType(), True)])
        >>> df3 = sqlContext.createDataFrame(rdd, schema)
        >>> df3.collect()
        [Row(name=u'Alice', age=1)]


        Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row



        "Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.






        share|improve this answer









        $endgroup$


















          10












          $begingroup$

          "How can I import a .csv file into pyspark dataframes ?"
          -- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with



          pyspark --packages com.databricks:spark-csv_2.10:1.4.0


          then you can follow the following steps:



          from pyspark.sql import SQLContext
          sqlContext = SQLContext(sc)

          df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')


          The other method would be to read in the text file as an rdd using



          myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))


          Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use



          >>> from pyspark.sql import Row
          >>> Person = Row('name', 'age')
          >>> person = rdd.map(lambda r: Person(*r))
          >>> df2 = sqlContext.createDataFrame(person)
          >>> df2.collect()
          [Row(name=u'Alice', age=1)]
          >>> from pyspark.sql.types import *
          >>> schema = StructType([
          ... StructField("name", StringType(), True),
          ... StructField("age", IntegerType(), True)])
          >>> df3 = sqlContext.createDataFrame(rdd, schema)
          >>> df3.collect()
          [Row(name=u'Alice', age=1)]


          Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row



          "Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.






          share|improve this answer









          $endgroup$
















            10












            10








            10





            $begingroup$

            "How can I import a .csv file into pyspark dataframes ?"
            -- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with



            pyspark --packages com.databricks:spark-csv_2.10:1.4.0


            then you can follow the following steps:



            from pyspark.sql import SQLContext
            sqlContext = SQLContext(sc)

            df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')


            The other method would be to read in the text file as an rdd using



            myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))


            Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use



            >>> from pyspark.sql import Row
            >>> Person = Row('name', 'age')
            >>> person = rdd.map(lambda r: Person(*r))
            >>> df2 = sqlContext.createDataFrame(person)
            >>> df2.collect()
            [Row(name=u'Alice', age=1)]
            >>> from pyspark.sql.types import *
            >>> schema = StructType([
            ... StructField("name", StringType(), True),
            ... StructField("age", IntegerType(), True)])
            >>> df3 = sqlContext.createDataFrame(rdd, schema)
            >>> df3.collect()
            [Row(name=u'Alice', age=1)]


            Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row



            "Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.






            share|improve this answer









            $endgroup$



            "How can I import a .csv file into pyspark dataframes ?"
            -- there are many ways to do this; the simplest would be to start up pyspark with Databrick's spark-csv module. You can do this by starting pyspark with



            pyspark --packages com.databricks:spark-csv_2.10:1.4.0


            then you can follow the following steps:



            from pyspark.sql import SQLContext
            sqlContext = SQLContext(sc)

            df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')


            The other method would be to read in the text file as an rdd using



            myrdd = sc.textFile("yourfile.csv").map(lambda line: line.split(","))


            Then transform your data so that every item is in the correct format for the schema (i.e. Ints, Strings, Floats, etc.). You'll want to then use



            >>> from pyspark.sql import Row
            >>> Person = Row('name', 'age')
            >>> person = rdd.map(lambda r: Person(*r))
            >>> df2 = sqlContext.createDataFrame(person)
            >>> df2.collect()
            [Row(name=u'Alice', age=1)]
            >>> from pyspark.sql.types import *
            >>> schema = StructType([
            ... StructField("name", StringType(), True),
            ... StructField("age", IntegerType(), True)])
            >>> df3 = sqlContext.createDataFrame(rdd, schema)
            >>> df3.collect()
            [Row(name=u'Alice', age=1)]


            Reference: http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row



            "Also, please tell me how can I import xlsx file?" -- Excel files are not used in "Big Data"; Spark is meant to be used with large files or databases. If you have an Excel file that is 50GB in size, then you're doing things wrong. Excel wouldn't even be able to open a file that size; from my experience, anything above 20MB and Excel dies.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Aug 8 '16 at 17:27









            JonJon

            41328




            41328























                0












                $begingroup$

                I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:



                >>> from pyspark import SQLContext
                >>> from pyspark.sql import Row
                >>> sql_c = SQLContext(sc)
                >>> d0 = sc.textFile('./temp.csv')
                >>> d0.collect()
                [u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
                >>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
                >>> d1.take(1)
                [Row(label=u'a', number=1, value=0.239)]
                >>> df = sql_c.createDataFrame(d1)
                >>> df_cut = df[df.number>1]
                >>> df_cut.select('label', 'value').collect()
                [Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]


                So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.






                share|improve this answer









                $endgroup$


















                  0












                  $begingroup$

                  I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:



                  >>> from pyspark import SQLContext
                  >>> from pyspark.sql import Row
                  >>> sql_c = SQLContext(sc)
                  >>> d0 = sc.textFile('./temp.csv')
                  >>> d0.collect()
                  [u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
                  >>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
                  >>> d1.take(1)
                  [Row(label=u'a', number=1, value=0.239)]
                  >>> df = sql_c.createDataFrame(d1)
                  >>> df_cut = df[df.number>1]
                  >>> df_cut.select('label', 'value').collect()
                  [Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]


                  So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.






                  share|improve this answer









                  $endgroup$
















                    0












                    0








                    0





                    $begingroup$

                    I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:



                    >>> from pyspark import SQLContext
                    >>> from pyspark.sql import Row
                    >>> sql_c = SQLContext(sc)
                    >>> d0 = sc.textFile('./temp.csv')
                    >>> d0.collect()
                    [u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
                    >>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
                    >>> d1.take(1)
                    [Row(label=u'a', number=1, value=0.239)]
                    >>> df = sql_c.createDataFrame(d1)
                    >>> df_cut = df[df.number>1]
                    >>> df_cut.select('label', 'value').collect()
                    [Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]


                    So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.






                    share|improve this answer









                    $endgroup$



                    I have in my local directory a file 'temp.csv'. From there, using a local instance I do the following:



                    >>> from pyspark import SQLContext
                    >>> from pyspark.sql import Row
                    >>> sql_c = SQLContext(sc)
                    >>> d0 = sc.textFile('./temp.csv')
                    >>> d0.collect()
                    [u'a,1,.2390', u'b,2,.4390', u'c,3,.2323']
                    >>> d1 = d0.map(lambda x: x.split(',')).map(lambda x: Row(label = x[0], number = int(x[1]), value = float(x[2])))
                    >>> d1.take(1)
                    [Row(label=u'a', number=1, value=0.239)]
                    >>> df = sql_c.createDataFrame(d1)
                    >>> df_cut = df[df.number>1]
                    >>> df_cut.select('label', 'value').collect()
                    [Row(label=u'b', value=0.439), Row(label=u'c', value=0.2323)]


                    So d0 is the raw text file that we send off to a spark RDD. In order for you to make a data frame, you want to break the csv apart, and to make every entry a Row type, as I do when creating d1. The last step is to make the data frame from the RDD.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Aug 1 '16 at 16:24









                    j.a.gartnerj.a.gartner

                    1,0301716




                    1,0301716























                        0












                        $begingroup$

                        You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.






                        share|improve this answer









                        $endgroup$


















                          0












                          $begingroup$

                          You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.






                          share|improve this answer









                          $endgroup$
















                            0












                            0








                            0





                            $begingroup$

                            You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.






                            share|improve this answer









                            $endgroup$



                            You can use the package spark-csv by DataBricks that does a lot of things for you automatically, like taking care of the header, use escape characters, automatic schema inferring etcetera. Starting from Spark 2.0 there is an inbuilt function for dealing with CSVs.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Aug 2 '16 at 20:39









                            Jan van der VegtJan van der Vegt

                            6,6051839




                            6,6051839






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f13123%2fimport-csv-file-contents-into-pyspark-dataframes%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                How to label and detect the document text images

                                Tabula Rosettana

                                Aureus (color)