July 9th 2017

Rhyzx:

This question already has an answer here:

How to know which count query is the fastest? 2 answers

I have a short question regarding Dataframes and RDDs in Spark (2.1.0).
When I load a table from a CSV file and then execute a very simple count operation in Spark like this:

The Ultimate Gaming Experience: Explo…
Key Tips for Renting an Electric Vehi…
Revive Your Morning: NAC Hangover Cur…
Another Appliance Company Now Files A…
Understanding the Software and Apps o…


    //Dataframe
    spark.table("lineitem").count();

    //RDD
    spark.table("lineitem").rdd().count();

The execution time for Dataframe is 7,991649 s
and the RDD API needs 19,384267 s

I get the same difference if i execute a Reduce or an Aggregate operation to sum a column.

Is there any way to execute a count operation on RDDs (by just using the RDD API), so it is on par or faster than the Dataframe implementation?

Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots
This Question have been answered
HERE

This post first appeared on Stack Solved, please read the originial post: here

People also like

The Ultimate Gaming Experience: Exploring Xbox in the Cloud

Key Tips for Renting an Electric Vehicle in Europe

Revive Your Morning: NAC Hangover Cure Unveiled

Another Appliance Company Now Files An Unexpected Bankruptcy

Understanding the Software and Apps of the Samsung Galaxy Tab S8 tablet

SOLVED: Comparison of Dataframe and RDD count execution time in Apache Spark [duplicate]

Related Articles

SOLVED: Comparison of Dataframe and RDD count execution time in Apache Spark [duplicate]

Related Articles

Share the post

Subscribe to Stack Solved

Thank you for your subscription