Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

SOLVED: Tune Spark, set executors and memory driver for reading large csv file

Alg_D:

I am wondering how to choose the best settings to run Tune me Spark Job. Basically I am just reading a big csv file into a DataFrame and count some string occurrences.

The input file is over 500 GB. The Spark job seems too slow..

terminal Progress Bar:


[Stage1:=======> (4174 + 50) / 18500]

NumberCompletedTasks: (4174) takes around one hour.

NumberActiveTasks: (50), I believe I can control with setting. --conf spark.dynamicAllocation.maxExecutors=50 (tried with different values).

TotalNumberOfTasks: (18500), why is this fixed? what does it mean, is it only related to file size? Since I am reading a csv just with little logic, how can I optimize the Spark Job?

I also tried changing :


--executor-memory 10g
--driver-memory 12g



Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots
This Question have been answered
HERE


This post first appeared on Stack Solved, please read the originial post: here

Share the post

SOLVED: Tune Spark, set executors and memory driver for reading large csv file

×

Subscribe to Stack Solved

Get updates delivered right to your inbox!

Thank you for your subscription

×