I am wondering how to choose the best settings to run Tune me Spark Job. Basically I am just reading a big
csv file into a
DataFrame and count some string occurrences.
The input file is over 500 GB. The Spark job seems too slow..
terminal Progress Bar:
[Stage1:=======> (4174 + 50) / 18500]
NumberCompletedTasks: (4174) takes around one hour.
NumberActiveTasks: (50), I believe I can control with setting.
--conf spark.dynamicAllocation.maxExecutors=50 (tried with different values).
TotalNumberOfTasks: (18500), why is this fixed? what does it mean, is it only related to file size? Since I am reading a
csv just with little logic, how can I optimize the Spark Job?
I also tried changing :
Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots
This Question have been answered