1.Collect()
The action collect() is the common and simplest operation that returns our entire RDDs content to driver program.
from pyspark import SparkContext,SparkConf
FileA=sc.textFile("/FileStore/tables/file1.txt")
FileA.collect()
2.count()
Action count() returns the number of elements in RDD.
from pyspark import SparkContext,SparkConfFileA=sc.textFile("/FileStore/tables/file1.txt")#find our occurence of 'Data' word in fileFile2=FileA.flatMap(lambda x:x.split(" ")).filter(lambda x:x=='Data')File2.count()
3.take()
The action take(n) returns n number of elements from RDD. It tries to cut the number of partition it accesses, so it represents a biased collection.
from pyspark import SparkContext,SparkConf
FileA=sc.textFile("/FileStore/tables/file1.txt")
#find our occurence of 'Data' word in file
File2=FileA.flatMap(lambda x:x.split(" "))
File2.take(5)
4.first()
returns the first element from the list.
from pyspark import SparkContext,SparkConf
FileA=sc.textFile("/FileStore/tables/file1.txt")
#find our occurence of 'Data' word in file
File2=FileA.flatMap(lambda x:x.split(" "))
File2.first()
5.reduce()
A reduce action is used for aggregating all the elements of RDD by applying pairwise user function.
from pyspark import SparkContext,SparkConf
#find out sum of all numbers in RDD
List1=sc.parallelize(range(1,50),3)
Sum1=List1.reduce(lambda x,y :x+y)
print('Sum is :' ,Sum1)
6.countByKey()
Return a map of keys and counts of their occurrences in the RDD.
from pyspark import SparkContext,SparkConf
#Count common words in the file
FileA=sc.textFile("/FileStore/tables/file1.txt")
#separate all words by spaces
File1=FileA.flatMap(lambda x:x.split(" ")).flatMap(lambda x:x.split(".")).flatMap(lambda x:x.split(","))
#create key with value 1 for wach word
File2=File1.map(lambda x:(x,1))
#use countByKey to find out commom words with number of occurence
File3=File2.countByKey()
print(File3)
7.saveAsTextFile(path)
Save the RDD to the filesystem indicated in the path.This works on RDD , not on list.
Ex. countByKey action gives result into list and saveAsTextFile can not work on list
So, in such case, we have to use reduceByKey which gives result in RDD and can be saved in text format.
from pyspark import SparkContext,SparkConf
#Count common words in file
FileA=sc.textFile("/FileStore/tables/file1.txt")
#separate all words by spaces
File1=FileA.flatMap(lambda x:x.split(" ")).flatMap(lambda x:x.split(".")).flatMap(lambda x:x.split(","))
#create key with value 1 for wach word
File2=File1.map(lambda x:(x,1))
#use countByKey to find out commom words with number of occurence
#print(File2.countByKey())
File3=File2.reduceByKey(lambda a,b:a+b)
#File3.collect()
#save result in text file
File3.saveAsTextFile("/FileStore/tables/countByKey2.txt")
print("File has been saved")
When we try to read file , it gives result as follows
8.takeSample
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
from pyspark import SparkContext,SparkConf
List1=range(1,20)
Rdd1=sc.parallelize(List1)print(Rdd1.takeSample(False,10,2))
9.takeOrdered
Return the first n elements of the RDD using either their natural order or a custom comparator.
10.foreach
foreach(println)word fine in scala.But with pyspark, it does not work.
We have to use actions like collect,first,take or print as follows
No comments:
Post a Comment