Map and flatMap are similar in the way that they take a line from input RDD and apply a function on that line. The key difference between map() and flatMap() is map() returns only one element, while flatMap() can return a list of elements.
1.map()
The map function iterates over every line in RDD and split into new RDD. Using map() transformation we take in any function, and that function is applied to every element/line of RDD.
from pyspark import SparkContext,SparkConf
FileA=sc.textFile("/FileStore/tables/file1.txt")
FileB=sc.textFile("/FileStore/tables/file2.txt")
FileC=sc.textFile("/FileStore/tables/file3.txt")
FileAB=FileA.union(FileB)
FileABC=FileAB.union(FileC)
FileABCflatten=FileABC.map(lambda x:x.split(" "))
FileABCflatten.collect()
Here, you can see, each line has been split by spaces and put into tuples.
2.flatMap()
flatMap() function, to each input element, we have many elements in an output RDD. The most simple use of flatMap() is to split each input string into words.
from pyspark import SparkContext,SparkConf
FileA=sc.textFile("/FileStore/tables/file1.txt")
FileB=sc.textFile("/FileStore/tables/file2.txt")
FileC=sc.textFile("/FileStore/tables/file3.txt")
FileAB=FileA.union(FileB)
FileABC=FileAB.union(FileC)
FileABCflatten=FileABC.flatMap(lambda x:x.split(" "))
FileABCflatten.collect()
Here, all lines are flattened and no tuples are considered.All words are split by spaces.
No comments:
Post a Comment