-
Notifications
You must be signed in to change notification settings - Fork 0
/
RDD functions only but not dataframes
178 lines (106 loc) · 7.49 KB
/
RDD functions only but not dataframes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
In PySpark, there are some functions that are specific to RDDs (Resilient Distributed Datasets) and do not work with DataFrames.
RDDs are the fundamental building blocks of PySpark, and they provide a lower-level API compared to DataFrames, which offer higher-level, SQL-like abstractions.
Below is a list of PySpark functions that work specifically for RDDs but are not applicable to DataFrames:
-----------------------------------------------------------------------------------------------
1. map()
RDD: Transforms each element of the RDD by applying a given function.
Not for DataFrames: DataFrames use select or withColumn for similar transformations.
rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x * 2).collect() # [2, 4, 6, 8]
-----------------------------------------------------------------------------------------------
2. flatMap()
RDD: Similar to map(), but it allows returning multiple items for each element, flattening the result.
Not for DataFrames: There is no direct equivalent for this in DataFrames.
rdd = sc.parallelize([1, 2, 3])
rdd.flatMap(lambda x: [x, x * 2]).collect() # [1, 2, 2, 4, 3, 6]
-----------------------------------------------------------------------------------------------
3. reduce()
RDD: Aggregates the elements of the RDD using a function (e.g., sum).
Not for DataFrames: In DataFrames, you would use groupBy with aggregation functions (e.g., sum(), avg()).
rdd = sc.parallelize([1, 2, 3, 4])
rdd.reduce(lambda x, y: x + y) # 10
-----------------------------------------------------------------------------------------------
4. reduceByKey()
RDD: Aggregates values by key in key-value pairs.
Not for DataFrames: DataFrames use groupBy for similar key-based aggregations.
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.reduceByKey(lambda x, y: x + y).collect() # [('a', 4), ('b', 2)]
-----------------------------------------------------------------------------------------------
5. filter()
RDD: Filters elements of the RDD that satisfy a condition.
Not for DataFrames: DataFrames use the filter() or where() methods, but the DataFrame filter() is SQL-based and differs from RDD’s Python function-based filter.
rdd = sc.parallelize([1, 2, 3, 4])
rdd.filter(lambda x: x % 2 == 0).collect() # [2, 4]
-----------------------------------------------------------------------------------------------
6. groupByKey()
RDD: Groups the values for each key in a key-value RDD.
Not for DataFrames: DataFrames use groupBy() for key-based operations, but they don’t have a direct equivalent to groupByKey() because it’s inefficient for large datasets.
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.groupByKey().mapValues(list).collect() # [('a', [1, 3]), ('b', [2])]
-----------------------------------------------------------------------------------------------
7. partitionBy()
RDD: Partitions the RDD by key using a specified number of partitions.
Not for DataFrames: DataFrames use repartition() or coalesce(), which are higher-level APIs that operate differently from partitionBy().
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.partitionBy(2).glom().collect()
-----------------------------------------------------------------------------------------------
8. coalesce() (for RDDs)
RDD: Reduces the number of partitions in an RDD.
For DataFrames: The function coalesce() exists but behaves differently for RDDs and DataFrames. For RDDs, it simply changes the partitioning; for DataFrames, it merges partitions for optimization.
rdd = sc.parallelize([1, 2, 3, 4], 4)
rdd.coalesce(2).getNumPartitions() # 2
-----------------------------------------------------------------------------------------------
9. glom()
RDD: Converts each partition of the RDD into a list.
Not for DataFrames: There is no equivalent for this in DataFrames.
rdd = sc.parallelize([1, 2, 3, 4], 2)
rdd.glom().collect() # [[1, 2], [3, 4]]
-----------------------------------------------------------------------------------------------
10. keyBy()
RDD: Transforms an RDD into key-value pairs by applying a function to each element to extract the key.
Not for DataFrames: DataFrames do not work with key-value pairs in the same way.
rdd = sc.parallelize([1, 2, 3])
rdd.keyBy(lambda x: x % 2).collect() # [(1, 1), (0, 2), (1, 3)]
-----------------------------------------------------------------------------------------------
11. aggregate()
RDD: Similar to reduce(), but allows a more complex aggregation with an initial value and a combination of two functions.
Not for DataFrames: DataFrames do not have a direct equivalent; for DataFrames, you would typically use agg() in conjunction with groupBy().
rdd = sc.parallelize([1, 2, 3, 4])
rdd.aggregate(0, (lambda x, y: x + y), (lambda x, y: x + y)) # 10
-----------------------------------------------------------------------------------------------
12. cartesian()
RDD: Produces the Cartesian product (cross product) of two RDDs.
Not for DataFrames: DataFrames do not have this function directly, though you can achieve a Cartesian product using SQL-like operations (crossJoin()).
rdd1 = sc.parallelize([1, 2])
rdd2 = sc.parallelize([3, 4])
rdd1.cartesian(rdd2).collect() # [(1, 3), (1, 4), (2, 3), (2, 4)]
-----------------------------------------------------------------------------------------------
13. subtract()
RDD: Returns an RDD that contains the elements of the first RDD that are not in the second RDD.
Not for DataFrames: DataFrames have the exceptAll() function which performs a similar operation but follows SQL semantics.
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([2, 3, 4])
rdd1.subtract(rdd2).collect() # [1]
-----------------------------------------------------------------------------------------------
14. intersection()
RDD: Returns an RDD that contains only the elements present in both RDDs.
Not for DataFrames: DataFrames have intersect() which performs a similar operation but is based on DataFrame columns and SQL-like behavior.
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([2, 3, 4])
rdd1.intersection(rdd2).collect() # [2, 3]
-----------------------------------------------------------------------------------------------
15. sample()
RDD: Returns a sampled subset of an RDD.
For DataFrames: DataFrames also have a sample() method, but the behavior is slightly different due to how it is applied to distributed datasets.
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.sample(False, 0.5).collect() # Random subset
-----------------------------------------------------------------------------------------------
16. getNumPartitions()
RDD: Returns the number of partitions in an RDD
Not for DataFrames: To get the number of partitions, we convert it to RDD and then apply the function
df_repartitioned = df.repartition(2)
num_partitions_repartitioned = df_repartitioned.rdd.getNumPartitions()
print(f"Number of partitions after repartitioning: {num_partitions_repartitioned}")
Summary:
The functions listed above are RDD-specific because RDDs offer lower-level, functional-style APIs, while DataFrames operate at a higher, declarative level with SQL-like operations. DataFrames are optimized for performance and provide a more structured approach to working with data, while RDDs are more flexible but less optimized for SQL-like operations.
If you're working on low-level transformations and need fine control over data processing, RDD functions can be powerful. However, for most high-level operations, especially those related to structured data, DataFrames are generally preferred due to their performance optimizations and ease of use.