What are filters in PySpark?

PySpark - Interview Questions

In PySpark, filters are used to select a subset of data from a DataFrame or RDD based on specified criteria. Filters allow you to conditionally include or exclude rows of data based on values in one or more columns. Filters are commonly used in data processing pipelines to perform data cleansing, data wrangling, and data analysis tasks.

Here's how filters work in PySpark :

Filtering DataFrames :

* When working with DataFrames in PySpark, you can use the filter() method to apply a filter condition to the DataFrame.

* The filter() method takes a predicate function that evaluates to true or false for each row in the DataFrame. Rows for which the predicate function returns true are retained, while rows for which it returns false are filtered out.

* The predicate function typically involves comparisons or logical operations on column values. For example, you can filter rows where a specific column meets a certain condition, such as df.filter(df['age'] > 30) to retain rows where the 'age' column is greater than 30.

Filtering RDDs :

* When working with Resilient Distributed Datasets (RDDs) in PySpark, you can use the filter() transformation to create a new RDD containing only the elements that satisfy a given predicate function.

* Similar to DataFrames, the predicate function used with filter() evaluates to true or false for each element in the RDD. Elements for which the predicate function returns true are retained, while elements for which it returns false are filtered out.

* For example, you can filter an RDD of integers to retain only even numbers using rdd.filter(lambda x: x % 2 == 0).

Here's a simple example demonstrating how to use filters in PySpark with DataFrames:

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("FilterExample") \
    .getOrCreate()

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])

# Apply a filter to retain rows where age is greater than 30
filtered_df = df.filter(df['age'] > 30)

# Show the filtered DataFrame
filtered_df.show()

# Stop SparkSession
spark.stop()?

In this example, the DataFrame df is filtered to retain rows where the 'age' column is greater than 30 using the filter() method. The resulting DataFrame filtered_df contains only the rows that satisfy the filter condition.