decentriq_util.spark

Functions

spark_session

def spark_session(
    temp_dir: str = '/scratch',
    input_files: list[str] = None,
    **kwargs,
)

Create a spark session and configure it according to the enclave environment.

Parameters:

temp_dir: Where to store temporary data such as persisted data frames or shuffle data.
input_files: A list of input files on the basis of which the partition size is determined.

Example:

import decentriq_util as dq

# Path to a potentially very large file
input_csv_path = "/input/my_file.csv"

# Automatically create and configure a spark session and
# make sure it's being stopped at the end.
with dq.spark.spark_session(input_files=[input_csv_path]) as ss:
    # Read from a CSV file
    df = ss.read.csv(input_csv_path, header=False).cache()

    # Perform any pyspark transformations
    print(f"Original number of rows: {df.count()}")
    result_df = df.limit(100)

    # Write the result to an output file
    result_df.write.parquet("/output/my_file.parquet")

decentriq_util.spark

Functions​

spark_session​

Functions

spark_session