Skip to main content

decentriq_util.spark

Functions

spark_session

def spark_session(
temp_dir: str = '/scratch',
input_files: list[str] = None,
**kwargs,
)

Create a spark session and configure it according to the enclave environment.

Parameters:

  • temp_dir: Where to store temporary data such as persisted data frames or shuffle data.
  • input_files: A list of input files on the basis of which the partition size is determined.

Example:

import decentriq_util as dq

# Path to a potentially very large file
input_csv_path = "/input/my_file.csv"

# Automatically create and configure a spark session and
# make sure it's being stopped at the end.
with dq.spark.spark_session(input_files=[input_csv_path]) as ss:
# Read from a CSV file
df = ss.read.csv(input_csv_path, header=False).cache()

# Perform any pyspark transformations
print(f"Original number of rows: {df.count()}")
result_df = df.limit(100)

# Write the result to an output file
result_df.write.parquet("/output/my_file.parquet")