decentriq_util.spark
Functions
spark_session
def spark_session(
temp_dir: str = '/scratch',
input_files: list[str] = None,
**kwargs,
)
Create a spark session and configure it according to the enclave environment.
Parameters:
temp_dir
: Where to store temporary data such as persisted data frames or shuffle data.input_files
: A list of input files on the basis of which the partition size is determined.
Example:
import decentriq_util as dq
# Path to a potentially very large file
input_csv_path = "/input/my_file.csv"
# Automatically create and configure a spark session and
# make sure it's being stopped at the end.
with dq.spark.spark_session(input_files=[input_csv_path]) as ss:
# Read from a CSV file
df = ss.read.csv(input_csv_path, header=False).cache()
# Perform any pyspark transformations
print(f"Original number of rows: {df.count()}")
result_df = df.limit(100)
# Write the result to an output file
result_df.write.parquet("/output/my_file.parquet")