Skip to main content

decentriq_util.spark

Functions

create_spark_session

def create_spark_session(
name: str = 'local_spark_session',
parallelism: int = 8,
heap_size: Optional[int] = None,
heap_size_extra_room: int = 536870912,
)> pyspark.sql.session.SparkSession

:param name: The name of the spark session. :param parallelism: The size of the executor pool computing internal spark tasks. :param heap_size: If set then this determines the JVM heap size. Note that not all the heap size is used by spark. :param heap_size_extra_room: If heap_size is unset then it is determined by the available memory in the current cgroup. Then heap_size_extra_room is subtracted to arrive at the final heap size. This number should account for the python runtime itself as well as other structures that may be needed during spark compute. :return: The spark session.