pyspark.pipelines.table#

pyspark.pipelines.table(query_function=None, *, name=None, comment=None, spark_conf=None, table_properties=None, partition_cols=None, cluster_by=None, schema=None, format=None)[source]#

(Return a) decorator to define a table in the pipeline and mark a function as the table’s query function.

@table can be used with or without parameters. If called without parameters, Python will implicitly pass the decorated query function as the query_function param. If called with parameters, @table will return a decorator that is applied on the decorated query function.

Parameters

query_function – The table’s query function. This parameter should not be explicitly passed by users. This is passed implicitly by Python if the decorator is called without parameters.
name – The name of the dataset. If unspecified, the query function’s name will be used.
comment – Description of the dataset.
spark_conf – A dict whose keys are the conf names and values are the conf values. These confs will be set when the query for the dataset is executed and they can override confs set for the pipeline or on the cluster.
table_properties – A dict where the keys are the property names and the values are the property values. These properties will be set on the table.
partition_cols – A list containing the column names of the partition columns.
cluster_by – A list containing the column names of the cluster columns.
schema – Explicit Spark SQL schema to materialize this table with. Supports either a Pyspark StructType or a SQL DDL string, such as “a INT, b STRING”.
format – The format of the table, e.g. “parquet”.