fugue_dask 

fillna(df, value, subset=None)[source]

Fill NULL, NAN, NAT values in a dataframe

Parameters:

df (DataFrame) – DataFrame
value (Any) – if scalar, fills all columns with same value. if dictionary, fills NA using the keys as column names and the values as the replacement values.
subset (List[str] | None) – list of columns to operate on. ignored if value is a dictionary

Returns:

DataFrame with NA records filled

Return type:

get_current_parallelism()[source]

Get the current number of parallelism of this engine

Return type:: int

intersect(df1, df2, distinct=True)[source]

Intersect df1 and df2

Parameters:

df1 (DataFrame) – the first dataframe
df2 (DataFrame) – the second dataframe
distinct (bool) – true for INTERSECT (== INTERSECT DISTINCT), false for INTERSECT ALL

Returns:

the unioned dataframe

Return type:

Note

Currently, the schema of df1 and df2 must be identical, or an exception will be thrown.

property is_distributed: bool: Whether this engine is a distributed engine

join(df1, df2, how, on=None)[source]

Join two dataframes

Parameters:

df1 (DataFrame) – the first dataframe
df2 (DataFrame) – the second dataframe
how (str) – can accept semi, left_semi, anti, left_anti, inner, left_outer, right_outer, full_outer, cross
on (List[str] | None) – it can always be inferred, but if you provide, it will be validated against the inferred keys.

Returns:

the joined dataframe

Return type:

Note

Please read get_join_schemas()

load_df(path, format_hint=None, columns=None, **kwargs)[source]

Load dataframe from persistent storage

Parameters:

path (str | List[str]) – the path to the dataframe
format_hint (Any | None) – can accept parquet, csv, json, defaults to None, meaning to infer
columns (Any | None) – list of columns or a Schema like object, defaults to None
kwargs (Any) – parameters to pass to the underlying framework

Returns:

an engine compatible dataframe

Return type:

For more details and examples, read Zip & Comap.

property log: Logger: Logger of this engine instance

persist(df, lazy=False, **kwargs)[source]

Force materializing and caching the dataframe

Parameters:

df (DataFrame) – the input dataframe
lazy (bool) – True: first usage of the output will trigger persisting to happen; False (eager): persist is forced to happend immediately. Default to False
kwargs (Any) – parameter to pass to the underlying persist implementation

Returns:

the persisted dataframe

Return type:

Note

persist can only guarantee the persisted dataframe will be computed for only once. However this doesn’t mean the backend really breaks up the execution dependency at the persisting point. Commonly, it doesn’t cause any issue, but if your execution graph is long, it may cause expected problems for example, stack overflow.

property pl_utils: DaskUtils: Pandas-like dataframe utils

repartition(df, partition_spec)[source]

Partition the input dataframe using partition_spec.

Parameters:

df (DataFrame) – input dataframe
partition_spec (PartitionSpec) – how you want to partition the dataframe

Returns:

repartitioned dataframe

Return type:

Note

Before implementing please read the Partition Tutorial

sample(df, n=None, frac=None, replace=False, seed=None)[source]

Sample dataframe by number of rows or by fraction

Parameters:

df (DataFrame) – DataFrame
n (int | None) – number of rows to sample, one and only one of n and frac must be set
frac (float | None) – fraction [0,1] to sample, one and only one of n and frac must be set
replace (bool) – whether replacement is allowed. With replacement, there may be duplicated rows in the result, defaults to False
seed (int | None) – seed for randomness, defaults to None

Returns:

sampled dataframe

Return type:

save_df(df, path, format_hint=None, mode='overwrite', partition_spec=None, force_single=False, **kwargs)[source]

Save dataframe to a persistent storage

Parameters:

df (DataFrame) – input dataframe
path (str) – output path
format_hint (Any | None) – can accept parquet, csv, json, defaults to None, meaning to infer
mode (str) – can accept overwrite, append, error, defaults to “overwrite”
partition_spec (PartitionSpec | None) – how to partition the dataframe before saving, defaults to empty
force_single (bool) – force the output as a single file, defaults to False
kwargs (Any) – parameters to pass to the underlying framework

Return type:

None

For more details and examples, read Load & Save.

subtract(df1, df2, distinct=True)[source]

df1 - df2

Parameters:

df1 (DataFrame) – the first dataframe
df2 (DataFrame) – the second dataframe
distinct (bool) – true for EXCEPT (== EXCEPT DISTINCT), false for EXCEPT ALL

Returns:

the unioned dataframe

Return type:

Note

Currently, the schema of df1 and df2 must be identical, or an exception will be thrown.

take(df, n, presort, na_position='last', partition_spec=None)[source]

Get the first n rows of a DataFrame per partition. If a presort is defined, use the presort before applying take. presort overrides partition_spec.presort. The Fugue implementation of the presort follows Pandas convention of specifying NULLs first or NULLs last. This is different from the Spark and SQL convention of NULLs as the smallest value.

Parameters:

df (DataFrame) – DataFrame
n (int) – number of rows to return
presort (str) – presort expression similar to partition presort
na_position (str) – position of null values during the presort. can accept first or last
partition_spec (PartitionSpec | None) – PartitionSpec to apply the take operation

Returns:

n rows of DataFrame per partition

Return type:

to_df(df, schema=None)[source]

Convert a data structure to DaskDataFrame

Parameters:

data – DataFrame, dask.dataframe.DataFrame, pandas DataFrame or list or iterable of arrays
schema (Any | None) – Schema like object, defaults to None.
df (Any) –

Returns:

engine compatible dataframe

Return type:

Note

if the input is already DaskDataFrame, it should return itself
For list or iterable of arrays, schema must be specified
When schema is not None, a potential type cast may happen to ensure the dataframe’s schema.
all other methods in the engine can take arbitrary dataframes and call this method to convert before doing anything

union(df1, df2, distinct=True)[source]

Join two dataframes

Parameters:

df1 (DataFrame) – the first dataframe
df2 (DataFrame) – the second dataframe
distinct (bool) – true for UNION (== UNION DISTINCT), false for UNION ALL

Returns:

the unioned dataframe

Return type:

Note

Currently, the schema of df1 and df2 must be identical, or an exception will be thrown.

class fugue_dask.execution_engine.DaskMapEngine(execution_engine)[source]

Bases: MapEngine

Parameters:: execution_engine (ExecutionEngine) –

property execution_engine_constraint: Type[ExecutionEngine]

This defines the required ExecutionEngine type of this facet

Returns:: a subtype of ExecutionEngine

property is_distributed: bool: Whether this engine is a distributed engine

map_dataframe(df, map_func, output_schema, partition_spec, on_init=None, map_func_format_hint=None)[source]

Apply a function to each partition after you partition the dataframe in a specified way.

Parameters:

df (DataFrame) – input dataframe
map_func (Callable[[PartitionCursor, LocalDataFrame], LocalDataFrame]) – the function to apply on every logical partition
output_schema (Any) – Schema like object that can’t be None. Please also understand why we need this
partition_spec (PartitionSpec) – partition specification
on_init (Callable[[int, DataFrame], Any] | None) – callback function when the physical partition is initializaing, defaults to None
map_func_format_hint (str | None) – the preferred data format for map_func, it can be pandas, pyarrow, etc, defaults to None. Certain engines can provide the most efficient map operations based on the hint.

Returns:

the dataframe after the map operation

Return type:

Note

Before implementing, you must read this to understand what map is used for and how it should work.

class fugue_dask.execution_engine.DaskSQLEngine(execution_engine)[source]

Bases: SQLEngine

Dask-sql implementation.

Parameters:: execution_engine (ExecutionEngine) –

property dialect: str | None

property is_distributed: bool: Whether this engine is a distributed engine

select(dfs, statement)[source]

Execute select statement on the sql engine.

Parameters:

dfs (DataFrames) – a collection of dataframes that must have keys
statement (StructuredRawSQL) – the SELECT statement using the dfs keys as tables.

Returns:

result of the SELECT statement

Return type:

Examples

dfs = DataFrames(a=df1, b=df2)
sql_engine.select(
    dfs,
    [(False, "SELECT * FROM "),
     (True,"a"),
     (False," UNION SELECT * FROM "),
     (True,"b")])

Note

There can be tables that is not in dfs. For example you want to select from hive without input DataFrames:

>>> sql_engine.select(DataFrames(), "SELECT * FROM hive.a.table")

to_df(df, schema=None)[source]

Convert a data structure to this engine compatible DataFrame

Parameters:

data – DataFrame, pandas DataFramme or list or iterable of arrays or others that is supported by certain engine implementation
schema (Any | None) – Schema like object, defaults to None
df (AnyDataFrame) –

Returns:

engine compatible dataframe

Return type:

Note

There are certain conventions to follow for a new implementation:

if the input is already in compatible dataframe type, it should return itself
all other methods in the engine interface should take arbitrary dataframes and call this method to convert before doing anything

fugue_dask.execution_engine.to_dask_engine_df(df, schema=None)[source]

Convert a data structure to DaskDataFrame

Parameters:

data – DataFrame, dask.dataframe.DataFrame, pandas DataFrame or list or iterable of arrays
schema (Any | None) – Schema like object, defaults to None.
df (Any) –

Returns:

engine compatible dataframe

Return type: