fugue.dataframe

fugue.dataframe.api

fugue.dataframe.api.get_native_as_df(df)[source]

Return the dataframe form of any dataframe. If df is a DataFrame, then call the native_as_df(), otherwise, it depends on whether there is a correspondent function handling it.

Parameters:

df (AnyDataFrame)

Return type:

AnyDataFrame

fugue.dataframe.api.normalize_column_names(df)[source]

A generic function to normalize any dataframe’s column names to follow Fugue naming rules

Note

This is a temporary solution before Schema can take arbitrary names

Examples

  • [0,1] => {"_0":0, "_1":1}

  • ["1a","2b"] => {"_1a":"1a", "_2b":"2b"}

  • ["*a","-a"] => {"_a":"*a", "_a_1":"-a"}

Parameters:

df (AnyDataFrame) – a dataframe object

Returns:

the renamed dataframe and the rename operations as a dict that can undo the change

Return type:

Tuple[AnyDataFrame, Dict[str, Any]]

fugue.dataframe.array_dataframe

class fugue.dataframe.array_dataframe.ArrayDataFrame(df=None, schema=None)[source]

Bases: LocalBoundedDataFrame

DataFrame that wraps native python 2-dimensional arrays. Please read the DataFrame Tutorial to understand the concept

Parameters:
  • df (Any) – 2-dimensional array, iterable of arrays, or DataFrame

  • schema (Any) – Schema like object

Examples

>>> a = ArrayDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> b = ArrayDataFrame(a)
alter_columns(columns)[source]

Change column types

Parameters:

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns:

a new dataframe with altered columns, the order of the original schema will not change

Return type:

DataFrame

as_array(columns=None, type_safe=False)[source]

Convert to 2-dimensional native python array

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

2-dimensional native python array

Return type:

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]

Convert to iterable of native python arrays

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

iterable of native python arrays

Return type:

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

count()[source]

Get number of rows of this dataframe

Return type:

int

property empty: bool

Whether this dataframe is empty

head(n, columns=None)[source]

Get first n rows of the dataframe as a new local bounded dataframe

Parameters:
  • n (int) – number of rows

  • columns (List[str] | None) – selected columns, defaults to None (all columns)

Returns:

a local bounded dataframe

Return type:

LocalBoundedDataFrame

property native: List[Any]

2-dimensional native python array

peek_array()[source]

Peek the first row of the dataframe as array

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

List[Any]

rename(columns)[source]

Rename the dataframe using a mapping dict

Parameters:

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns:

a new dataframe with the new names

Return type:

DataFrame

fugue.dataframe.arrow_dataframe

class fugue.dataframe.arrow_dataframe.ArrowDataFrame(df=None, schema=None)[source]

Bases: LocalBoundedDataFrame

DataFrame that wraps pyarrow.Table. Please also read the DataFrame Tutorial to understand this Fugue concept

Parameters:
  • df (Any) – 2-dimensional array, iterable of arrays, pyarrow.Table or pandas DataFrame

  • schema (Any) – Schema like object

Examples

>>> ArrowDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> ArrowDataFrame(schema = "a:int,b:int")  # empty dataframe
>>> ArrowDataFrame(pd.DataFrame([[0]],columns=["a"]))
>>> ArrowDataFrame(ArrayDataFrame([[0]],"a:int).as_arrow())
alter_columns(columns)[source]

Change column types

Parameters:

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns:

a new dataframe with altered columns, the order of the original schema will not change

Return type:

DataFrame

as_array(columns=None, type_safe=False)[source]

Convert to 2-dimensional native python array

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

2-dimensional native python array

Return type:

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]

Convert to iterable of native python arrays

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

iterable of native python arrays

Return type:

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_arrow(type_safe=False)[source]

Convert to pyArrow DataFrame

Parameters:

type_safe (bool)

Return type:

Table

as_dict_iterable(columns=None)[source]

Convert to iterable of python dicts

Parameters:

columns (List[str] | None) – columns to extract, defaults to None

Returns:

iterable of python dicts

Return type:

Iterable[Dict[str, Any]]

Note

The default implementation enforces type_safe True

as_dicts(columns=None)[source]

Convert to a list of python dicts

Parameters:

columns (List[str] | None) – columns to extract, defaults to None

Returns:

a list of python dicts

Return type:

List[Dict[str, Any]]

Note

The default implementation enforces type_safe True

as_pandas()[source]

Convert to pandas DataFrame

Return type:

DataFrame

count()[source]

Get number of rows of this dataframe

Return type:

int

property empty: bool

Whether this dataframe is empty

head(n, columns=None)[source]

Get first n rows of the dataframe as a new local bounded dataframe

Parameters:
  • n (int) – number of rows

  • columns (List[str] | None) – selected columns, defaults to None (all columns)

Returns:

a local bounded dataframe

Return type:

LocalBoundedDataFrame

property native: Table

pyarrow.Table

native_as_df()[source]

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type:

Table

peek_array()[source]

Peek the first row of the dataframe as array

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

List[Any]

peek_dict()[source]

Peek the first row of the dataframe as dict

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

Dict[str, Any]

rename(columns)[source]

Rename the dataframe using a mapping dict

Parameters:

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns:

a new dataframe with the new names

Return type:

DataFrame

fugue.dataframe.dataframe

class fugue.dataframe.dataframe.DataFrame(schema=None)[source]

Bases: Dataset

Base class of Fugue DataFrame. Please read the DataFrame Tutorial to understand the concept

Parameters:

schema (Any) – Schema like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

abstract alter_columns(columns)[source]

Change column types

Parameters:

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns:

a new dataframe with altered columns, the order of the original schema will not change

Return type:

DataFrame

abstract as_array(columns=None, type_safe=False)[source]

Convert to 2-dimensional native python array

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

2-dimensional native python array

Return type:

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

abstract as_array_iterable(columns=None, type_safe=False)[source]

Convert to iterable of native python arrays

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

iterable of native python arrays

Return type:

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_arrow(type_safe=False)[source]

Convert to pyArrow DataFrame

Parameters:

type_safe (bool)

Return type:

Table

as_dict_iterable(columns=None)[source]

Convert to iterable of python dicts

Parameters:

columns (List[str] | None) – columns to extract, defaults to None

Returns:

iterable of python dicts

Return type:

Iterable[Dict[str, Any]]

Note

The default implementation enforces type_safe True

as_dicts(columns=None)[source]

Convert to a list of python dicts

Parameters:

columns (List[str] | None) – columns to extract, defaults to None

Returns:

a list of python dicts

Return type:

List[Dict[str, Any]]

Note

The default implementation enforces type_safe True

as_local()[source]

Convert this dataframe to a LocalDataFrame

Return type:

LocalDataFrame

abstract as_local_bounded()[source]

Convert this dataframe to a LocalBoundedDataFrame

Return type:

LocalBoundedDataFrame

as_pandas()[source]

Convert to pandas DataFrame

Return type:

DataFrame

property columns: List[str]

The column names of the dataframe

drop(columns)[source]

Drop certain columns and return a new dataframe

Parameters:

columns (List[str]) – columns to drop

Raises:

FugueDataFrameOperationError – if columns are not strictly contained by this dataframe, or it is the entire dataframe columns

Returns:

a new dataframe removing the columns

Return type:

DataFrame

get_info_str()[source]

Get dataframe information (schema, type, metadata) as json string

Returns:

json string

Return type:

str

abstract head(n, columns=None)[source]

Get first n rows of the dataframe as a new local bounded dataframe

Parameters:
  • n (int) – number of rows

  • columns (List[str] | None) – selected columns, defaults to None (all columns)

Returns:

a local bounded dataframe

Return type:

LocalBoundedDataFrame

abstract native_as_df()[source]

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type:

AnyDataFrame

abstract peek_array()[source]

Peek the first row of the dataframe as array

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

List[Any]

peek_dict()[source]

Peek the first row of the dataframe as dict

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

Dict[str, Any]

abstract rename(columns)[source]

Rename the dataframe using a mapping dict

Parameters:

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns:

a new dataframe with the new names

Return type:

DataFrame

property schema: Schema

The schema of the dataframe

property schema_discovered: Schema

Whether the schema has been discovered or still a lambda

class fugue.dataframe.dataframe.DataFrameDisplay(ds)[source]

Bases: DatasetDisplay

DataFrame plain display class

Parameters:

ds (Dataset)

property df: DataFrame

The target DataFrame

show(n=10, with_count=False, title=None)[source]

Show the Dataset

Parameters:
  • n (int) – top n items to display, defaults to 10

  • with_count (bool) – whether to display the total count, defaults to False

  • title (str | None) – title to display, defaults to None

Return type:

None

class fugue.dataframe.dataframe.LocalBoundedDataFrame(schema=None)[source]

Bases: LocalDataFrame

Base class of all local bounded dataframes. Please read this to understand the concept

Parameters:

schema (Any) – Schema like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

as_local_bounded()[source]

Always True because it’s a bounded dataframe

Return type:

LocalBoundedDataFrame

property is_bounded: bool

Always True because it’s a bounded dataframe

class fugue.dataframe.dataframe.LocalDataFrame(schema=None)[source]

Bases: DataFrame

Base class of all local dataframes. Please read this to understand the concept

Parameters:

schema (Any) – a schema-like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

property is_local: bool

Always True because it’s a LocalDataFrame

native_as_df()[source]

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type:

AnyDataFrame

property num_partitions: int

Always 1 because it’s a LocalDataFrame

class fugue.dataframe.dataframe.LocalUnboundedDataFrame(schema=None)[source]

Bases: LocalDataFrame

Base class of all local unbounded dataframes. Read this <https://fugue-tutorials.readthedocs.io/ en/latest/tutorials/advanced/schema_dataframes.html#DataFrame>`_ to understand the concept

Parameters:

schema (Any) – Schema like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

as_local()[source]

Convert this dataframe to a LocalDataFrame

Return type:

LocalDataFrame

count()[source]
Raises:

InvalidOperationError – You can’t count an unbounded dataframe

Return type:

int

property is_bounded

Always False because it’s an unbounded dataframe

class fugue.dataframe.dataframe.YieldedDataFrame(yid)[source]

Bases: Yielded

Yielded dataframe from FugueWorkflow. Users shouldn’t create this object directly.

Parameters:

yid (str) – unique id for determinism

property is_set: bool

Whether the value is set. It can be false if the parent workflow has not been executed.

property result: DataFrame

The yielded dataframe, it will be set after the parent workflow is computed

set_value(df)[source]

Set the yielded dataframe after compute. Users should not call it.

Parameters:
Return type:

None

fugue.dataframe.dataframe.as_fugue_df(df, **kwargs)[source]

Wrap the object as a Fugue DataFrame.

Parameters:
  • df (AnyDataFrame) – the object to wrap

  • kwargs (Any)

Return type:

DataFrame

fugue.dataframe.dataframe_iterable_dataframe

class fugue.dataframe.dataframe_iterable_dataframe.IterableArrowDataFrame(df=None, schema=None)[source]

Bases: LocalDataFrameIterableDataFrame

Parameters:
  • df (Any)

  • schema (Any)

class fugue.dataframe.dataframe_iterable_dataframe.IterablePandasDataFrame(df=None, schema=None)[source]

Bases: LocalDataFrameIterableDataFrame

Parameters:
  • df (Any)

  • schema (Any)

as_local_bounded()[source]

Convert this dataframe to a LocalBoundedDataFrame

Return type:

LocalBoundedDataFrame

class fugue.dataframe.dataframe_iterable_dataframe.LocalDataFrameIterableDataFrame(df=None, schema=None)[source]

Bases: LocalUnboundedDataFrame

DataFrame that wraps an iterable of local dataframes

Parameters:
  • df (Any) – an iterable of DataFrame. If any is not local, they will be converted to LocalDataFrame by as_local()

  • schema (Any) – Schema like object, if it is provided, it must match the schema of the dataframes

Examples

def get_dfs(seq):
    yield IterableDataFrame([], "a:int,b:int")
    yield IterableDataFrame([[1, 10]], "a:int,b:int")
    yield ArrayDataFrame([], "a:int,b:str")

df = LocalDataFrameIterableDataFrame(get_dfs())
for subdf in df.native:
    subdf.show()

Note

It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid to count.

schema can be used when the iterable contains no dataframe. But if there is any dataframe, schema must match the schema of the dataframes.

For the iterable of dataframes, if there is any empty dataframe, they will be skipped and their schema will not matter. However, if all dataframes in the interable are empty, then the last empty dataframe will be used to set the schema.

alter_columns(columns)[source]

Change column types

Parameters:

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns:

a new dataframe with altered columns, the order of the original schema will not change

Return type:

DataFrame

as_array(columns=None, type_safe=False)[source]

Convert to 2-dimensional native python array

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

2-dimensional native python array

Return type:

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]

Convert to iterable of native python arrays

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

iterable of native python arrays

Return type:

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_arrow(type_safe=False)[source]

Convert to pyArrow DataFrame

Parameters:

type_safe (bool)

Return type:

Table

as_local_bounded()[source]

Convert this dataframe to a LocalBoundedDataFrame

Return type:

LocalBoundedDataFrame

as_pandas()[source]

Convert to pandas DataFrame

Return type:

DataFrame

property empty: bool

Whether this dataframe is empty

head(n, columns=None)[source]

Get first n rows of the dataframe as a new local bounded dataframe

Parameters:
  • n (int) – number of rows

  • columns (List[str] | None) – selected columns, defaults to None (all columns)

Returns:

a local bounded dataframe

Return type:

LocalBoundedDataFrame

property native: EmptyAwareIterable[LocalDataFrame]

Iterable of dataframes

peek_array()[source]

Peek the first row of the dataframe as array

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

List[Any]

rename(columns)[source]

Rename the dataframe using a mapping dict

Parameters:

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns:

a new dataframe with the new names

Return type:

DataFrame

fugue.dataframe.dataframes

class fugue.dataframe.dataframes.DataFrames(*args, **kwargs)[source]

Bases: IndexedOrderedDict[str, DataFrame]

Ordered dictionary of DataFrames. There are two modes: with keys and without keys. If without key _<n> will be used as the key for each dataframe, and it will be treated as an array in Fugue framework.

It’s a subclass of dict, so it supports all dict operations. It’s also ordered, so you can trust the order of keys and values.

The initialization is flexible

>>> df1 = ArrayDataFrame([[0]],"a:int")
>>> df2 = ArrayDataFrame([[1]],"a:int")
>>> dfs = DataFrames(df1,df2)  # init as [df1, df2]
>>> assert not dfs.has_key
>>> assert df1 is dfs[0] and df2 is dfs[1]
>>> dfs_array = list(dfs.values())
>>> dfs = DataFrames(a=df1,b=df2)  # init as {a:df1, b:df2}
>>> assert dfs.has_key
>>> assert df1 is dfs[0] and df2 is dfs[1]  # order is guaranteed
>>> df3 = ArrayDataFrame([[1]],"b:int")
>>> dfs2 = DataFrames(dfs, c=df3)  # {a:df1, b:df2, c:df3}
>>> dfs2 = DataFrames(dfs, df3)  # invalid, because dfs has key, df3 doesn't
>>> dfs2 = DataFrames(dict(a=df1,b=df2))  # init as {a:df1, b:df2}
>>> dfs2 = DataFrames([df1,df2],df3)  # init as [df1, df2, df3]
Parameters:
  • args (Any)

  • kwargs (Any)

convert(func)[source]

Create another DataFrames with the same structure, but all converted by func

Returns:

the new DataFrames

Parameters:

func (Callable[[DataFrame], DataFrame])

Return type:

DataFrames

Examples

>>> dfs2 = dfs.convert(lambda df: df.as_local()) # convert all to local
property has_key

If this collection has key (dict-like) or not (list-like)

fugue.dataframe.function_wrapper

class fugue.dataframe.function_wrapper.DataFrameFunctionWrapper(func, params_re='.*', return_re='.*')[source]

Bases: FunctionWrapper

Parameters:
  • func (Callable)

  • params_re (str)

  • return_re (str)

get_format_hint()[source]
Return type:

str | None

property need_output_schema: bool | None
run(args, kwargs, ignore_unknown=False, output_schema=None, output=True, ctx=None)[source]
Parameters:
  • args (List[Any])

  • kwargs (Dict[str, Any])

  • ignore_unknown (bool)

  • output_schema (Any | None)

  • output (bool)

  • ctx (Any | None)

Return type:

Any

class fugue.dataframe.function_wrapper.DataFrameParam(param)[source]

Bases: _DataFrameParamBase

Parameters:

param (Parameter | None)

count(df)[source]
Parameters:

df (Any)

Return type:

int

to_input_data(df, ctx)[source]
Parameters:
Return type:

Any

to_output_df(output, schema, ctx)[source]
Parameters:
  • output (Any)

  • schema (Any)

  • ctx (Any)

Return type:

DataFrame

class fugue.dataframe.function_wrapper.DictParam(param)[source]

Bases: RowParam

Parameters:

param (Parameter | None)

iterable_to_output_df(dfs, schema, ctx)[source]
Parameters:
  • dfs (Iterable[Dict[str, Any]])

  • schema (Any)

  • ctx (Any)

Return type:

DataFrame

to_input_rows(df, ctx)[source]
Parameters:
Return type:

Iterable[Any]

to_output_df(output, schema, ctx)[source]
Parameters:
  • output (Dict[str, Any])

  • schema (Any)

  • ctx (Any)

Return type:

DataFrame

class fugue.dataframe.function_wrapper.LocalDataFrameParam(param)[source]

Bases: DataFrameParam

Parameters:

param (Parameter | None)

count(df)[source]
Parameters:

df (LocalDataFrame)

Return type:

int

iterable_to_output_df(dfs, schema, ctx)[source]
Parameters:
  • dfs (Iterable[Any])

  • schema (Any)

  • ctx (Any)

Return type:

DataFrame

to_input_data(df, ctx)[source]
Parameters:
Return type:

LocalDataFrame

to_output_df(output, schema, ctx)[source]
Parameters:
Return type:

DataFrame

class fugue.dataframe.function_wrapper.RowParam(param)[source]

Bases: _DataFrameParamBase

Parameters:

param (Parameter | None)

count(df)[source]
Parameters:

df (Any)

Return type:

int

property is_per_row: bool

fugue.dataframe.iterable_dataframe

class fugue.dataframe.iterable_dataframe.IterableDataFrame(df=None, schema=None)[source]

Bases: LocalUnboundedDataFrame

DataFrame that wraps native python iterable of arrays. Please read the DataFrame Tutorial to understand the concept

Parameters:
  • df (Any) – 2-dimensional array, iterable of arrays, or DataFrame

  • schema (Any) – Schema like object

Examples

>>> a = IterableDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> b = IterableDataFrame(a)

Note

It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid operation to count

alter_columns(columns)[source]

Change column types

Parameters:

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns:

a new dataframe with altered columns, the order of the original schema will not change

Return type:

DataFrame

as_array(columns=None, type_safe=False)[source]

Convert to 2-dimensional native python array

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

2-dimensional native python array

Return type:

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]

Convert to iterable of native python arrays

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

iterable of native python arrays

Return type:

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_dicts(columns=None)[source]

Convert to a list of python dicts

Parameters:

columns (List[str] | None) – columns to extract, defaults to None

Returns:

a list of python dicts

Return type:

List[Dict[str, Any]]

Note

The default implementation enforces type_safe True

as_local_bounded()[source]

Convert this dataframe to a LocalBoundedDataFrame

Return type:

LocalBoundedDataFrame

property empty: bool

Whether this dataframe is empty

head(n, columns=None)[source]

Get first n rows of the dataframe as a new local bounded dataframe

Parameters:
  • n (int) – number of rows

  • columns (List[str] | None) – selected columns, defaults to None (all columns)

Returns:

a local bounded dataframe

Return type:

LocalBoundedDataFrame

property native: EmptyAwareIterable[Any]

Iterable of native python arrays

peek_array()[source]

Peek the first row of the dataframe as array

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

List[Any]

rename(columns)[source]

Rename the dataframe using a mapping dict

Parameters:

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns:

a new dataframe with the new names

Return type:

DataFrame

fugue.dataframe.pandas_dataframe

class fugue.dataframe.pandas_dataframe.PandasDataFrame(df=None, schema=None, pandas_df_wrapper=False)[source]

Bases: LocalBoundedDataFrame

DataFrame that wraps pandas DataFrame. Please also read the DataFrame Tutorial to understand this Fugue concept

Parameters:
  • df (Any) – 2-dimensional array, iterable of arrays or pandas DataFrame

  • schema (Any) – Schema like object

  • pandas_df_wrapper (bool) – if this is a simple wrapper, default False

Examples

>>> PandasDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> PandasDataFrame(schema = "a:int,b:int")  # empty dataframe
>>> PandasDataFrame(pd.DataFrame([[0]],columns=["a"]))
>>> PandasDataFrame(ArrayDataFrame([[0]],"a:int).as_pandas())

Note

If pandas_df_wrapper is True, then the constructor will not do any type check otherwise, it will enforce type according to the input schema after the construction

alter_columns(columns)[source]

Change column types

Parameters:

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns:

a new dataframe with altered columns, the order of the original schema will not change

Return type:

DataFrame

as_array(columns=None, type_safe=False)[source]

Convert to 2-dimensional native python array

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

2-dimensional native python array

Return type:

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]

Convert to iterable of native python arrays

Parameters:
  • columns (List[str] | None) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns:

iterable of native python arrays

Return type:

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_arrow(type_safe=False)[source]

Convert to pyArrow DataFrame

Parameters:

type_safe (bool)

Return type:

Table

as_dict_iterable(columns=None)[source]

Convert to iterable of python dicts

Parameters:

columns (List[str] | None) – columns to extract, defaults to None

Returns:

iterable of python dicts

Return type:

Iterable[Dict[str, Any]]

Note

The default implementation enforces type_safe True

as_dicts(columns=None)[source]

Convert to a list of python dicts

Parameters:

columns (List[str] | None) – columns to extract, defaults to None

Returns:

a list of python dicts

Return type:

List[Dict[str, Any]]

Note

The default implementation enforces type_safe True

as_pandas()[source]

Convert to pandas DataFrame

Return type:

DataFrame

count()[source]

Get number of rows of this dataframe

Return type:

int

property empty: bool

Whether this dataframe is empty

head(n, columns=None)[source]

Get first n rows of the dataframe as a new local bounded dataframe

Parameters:
  • n (int) – number of rows

  • columns (List[str] | None) – selected columns, defaults to None (all columns)

Returns:

a local bounded dataframe

Return type:

LocalBoundedDataFrame

property native: DataFrame

Pandas DataFrame

native_as_df()[source]

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type:

DataFrame

peek_array()[source]

Peek the first row of the dataframe as array

Raises:

FugueDatasetEmptyError – if it is empty

Return type:

List[Any]

rename(columns)[source]

Rename the dataframe using a mapping dict

Parameters:

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns:

a new dataframe with the new names

Return type:

DataFrame

fugue.dataframe.utils

fugue.dataframe.utils.deserialize_df(data, fs=None)[source]

Deserialize json string to LocalBoundedDataFrame

Parameters:
  • json_str – json string containing the base64 data or a file path

  • fs (AbstractFileSystem | None) – the file system to use, defaults to None

  • data (bytes | None)

Raises:

ValueError – if the json string is invalid, not generated from serialize_df()

Returns:

LocalBoundedDataFrame if json_str contains a dataframe or None if its valid but contains no data

Return type:

LocalBoundedDataFrame | None

fugue.dataframe.utils.get_join_schemas(df1, df2, how, on)[source]

Get Schema object after joining df1 and df2. If on is not empty, it’s mainly for validation purpose.

Parameters:
  • df1 (DataFrame) – first dataframe

  • df2 (DataFrame) – second dataframe

  • how (str) – can accept semi, left_semi, anti, left_anti, inner, left_outer, right_outer, full_outer, cross

  • on (Iterable[str] | None) – it can always be inferred, but if you provide, it will be validated agained the inferred keys.

Returns:

the pair key schema and schema after join

Return type:

Tuple[Schema, Schema]

Note

In Fugue, joined schema can always be inferred because it always uses the input dataframes’ common keys as the join keys. So you must make sure to rename() to input dataframes so they follow this rule.

fugue.dataframe.utils.pa_table_as_array(df, columns=None)[source]

Convert a pyarrow table to a list of list

Parameters:
  • df (Table) – pyarrow table

  • columns (List[str] | None) – if not None, only these columns will be returned, defaults to None

Returns:

a list of list

Return type:

List[List[List[Any]]]

fugue.dataframe.utils.pa_table_as_array_iterable(df, columns=None)[source]

Convert a pyarrow table to an iterable of list

Parameters:
  • df (Table) – pyarrow table

  • columns (List[str] | None) – if not None, only these columns will be returned, defaults to None

Returns:

an iterable of list

Return type:

Iterable[List[List[Any]]]

fugue.dataframe.utils.pa_table_as_dict_iterable(df, columns=None)[source]

Convert a pyarrow table to an iterable of dict

Parameters:
  • df (Table) – pyarrow table

  • columns (List[str] | None) – if not None, only these columns will be returned, defaults to None

Returns:

an iterable of dict

Return type:

Iterable[Dict[str, Any]]

fugue.dataframe.utils.pa_table_as_dicts(df, columns=None)[source]

Convert a pyarrow table to a list of dict

Parameters:
  • df (Table) – pyarrow table

  • columns (List[str] | None) – if not None, only these columns will be returned, defaults to None

Returns:

a list of dict

Return type:

List[Dict[str, Any]]

fugue.dataframe.utils.serialize_df(df, threshold=-1, file_path=None)[source]

Serialize input dataframe to base64 string or to file if it’s larger than threshold

Parameters:
  • df (DataFrame | None) – input DataFrame

  • threshold (int) – file byte size threshold, defaults to -1

  • file_path (str | None) – file path to store the data (used only if the serialized data is larger than threshold), defaults to None

Raises:

InvalidOperationError – if file is large but file_path is not provided

Returns:

a pickled blob either containing the data or the file path

Return type:

bytes | None