fugue.dataframe#

fugue.dataframe.api#

fugue.dataframe.api.get_native_as_df(df)[source]#

Return the dataframe form of the input df. If df is a DataFrame, then call the native_as_df(), otherwise, it depends on whether there is a correspondent function handling it.

Parameters

df (AnyDataFrame) –

Return type

AnyDataFrame

fugue.dataframe.api.normalize_column_names(df)[source]#

A generic function to normalize any dataframe’s column names to follow Fugue naming rules

Note

This is a temporary solution before Schema can take arbitrary names

Examples

  • [0,1] => {"_0":0, "_1":1}

  • ["1a","2b"] => {"_1a":"1a", "_2b":"2b"}

  • ["*a","-a"] => {"_a":"*a", "_a_1":"-a"}

Parameters

df (AnyDataFrame) – a dataframe object

Returns

the renamed dataframe and the rename operations as a dict that can undo the change

Return type

Tuple[AnyDataFrame, Dict[str, Any]]

fugue.dataframe.array_dataframe#

class fugue.dataframe.array_dataframe.ArrayDataFrame(df=None, schema=None)[source]#

Bases: LocalBoundedDataFrame

DataFrame that wraps native python 2-dimensional arrays. Please read the DataFrame Tutorial to understand the concept

Parameters
  • df (Any) – 2-dimensional array, iterable of arrays, or DataFrame

  • schema (Any) – Schema like object

Examples

>>> a = ArrayDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> b = ArrayDataFrame(a)
alter_columns(columns)[source]#

Change column types

Parameters

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns

a new dataframe with altered columns, the order of the original schema will not change

Return type

DataFrame

as_array(columns=None, type_safe=False)[source]#

Convert to 2-dimensional native python array

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

2-dimensional native python array

Return type

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]#

Convert to iterable of native python arrays

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

iterable of native python arrays

Return type

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

count()[source]#

Get number of rows of this dataframe

Return type

int

property empty: bool#

Whether this dataframe is empty

head(n, columns=None)[source]#

Get first n rows of the dataframe as a new local bounded dataframe

Parameters
  • n (int) – number of rows

  • columns (Optional[List[str]]) – selected columns, defaults to None (all columns)

Returns

a local bounded dataframe

Return type

LocalBoundedDataFrame

property native: List[Any]#

2-dimensional native python array

peek_array()[source]#

Peek the first row of the dataframe as array

Raises

FugueDatasetEmptyError – if it is empty

Return type

List[Any]

rename(columns)[source]#

Rename the dataframe using a mapping dict

Parameters

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns

a new dataframe with the new names

Return type

DataFrame

fugue.dataframe.arrow_dataframe#

class fugue.dataframe.arrow_dataframe.ArrowDataFrame(df=None, schema=None, pandas_df_wrapper=False)[source]#

Bases: LocalBoundedDataFrame

DataFrame that wraps pyarrow.Table. Please also read the DataFrame Tutorial to understand this Fugue concept

Parameters
  • df (Any) – 2-dimensional array, iterable of arrays, pyarrow.Table or pandas DataFrame

  • schema (Any) – Schema like object

  • pandas_df_wrapper (bool) –

Examples

>>> ArrowDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> ArrowDataFrame(schema = "a:int,b:int")  # empty dataframe
>>> ArrowDataFrame(pd.DataFrame([[0]],columns=["a"]))
>>> ArrowDataFrame(ArrayDataFrame([[0]],"a:int).as_arrow())
alter_columns(columns)[source]#

Change column types

Parameters

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns

a new dataframe with altered columns, the order of the original schema will not change

Return type

DataFrame

as_array(columns=None, type_safe=False)[source]#

Convert to 2-dimensional native python array

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

2-dimensional native python array

Return type

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]#

Convert to iterable of native python arrays

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

iterable of native python arrays

Return type

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_arrow(type_safe=False)[source]#

Convert to pyArrow DataFrame

Parameters

type_safe (bool) –

Return type

Table

as_pandas()[source]#

Convert to pandas DataFrame

Return type

DataFrame

count()[source]#

Get number of rows of this dataframe

Return type

int

property empty: bool#

Whether this dataframe is empty

head(n, columns=None)[source]#

Get first n rows of the dataframe as a new local bounded dataframe

Parameters
  • n (int) – number of rows

  • columns (Optional[List[str]]) – selected columns, defaults to None (all columns)

Returns

a local bounded dataframe

Return type

LocalBoundedDataFrame

property native: Table#

pyarrow.Table

native_as_df()[source]#

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type

Table

peek_array()[source]#

Peek the first row of the dataframe as array

Raises

FugueDatasetEmptyError – if it is empty

Return type

List[Any]

peek_dict()[source]#

Peek the first row of the dataframe as dict

Raises

FugueDatasetEmptyError – if it is empty

Return type

Dict[str, Any]

rename(columns)[source]#

Rename the dataframe using a mapping dict

Parameters

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns

a new dataframe with the new names

Return type

DataFrame

fugue.dataframe.dataframe#

class fugue.dataframe.dataframe.DataFrame(schema=None)[source]#

Bases: Dataset

Base class of Fugue DataFrame. Please read the DataFrame Tutorial to understand the concept

Parameters

schema (Any) – Schema like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

abstract alter_columns(columns)[source]#

Change column types

Parameters

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns

a new dataframe with altered columns, the order of the original schema will not change

Return type

DataFrame

abstract as_array(columns=None, type_safe=False)[source]#

Convert to 2-dimensional native python array

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

2-dimensional native python array

Return type

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

abstract as_array_iterable(columns=None, type_safe=False)[source]#

Convert to iterable of native python arrays

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

iterable of native python arrays

Return type

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_arrow(type_safe=False)[source]#

Convert to pyArrow DataFrame

Parameters

type_safe (bool) –

Return type

Table

as_dict_iterable(columns=None)[source]#

Convert to iterable of native python dicts

Parameters

columns (Optional[List[str]]) – columns to extract, defaults to None

Returns

iterable of native python dicts

Return type

Iterable[Dict[str, Any]]

Note

The default implementation enforces type_safe True

abstract as_local()[source]#

Convert this dataframe to a LocalDataFrame

Return type

LocalDataFrame

as_pandas()[source]#

Convert to pandas DataFrame

Return type

DataFrame

property columns: List[str]#

The column names of the dataframe

drop(columns)[source]#

Drop certain columns and return a new dataframe

Parameters

columns (List[str]) – columns to drop

Raises

FugueDataFrameOperationError – if columns are not strictly contained by this dataframe, or it is the entire dataframe columns

Returns

a new dataframe removing the columns

Return type

DataFrame

get_info_str()[source]#

Get dataframe information (schema, type, metadata) as json string

Returns

json string

Return type

str

abstract head(n, columns=None)[source]#

Get first n rows of the dataframe as a new local bounded dataframe

Parameters
  • n (int) – number of rows

  • columns (Optional[List[str]]) – selected columns, defaults to None (all columns)

Returns

a local bounded dataframe

Return type

LocalBoundedDataFrame

abstract native_as_df()[source]#

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type

AnyDataFrame

abstract peek_array()[source]#

Peek the first row of the dataframe as array

Raises

FugueDatasetEmptyError – if it is empty

Return type

List[Any]

peek_dict()[source]#

Peek the first row of the dataframe as dict

Raises

FugueDatasetEmptyError – if it is empty

Return type

Dict[str, Any]

abstract rename(columns)[source]#

Rename the dataframe using a mapping dict

Parameters

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns

a new dataframe with the new names

Return type

DataFrame

property schema: Schema#

The schema of the dataframe

property schema_discovered: Schema#

Whether the schema has been discovered or still a lambda

class fugue.dataframe.dataframe.DataFrameDisplay(ds)[source]#

Bases: DatasetDisplay

DataFrame plain display class

Parameters

ds (Dataset) –

property df: DataFrame#

The target DataFrame

show(n=10, with_count=False, title=None)[source]#

Show the Dataset

Parameters
  • n (int) – top n items to display, defaults to 10

  • with_count (bool) – whether to display the total count, defaults to False

  • title (Optional[str]) – title to display, defaults to None

Return type

None

class fugue.dataframe.dataframe.LocalBoundedDataFrame(schema=None)[source]#

Bases: LocalDataFrame

Base class of all local bounded dataframes. Please read this to understand the concept

Parameters

schema (Any) – Schema like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

property is_bounded: bool#

Always True because it’s a bounded dataframe

class fugue.dataframe.dataframe.LocalDataFrame(schema=None)[source]#

Bases: DataFrame

Base class of all local dataframes. Please read this to understand the concept

Parameters

schema (Any) – a schema-like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

as_local()[source]#

Always return self, because it’s a LocalDataFrame

Return type

LocalDataFrame

property is_local: bool#

Always True because it’s a LocalDataFrame

native_as_df()[source]#

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type

AnyDataFrame

property num_partitions: int#

Always 1 because it’s a LocalDataFrame

class fugue.dataframe.dataframe.LocalUnboundedDataFrame(schema=None)[source]#

Bases: LocalDataFrame

Base class of all local unbounded dataframes. Read this <https://fugue-tutorials.readthedocs.io/ en/latest/tutorials/advanced/schema_dataframes.html#DataFrame>`_ to understand the concept

Parameters

schema (Any) – Schema like object

Note

This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new ExecutionEngine

count()[source]#
Raises

InvalidOperationError – You can’t count an unbounded dataframe

Return type

int

property is_bounded#

Always False because it’s an unbounded dataframe

class fugue.dataframe.dataframe.YieldedDataFrame(yid)[source]#

Bases: Yielded

Yielded dataframe from FugueWorkflow. Users shouldn’t create this object directly.

Parameters

yid (str) – unique id for determinism

property is_set: bool#

Whether the value is set. It can be false if the parent workflow has not been executed.

property result: DataFrame#

The yielded dataframe, it will be set after the parent workflow is computed

set_value(df)[source]#

Set the yielded dataframe after compute. Users should not call it.

Parameters
Return type

None

fugue.dataframe.dataframe.as_fugue_df(df, **kwargs)[source]#

Wrap the object as a Fugue DataFrame.

Parameters
  • df (AnyDataFrame) – the object to wrap

  • kwargs (Any) –

Return type

DataFrame

fugue.dataframe.dataframe_iterable_dataframe#

class fugue.dataframe.dataframe_iterable_dataframe.LocalDataFrameIterableDataFrame(df=None, schema=None)[source]#

Bases: LocalUnboundedDataFrame

DataFrame that wraps an iterable of local dataframes

Parameters
  • df (Any) – an iterable of DataFrame. If any is not local, they will be converted to LocalDataFrame by as_local()

  • schema (Any) – Schema like object, if it is provided, it must match the schema of the dataframes

Examples

def get_dfs(seq):
    yield IterableDataFrame([], "a:int,b:int")
    yield IterableDataFrame([[1, 10]], "a:int,b:int")
    yield ArrayDataFrame([], "a:int,b:str")

df = LocalDataFrameIterableDataFrame(get_dfs())
for subdf in df.native:
    subdf.show()

Note

It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid to count.

schema can be used when the iterable contains no dataframe. But if there is any dataframe, schema must match the schema of the dataframes.

For the iterable of dataframes, if there is any empty dataframe, they will be skipped and their schema will not matter. However, if all dataframes in the interable are empty, then the last empty dataframe will be used to set the schema.

alter_columns(columns)[source]#

Change column types

Parameters

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns

a new dataframe with altered columns, the order of the original schema will not change

Return type

DataFrame

as_array(columns=None, type_safe=False)[source]#

Convert to 2-dimensional native python array

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

2-dimensional native python array

Return type

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]#

Convert to iterable of native python arrays

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

iterable of native python arrays

Return type

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_arrow(type_safe=False)[source]#

Convert to pyArrow DataFrame

Parameters

type_safe (bool) –

Return type

Table

as_pandas()[source]#

Convert to pandas DataFrame

Return type

DataFrame

property empty: bool#

Whether this dataframe is empty

head(n, columns=None)[source]#

Get first n rows of the dataframe as a new local bounded dataframe

Parameters
  • n (int) – number of rows

  • columns (Optional[List[str]]) – selected columns, defaults to None (all columns)

Returns

a local bounded dataframe

Return type

LocalBoundedDataFrame

property native: EmptyAwareIterable[LocalDataFrame]#

Iterable of dataframes

peek_array()[source]#

Peek the first row of the dataframe as array

Raises

FugueDatasetEmptyError – if it is empty

Return type

List[Any]

rename(columns)[source]#

Rename the dataframe using a mapping dict

Parameters

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns

a new dataframe with the new names

Return type

DataFrame

fugue.dataframe.dataframes#

class fugue.dataframe.dataframes.DataFrames(*args, **kwargs)[source]#

Bases: IndexedOrderedDict[str, DataFrame]

Ordered dictionary of DataFrames. There are two modes: with keys and without keys. If without key _<n> will be used as the key for each dataframe, and it will be treated as an array in Fugue framework.

It’s a subclass of dict, so it supports all dict operations. It’s also ordered, so you can trust the order of keys and values.

The initialization is flexible

>>> df1 = ArrayDataFrame([[0]],"a:int")
>>> df2 = ArrayDataFrame([[1]],"a:int")
>>> dfs = DataFrames(df1,df2)  # init as [df1, df2]
>>> assert not dfs.has_key
>>> assert df1 is dfs[0] and df2 is dfs[1]
>>> dfs_array = list(dfs.values())
>>> dfs = DataFrames(a=df1,b=df2)  # init as {a:df1, b:df2}
>>> assert dfs.has_key
>>> assert df1 is dfs[0] and df2 is dfs[1]  # order is guaranteed
>>> df3 = ArrayDataFrame([[1]],"b:int")
>>> dfs2 = DataFrames(dfs, c=df3)  # {a:df1, b:df2, c:df3}
>>> dfs2 = DataFrames(dfs, df3)  # invalid, because dfs has key, df3 doesn't
>>> dfs2 = DataFrames(dict(a=df1,b=df2))  # init as {a:df1, b:df2}
>>> dfs2 = DataFrames([df1,df2],df3)  # init as [df1, df2, df3]
Parameters
  • args (Any) –

  • kwargs (Any) –

convert(func)[source]#

Create another DataFrames with the same structure, but all converted by func

Returns

the new DataFrames

Parameters

func (Callable[[DataFrame], DataFrame]) –

Return type

DataFrames

Examples

>>> dfs2 = dfs.convert(lambda df: df.as_local()) # convert all to local
property has_key#

If this collection has key (dict-like) or not (list-like)

fugue.dataframe.iterable_dataframe#

class fugue.dataframe.iterable_dataframe.IterableDataFrame(df=None, schema=None)[source]#

Bases: LocalUnboundedDataFrame

DataFrame that wraps native python iterable of arrays. Please read the DataFrame Tutorial to understand the concept

Parameters
  • df (Any) – 2-dimensional array, iterable of arrays, or DataFrame

  • schema (Any) – Schema like object

Examples

>>> a = IterableDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> b = IterableDataFrame(a)

Note

It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid operation to count

alter_columns(columns)[source]#

Change column types

Parameters

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns

a new dataframe with altered columns, the order of the original schema will not change

Return type

DataFrame

as_array(columns=None, type_safe=False)[source]#

Convert to 2-dimensional native python array

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

2-dimensional native python array

Return type

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]#

Convert to iterable of native python arrays

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

iterable of native python arrays

Return type

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

property empty: bool#

Whether this dataframe is empty

head(n, columns=None)[source]#

Get first n rows of the dataframe as a new local bounded dataframe

Parameters
  • n (int) – number of rows

  • columns (Optional[List[str]]) – selected columns, defaults to None (all columns)

Returns

a local bounded dataframe

Return type

LocalBoundedDataFrame

property native: EmptyAwareIterable[Any]#

Iterable of native python arrays

peek_array()[source]#

Peek the first row of the dataframe as array

Raises

FugueDatasetEmptyError – if it is empty

Return type

List[Any]

rename(columns)[source]#

Rename the dataframe using a mapping dict

Parameters

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns

a new dataframe with the new names

Return type

DataFrame

fugue.dataframe.pandas_dataframe#

class fugue.dataframe.pandas_dataframe.PandasDataFrame(df=None, schema=None, pandas_df_wrapper=False)[source]#

Bases: LocalBoundedDataFrame

DataFrame that wraps pandas DataFrame. Please also read the DataFrame Tutorial to understand this Fugue concept

Parameters
  • df (Any) – 2-dimensional array, iterable of arrays or pandas DataFrame

  • schema (Any) – Schema like object

  • pandas_df_wrapper (bool) – if this is a simple wrapper, default False

Examples

>>> PandasDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> PandasDataFrame(schema = "a:int,b:int")  # empty dataframe
>>> PandasDataFrame(pd.DataFrame([[0]],columns=["a"]))
>>> PandasDataFrame(ArrayDataFrame([[0]],"a:int).as_pandas())

Note

If pandas_df_wrapper is True, then the constructor will not do any type check otherwise, it will enforce type according to the input schema after the construction

alter_columns(columns)[source]#

Change column types

Parameters

columns (Any) – Schema like object, all columns should be contained by the dataframe schema

Returns

a new dataframe with altered columns, the order of the original schema will not change

Return type

DataFrame

as_array(columns=None, type_safe=False)[source]#

Convert to 2-dimensional native python array

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

2-dimensional native python array

Return type

List[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_array_iterable(columns=None, type_safe=False)[source]#

Convert to iterable of native python arrays

Parameters
  • columns (Optional[List[str]]) – columns to extract, defaults to None

  • type_safe (bool) – whether to ensure output conforms with its schema, defaults to False

Returns

iterable of native python arrays

Return type

Iterable[Any]

Note

If type_safe is False, then the returned values are ‘raw’ values.

as_pandas()[source]#

Convert to pandas DataFrame

Return type

DataFrame

count()[source]#

Get number of rows of this dataframe

Return type

int

property empty: bool#

Whether this dataframe is empty

head(n, columns=None)[source]#

Get first n rows of the dataframe as a new local bounded dataframe

Parameters
  • n (int) – number of rows

  • columns (Optional[List[str]]) – selected columns, defaults to None (all columns)

Returns

a local bounded dataframe

Return type

LocalBoundedDataFrame

property native: DataFrame#

Pandas DataFrame

native_as_df()[source]#

The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its native_as_df should be either a pandas dataframe or an arrow dataframe.

Return type

DataFrame

peek_array()[source]#

Peek the first row of the dataframe as array

Raises

FugueDatasetEmptyError – if it is empty

Return type

List[Any]

rename(columns)[source]#

Rename the dataframe using a mapping dict

Parameters

columns (Dict[str, str]) – key: the original column name, value: the new name

Returns

a new dataframe with the new names

Return type

DataFrame

fugue.dataframe.utils#

fugue.dataframe.utils.deserialize_df(json_str, fs=None)[source]#

Deserialize json string to LocalBoundedDataFrame

Parameters
  • json_str (str) – json string containing the base64 data or a file path

  • fs (Optional[FileSystem]) – FileSystem, defaults to None

Raises

ValueError – if the json string is invalid, not generated from serialize_df()

Returns

LocalBoundedDataFrame if json_str contains a dataframe or None if its valid but contains no data

Return type

Optional[LocalBoundedDataFrame]

fugue.dataframe.utils.get_join_schemas(df1, df2, how, on)[source]#

Get Schema object after joining df1 and df2. If on is not empty, it’s mainly for validation purpose.

Parameters
  • df1 (DataFrame) – first dataframe

  • df2 (DataFrame) – second dataframe

  • how (str) – can accept semi, left_semi, anti, left_anti, inner, left_outer, right_outer, full_outer, cross

  • on (Optional[Iterable[str]]) – it can always be inferred, but if you provide, it will be validated agained the inferred keys.

Returns

the pair key schema and schema after join

Return type

Tuple[Schema, Schema]

Note

In Fugue, joined schema can always be inferred because it always uses the input dataframes’ common keys as the join keys. So you must make sure to rename() to input dataframes so they follow this rule.

fugue.dataframe.utils.pickle_df(df)[source]#

Pickles a dataframe to bytes array. It firstly converts the dataframe using to_local_bounded_df(), and then serialize the underlying data.

Parameters

df (DataFrame) – input DataFrame

Returns

pickled binary data

Return type

bytes

Note

Be careful to use on large dataframes or non-local, un-materialized dataframes, it can be slow. You should always use unpickle_df() to deserialize.

fugue.dataframe.utils.serialize_df(df, threshold=-1, file_path=None, fs=None)[source]#

Serialize input dataframe to base64 string or to file if it’s larger than threshold

Parameters
  • df (Optional[DataFrame]) – input DataFrame

  • threshold (int) – file byte size threshold, defaults to -1

  • file_path (Optional[str]) – file path to store the data (used only if the serialized data is larger than threshold), defaults to None

  • fs (Optional[FileSystem]) – FileSystem, defaults to None

Raises

InvalidOperationError – if file is large but file_path is not provided

Returns

a json string either containing the base64 data or the file path

Return type

str

Note

If fs is not provided but it needs to write to disk, then it will use open_fs() to try to open the file to write.

fugue.dataframe.utils.to_local_bounded_df(df, schema=None)[source]#

Convert a data structure to LocalBoundedDataFrame

Parameters
  • df (Any) – DataFrame, pandas DataFramme and list or iterable of arrays

  • schema (Optional[Any]) – Schema like object, defaults to None, it should not be set for DataFrame type

Raises
  • ValueError – if df is DataFrame but you set schema

  • TypeError – if df is not compatible

Returns

the dataframe itself if it’s LocalBoundedDataFrame else a converted one

Return type

LocalBoundedDataFrame

Examples

>>> a = IterableDataFrame([[0,'a'],[1,'b']],"a:int,b:str")
>>> assert isinstance(to_local_bounded_df(a), LocalBoundedDataFrame)
>>> to_local_bounded_df(SparkDataFrame([[0,'a'],[1,'b']],"a:int,b:str"))

Note

Compared to to_local_df(), this function makes sure the dataframe is also bounded, so IterableDataFrame will be converted although it’s local.

fugue.dataframe.utils.to_local_df(df, schema=None)[source]#

Convert a data structure to LocalDataFrame

Parameters
  • df (Any) – DataFrame, pandas DataFramme and list or iterable of arrays

  • schema (Optional[Any]) – Schema like object, defaults to None, it should not be set for DataFrame type

Raises
  • ValueError – if df is DataFrame but you set schema

  • TypeError – if df is not compatible

Returns

the dataframe itself if it’s LocalDataFrame else a converted one

Return type

LocalDataFrame

Examples

>>> a = to_local_df([[0,'a'],[1,'b']],"a:int,b:str")
>>> assert to_local_df(a) is a
>>> to_local_df(SparkDataFrame([[0,'a'],[1,'b']],"a:int,b:str"))
fugue.dataframe.utils.unpickle_df(stream)[source]#

Unpickles a dataframe from bytes array.

Parameters

stream (bytes) – binary data

Returns

unpickled dataframe

Return type

LocalBoundedDataFrame

Note

The data must be serialized by pickle_df() to deserialize.