fugue.dataframe#
fugue.dataframe.api#
- fugue.dataframe.api.get_native_as_df(df)[source]#
Return the dataframe form of the input
df
. Ifdf
is aDataFrame
, then call thenative_as_df()
, otherwise, it depends on whether there is a correspondent function handling it.- Parameters
df (AnyDataFrame) –
- Return type
AnyDataFrame
- fugue.dataframe.api.normalize_column_names(df)[source]#
A generic function to normalize any dataframe’s column names to follow Fugue naming rules
Note
This is a temporary solution before
Schema
can take arbitrary namesExamples
[0,1]
=>{"_0":0, "_1":1}
["1a","2b"]
=>{"_1a":"1a", "_2b":"2b"}
["*a","-a"]
=>{"_a":"*a", "_a_1":"-a"}
- Parameters
df (AnyDataFrame) – a dataframe object
- Returns
the renamed dataframe and the rename operations as a dict that can undo the change
- Return type
Tuple[AnyDataFrame, Dict[str, Any]]
See also
fugue.dataframe.array_dataframe#
- class fugue.dataframe.array_dataframe.ArrayDataFrame(df=None, schema=None)[source]#
Bases:
LocalBoundedDataFrame
DataFrame that wraps native python 2-dimensional arrays. Please read the DataFrame Tutorial to understand the concept
- Parameters
df (Any) – 2-dimensional array, iterable of arrays, or
DataFrame
schema (Any) – Schema like object
Examples
>>> a = ArrayDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> b = ArrayDataFrame(a)
- alter_columns(columns)[source]#
Change column types
- Parameters
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns
a new dataframe with altered columns, the order of the original schema will not change
- Return type
- as_array(columns=None, type_safe=False)[source]#
Convert to 2-dimensional native python array
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
2-dimensional native python array
- Return type
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]#
Convert to iterable of native python arrays
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
iterable of native python arrays
- Return type
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- property empty: bool#
Whether this dataframe is empty
- head(n, columns=None)[source]#
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters
n (int) – number of rows
columns (Optional[List[str]]) – selected columns, defaults to None (all columns)
- Returns
a local bounded dataframe
- Return type
- property native: List[Any]#
2-dimensional native python array
- peek_array()[source]#
Peek the first row of the dataframe as array
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
List[Any]
fugue.dataframe.arrow_dataframe#
- class fugue.dataframe.arrow_dataframe.ArrowDataFrame(df=None, schema=None)[source]#
Bases:
LocalBoundedDataFrame
DataFrame that wraps
pyarrow.Table
. Please also read the DataFrame Tutorial to understand this Fugue concept- Parameters
df (Any) – 2-dimensional array, iterable of arrays,
pyarrow.Table
or pandas DataFrameschema (Any) – Schema like object
Examples
>>> ArrowDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> ArrowDataFrame(schema = "a:int,b:int") # empty dataframe >>> ArrowDataFrame(pd.DataFrame([[0]],columns=["a"])) >>> ArrowDataFrame(ArrayDataFrame([[0]],"a:int).as_arrow())
- alter_columns(columns)[source]#
Change column types
- Parameters
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns
a new dataframe with altered columns, the order of the original schema will not change
- Return type
- as_array(columns=None, type_safe=False)[source]#
Convert to 2-dimensional native python array
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
2-dimensional native python array
- Return type
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]#
Convert to iterable of native python arrays
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
iterable of native python arrays
- Return type
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_arrow(type_safe=False)[source]#
Convert to pyArrow DataFrame
- Parameters
type_safe (bool) –
- Return type
- property empty: bool#
Whether this dataframe is empty
- head(n, columns=None)[source]#
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters
n (int) – number of rows
columns (Optional[List[str]]) – selected columns, defaults to None (all columns)
- Returns
a local bounded dataframe
- Return type
- native_as_df()[source]#
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type
- peek_array()[source]#
Peek the first row of the dataframe as array
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
List[Any]
- peek_dict()[source]#
Peek the first row of the dataframe as dict
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
Dict[str, Any]
fugue.dataframe.dataframe#
- class fugue.dataframe.dataframe.DataFrame(schema=None)[source]#
Bases:
Dataset
Base class of Fugue DataFrame. Please read the DataFrame Tutorial to understand the concept
- Parameters
schema (Any) – Schema like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- abstract alter_columns(columns)[source]#
Change column types
- Parameters
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns
a new dataframe with altered columns, the order of the original schema will not change
- Return type
- abstract as_array(columns=None, type_safe=False)[source]#
Convert to 2-dimensional native python array
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
2-dimensional native python array
- Return type
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- abstract as_array_iterable(columns=None, type_safe=False)[source]#
Convert to iterable of native python arrays
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
iterable of native python arrays
- Return type
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_arrow(type_safe=False)[source]#
Convert to pyArrow DataFrame
- Parameters
type_safe (bool) –
- Return type
- as_dict_iterable(columns=None)[source]#
Convert to iterable of native python dicts
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
- Returns
iterable of native python dicts
- Return type
Iterable[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- as_local()[source]#
Convert this dataframe to a
LocalDataFrame
- Return type
- abstract as_local_bounded()[source]#
Convert this dataframe to a
LocalBoundedDataFrame
- Return type
- property columns: List[str]#
The column names of the dataframe
- drop(columns)[source]#
Drop certain columns and return a new dataframe
- Parameters
columns (List[str]) – columns to drop
- Raises
FugueDataFrameOperationError – if
columns
are not strictly contained by this dataframe, or it is the entire dataframe columns- Returns
a new dataframe removing the columns
- Return type
- get_info_str()[source]#
Get dataframe information (schema, type, metadata) as json string
- Returns
json string
- Return type
str
- abstract head(n, columns=None)[source]#
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters
n (int) – number of rows
columns (Optional[List[str]]) – selected columns, defaults to None (all columns)
- Returns
a local bounded dataframe
- Return type
- abstract native_as_df()[source]#
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type
AnyDataFrame
- abstract peek_array()[source]#
Peek the first row of the dataframe as array
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
List[Any]
- peek_dict()[source]#
Peek the first row of the dataframe as dict
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
Dict[str, Any]
- class fugue.dataframe.dataframe.DataFrameDisplay(ds)[source]#
Bases:
DatasetDisplay
DataFrame
plain display class- Parameters
ds (Dataset) –
- class fugue.dataframe.dataframe.LocalBoundedDataFrame(schema=None)[source]#
Bases:
LocalDataFrame
Base class of all local bounded dataframes. Please read this to understand the concept
- Parameters
schema (Any) – Schema like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- property is_bounded: bool#
Always True because it’s a bounded dataframe
- class fugue.dataframe.dataframe.LocalDataFrame(schema=None)[source]#
Bases:
DataFrame
Base class of all local dataframes. Please read this to understand the concept
- Parameters
schema (Any) – a schema-like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- property is_local: bool#
Always True because it’s a LocalDataFrame
- native_as_df()[source]#
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type
AnyDataFrame
- property num_partitions: int#
Always 1 because it’s a LocalDataFrame
- class fugue.dataframe.dataframe.LocalUnboundedDataFrame(schema=None)[source]#
Bases:
LocalDataFrame
Base class of all local unbounded dataframes. Read this <https://fugue-tutorials.readthedocs.io/ en/latest/tutorials/advanced/schema_dataframes.html#DataFrame>`_ to understand the concept
- Parameters
schema (Any) – Schema like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- as_local()[source]#
Convert this dataframe to a
LocalDataFrame
- Return type
- count()[source]#
- Raises
InvalidOperationError – You can’t count an unbounded dataframe
- Return type
int
- property is_bounded#
Always False because it’s an unbounded dataframe
- class fugue.dataframe.dataframe.YieldedDataFrame(yid)[source]#
Bases:
Yielded
Yielded dataframe from
FugueWorkflow
. Users shouldn’t create this object directly.- Parameters
yid (str) – unique id for determinism
- property is_set: bool#
Whether the value is set. It can be false if the parent workflow has not been executed.
fugue.dataframe.dataframe_iterable_dataframe#
- class fugue.dataframe.dataframe_iterable_dataframe.IterableArrowDataFrame(df=None, schema=None)[source]#
Bases:
LocalDataFrameIterableDataFrame
- Parameters
df (Any) –
schema (Any) –
- class fugue.dataframe.dataframe_iterable_dataframe.IterablePandasDataFrame(df=None, schema=None)[source]#
Bases:
LocalDataFrameIterableDataFrame
- Parameters
df (Any) –
schema (Any) –
- as_local_bounded()[source]#
Convert this dataframe to a
LocalBoundedDataFrame
- Return type
- class fugue.dataframe.dataframe_iterable_dataframe.LocalDataFrameIterableDataFrame(df=None, schema=None)[source]#
Bases:
LocalUnboundedDataFrame
DataFrame that wraps an iterable of local dataframes
- Parameters
df (Any) – an iterable of
DataFrame
. If any is not local, they will be converted toLocalDataFrame
byas_local()
schema (Any) – Schema like object, if it is provided, it must match the schema of the dataframes
Examples
def get_dfs(seq): yield IterableDataFrame([], "a:int,b:int") yield IterableDataFrame([[1, 10]], "a:int,b:int") yield ArrayDataFrame([], "a:int,b:str") df = LocalDataFrameIterableDataFrame(get_dfs()) for subdf in df.native: subdf.show()
Note
It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid to count.
schema
can be used when the iterable contains no dataframe. But if there is any dataframe,schema
must match the schema of the dataframes.For the iterable of dataframes, if there is any empty dataframe, they will be skipped and their schema will not matter. However, if all dataframes in the interable are empty, then the last empty dataframe will be used to set the schema.
- alter_columns(columns)[source]#
Change column types
- Parameters
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns
a new dataframe with altered columns, the order of the original schema will not change
- Return type
- as_array(columns=None, type_safe=False)[source]#
Convert to 2-dimensional native python array
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
2-dimensional native python array
- Return type
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]#
Convert to iterable of native python arrays
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
iterable of native python arrays
- Return type
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_arrow(type_safe=False)[source]#
Convert to pyArrow DataFrame
- Parameters
type_safe (bool) –
- Return type
- as_local_bounded()[source]#
Convert this dataframe to a
LocalBoundedDataFrame
- Return type
- property empty: bool#
Whether this dataframe is empty
- head(n, columns=None)[source]#
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters
n (int) – number of rows
columns (Optional[List[str]]) – selected columns, defaults to None (all columns)
- Returns
a local bounded dataframe
- Return type
- property native: EmptyAwareIterable[LocalDataFrame]#
Iterable of dataframes
- peek_array()[source]#
Peek the first row of the dataframe as array
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
List[Any]
fugue.dataframe.dataframes#
- class fugue.dataframe.dataframes.DataFrames(*args, **kwargs)[source]#
Bases:
IndexedOrderedDict
[str
,DataFrame
]Ordered dictionary of DataFrames. There are two modes: with keys and without keys. If without key
_<n>
will be used as the key for each dataframe, and it will be treated as an array in Fugue framework.It’s a subclass of dict, so it supports all dict operations. It’s also ordered, so you can trust the order of keys and values.
The initialization is flexible
>>> df1 = ArrayDataFrame([[0]],"a:int") >>> df2 = ArrayDataFrame([[1]],"a:int") >>> dfs = DataFrames(df1,df2) # init as [df1, df2] >>> assert not dfs.has_key >>> assert df1 is dfs[0] and df2 is dfs[1] >>> dfs_array = list(dfs.values()) >>> dfs = DataFrames(a=df1,b=df2) # init as {a:df1, b:df2} >>> assert dfs.has_key >>> assert df1 is dfs[0] and df2 is dfs[1] # order is guaranteed >>> df3 = ArrayDataFrame([[1]],"b:int") >>> dfs2 = DataFrames(dfs, c=df3) # {a:df1, b:df2, c:df3} >>> dfs2 = DataFrames(dfs, df3) # invalid, because dfs has key, df3 doesn't >>> dfs2 = DataFrames(dict(a=df1,b=df2)) # init as {a:df1, b:df2} >>> dfs2 = DataFrames([df1,df2],df3) # init as [df1, df2, df3]
- Parameters
args (Any) –
kwargs (Any) –
- convert(func)[source]#
Create another DataFrames with the same structure, but all converted by
func
- Returns
the new DataFrames
- Parameters
- Return type
Examples
>>> dfs2 = dfs.convert(lambda df: df.as_local()) # convert all to local
- property has_key#
If this collection has key (dict-like) or not (list-like)
fugue.dataframe.function_wrapper#
- class fugue.dataframe.function_wrapper.DataFrameFunctionWrapper(func, params_re='.*', return_re='.*')[source]#
Bases:
FunctionWrapper
- Parameters
func (Callable) –
params_re (str) –
return_re (str) –
- property need_output_schema: Optional[bool]#
- class fugue.dataframe.function_wrapper.DataFrameParam(param)[source]#
Bases:
_DataFrameParamBase
- Parameters
param (Optional[Parameter]) –
- class fugue.dataframe.function_wrapper.LocalDataFrameParam(param)[source]#
Bases:
DataFrameParam
- Parameters
param (Optional[Parameter]) –
- count(df)[source]#
- Parameters
df (LocalDataFrame) –
- Return type
int
- to_output_df(output, schema, ctx)[source]#
- Parameters
output (LocalDataFrame) –
schema (Any) –
ctx (Any) –
- Return type
fugue.dataframe.iterable_dataframe#
- class fugue.dataframe.iterable_dataframe.IterableDataFrame(df=None, schema=None)[source]#
Bases:
LocalUnboundedDataFrame
DataFrame that wraps native python iterable of arrays. Please read the DataFrame Tutorial to understand the concept
- Parameters
df (Any) – 2-dimensional array, iterable of arrays, or
DataFrame
schema (Any) – Schema like object
Examples
>>> a = IterableDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> b = IterableDataFrame(a)
Note
It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid operation to count
- alter_columns(columns)[source]#
Change column types
- Parameters
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns
a new dataframe with altered columns, the order of the original schema will not change
- Return type
- as_array(columns=None, type_safe=False)[source]#
Convert to 2-dimensional native python array
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
2-dimensional native python array
- Return type
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]#
Convert to iterable of native python arrays
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
iterable of native python arrays
- Return type
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_local_bounded()[source]#
Convert this dataframe to a
LocalBoundedDataFrame
- Return type
- property empty: bool#
Whether this dataframe is empty
- head(n, columns=None)[source]#
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters
n (int) – number of rows
columns (Optional[List[str]]) – selected columns, defaults to None (all columns)
- Returns
a local bounded dataframe
- Return type
- property native: EmptyAwareIterable[Any]#
Iterable of native python arrays
- peek_array()[source]#
Peek the first row of the dataframe as array
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
List[Any]
fugue.dataframe.pandas_dataframe#
- class fugue.dataframe.pandas_dataframe.PandasDataFrame(df=None, schema=None, pandas_df_wrapper=False)[source]#
Bases:
LocalBoundedDataFrame
DataFrame that wraps pandas DataFrame. Please also read the DataFrame Tutorial to understand this Fugue concept
- Parameters
df (Any) – 2-dimensional array, iterable of arrays or pandas DataFrame
schema (Any) – Schema like object
pandas_df_wrapper (bool) – if this is a simple wrapper, default False
Examples
>>> PandasDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> PandasDataFrame(schema = "a:int,b:int") # empty dataframe >>> PandasDataFrame(pd.DataFrame([[0]],columns=["a"])) >>> PandasDataFrame(ArrayDataFrame([[0]],"a:int).as_pandas())
Note
If
pandas_df_wrapper
is True, then the constructor will not do any type check otherwise, it will enforce type according to the input schema after the construction- alter_columns(columns)[source]#
Change column types
- Parameters
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns
a new dataframe with altered columns, the order of the original schema will not change
- Return type
- as_array(columns=None, type_safe=False)[source]#
Convert to 2-dimensional native python array
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
2-dimensional native python array
- Return type
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]#
Convert to iterable of native python arrays
- Parameters
columns (Optional[List[str]]) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns
iterable of native python arrays
- Return type
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- property empty: bool#
Whether this dataframe is empty
- head(n, columns=None)[source]#
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters
n (int) – number of rows
columns (Optional[List[str]]) – selected columns, defaults to None (all columns)
- Returns
a local bounded dataframe
- Return type
- native_as_df()[source]#
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type
- peek_array()[source]#
Peek the first row of the dataframe as array
- Raises
FugueDatasetEmptyError – if it is empty
- Return type
List[Any]
fugue.dataframe.utils#
- fugue.dataframe.utils.deserialize_df(data, fs=None)[source]#
Deserialize json string to
LocalBoundedDataFrame
- Parameters
json_str – json string containing the base64 data or a file path
fs (Optional[FileSystem]) –
FileSystem
, defaults to Nonedata (Optional[bytes]) –
- Raises
ValueError – if the json string is invalid, not generated from
serialize_df()
- Returns
LocalBoundedDataFrame
ifjson_str
contains a dataframe or None if its valid but contains no data- Return type
Optional[LocalBoundedDataFrame]
- fugue.dataframe.utils.get_join_schemas(df1, df2, how, on)[source]#
Get
Schema
object after joiningdf1
anddf2
. Ifon
is not empty, it’s mainly for validation purpose.- Parameters
df1 (DataFrame) – first dataframe
df2 (DataFrame) – second dataframe
how (str) – can accept
semi
,left_semi
,anti
,left_anti
,inner
,left_outer
,right_outer
,full_outer
,cross
on (Optional[Iterable[str]]) – it can always be inferred, but if you provide, it will be validated agained the inferred keys.
- Returns
the pair key schema and schema after join
- Return type
Note
In Fugue, joined schema can always be inferred because it always uses the input dataframes’ common keys as the join keys. So you must make sure to
rename()
to input dataframes so they follow this rule.
- fugue.dataframe.utils.serialize_df(df, threshold=-1, file_path=None, fs=None)[source]#
Serialize input dataframe to base64 string or to file if it’s larger than threshold
- Parameters
df (Optional[DataFrame]) – input DataFrame
threshold (int) – file byte size threshold, defaults to -1
file_path (Optional[str]) – file path to store the data (used only if the serialized data is larger than
threshold
), defaults to Nonefs (Optional[FileSystem]) –
FileSystem
, defaults to None
- Raises
InvalidOperationError – if file is large but
file_path
is not provided- Returns
a pickled blob either containing the data or the file path
- Return type
Optional[bytes]
Note
If fs is not provided but it needs to write to disk, then it will use
open_fs()
to try to open the file to write.