fugue.dataframe
fugue.dataframe.api
- fugue.dataframe.api.get_native_as_df(df)[source]
Return the dataframe form of any dataframe. If
df
is aDataFrame
, then call thenative_as_df()
, otherwise, it depends on whether there is a correspondent function handling it.- Parameters:
df (AnyDataFrame)
- Return type:
AnyDataFrame
- fugue.dataframe.api.normalize_column_names(df)[source]
A generic function to normalize any dataframe’s column names to follow Fugue naming rules
Note
This is a temporary solution before
Schema
can take arbitrary namesExamples
[0,1]
=>{"_0":0, "_1":1}
["1a","2b"]
=>{"_1a":"1a", "_2b":"2b"}
["*a","-a"]
=>{"_a":"*a", "_a_1":"-a"}
- Parameters:
df (AnyDataFrame) – a dataframe object
- Returns:
the renamed dataframe and the rename operations as a dict that can undo the change
- Return type:
Tuple[AnyDataFrame, Dict[str, Any]]
See also
fugue.dataframe.array_dataframe
- class fugue.dataframe.array_dataframe.ArrayDataFrame(df=None, schema=None)[source]
Bases:
LocalBoundedDataFrame
DataFrame that wraps native python 2-dimensional arrays. Please read the DataFrame Tutorial to understand the concept
- Parameters:
df (Any) – 2-dimensional array, iterable of arrays, or
DataFrame
schema (Any) – Schema like object
Examples
>>> a = ArrayDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> b = ArrayDataFrame(a)
- alter_columns(columns)[source]
Change column types
- Parameters:
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns:
a new dataframe with altered columns, the order of the original schema will not change
- Return type:
- as_array(columns=None, type_safe=False)[source]
Convert to 2-dimensional native python array
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
2-dimensional native python array
- Return type:
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]
Convert to iterable of native python arrays
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
iterable of native python arrays
- Return type:
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- property empty: bool
Whether this dataframe is empty
- head(n, columns=None)[source]
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters:
n (int) – number of rows
columns (List[str] | None) – selected columns, defaults to None (all columns)
- Returns:
a local bounded dataframe
- Return type:
- property native: List[Any]
2-dimensional native python array
- peek_array()[source]
Peek the first row of the dataframe as array
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
List[Any]
fugue.dataframe.arrow_dataframe
- class fugue.dataframe.arrow_dataframe.ArrowDataFrame(df=None, schema=None)[source]
Bases:
LocalBoundedDataFrame
DataFrame that wraps
pyarrow.Table
. Please also read the DataFrame Tutorial to understand this Fugue concept- Parameters:
df (Any) – 2-dimensional array, iterable of arrays,
pyarrow.Table
or pandas DataFrameschema (Any) – Schema like object
Examples
>>> ArrowDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> ArrowDataFrame(schema = "a:int,b:int") # empty dataframe >>> ArrowDataFrame(pd.DataFrame([[0]],columns=["a"])) >>> ArrowDataFrame(ArrayDataFrame([[0]],"a:int).as_arrow())
- alter_columns(columns)[source]
Change column types
- Parameters:
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns:
a new dataframe with altered columns, the order of the original schema will not change
- Return type:
- as_array(columns=None, type_safe=False)[source]
Convert to 2-dimensional native python array
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
2-dimensional native python array
- Return type:
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]
Convert to iterable of native python arrays
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
iterable of native python arrays
- Return type:
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_arrow(type_safe=False)[source]
Convert to pyArrow DataFrame
- Parameters:
type_safe (bool)
- Return type:
- as_dict_iterable(columns=None)[source]
Convert to iterable of python dicts
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
- Returns:
iterable of python dicts
- Return type:
Iterable[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- as_dicts(columns=None)[source]
Convert to a list of python dicts
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
- Returns:
a list of python dicts
- Return type:
List[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- property empty: bool
Whether this dataframe is empty
- head(n, columns=None)[source]
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters:
n (int) – number of rows
columns (List[str] | None) – selected columns, defaults to None (all columns)
- Returns:
a local bounded dataframe
- Return type:
- native_as_df()[source]
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type:
- peek_array()[source]
Peek the first row of the dataframe as array
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
List[Any]
- peek_dict()[source]
Peek the first row of the dataframe as dict
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
Dict[str, Any]
fugue.dataframe.dataframe
- class fugue.dataframe.dataframe.DataFrame(schema=None)[source]
Bases:
Dataset
Base class of Fugue DataFrame. Please read the DataFrame Tutorial to understand the concept
- Parameters:
schema (Any) – Schema like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- abstract alter_columns(columns)[source]
Change column types
- Parameters:
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns:
a new dataframe with altered columns, the order of the original schema will not change
- Return type:
- abstract as_array(columns=None, type_safe=False)[source]
Convert to 2-dimensional native python array
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
2-dimensional native python array
- Return type:
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- abstract as_array_iterable(columns=None, type_safe=False)[source]
Convert to iterable of native python arrays
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
iterable of native python arrays
- Return type:
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_arrow(type_safe=False)[source]
Convert to pyArrow DataFrame
- Parameters:
type_safe (bool)
- Return type:
- as_dict_iterable(columns=None)[source]
Convert to iterable of python dicts
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
- Returns:
iterable of python dicts
- Return type:
Iterable[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- as_dicts(columns=None)[source]
Convert to a list of python dicts
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
- Returns:
a list of python dicts
- Return type:
List[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- as_local()[source]
Convert this dataframe to a
LocalDataFrame
- Return type:
- abstract as_local_bounded()[source]
Convert this dataframe to a
LocalBoundedDataFrame
- Return type:
- property columns: List[str]
The column names of the dataframe
- drop(columns)[source]
Drop certain columns and return a new dataframe
- Parameters:
columns (List[str]) – columns to drop
- Raises:
FugueDataFrameOperationError – if
columns
are not strictly contained by this dataframe, or it is the entire dataframe columns- Returns:
a new dataframe removing the columns
- Return type:
- get_info_str()[source]
Get dataframe information (schema, type, metadata) as json string
- Returns:
json string
- Return type:
str
- abstract head(n, columns=None)[source]
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters:
n (int) – number of rows
columns (List[str] | None) – selected columns, defaults to None (all columns)
- Returns:
a local bounded dataframe
- Return type:
- abstract native_as_df()[source]
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type:
AnyDataFrame
- abstract peek_array()[source]
Peek the first row of the dataframe as array
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
List[Any]
- peek_dict()[source]
Peek the first row of the dataframe as dict
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
Dict[str, Any]
- class fugue.dataframe.dataframe.DataFrameDisplay(ds)[source]
Bases:
DatasetDisplay
DataFrame
plain display class- Parameters:
ds (Dataset)
- class fugue.dataframe.dataframe.LocalBoundedDataFrame(schema=None)[source]
Bases:
LocalDataFrame
Base class of all local bounded dataframes. Please read this to understand the concept
- Parameters:
schema (Any) – Schema like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- property is_bounded: bool
Always True because it’s a bounded dataframe
- class fugue.dataframe.dataframe.LocalDataFrame(schema=None)[source]
Bases:
DataFrame
Base class of all local dataframes. Please read this to understand the concept
- Parameters:
schema (Any) – a schema-like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- property is_local: bool
Always True because it’s a LocalDataFrame
- native_as_df()[source]
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type:
AnyDataFrame
- property num_partitions: int
Always 1 because it’s a LocalDataFrame
- class fugue.dataframe.dataframe.LocalUnboundedDataFrame(schema=None)[source]
Bases:
LocalDataFrame
Base class of all local unbounded dataframes. Read this <https://fugue-tutorials.readthedocs.io/ en/latest/tutorials/advanced/schema_dataframes.html#DataFrame>`_ to understand the concept
- Parameters:
schema (Any) – Schema like object
Note
This is an abstract class, and normally you don’t construct it by yourself unless you are implementing a new
ExecutionEngine
- as_local()[source]
Convert this dataframe to a
LocalDataFrame
- Return type:
- count()[source]
- Raises:
InvalidOperationError – You can’t count an unbounded dataframe
- Return type:
int
- property is_bounded
Always False because it’s an unbounded dataframe
- class fugue.dataframe.dataframe.YieldedDataFrame(yid)[source]
Bases:
Yielded
Yielded dataframe from
FugueWorkflow
. Users shouldn’t create this object directly.- Parameters:
yid (str) – unique id for determinism
- property is_set: bool
Whether the value is set. It can be false if the parent workflow has not been executed.
fugue.dataframe.dataframe_iterable_dataframe
- class fugue.dataframe.dataframe_iterable_dataframe.IterableArrowDataFrame(df=None, schema=None)[source]
Bases:
LocalDataFrameIterableDataFrame
- Parameters:
df (Any)
schema (Any)
- class fugue.dataframe.dataframe_iterable_dataframe.IterablePandasDataFrame(df=None, schema=None)[source]
Bases:
LocalDataFrameIterableDataFrame
- Parameters:
df (Any)
schema (Any)
- as_local_bounded()[source]
Convert this dataframe to a
LocalBoundedDataFrame
- Return type:
- class fugue.dataframe.dataframe_iterable_dataframe.LocalDataFrameIterableDataFrame(df=None, schema=None)[source]
Bases:
LocalUnboundedDataFrame
DataFrame that wraps an iterable of local dataframes
- Parameters:
df (Any) – an iterable of
DataFrame
. If any is not local, they will be converted toLocalDataFrame
byas_local()
schema (Any) – Schema like object, if it is provided, it must match the schema of the dataframes
Examples
def get_dfs(seq): yield IterableDataFrame([], "a:int,b:int") yield IterableDataFrame([[1, 10]], "a:int,b:int") yield ArrayDataFrame([], "a:int,b:str") df = LocalDataFrameIterableDataFrame(get_dfs()) for subdf in df.native: subdf.show()
Note
It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid to count.
schema
can be used when the iterable contains no dataframe. But if there is any dataframe,schema
must match the schema of the dataframes.For the iterable of dataframes, if there is any empty dataframe, they will be skipped and their schema will not matter. However, if all dataframes in the interable are empty, then the last empty dataframe will be used to set the schema.
- alter_columns(columns)[source]
Change column types
- Parameters:
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns:
a new dataframe with altered columns, the order of the original schema will not change
- Return type:
- as_array(columns=None, type_safe=False)[source]
Convert to 2-dimensional native python array
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
2-dimensional native python array
- Return type:
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]
Convert to iterable of native python arrays
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
iterable of native python arrays
- Return type:
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_arrow(type_safe=False)[source]
Convert to pyArrow DataFrame
- Parameters:
type_safe (bool)
- Return type:
- as_local_bounded()[source]
Convert this dataframe to a
LocalBoundedDataFrame
- Return type:
- property empty: bool
Whether this dataframe is empty
- head(n, columns=None)[source]
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters:
n (int) – number of rows
columns (List[str] | None) – selected columns, defaults to None (all columns)
- Returns:
a local bounded dataframe
- Return type:
- property native: EmptyAwareIterable[LocalDataFrame]
Iterable of dataframes
- peek_array()[source]
Peek the first row of the dataframe as array
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
List[Any]
fugue.dataframe.dataframes
- class fugue.dataframe.dataframes.DataFrames(*args, **kwargs)[source]
Bases:
IndexedOrderedDict
[str
,DataFrame
]Ordered dictionary of DataFrames. There are two modes: with keys and without keys. If without key
_<n>
will be used as the key for each dataframe, and it will be treated as an array in Fugue framework.It’s a subclass of dict, so it supports all dict operations. It’s also ordered, so you can trust the order of keys and values.
The initialization is flexible
>>> df1 = ArrayDataFrame([[0]],"a:int") >>> df2 = ArrayDataFrame([[1]],"a:int") >>> dfs = DataFrames(df1,df2) # init as [df1, df2] >>> assert not dfs.has_key >>> assert df1 is dfs[0] and df2 is dfs[1] >>> dfs_array = list(dfs.values()) >>> dfs = DataFrames(a=df1,b=df2) # init as {a:df1, b:df2} >>> assert dfs.has_key >>> assert df1 is dfs[0] and df2 is dfs[1] # order is guaranteed >>> df3 = ArrayDataFrame([[1]],"b:int") >>> dfs2 = DataFrames(dfs, c=df3) # {a:df1, b:df2, c:df3} >>> dfs2 = DataFrames(dfs, df3) # invalid, because dfs has key, df3 doesn't >>> dfs2 = DataFrames(dict(a=df1,b=df2)) # init as {a:df1, b:df2} >>> dfs2 = DataFrames([df1,df2],df3) # init as [df1, df2, df3]
- Parameters:
args (Any)
kwargs (Any)
- convert(func)[source]
Create another DataFrames with the same structure, but all converted by
func
- Returns:
the new DataFrames
- Parameters:
- Return type:
Examples
>>> dfs2 = dfs.convert(lambda df: df.as_local()) # convert all to local
- property has_key
If this collection has key (dict-like) or not (list-like)
fugue.dataframe.function_wrapper
- class fugue.dataframe.function_wrapper.DataFrameFunctionWrapper(func, params_re='.*', return_re='.*')[source]
Bases:
FunctionWrapper
- Parameters:
func (Callable)
params_re (str)
return_re (str)
- property need_output_schema: bool | None
- class fugue.dataframe.function_wrapper.DataFrameParam(param)[source]
Bases:
_DataFrameParamBase
- Parameters:
param (Parameter | None)
- class fugue.dataframe.function_wrapper.DictParam(param)[source]
Bases:
RowParam
- Parameters:
param (Parameter | None)
- class fugue.dataframe.function_wrapper.LocalDataFrameParam(param)[source]
Bases:
DataFrameParam
- Parameters:
param (Parameter | None)
- count(df)[source]
- Parameters:
df (LocalDataFrame)
- Return type:
int
- iterable_to_output_df(dfs, schema, ctx)[source]
- Parameters:
dfs (Iterable[Any])
schema (Any)
ctx (Any)
- Return type:
- to_output_df(output, schema, ctx)[source]
- Parameters:
output (LocalDataFrame)
schema (Any)
ctx (Any)
- Return type:
fugue.dataframe.iterable_dataframe
- class fugue.dataframe.iterable_dataframe.IterableDataFrame(df=None, schema=None)[source]
Bases:
LocalUnboundedDataFrame
DataFrame that wraps native python iterable of arrays. Please read the DataFrame Tutorial to understand the concept
- Parameters:
df (Any) – 2-dimensional array, iterable of arrays, or
DataFrame
schema (Any) – Schema like object
Examples
>>> a = IterableDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> b = IterableDataFrame(a)
Note
It’s ok to peek the dataframe, it will not affect the iteration, but it’s invalid operation to count
- alter_columns(columns)[source]
Change column types
- Parameters:
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns:
a new dataframe with altered columns, the order of the original schema will not change
- Return type:
- as_array(columns=None, type_safe=False)[source]
Convert to 2-dimensional native python array
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
2-dimensional native python array
- Return type:
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]
Convert to iterable of native python arrays
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
iterable of native python arrays
- Return type:
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_dicts(columns=None)[source]
Convert to a list of python dicts
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
- Returns:
a list of python dicts
- Return type:
List[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- as_local_bounded()[source]
Convert this dataframe to a
LocalBoundedDataFrame
- Return type:
- property empty: bool
Whether this dataframe is empty
- head(n, columns=None)[source]
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters:
n (int) – number of rows
columns (List[str] | None) – selected columns, defaults to None (all columns)
- Returns:
a local bounded dataframe
- Return type:
- property native: EmptyAwareIterable[Any]
Iterable of native python arrays
- peek_array()[source]
Peek the first row of the dataframe as array
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
List[Any]
fugue.dataframe.pandas_dataframe
- class fugue.dataframe.pandas_dataframe.PandasDataFrame(df=None, schema=None, pandas_df_wrapper=False)[source]
Bases:
LocalBoundedDataFrame
DataFrame that wraps pandas DataFrame. Please also read the DataFrame Tutorial to understand this Fugue concept
- Parameters:
df (Any) – 2-dimensional array, iterable of arrays or pandas DataFrame
schema (Any) – Schema like object
pandas_df_wrapper (bool) – if this is a simple wrapper, default False
Examples
>>> PandasDataFrame([[0,'a'],[1,'b']],"a:int,b:str") >>> PandasDataFrame(schema = "a:int,b:int") # empty dataframe >>> PandasDataFrame(pd.DataFrame([[0]],columns=["a"])) >>> PandasDataFrame(ArrayDataFrame([[0]],"a:int).as_pandas())
Note
If
pandas_df_wrapper
is True, then the constructor will not do any type check otherwise, it will enforce type according to the input schema after the construction- alter_columns(columns)[source]
Change column types
- Parameters:
columns (Any) – Schema like object, all columns should be contained by the dataframe schema
- Returns:
a new dataframe with altered columns, the order of the original schema will not change
- Return type:
- as_array(columns=None, type_safe=False)[source]
Convert to 2-dimensional native python array
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
2-dimensional native python array
- Return type:
List[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_array_iterable(columns=None, type_safe=False)[source]
Convert to iterable of native python arrays
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
type_safe (bool) – whether to ensure output conforms with its schema, defaults to False
- Returns:
iterable of native python arrays
- Return type:
Iterable[Any]
Note
If
type_safe
is False, then the returned values are ‘raw’ values.
- as_arrow(type_safe=False)[source]
Convert to pyArrow DataFrame
- Parameters:
type_safe (bool)
- Return type:
- as_dict_iterable(columns=None)[source]
Convert to iterable of python dicts
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
- Returns:
iterable of python dicts
- Return type:
Iterable[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- as_dicts(columns=None)[source]
Convert to a list of python dicts
- Parameters:
columns (List[str] | None) – columns to extract, defaults to None
- Returns:
a list of python dicts
- Return type:
List[Dict[str, Any]]
Note
The default implementation enforces
type_safe
True
- property empty: bool
Whether this dataframe is empty
- head(n, columns=None)[source]
Get first n rows of the dataframe as a new local bounded dataframe
- Parameters:
n (int) – number of rows
columns (List[str] | None) – selected columns, defaults to None (all columns)
- Returns:
a local bounded dataframe
- Return type:
- native_as_df()[source]
The dataframe form of the native object this Dataset class wraps. Dataframe form means the object contains schema information. For example the native an ArrayDataFrame is a python array, it doesn’t contain schema information, and its
native_as_df
should be either a pandas dataframe or an arrow dataframe.- Return type:
- peek_array()[source]
Peek the first row of the dataframe as array
- Raises:
FugueDatasetEmptyError – if it is empty
- Return type:
List[Any]
fugue.dataframe.utils
- fugue.dataframe.utils.deserialize_df(data, fs=None)[source]
Deserialize json string to
LocalBoundedDataFrame
- Parameters:
json_str – json string containing the base64 data or a file path
fs (AbstractFileSystem | None) – the file system to use, defaults to None
data (bytes | None)
- Raises:
ValueError – if the json string is invalid, not generated from
serialize_df()
- Returns:
LocalBoundedDataFrame
ifjson_str
contains a dataframe or None if its valid but contains no data- Return type:
LocalBoundedDataFrame | None
- fugue.dataframe.utils.get_join_schemas(df1, df2, how, on)[source]
Get
Schema
object after joiningdf1
anddf2
. Ifon
is not empty, it’s mainly for validation purpose.- Parameters:
- Returns:
the pair key schema and schema after join
- Return type:
Note
In Fugue, joined schema can always be inferred because it always uses the input dataframes’ common keys as the join keys. So you must make sure to
rename()
to input dataframes so they follow this rule.
- fugue.dataframe.utils.pa_table_as_array(df, columns=None)[source]
Convert a pyarrow table to a list of list
- Parameters:
df (Table) – pyarrow table
columns (List[str] | None) – if not None, only these columns will be returned, defaults to None
- Returns:
a list of list
- Return type:
List[List[List[Any]]]
- fugue.dataframe.utils.pa_table_as_array_iterable(df, columns=None)[source]
Convert a pyarrow table to an iterable of list
- Parameters:
df (Table) – pyarrow table
columns (List[str] | None) – if not None, only these columns will be returned, defaults to None
- Returns:
an iterable of list
- Return type:
Iterable[List[List[Any]]]
- fugue.dataframe.utils.pa_table_as_dict_iterable(df, columns=None)[source]
Convert a pyarrow table to an iterable of dict
- Parameters:
df (Table) – pyarrow table
columns (List[str] | None) – if not None, only these columns will be returned, defaults to None
- Returns:
an iterable of dict
- Return type:
Iterable[Dict[str, Any]]
- fugue.dataframe.utils.pa_table_as_dicts(df, columns=None)[source]
Convert a pyarrow table to a list of dict
- Parameters:
df (Table) – pyarrow table
columns (List[str] | None) – if not None, only these columns will be returned, defaults to None
- Returns:
a list of dict
- Return type:
List[Dict[str, Any]]
- fugue.dataframe.utils.serialize_df(df, threshold=-1, file_path=None)[source]
Serialize input dataframe to base64 string or to file if it’s larger than threshold
- Parameters:
df (DataFrame | None) – input DataFrame
threshold (int) – file byte size threshold, defaults to -1
file_path (str | None) – file path to store the data (used only if the serialized data is larger than
threshold
), defaults to None
- Raises:
InvalidOperationError – if file is large but
file_path
is not provided- Returns:
a pickled blob either containing the data or the file path
- Return type:
bytes | None