fugue.column
fugue.column.expressions
- class fugue.column.expressions.ColumnExpr[source]
Bases:
object
Fugue column expression class. It is inspired from
pyspark.sql.Column
and it is working in progress.New Since
0.6.0
Caution
This is a base class of different column classes, and users are not supposed to construct this class directly. Use
col()
andlit()
instead.- alias(as_name)[source]
Assign or remove alias of a column. To remove, set
as_name
to empty- Returns:
a new column with the alias value
- Parameters:
as_name (str) –
- Return type:
Examples
assert "b" == col("a").alias("b").as_name assert "x" == (col("a") * 2).alias("x").as_name assert "" == col("a").alias("b").alias("").as_name
- property as_name: str
The name assigned by
alias()
- Returns:
the alias
Examples
assert "" == col("a").as_name assert "b" == col("a").alias("b").as_name assert "x" == (col("a") * 2).alias("x").as_name
- property as_type: DataType | None
The type assigned by
cast()
- Returns:
the pyarrow datatype if
cast()
was called otherwise None
Examples
import pyarrow as pa assert col("a").as_type is None assert pa.int64() == col("a").cast(int).as_type assert pa.string() == (col("a") * 2).cast(str).as_type
- property body_str: str
The string expression of this column without cast type and alias. This is only used for debug purpose. It is not SQL expression.
- Returns:
the string expression
- cast(data_type)[source]
Cast the column to a new data type
- Parameters:
data_type (Any) – It can be string expressions, python primitive types, python datetime.datetime and pyarrow types. For details read Fugue Data Types
- Returns:
a new column instance with the assigned data type
- Return type:
Caution
Currently, casting to struct or list type has undefined behavior.
Examples
import pyarrow as pa assert pa.int64() == col("a").cast(int).as_type assert pa.string() == col("a").cast(str).as_type assert pa.float64() == col("a").cast(float).as_type assert pa._bool() == col("a").cast(bool).as_type # string follows the type expression of Triad Schema assert pa.int32() == col("a").cast("int").as_type assert pa.int32() == col("a").cast("int32").as_type assert pa.int32() == col("a").cast(pa.int32()).as_type
- infer_alias()[source]
Infer alias of a column. If the column’s
output_name()
is not empty then it returns itself without change. Otherwise it tries to infer alias from the underlying columns.- Returns:
a column instance with inferred alias
- Return type:
Caution
Users should not use it directly.
Examples
import fugue.column.functions as f assert "a" == col("a").infer_alias().output_name assert "" == (col("a") * 2).infer_alias().output_name assert "a" == col("a").is_null().infer_alias().output_name assert "a" == f.max(col("a").is_null()).infer_alias().output_name
- infer_type(schema)[source]
Infer data type of this column given the input schema
- Parameters:
schema (Schema) – the schema instance to infer from
- Returns:
a pyarrow datatype or None if failed to infer
- Return type:
DataType | None
Caution
Users should not use it directly.
Examples
import pyarrow as pa from triad import Schema import fugue.column.functions as f schema = Schema("a:int,b:str") assert pa.int32() == col("a").infer_schema(schema) assert pa.int32() == (-col("a")).infer_schema(schema) # due to overflow risk, can't infer certain operations assert (col("a")+1).infer_schema(schema) is None assert (col("a")+col("a")).infer_schema(schema) is None assert pa.int32() == f.max(col("a")).infer_schema(schema) assert pa.int32() == f.min(col("a")).infer_schema(schema) assert f.sum(col("a")).infer_schema(schema) is None
- is_null()[source]
Same as SQL
<col> IS NULL
.- Returns:
a new column with the boolean values
- Return type:
- property name: str
The original name of this column, default empty
- Returns:
the name
Examples
assert "a" == col("a").name assert "b" == col("a").alias("b").name assert "" == lit(1).name assert "" == (col("a") * 2).name
- fugue.column.expressions.col(obj, alias='')[source]
Convert the
obj
to aColumnExpr
object- Parameters:
obj (str | ColumnExpr) – a string representing a column name or a
ColumnExpr
objectalias (str) – the alias of this column, defaults to “” (no alias)
- Returns:
a literal column expression
- Return type:
New Since
0.6.0
Examples
import fugue.column import col import fugue.column.functions as f col("a") col("a").alias("x") col("a", "x") # unary operations -col("a") # negative value of a ~col("a") # NOT a col("a").is_null() # a IS NULL col("a").not_null() # a IS NOT NULL # binary operations col("a") + 1 # col("a") + lit(1) 1 - col("a") # lit(1) - col("a") col("a") * col("b") col("a") / col("b") # binary boolean expressions col("a") == 1 # col("a") == lit(1) 2 != col("a") # col("a") != lit(2) col("a") < 5 col("a") > 5 col("a") <= 5 col("a") >= 5 (col("a") < col("b")) & (col("b") > 1) | col("c").is_null() # with functions f.max(col("a")) f.max(col("a")+col("b")) f.max(col("a")) + f.min(col("b")) f.count_distinct(col("a")).alias("dcount")
- fugue.column.expressions.function(name, *args, arg_distinct=False, **kwargs)[source]
Construct a function expression
- Parameters:
name (str) – the name of the function
arg_distinct (bool) – whether to add
DISTINCT
before all arguments, defaults to Falseargs (Any) –
- Returns:
the function expression
- Return type:
Caution
Users should not use this directly
- fugue.column.expressions.lit(obj, alias='')[source]
Convert the
obj
to a literal column. Currentlyobj
must beint
,bool
,float
orstr
, otherwise an exception will be raised- Parameters:
obj (Any) – an arbitrary value
alias (str) – the alias of this literal column, defaults to “” (no alias)
- Returns:
a literal column expression
- Return type:
New Since
0.6.0
Examples
import fugue.column import lit lit("abc") lit(100).alias("x") lit(100, "x")
fugue.column.functions
- fugue.column.functions.avg(col)[source]
SQL
AVG
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find average
- Return type:
Note
this function cannot infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f f.avg(col("a")) # AVG(a) AS a # you can specify explicitly # CAST(AVG(a) AS double) AS a f.avg(col("a")).cast(float)
- fugue.column.functions.coalesce(*args)[source]
SQL
COALESCE
function- Parameters:
args (Any) – If a value is not
ColumnExpr
then it’s converted to a literal column bycol()
- Return type:
Note
this function can infer neither type nor alias
New Since
0.6.0
Examples
import fugue.column.functions as f f.coalesce(col("a"), col("b")+col("c"), 1)
- fugue.column.functions.count(col)[source]
SQL
COUNT
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find count
- Return type:
Note
this function cannot infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f f.count(all_cols()) # COUNT(*) f.count(col("a")) # COUNT(a) AS a # you can specify explicitly # CAST(COUNT(a) AS double) AS a f.count(col("a")).cast(float)
- fugue.column.functions.count_distinct(col)[source]
SQL
COUNT DISTINCT
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find distinct element count
- Return type:
Note
this function cannot infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f f.count_distinct(all_cols()) # COUNT(DISTINCT *) f.count_distinct(col("a")) # COUNT(DISTINCT a) AS a # you can specify explicitly # CAST(COUNT(DISTINCT a) AS double) AS a f.count_distinct(col("a")).cast(float)
- fugue.column.functions.first(col)[source]
SQL
FIRST
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find first
- Return type:
Note
this function can infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f # assume col a has type double f.first(col("a")) # CAST(FIRST(a) AS double) AS a f.first(-col("a")) # CAST(FIRST(-a) AS double) AS a # neither type nor alias can be inferred in the following cases f.first(col("a")+1) f.first(col("a")+col("b")) # you can specify explicitly # CAST(FIRST(a+b) AS int) AS x f.first(col("a")+col("b")).cast(int).alias("x")
- fugue.column.functions.is_agg(column)[source]
Check if a column contains aggregation operation
- Parameters:
col – the column to check
column (Any) –
- Returns:
whether the column is
ColumnExpr
and contains aggregation operations- Return type:
bool
New Since
0.6.0
Examples
import fugue.column.functions as f assert not f.is_agg(1) assert not f.is_agg(col("a")) assert not f.is_agg(col("a")+lit(1)) assert f.is_agg(f.max(col("a"))) assert f.is_agg(-f.max(col("a"))) assert f.is_agg(f.max(col("a")+1)) assert f.is_agg(f.max(col("a"))+f.min(col("a"))))
- fugue.column.functions.last(col)[source]
SQL
LAST
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find last
- Return type:
Note
this function can infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f # assume col a has type double f.last(col("a")) # CAST(LAST(a) AS double) AS a f.last(-col("a")) # CAST(LAST(-a) AS double) AS a # neither type nor alias can be inferred in the following cases f.last(col("a")+1) f.last(col("a")+col("b")) # you can specify explicitly # CAST(LAST(a+b) AS int) AS x f.last(col("a")+col("b")).cast(int).alias("x")
- fugue.column.functions.max(col)[source]
SQL
MAX
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find max
- Return type:
Note
this function can infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f # assume col a has type double f.max(col("a")) # CAST(MAX(a) AS double) AS a f.max(-col("a")) # CAST(MAX(-a) AS double) AS a # neither type nor alias can be inferred in the following cases f.max(col("a")+1) f.max(col("a")+col("b")) # you can specify explicitly # CAST(MAX(a+b) AS int) AS x f.max(col("a")+col("b")).cast(int).alias("x")
- fugue.column.functions.min(col)[source]
SQL
MIN
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find min
- Return type:
Note
this function can infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f # assume col a has type double f.min(col("a")) # CAST(MIN(a) AS double) AS a f.min(-col("a")) # CAST(MIN(-a) AS double) AS a # neither type nor alias can be inferred in the following cases f.min(col("a")+1) f.min(col("a")+col("b")) # you can specify explicitly # CAST(MIN(a+b) AS int) AS x f.min(col("a")+col("b")).cast(int).alias("x")
- fugue.column.functions.sum(col)[source]
SQL
SUM
function (aggregation)- Parameters:
col (ColumnExpr) – the column to find sum
- Return type:
Note
this function cannot infer type from
col
typethis function can infer alias from
col
’s inferred alias
New Since
0.6.0
Examples
import fugue.column.functions as f f.sum(col("a")) # SUM(a) AS a # you can specify explicitly # CAST(SUM(a) AS double) AS a f.sum(col("a")).cast(float)
fugue.column.sql
- class fugue.column.sql.SQLExpressionGenerator(enable_cast=True)[source]
Bases:
object
SQL generator for
SelectColumns
- Parameters:
enable_cast (bool) – whether convert
cast
into the statement, defaults to True
New Since
0.6.0
- add_func_handler(name, handler)[source]
Add special function handler.
- Parameters:
name (str) – name of the function
handler (Callable[[_FuncExpr], Iterable[str]]) – the function to convert the function expression to SQL clause
- Returns:
the instance itself
- Return type:
Caution
Users should not use this directly
- correct_select_schema(input_schema, select, output_schema)[source]
Do partial schema inference from
input_schema
andselect
columns, then compare with the SQL output dataframe schema, and return the different part as a new schema, or None if there is no difference- Parameters:
input_schema (Schema) – input dataframe schema for the select statement
select (SelectColumns) – the collection of select columns
output_schema (Schema) – schema of the output dataframe after executing the SQL
- Returns:
the difference as a new schema or None if no difference
- Return type:
Schema | None
Tip
This is particularly useful when the SQL engine messed up the schema of the output. For example,
SELECT *
should return a dataframe with the same schema of the input. However, for example a columna:int
could becomea:long
in the output dataframe because of information loss. This function is designed to make corrections on column types when they can be inferred. This may not be perfect but it can solve major discrepancies.
- generate(expr)[source]
Convert
ColumnExpr
to SQL clause- Parameters:
expr (ColumnExpr) – the column expression to convert
- Returns:
the SQL clause for this expression
- Return type:
str
- select(columns, table, where=None, having=None)[source]
Construct the full
SELECT
statement on a single table- Parameters:
columns (SelectColumns) – columns to select, it may contain aggregations, if so, the group keys are inferred. See
group_keys()
table (str) – table name to select from
where (ColumnExpr | None) –
WHERE
condition, defaults to Nonehaving (ColumnExpr | None) –
HAVING
condition, defaults to None. It is used only when there is aggregation
- Returns:
the full
SELECT
statement- Return type:
Iterable[Tuple[bool, str]]
- where(condition, table)[source]
Generate a
SELECT *
statement with the given where clause- Parameters:
condition (ColumnExpr) – column expression for
WHERE
table (str) – table name for
FROM
- Returns:
the SQL statement
- Raises:
ValueError – if
condition
contains aggregation- Return type:
Iterable[Tuple[bool, str]]
Examples
gen = SQLExpressionGenerator(enable_cast=False) # SELECT * FROM tb WHERE a>1 AND b IS NULL gen.where((col("a")>1) & col("b").is_null(), "tb")
- class fugue.column.sql.SelectColumns(*cols, arg_distinct=False)[source]
Bases:
object
SQL
SELECT
columns collection.- Parameters:
cols (ColumnExpr) – collection of
ColumnExpr
arg_distinct (bool) – whether this is
SELECT DISTINCT
, defaults to False
New Since
0.6.0
- property agg_funcs: List[ColumnExpr]
All columns with aggregation operations
- property all_cols: List[ColumnExpr]
All columns (with inferred aliases)
- assert_all_with_names()[source]
Assert every column have explicit alias or the alias can be inferred (non empty value). It will also validate there is no duplicated aliases
- Raises:
ValueError – if there are columns without alias, or there are duplicated names.
- Returns:
the instance itself
- Return type:
- assert_no_agg()[source]
Assert there is no aggregation operation on any column.
- Raises:
AssertionError – if there is any aggregation in the collection.
- Returns:
the instance itself
- Return type:
See also
Go to
is_agg()
to see how the aggregations are detected.
- assert_no_wildcard()[source]
Assert there is no
*
on first level columns- Raises:
AssertionError – if
all_cols()
exists- Returns:
the instance itself
- Return type:
- property group_keys: List[ColumnExpr]
Group keys inferred from the columns.
Note
if there is no aggregation, the result will be empty
it is
simple_cols()
plusnon_agg_funcs()
- property has_agg: bool
Whether this select is an aggregation
- property has_literals: bool
Whether this select contains literal columns
- property is_distinct: bool
Whether this is a
SELECT DISTINCT
- property literals: List[ColumnExpr]
All literal columns
- property non_agg_funcs: List[ColumnExpr]
All columns with non-aggregation operations
- replace_wildcard(schema)[source]
Replace wildcard
*
with explicit column names- Parameters:
schema (Schema) – the schema used to parse the wildcard
- Returns:
a new instance containing only explicit columns
- Return type:
Note
It only replaces the top level
*
. For examplecount_distinct(all_cols())
will not be transformed because this*
is not first level.
- property simple: bool
Whether this select contains only simple column representations
- property simple_cols: List[ColumnExpr]
All columns directly representing column names