Package pandas (2.29.0)

API documentation for pandas package.

Classes

ArrowDtype

An ExtensionDtype for PyArrow data types.

BooleanDtype

Extension dtype for boolean data.

.. method:: None :noindex:

.. rubric:: Examples

pd.BooleanDtype() BooleanDtype

DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

DatetimeIndex

Immutable sequence used for indexing and alignment with datetime-like values

Float64Dtype

An ExtensionDtype for float64 data.

This dtype uses pd.NA as missing value indicator.

For Float32Dtype:

ser = pd.Series([2.25, pd.NA], dtype=pd.Float32Dtype()) ser.dtype Float32Dtype()

For Float64Dtype:

ser = pd.Series([2.25, pd.NA], dtype=pd.Float64Dtype()) ser.dtype Float64Dtype()

Index

Immutable sequence used for indexing and alignment.

The basic object storing axis labels for all objects.

Int64Dtype

An ExtensionDtype for int64 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

For Int8Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int8Dtype()) ser.dtype Int8Dtype()

For Int16Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int16Dtype()) ser.dtype Int16Dtype()

For Int32Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int32Dtype()) ser.dtype Int32Dtype()

For Int64Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int64Dtype()) ser.dtype Int64Dtype()

For UInt8Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt8Dtype()) ser.dtype UInt8Dtype()

For UInt16Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt16Dtype()) ser.dtype UInt16Dtype()

For UInt32Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt32Dtype()) ser.dtype UInt32Dtype()

For UInt64Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt64Dtype()) ser.dtype UInt64Dtype()

MultiIndex

A multi-level, or hierarchical, index object for pandas objects.

NamedAgg

API documentation for NamedAgg class.

Series

API documentation for Series class.

StringDtype

Extension dtype for string data.

option_context

Context manager to temporarily set thread-local options in the with statement context.

You need to invoke as option_context(pat, val, [(pat, val), ...]).

import bigframes

with bigframes.option_context('display.max_rows', 10, 'display.max_columns', 5): ... pass

Packages Functions

clean_up_by_session_id

clean_up_by_session_id(
    session_id: str,
    location: typing.Optional[str] = None,
    project: typing.Optional[str] = None,
) -> None

Searches through BigQuery tables and routines and deletes the ones created during the session with the given session id. The match is determined by having the session id present in the resource name or metadata. The cloud functions serving the cleaned up routines are also cleaned up.

This could be useful if the session object has been lost. Calling session.close() or <xref uid="bigframes.pandas.close_session">bigframes.pandas.close_session</xref>() is preferred in most cases.

close_session

close_session()

Start a fresh session the next time a function requires a session.

Closes the current session if it was already started, deleting any temporary tables that were created.

concat

concat(
    objs: typing.Iterable[
        typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series]
    ],
    *,
    axis: typing.Union[str, int] = 0,
    join: typing.Literal["inner", "outer"] = "outer",
    ignore_index: bool = False
) -> typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series]

Concatenate BigQuery DataFrames objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Examples:

>>> import bigframes.pandas as pd
>>> pd.options.display.progress_bar = None

Combine two Series.

>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0    a
1    b
0    c
1    d
dtype: string

Clear the existing index and reset it in the result by setting the ignore_index option to True.

>>> pd.concat([s1, s2], ignore_index=True)
0    a
1    b
2    c
3    d
dtype: string

Combine two DataFrame objects with identical columns.

>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
...                    columns=['letter', 'number'])
>>> df1
  letter  number
0      a       1
1      b       2
<BLANKLINE>
[2 rows x 2 columns]
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
...                    columns=['letter', 'number'])
>>> df2
  letter  number
0      c       3
1      d       4
<BLANKLINE>
[2 rows x 2 columns]
>>> pd.concat([df1, df2])
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4
<BLANKLINE>
[4 rows x 2 columns]

Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.

>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
...                    columns=['letter', 'number', 'animal'])
>>> df3
  letter  number animal
0      c       3    cat
1      d       4    dog
<BLANKLINE>
[2 rows x 3 columns]
>>> pd.concat([df1, df3])
  letter  number animal
0      a       1   <NA>
1      b       2   <NA>
0      c       3    cat
1      d       4    dog
<BLANKLINE>
[4 rows x 3 columns]

Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

>>> pd.concat([df1, df3], join="inner")
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4
<BLANKLINE>
[4 rows x 2 columns]

cut

cut(
    x,
    bins: typing.Union[
        int, pandas.core.indexes.interval.IntervalIndex, typing.Iterable
    ],
    *,
    right: typing.Optional[bool] = True,
    labels: typing.Optional[typing.Union[typing.Iterable[str], bool]] = None,
    session: typing.Optional[bigframes.session.Session] = None
) -> bigframes.series.Series

Bin values into discrete intervals.

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

Examples:

>>> import bigframes.pandas as bpd
>>> s = bpd.Series([0, 1, 5, 10])
>>> s
0     0
1     1
2     5
3    10
dtype: Int64

Cut with an integer (equal-width bins):

>>> bpd.cut(s, bins=4)
    0    {'left_exclusive': -0.01, 'right_inclusive': 2.5}
    1    {'left_exclusive': -0.01, 'right_inclusive': 2.5}
    2      {'left_exclusive': 2.5, 'right_inclusive': 5.0}
    3     {'left_exclusive': 7.5, 'right_inclusive': 10.0}
    dtype: struct<left_exclusive: double, right_inclusive: double>[pyarrow]

Cut with the same bins, but assign them specific labels:

>>> bpd.cut(s, bins=3, labels=["bad", "medium", "good"])
0    bad
1    bad
2    medium
3    good
dtype: string

labels=False implies you want the bins back.

>>> bpd.cut(s, bins=4, labels=False)
0    0
1    0
2    1
3    3
dtype: Int64

Cut with pd.IntervalIndex, requires importing pandas for IntervalIndex:

>>> interval_index = pd.IntervalIndex.from_tuples([(0, 1), (1, 5), (5, 20)])
>>> bpd.cut(s, bins=interval_index)
0                                            <NA>
1     {'left_exclusive': 0, 'right_inclusive': 1}
2     {'left_exclusive': 1, 'right_inclusive': 5}
3    {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]

Cut with an iterable of tuples:

>>> bins_tuples = [(0, 1), (1, 4), (5, 20)]
>>> bpd.cut(s, bins=bins_tuples)
0                                            <NA>
1     {'left_exclusive': 0, 'right_inclusive': 1}
2                                            <NA>
3    {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]

Cut with an iterable of ints:

>>> bins_ints = [0, 1, 5, 20]
>>> bpd.cut(s, bins=bins_ints)
0                                            <NA>
1     {'left_exclusive': 0, 'right_inclusive': 1}
2     {'left_exclusive': 1, 'right_inclusive': 5}
3    {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]

Cut with an interable of ints, where intervals are left-inclusive and right-exclusive.

>>> bins_ints = [0, 1, 5, 20]
>>> bpd.cut(s, bins=bins_ints, right=False)
0     {'left_inclusive': 0, 'right_exclusive': 1}
1     {'left_inclusive': 1, 'right_exclusive': 5}
2    {'left_inclusive': 5, 'right_exclusive': 20}
3    {'left_inclusive': 5, 'right_exclusive': 20}
dtype: struct<left_inclusive: int64, right_exclusive: int64>[pyarrow]

deploy_remote_function

deploy_remote_function(func, **kwargs)

Orchestrates the creation of a BigQuery remote function that deploys immediately.

This method ensures that the remote function is created and available for use in BigQuery as soon as this call is made.

deploy_udf

deploy_udf(func, **kwargs)

Orchestrates the creation of a BigQuery UDF that deploys immediately.

This method ensures that the UDF is created and available for use in BigQuery as soon as this call is made.

from_glob_path

from_glob_path(
    path: str,
    *,
    connection: typing.Optional[str] = None,
    name: typing.Optional[str] = None
) -> bigframes.dataframe.DataFrame

Create a BigFrames DataFrame that contains a BigFrames Blob column from a global wildcard path. This operation creates a temporary BQ Object Table under the hood and requires bigquery.connections.delegate permission or BigQuery Connection Admin role. If you have an existing BQ Object Table, use read_gbq_object_table().

get_default_session_id

get_default_session_id() -> str

Gets the session id that is used whenever a custom session has not been provided.

It is the session id of the default global session. It is prefixed to the table id of all temporary tables created in the global session.

get_dummies

get_dummies(
    data: typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series],
    prefix: typing.Optional[typing.Union[typing.List, dict, str]] = None,
    prefix_sep: typing.Optional[typing.Union[typing.List, dict, str]] = "_",
    dummy_na: bool = False,
    columns: typing.Optional[typing.List] = None,
    drop_first: bool = False,
    dtype: typing.Any = None,
) -> bigframes.dataframe.DataFrame

Convert categorical variable into dummy/indicator variables.

Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

Examples:

>>> import bigframes.pandas as pd
>>> pd.options.display.progress_bar = None
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
       a      b      c
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False
<BLANKLINE>
[4 rows x 3 columns]

>>> s1 = pd.Series(['a', 'b', None])
>>> pd.get_dummies(s1)
       a      b
0   True  False
1  False   True
2  False  False
<BLANKLINE>
[3 rows x 2 columns]

>>> pd.get_dummies(s1, dummy_na=True)
       a      b   <NA>
0   True  False  False
1  False   True  False
2  False  False   True
<BLANKLINE>
[3 rows x 3 columns]

>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
   C  col1_a  col1_b  col2_a  col2_b  col2_c
0  1    True   False   False    True   False
1  2   False    True    True   False   False
2  3    True   False   False   False    True
<BLANKLINE>
[3 rows x 6 columns]

>>> pd.get_dummies(pd.Series(list('abcaa')))
       a      b      c
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False
4   True  False  False
<BLANKLINE>
[5 rows x 3 columns]

>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b      c
0  False  False
1   True  False
2  False   True
3  False  False
4  False  False
<BLANKLINE>
[5 rows x 2 columns]

get_global_session

get_global_session()

Gets the global session.

Creates the global session if it does not exist.

merge

merge(
    left: bigframes.dataframe.DataFrame,
    right: bigframes.dataframe.DataFrame,
    how: typing.Literal["inner", "left", "outer", "right", "cross"] = "inner",
    on: typing.Optional[
        typing.Union[typing.Sequence[typing.Hashable], typing.Hashable]
    ] = None,
    *,
    left_on: typing.Optional[
        typing.Union[typing.Sequence[typing.Hashable], typing.Hashable]
    ] = None,
    right_on: typing.Optional[
        typing.Union[typing.Sequence[typing.Hashable], typing.Hashable]
    ] = None,
    left_index: bool = False,
    right_index: bool = False,
    sort: bool = False,
    suffixes: tuple[str, str] = ("_x", "_y")
) -> bigframes.dataframe.DataFrame

Merge DataFrame objects with a database-style join.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

qcut

qcut(
    x: bigframes.series.Series,
    q: typing.Union[int, typing.Sequence[float]],
    *,
    labels: typing.Optional[bool] = None,
    duplicates: typing.Literal["drop", "error"] = "error"
) -> bigframes.series.Series

Quantile-based discretization function.

Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.

read_arrow

read_arrow(pa_table: pyarrow.lib.Table) -> bigframes.dataframe.DataFrame

Load a PyArrow Table to a BigQuery DataFrames DataFrame.

read_csv

read_csv(
    filepath_or_buffer: typing.Union[str, typing.IO[bytes]],
    *,
    sep: typing.Optional[str] = ",",
    header: typing.Optional[int] = 0,
    names: typing.Optional[
        typing.Union[
            typing.MutableSequence[typing.Any],
            numpy.ndarray[typing.Any, typing.Any],
            typing.Tuple[typing.Any, ...],
            range,
        ]
    ] = None,
    index_col: typing.Optional[
        typing.Union[
            int,
            str,
            typing.Sequence[typing.Union[str, int]],
            bigframes.enums.DefaultIndexKind,
            typing.Literal[False],
        ]
    ] = None,
    usecols: typing.Optional[
        typing.Union[
            typing.MutableSequence[str],
            typing.Tuple[str, ...],
            typing.Sequence[int],
            pandas.core.series.Series,
            pandas.core.indexes.base.Index,
            numpy.ndarray[typing.Any, typing.Any],
            typing.Callable[[typing.Any], bool],
        ]
    ] = None,
    dtype: typing.Optional[typing.Dict] = None,
    engine: typing.Optional[
        typing.Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
    ] = None,
    encoding: typing.Optional[str] = None,
    write_engine: typing.Literal[
        "default",
        "bigquery_inline",
        "bigquery_load",
        "bigquery_streaming",
        "bigquery_write",
        "_deferred",
    ] = "default",
    **kwargs
) -> bigframes.dataframe.DataFrame

Loads data from a comma-separated values (csv) file into a DataFrame.

The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Examples:

>>> import bigframes.pandas as bpd

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
      name post_abbr
0  Alabama        AL
1   Alaska        AK
<BLANKLINE>
[2 rows x 2 columns]

read_gbq

read_gbq(
    query_or_table: str,
    *,
    index_col: typing.Union[
        typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
    ] = (),
    columns: typing.Iterable[str] = (),
    configuration: typing.Optional[typing.Dict] = None,
    max_results: typing.Optional[int] = None,
    filters: typing.Union[
        typing.Iterable[
            typing.Tuple[
                str,
                typing.Literal[
                    "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                ],
                typing.Any,
            ]
        ],
        typing.Iterable[
            typing.Iterable[
                typing.Tuple[
                    str,
                    typing.Literal[
                        "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                    ],
                    typing.Any,
                ]
            ]
        ],
    ] = (),
    use_cache: typing.Optional[bool] = None,
    col_order: typing.Iterable[str] = (),
    dry_run: bool = False,
    allow_large_results: typing.Optional[bool] = None
) -> bigframes.dataframe.DataFrame | pandas.core.series.Series

Loads a DataFrame from BigQuery.

BigQuery tables are an unordered, unindexed data source. To add support pandas-compatibility, the following indexing options are supported via the index_col parameter:

  • (Empty iterable, default) A default index. Behavior may change. Explicitly set index_col if your application makes use of specific index values.

    If a table has primary key(s), those are used as the index, otherwise a sequential index is generated.

  • (<xref uid="bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64">bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64</xref>) Add an arbitrary sequential index and ordering. Warning This uses an analytic windowed operation that prevents filtering push down. Avoid using on large clustered or partitioned tables.
  • (Recommended) Set the index_col argument to one or more columns. Unique values for the row labels are recommended. Duplicate labels are possible, but note that joins on a non-unique index can duplicate rows via pandas-compatible outer join behavior.

If your query doesn't have an ordering, select GENERATE_UUID() AS rowindex in your SQL and set index_col='rowindex' for the best performance.

Examples:

>>> import bigframes.pandas as bpd

If the input is a table ID:

>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")

Read table path with wildcard suffix and filters:

>>> df = bpd.read_gbq_table("bigquery-public-data.noaa_gsod.gsod19*", filters=[("_table_suffix", ">=", "30"), ("_table_suffix", "<=", "39")])

Preserve ordering in a query input.

>>> df = bpd.read_gbq('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

Reading data with columns and filters parameters:

>>> columns = ['pitcherFirstName', 'pitcherLastName', 'year', 'pitchSpeed']
>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant']), ('pitchSpeed', '>', 94)]
>>> df = bpd.read_gbq(
...             "bigquery-public-data.baseball.games_wide",
...             columns=columns,
...             filters=filters,
...         )
>>> df.head(1)
  pitcherFirstName pitcherLastName  year  pitchSpeed
0             John            Gant  2016          95
<BLANKLINE>
[1 rows x 4 columns]

read_gbq_function

read_gbq_function(function_name: str, is_row_processor: bool = False)

Loads a BigQuery function from BigQuery.

Then it can be applied to a DataFrame or Series.

BigQuery Utils provides many public functions under the bqutil project on Google Cloud Platform project (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs#using-the-udfs). You can checkout Community UDFs to use community-contributed functions. (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community#community-udfs).

Examples:

Use the cw_lower_case_ascii_only function from Community UDFs.

>>> import bigframes.pandas as bpd
>>> func = bpd.read_gbq_function("bqutil.fn.cw_lower_case_ascii_only")

You can run it on scalar input. Usually you would do so to verify that it works as expected before applying to all values in a Series.

>>> func('AURÉLIE')
'aurÉlie'

You can apply it to a BigQuery DataFrames Series.

>>> df = bpd.DataFrame({'id': [1, 2, 3], 'name': ['AURÉLIE', 'CÉLESTINE', 'DAPHNÉ']})
>>> df
   id       name
0   1    AURÉLIE
1   2  CÉLESTINE
2   3     DAPHNÉ
<BLANKLINE>
[3 rows x 2 columns]

>>> df1 = df.assign(new_name=df['name'].apply(func))
>>> df1
   id       name   new_name
0   1    AURÉLIE    aurÉlie
1   2  CÉLESTINE  cÉlestine
2   3     DAPHNÉ     daphnÉ
<BLANKLINE>
[3 rows x 3 columns]

You can even use a function with multiple inputs. For example, cw_regexp_replace_5 from Community UDFs.

>>> func = bpd.read_gbq_function("bqutil.fn.cw_regexp_replace_5")
>>> func('TestStr123456', 'Str', 'Cad$', 1, 1)
'TestCad$123456'

>>> df = bpd.DataFrame({
...     "haystack" : ["TestStr123456", "TestStr123456Str", "TestStr123456Str"],
...     "regexp" : ["Str", "Str", "Str"],
...     "replacement" : ["Cad$", "Cad$", "Cad$"],
...     "offset" : [1, 1, 1],
...     "occurrence" : [1, 2, 1]
... })
>>> df
           haystack regexp replacement  offset  occurrence
0     TestStr123456    Str        Cad$       1           1
1  TestStr123456Str    Str        Cad$       1           2
2  TestStr123456Str    Str        Cad$       1           1
<BLANKLINE>
[3 rows x 5 columns]
>>> df.apply(func, axis=1)
0       TestCad$123456
1    TestStr123456Cad$
2    TestCad$123456Str
dtype: string

Another use case is to define your own remote function and use it later. For example, define the remote function:

>>> @bpd.remote_function(cloud_function_service_account="default")  # doctest: +SKIP
... def tenfold(num: int) -> float:
...     return num * 10

Then, read back the deployed BQ remote function:

>>> tenfold_ref = bpd.read_gbq_function(  # doctest: +SKIP
...     tenfold.bigframes_remote_function,
... )

>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
    a   b   c
0   1   3   5
1   2   4   6
<BLANKLINE>
[2 rows x 3 columns]

>>> df['a'].apply(tenfold_ref)  # doctest: +SKIP
0    10.0
1    20.0
Name: a, dtype: Float64

It also supports row processing by using is_row_processor=True. Please note, row processor implies that the function has only one input parameter.

>>> @bpd.remote_function(cloud_function_service_account="default")  # doctest: +SKIP
... def row_sum(s: pd.Series) -> float:
...     return s['a'] + s['b'] + s['c']

>>> row_sum_ref = bpd.read_gbq_function(  # doctest: +SKIP
...     row_sum.bigframes_remote_function,
...     is_row_processor=True,
... )

>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
    a   b   c
0   1   3   5
1   2   4   6
<BLANKLINE>
[2 rows x 3 columns]

>>> df.apply(row_sum_ref, axis=1)  # doctest: +SKIP
0     9.0
1    12.0
dtype: Float64

read_gbq_model

read_gbq_model(model_name: str)

Loads a BigQuery ML model from BigQuery.

Examples:

Read an existing BigQuery ML model.

>>> import bigframes.pandas as bpd
>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)

read_gbq_object_table

read_gbq_object_table(
    object_table: str, *, name: typing.Optional[str] = None
) -> bigframes.dataframe.DataFrame

Read an existing object table to create a BigFrames Blob DataFrame. Use the connection of the object table for the connection of the blob. This function dosen't retrieve the object table data. If you want to read the data, use read_gbq() instead.

read_gbq_query

read_gbq_query(
    query: str,
    *,
    index_col: typing.Union[
        typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
    ] = (),
    columns: typing.Iterable[str] = (),
    configuration: typing.Optional[typing.Dict] = None,
    max_results: typing.Optional[int] = None,
    use_cache: typing.Optional[bool] = None,
    col_order: typing.Iterable[str] = (),
    filters: typing.Union[
        typing.Iterable[
            typing.Tuple[
                str,
                typing.Literal[
                    "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                ],
                typing.Any,
            ]
        ],
        typing.Iterable[
            typing.Iterable[
                typing.Tuple[
                    str,
                    typing.Literal[
                        "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                    ],
                    typing.Any,
                ]
            ]
        ],
    ] = (),
    dry_run: bool = False,
    allow_large_results: typing.Optional[bool] = None
) -> bigframes.dataframe.DataFrame | pandas.core.series.Series

Turn a SQL query into a DataFrame.

Note: Because the results are written to a temporary table, ordering by ORDER BY is not preserved. A unique index_col is recommended. Use row_number() over () if there is no natural unique index or you want to preserve ordering.

Examples:

Simple query input:

>>> import bigframes.pandas as bpd
>>> df = bpd.read_gbq_query('''
...    SELECT
...       pitcherFirstName,
...       pitcherLastName,
...       pitchSpeed,
...    FROM `bigquery-public-data.baseball.games_wide`
... ''')

Preserve ordering in a query input.

>>> df = bpd.read_gbq_query('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

See also: Session.read_gbq.

read_gbq_table

read_gbq_table(
    query: str,
    *,
    index_col: typing.Union[
        typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
    ] = (),
    columns: typing.Iterable[str] = (),
    max_results: typing.Optional[int] = None,
    filters: typing.Union[
        typing.Iterable[
            typing.Tuple[
                str,
                typing.Literal[
                    "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                ],
                typing.Any,
            ]
        ],
        typing.Iterable[
            typing.Iterable[
                typing.Tuple[
                    str,
                    typing.Literal[
                        "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                    ],
                    typing.Any,
                ]
            ]
        ],
    ] = (),
    use_cache: bool = True,
    col_order: typing.Iterable[str] = (),
    dry_run: bool = False
) -> bigframes.dataframe.DataFrame | pandas.core.series.Series

Turn a BigQuery table into a DataFrame.

Examples:

Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).

>>> import bigframes.pandas as bpd
>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")

See also: Session.read_gbq.

read_json

read_json(
    path_or_buf: typing.Union[str, typing.IO[bytes]],
    *,
    orient: typing.Literal[
        "split", "records", "index", "columns", "values", "table"
    ] = "columns",
    dtype: typing.Optional[typing.Dict] = None,
    encoding: typing.Optional[str] = None,
    lines: bool = False,
    engine: typing.Literal["ujson", "pyarrow", "bigquery"] = "ujson",
    write_engine: typing.Literal[
        "default",
        "bigquery_inline",
        "bigquery_load",
        "bigquery_streaming",
        "bigquery_write",
        "_deferred",
    ] = "default",
    **kwargs
) -> bigframes.dataframe.DataFrame

Convert a JSON string to DataFrame object.

Examples:

>>> import bigframes.pandas as bpd

>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
   id   name
0   1  Alice
1   2    Bob
<BLANKLINE>
[2 rows x 2 columns]

read_pandas

read_pandas(
    pandas_dataframe: typing.Union[
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
        pandas.core.indexes.base.Index,
    ],
    *,
    write_engine: typing.Literal[
        "default",
        "bigquery_inline",
        "bigquery_load",
        "bigquery_streaming",
        "bigquery_write",
        "_deferred",
    ] = "default"
)

Loads DataFrame from a pandas DataFrame.

The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Examples:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
   col1  col2
0     1     3
1     2     4
<BLANKLINE>
[2 rows x 2 columns]

read_parquet

read_parquet(
    path: typing.Union[str, typing.IO[bytes]],
    *,
    engine: str = "auto",
    write_engine: typing.Literal[
        "default",
        "bigquery_inline",
        "bigquery_load",
        "bigquery_streaming",
        "bigquery_write",
        "_deferred",
    ] = "default"
) -> bigframes.dataframe.DataFrame

Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.

Examples:

>>> import bigframes.pandas as bpd

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path, engine="bigquery")

read_pickle

read_pickle(
    filepath_or_buffer: FilePath | ReadPickleBuffer,
    compression: CompressionOptions = "infer",
    storage_options: StorageOptions = None,
    *,
    write_engine: constants.WriteEngineType = "default"
)

Load pickled BigFrames object (or any object) from file.

Examples:

>>> import bigframes.pandas as bpd

>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)

remote_function

remote_function(
    input_types: typing.Union[None, type, typing.Sequence[type]] = None,
    output_type: typing.Optional[type] = None,
    dataset: typing.Optional[str] = None,
    *,
    bigquery_connection: typing.Optional[str] = None,
    reuse: bool = True,
    name: typing.Optional[str] = None,
    packages: typing.Optional[typing.Sequence[str]] = None,
    cloud_function_service_account: str,
    cloud_function_kms_key_name: typing.Optional[str] = None,
    cloud_function_docker_repository: typing.Optional[str] = None,
    max_batching_rows: typing.Optional[int] = 1000,
    cloud_function_timeout: typing.Optional[int] = 600,
    cloud_function_max_instances: typing.Optional[int] = None,
    cloud_function_vpc_connector: typing.Optional[str] = None,
    cloud_function_vpc_connector_egress_settings: typing.Optional[
        typing.Literal["all", "private-ranges-only", "unspecified"]
    ] = None,
    cloud_function_memory_mib: typing.Optional[int] = 1024,
    cloud_function_ingress_settings: typing.Literal[
        "all", "internal-only", "internal-and-gclb"
    ] = "internal-only",
    cloud_build_service_account: typing.Optional[str] = None
)

Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.

See, https://cloud.google.com/functions/docs/securing/function-identity.

  1. Have the below APIs enabled for your project:

    • BigQuery Connection API
    • Cloud Functions API
    • Cloud Run API
    • Cloud Build API
    • Artifact Registry API
    • Cloud Resource Manager API

    This can be done from the cloud console (change PROJECT_ID to yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID

    Or from the gcloud CLI:

    $ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com

  2. Have following IAM roles enabled for you:

    • BigQuery Data Editor (roles/bigquery.dataEditor)
    • BigQuery Connection Admin (roles/bigquery.connectionAdmin)
    • Cloud Functions Developer (roles/cloudfunctions.developer)
    • Service Account User (roles/iam.serviceAccountUser) on the service account PROJECT_NUMBER-compute@developer.gserviceaccount.com
    • Storage Object Viewer (roles/storage.objectViewer)
    • Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
  3. Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:

    1. To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
    2. To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function

      Alternatively, the IAM could also be setup via the gcloud CLI:

      $ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker".

reset_session

reset_session()

Start a fresh session the next time a function requires a session.

Closes the current session if it was already started, deleting any temporary tables that were created.

to_datetime

This function converts a scalar, array-like or Series to a datetime object.

Examples:

>>> import bigframes.pandas as bpd

Converting a Scalar to datetime:

>>> scalar = 123456.789
>>> bpd.to_datetime(scalar, unit = 's')
Timestamp('1970-01-02 10:17:36.789000')

Converting a List of Strings without Timezone Information:

>>> list_str = ["01-31-2021 14:30", "02-28-2021 15:45"]
>>> bpd.to_datetime(list_str, format="%m-%d-%Y %H:%M", utc=True)
0    2021-01-31 14:30:00+00:00
1    2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]

Converting a Series of Strings with Timezone Information:

>>> series_str = bpd.Series(["01-31-2021 14:30+08:00", "02-28-2021 15:45+00:00"])
>>> bpd.to_datetime(series_str, format="%m-%d-%Y %H:%M%Z", utc=True)
0    2021-01-31 06:30:00+00:00
1    2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]

to_timedelta

to_timedelta(
    arg,
    unit: typing.Optional[
        typing.Literal[
            "W",
            "w",
            "D",
            "d",
            "days",
            "day",
            "hours",
            "hour",
            "hr",
            "h",
            "m",
            "minute",
            "min",
            "minutes",
            "s",
            "seconds",
            "sec",
            "second",
            "ms",
            "milliseconds",
            "millisecond",
            "milli",
            "millis",
            "us",
            "microseconds",
            "microsecond",
            "µs",
            "micro",
            "micros",
        ]
    ] = None,
    *,
    session: typing.Optional[bigframes.session.Session] = None
)

Converts a scalar or Series to a timedelta object.

Examples:

Converting a Scalar to timedelta

>>> import bigframes.pandas as bpd
>>> scalar = 2
>>> bpd.to_timedelta(scalar, unit='s')
Timedelta('0 days 00:00:02')

Converting a Series of integers to a Series of timedeltas

>>> int_series = bpd.Series([1,2,3])
>>> bpd.to_timedelta(int_series, unit='s')
0    0 days 00:00:01
1    0 days 00:00:02
2    0 days 00:00:03
dtype: duration`us][pyarrow]`

udf

udf(
    *,
    input_types: typing.Union[None, type, typing.Sequence[type]] = None,
    output_type: typing.Optional[type] = None,
    dataset: str,
    bigquery_connection: typing.Optional[str] = None,
    name: str,
    packages: typing.Optional[typing.Sequence[str]] = None,
    max_batching_rows: typing.Optional[int] = None,
    container_cpu: typing.Optional[float] = None,
    container_memory: typing.Optional[str] = None
)

Decorator to turn a Python user defined function (udf) into a BigQuery managed user-defined function.

Examples:

>>> import datetime

Turning an arbitrary python function into a BigQuery managed python udf:

>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(dataset="bigfranes_testing", name=bq_name)  # doctest: +SKIP
... def minutes_to_hours(x: int) -> float:
...     return x/60

>>> minutes = bpd.Series([0, 30, 60, 90, 120])
>>> minutes
0      0
1     30
2     60
3     90
4    120
dtype: Int64

>>> hours = minutes.apply(minutes_to_hours)  # doctest: +SKIP
>>> hours  # doctest: +SKIP
0    0.0
1    0.5
2    1.0
3    1.5
4    2.0
dtype: Float64

To turn a user defined function with external package dependencies into a BigQuery managed python udf, you would provide the names of the packages (optionally with the package version) via packages param.

>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(  # doctest: +SKIP
...     dataset="bigfranes_testing",
...     name=bq_name,
...     packages=["cryptography"]
... )
... def get_hash(input: str) -> str:
...     from cryptography.fernet import Fernet
...
...     # handle missing value
...     if input is None:
...         input = ""
...
...     key = Fernet.generate_key()
...     f = Fernet(key)
...     return f.encrypt(input.encode()).decode()

>>> names = bpd.Series(["Alice", "Bob"])
>>> hashes = names.apply(get_hash)  # doctest: +SKIP

You can clean-up the BigQuery functions created above using the BigQuery client from the BigQuery DataFrames session:

>>> session = bpd.get_global_session()  # doctest: +SKIP
>>> session.bqclient.delete_routine(minutes_to_hours.bigframes_bigquery_function)  # doctest: +SKIP
>>> session.bqclient.delete_routine(get_hash.bigframes_bigquery_function)  # doctest: +SKIP
Parameters
Name Description
input_types type or sequence(type), Optional

For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are bool, bytes, float, int, str.

output_type type, Optional

Data type of the output in the user defined function. If the user defined function returns an array, then list[type] should be specified. The supported output types are bool, bytes, float, int, str, list[bool], list[float], list[int] and list[str].

dataset str

Dataset in which to create a BigQuery managed function. It should be in <project_id>.<dataset_name> or <dataset_name> format.

bigquery_connection str, Optional

Name of the BigQuery connection. It is used to provide an identity to the serverless instances running the user code. It helps BigQuery manage and track the resources used by the udf. This connection is required for internet access and for interacting with other GCP services. To access GCP services, the appropriate IAM permissions must also be granted to the connection's Service Account. When it defaults to None, the udf will be created without any connection. A udf without a connection has no internet access and no access to other GCP services.

name str

Explicit name of the persisted BigQuery managed function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other's managed functions if they use the same persistent name. Please note that any session specific clean up ( bigframes.session.Session.close/ bigframes.pandas.close_session/ bigframes.pandas.reset_session/ bigframes.pandas.clean_up_by_session_id) does not clean up this function, and leaves it for the user to manage the function directly.

packages str[], Optional

Explicit name of the external package dependencies. Each dependency is added to the requirements.txt as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.

max_batching_rows int, Optional

The maximum number of rows in each batch. If you specify max_batching_rows, BigQuery determines the number of rows in a batch, up to the max_batching_rows limit. If max_batching_rows is not specified, the number of rows to batch is determined automatically.

container_cpu float, Optional

The CPU limits for containers that run Python UDFs. By default, the CPU allocated is 0.33 vCPU. See details at https://cloud.google.com/bigquery/docs/user-defined-functions-python#configure-container-limits.

container_memory str, Optional

The memory limits for containers that run Python UDFs. By default, the memory allocated to each container instance is 512 MiB. See details at https://cloud.google.com/bigquery/docs/user-defined-functions-python#configure-container-limits.