使用舊版 SQL 查詢巢狀與重複的欄位
本文件詳細說明如何使用舊版 SQL 查詢語法查詢巢狀與重複的資料。我們建議使用的 BigQuery 查詢語法是 GoogleSQL。如要瞭解如何在 GoogleSQL 中處理巢狀與重複的資料,請參閱 GoogleSQL 遷移指南。
BigQuery 支援以 JSON 與 Avro 檔案格式載入及匯出巢狀與重複的資料。BigQuery 可為許多舊版 SQL 查詢自動整併資料。舉例來說,許多 SELECT
陳述式皆可在維持資料結構的情況下,擷取巢狀或重複的欄位;而 WHERE
子句則可以在維持資料結構的情況下篩選資料。相反地,ORDER BY
和 GROUP BY
子句會自動整併查詢資料。針對無法自動整併資料的環境 (例如使用舊版 SQL 查詢多個重複欄位時),您可以使用 FLATTEN
與 WITHIN
SQL 函式查詢資料。
FLATTEN
在查詢巢狀資料時,BigQuery 會自動為您整併資料表的資料。請參閱以下個人資料的結構定義範例:
Last modified Schema Total Rows Total Bytes Expiration ----------------- ----------------------------------- ------------ ------------- ------------ 27 Sep 10:01:06 |- kind: string 4 794 |- fullName: string (required) |- age: integer |- gender: string +- phoneNumber: record | |- areaCode: integer | |- number: integer +- children: record (repeated) | |- name: string | |- gender: string | |- age: integer +- citiesLived: record (repeated) | |- place: string | +- yearsLived: integer (repeated)
請注意,有些是重複和巢狀的欄位。如果您像以下這樣對個人資料表執行舊版 SQL 查詢:
SELECT fullName AS name, age, gender, citiesLived.place, citiesLived.yearsLived FROM [dataset.tableId]
BigQuery 會將您的資料以整併的形式輸出:
+---------------+-----+--------+-------------------+------------------------+ | name | age | gender | citiesLived_place | citiesLived_yearsLived | +---------------+-----+--------+-------------------+------------------------+ | John Doe | 22 | Male | Seattle | 1995 | | John Doe | 22 | Male | Stockholm | 2005 | | Mike Jones | 35 | Male | Los Angeles | 1989 | | Mike Jones | 35 | Male | Los Angeles | 1993 | | Mike Jones | 35 | Male | Los Angeles | 1998 | | Mike Jones | 35 | Male | Los Angeles | 2002 | | Mike Jones | 35 | Male | Washington DC | 1990 | | Mike Jones | 35 | Male | Washington DC | 1993 | | Mike Jones | 35 | Male | Washington DC | 1998 | | Mike Jones | 35 | Male | Washington DC | 2008 | | Mike Jones | 35 | Male | Portland | 1993 | | Mike Jones | 35 | Male | Portland | 1998 | | Mike Jones | 35 | Male | Portland | 2003 | | Mike Jones | 35 | Male | Portland | 2005 | | Mike Jones | 35 | Male | Austin | 1973 | | Mike Jones | 35 | Male | Austin | 1998 | | Mike Jones | 35 | Male | Austin | 2001 | | Mike Jones | 35 | Male | Austin | 2005 | | Anna Karenina | 45 | Female | Stockholm | 1992 | | Anna Karenina | 45 | Female | Stockholm | 1998 | | Anna Karenina | 45 | Female | Stockholm | 2000 | | Anna Karenina | 45 | Female | Stockholm | 2010 | | Anna Karenina | 45 | Female | Moscow | 1998 | | Anna Karenina | 45 | Female | Moscow | 2001 | | Anna Karenina | 45 | Female | Moscow | 2005 | | Anna Karenina | 45 | Female | Austin | 1995 | | Anna Karenina | 45 | Female | Austin | 1999 | +---------------+-----+--------+-------------------+------------------------+
在這個範例中,citiesLived.place
現為 citiesLived_place
,citiesLived.yearsLived
則為 citiesLived_yearsLived
。
雖然 BigQuery 能自動整併巢狀欄位,不過您在處理超過多個重複欄位時,可能仍需明確呼叫 FLATTEN
。例如,如果您嘗試執行與下列內容相似的舊版 SQL 查詢:
SELECT fullName, age FROM [dataset.tableId] WHERE (citiesLived.yearsLived > 1995 ) AND (children.age > 3)
BigQuery 會傳回如下所示的錯誤:
Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived
若要查詢超過一個重複的欄位,您就必須整併其中一個欄位:
SELECT fullName, age, gender, citiesLived.place FROM (FLATTEN([dataset.tableId], children)) WHERE (citiesLived.yearsLived > 1995) AND (children.age > 3) GROUP BY fullName, age, gender, citiesLived.place
它會傳回:
+------------+-----+--------+-------------------+ | fullName | age | gender | citiesLived_place | +------------+-----+--------+-------------------+ | John Doe | 22 | Male | Stockholm | | Mike Jones | 35 | Male | Los Angeles | | Mike Jones | 35 | Male | Washington DC | | Mike Jones | 35 | Male | Portland | | Mike Jones | 35 | Male | Austin | +------------+-----+--------+-------------------+
WITHIN 子句
WITHIN
關鍵字專門搭配匯總函式使用,可在記錄與巢狀欄位中跨子女欄位與重複欄位進行匯總。在指定 WITHIN
關鍵字時,您需要指定要匯總的範圍:
WITHIN RECORD
:匯總記錄中重複值的資料。WITHIN node_name
:匯總指定節點中重複值的資料,其中節點是匯總函式中欄位的父項節點。
假設您想知道上一個範例內每個人有幾個小孩。您可以計算每筆記錄擁有的 children.name 數量:
SELECT fullName, COUNT(children.name) WITHIN RECORD AS numberOfChildren FROM [dataset.tableId];
結果如下:
+---------------+------------------+ | fullName | numberOfChildren | +---------------+------------------+ | John Doe | 2 | | Jane Austen | 2 | | Mike Jones | 3 | | Anna Karenina | 0 | +---------------+------------------+
您可以列出所有孩童的姓名來進行比較:
SELECT fullName, children.name FROM [dataset.tableId]
+---------------+---------------+ | fullName | children_name | +---------------+---------------+ | John Doe | Jane | | John Doe | John | | Jane Austen | Josh | | Jane Austen | Jim | | Mike Jones | Earl | | Mike Jones | Sam | | Mike Jones | Kit | | Anna Karenina | None | +---------------+---------------+
這與我們的 WITHIN RECORD
查詢結果相符:John Doe 的確有兩個名為 Jane 和 John 的小孩;Jane Austen 有兩個名為 Josh 和 Jim 的小孩;Mike Jones 有三個名為 Earl、Sam 和 Kit 的小孩;而 Anna Karenina 則沒有小孩。
現在,假設您想要知道某個人在不同地方居住過的次數,就可以使用 WITHIN
子句來彙整特定節點:
SELECT fullName, COUNT(citiesLived.place) WITHIN RECORD AS numberOfPlacesLived, citiesLived.place, COUNT(citiesLived.yearsLived) WITHIN citiesLived AS numberOfTimesInEachCity, FROM [dataset.tableId];
+---------------+---------------------+-------------------+-------------------------+ | fullName | numberOfPlacesLived | citiesLived_place | numberOfTimesInEachCity | +---------------+---------------------+-------------------+-------------------------+ | John Doe | 2 | Seattle | 1 | | John Doe | 2 | Stockholm | 1 | | Mike Jones | 4 | Los Angeles | 4 | | Mike Jones | 4 | Washington DC | 4 | | Mike Jones | 4 | Portland | 4 | | Mike Jones | 4 | Austin | 4 | | Anna Karenina | 3 | Stockholm | 4 | | Anna Karenina | 3 | Moscow | 3 | | Anna Karenina | 3 | Austin | 2 | +---------------+---------------------+-------------------+-------------------------+
這個查詢會執行以下動作:
- 對
citiesLived.place
執行WITHIN RECORD
,並計算每個人分別居住過幾個地方 - 對
citiesLived.yearsLived
執行WITHIN
,並計算每個人分別在各個城市的居住次數 (只計算citiesLived
)。
其中一項 BigQuery 具有的強大功能就是可對巢狀與重複的欄位使用範圍匯總,而這通常能避免查詢中出現成本昂貴的彙整。