Gemini 2.5 Computer Use 模型和工具可讓應用程式在瀏覽器中互動及自動執行工作。電腦使用模型可以根據螢幕截圖推斷電腦螢幕的相關資訊,並產生滑鼠點擊和鍵盤輸入等特定 UI 動作,進而執行動作。與函式呼叫類似,您需要編寫用戶端應用程式程式碼,才能接收 Computer Use 模型和工具函式呼叫,並執行對應的動作。
您可以使用電腦使用模型和工具,建構可執行下列動作的代理程式:
- 自動在網站上輸入重複資料或填寫表單。
- 瀏覽網站以收集資訊。
- 在網頁應用程式中執行一連串動作,協助使用者。
本指南涵蓋下列主題:
本指南假設您使用 Python 適用的 Gen AI SDK,且熟悉 Playwright API。
在預先發布期間,其他 SDK 語言或 Google Cloud 控制台不支援電腦使用模型和工具。
此外,您也可以在 GitHub 中查看電腦使用模型和工具的參考實作方式。
電腦使用模型和工具的運作方式
電腦使用模型和工具不會生成文字回應,而是判斷何時要執行特定 UI 動作 (例如滑鼠點擊),並傳回執行這些動作所需的參數。您需要編寫用戶端應用程式程式碼,才能接收電腦使用模型和工具 function_call
,並執行相應動作。
電腦使用模型和工具互動會遵循代理迴圈程序:
向模型傳送要求
- 將 Computer Use 模型和工具,以及任何其他工具 (選用) 新增至 API 要求。
- 根據使用者的要求和代表目前 GUI 狀態的螢幕截圖,提示電腦使用模型和工具。
接收模型回應
- 模型會分析使用者要求和螢幕截圖,然後產生回應,其中包含建議的
function_call
,代表 UI 動作 (例如「點選座標 (x,y)」或「輸入『文字』」)。如要查看模型支援的所有動作,請參閱「支援的動作」。 - API 回應也可能包含內部安全系統的
safety_response
,該系統已檢查模型建議的動作。這會將動作分類為:- 正常或允許:系統會將這類動作視為安全。這也可能以沒有
safety_response
的形式呈現。 - 需要確認:模型即將執行可能具有風險的動作 (例如點選「接受 Cookie 橫幅」)。
safety_response
- 正常或允許:系統會將這類動作視為安全。這也可能以沒有
- 模型會分析使用者要求和螢幕截圖,然後產生回應,其中包含建議的
執行收到的動作
- 您的用戶端程式碼會收到
function_call
和任何隨附的safety_response
。 - 如果
safety_response
指出為一般或允許 (或沒有safety_response
),用戶端程式碼就能在目標環境 (例如網頁瀏覽器) 中執行指定的function_call
。 - 如果
safety_response
指出需要確認,應用程式必須先提示使用者確認,才能執行function_call
。如果使用者確認,請繼續執行動作。如果使用者拒絕,請勿執行動作。
- 您的用戶端程式碼會收到
擷取新環境狀態
- 如果動作已執行,用戶端會擷取 GUI 和目前網址的新螢幕截圖,並做為
function_response
的一部分傳送回電腦使用模型和工具。 - 如果安全系統封鎖某項動作,或使用者拒絕確認,應用程式可能會將不同形式的回饋傳送給模型,或終止互動。
- 如果動作已執行,用戶端會擷取 GUI 和目前網址的新螢幕截圖,並做為
系統會將更新後的狀態傳送給模型。系統會從步驟 2 開始重複執行程序,並使用電腦使用模型和工具,根據新螢幕截圖 (如有) 和持續進行的目標,建議下一個動作。這個迴圈會持續運作,直到工作完成、發生錯誤或程序終止為止 (例如,如果回應遭到安全篩選器或使用者決策封鎖)。
下圖說明電腦使用模型和工具的運作方式:
啟用「電腦使用」模型和工具
如要啟用「電腦使用」模型和工具,請使用 gemini-2.5-computer-use-preview-10-2025
做為模型,並將「電腦使用」模型和工具新增至已啟用工具的清單:
Python
from google import genai from google.genai import types from google.genai.types import Content, Part, FunctionResponse client = genai.Client() # Add Computer Use model and tool to the list of tools generate_content_config = genai.types.GenerateContentConfig( tools=[ types.Tool( computer_use=types.ComputerUse( environment=types.Environment.ENVIRONMENT_BROWSER, ) ), ] ) # Example request using the Computer Use model and tool contents = [ Content( role="user", parts=[ Part(text="Go to google.com and search for 'weather in New York'"), ], ) ] response = client.models.generate_content( model="gemini-2.5-computer-use-preview-10-2025", contents=contents, config=generate_content_config, )
傳送要求
設定電腦使用模型和工具後,請將提示傳送至模型,其中包含使用者的目標和 GUI 的初始螢幕截圖。
您也可以視需要新增以下內容:
- 排除的動作:如果清單中有任何支援的 UI 動作,是您不希望模型執行的,請在
excluded_predefined_functions
中指定這些動作。 - 使用者定義函式:除了電腦使用模型和工具,您可能還想加入自訂使用者定義函式。
下列程式碼範例會啟用電腦使用模型和工具,並將要求傳送至模型:
Python
from google import genai from google.genai import types from google.genai.types import Content, Part client = genai.Client() # Specify predefined functions to exclude (optional) excluded_functions = ["drag_and_drop"] # Configuration for the Computer Use model and tool with browser environment generate_content_config = genai.types.GenerateContentConfig( tools=[ # 1. Computer Use model and tool with browser environment types.Tool( computer_use=types.ComputerUse( environment=types.Environment.ENVIRONMENT_BROWSER, # Optional: Exclude specific predefined functions excluded_predefined_functions=excluded_functions ) ), # 2. Optional: Custom user-defined functions (need to defined above) # types.Tool( # function_declarations=custom_functions # ) ], ) # Create the content with user message contents: list[Content] = [ Content( role="user", parts=[ Part(text="Search for highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping. Create a bulleted list of the 3 cheapest options in the format of name, description, price in an easy-to-read layout."), # Optional: include a screenshot of the initial state # Part.from_bytes( # data=screenshot_image_bytes, # mime_type='image/png', # ), ], ) ] # Generate content with the configured settings response = client.models.generate_content( model='gemini-2.5-computer-use-preview-10-2025', contents=contents, config=generate_content_config, ) # Print the response output print(response.text)
您也可以加入自訂使用者定義函式,擴充模型的函式。請參閱「針對行動裝置用途使用電腦用途模型和工具」,瞭解如何新增 open_app
、long_press_at
和 go_home
等動作,同時排除瀏覽器專屬動作,針對行動裝置用途設定電腦用途。
接收回覆
如果模型判斷完成工作需要 UI 動作或使用者定義函式,就會傳回一或多個 FunctionCalls
。應用程式程式碼必須剖析這些動作、執行動作,並收集結果。電腦使用模型和工具支援平行函式呼叫,也就是說,模型可以在單一回合中傳回多個獨立動作。
{
"content": {
"parts": [
{
"text": "I will type the search query into the search bar. The search bar is in the center of the page."
},
{
"function_call": {
"name": "type_text_at",
"args": {
"x": 371,
"y": 470,
"text": "highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping",
"press_enter": true
}
}
}
]
}
}
視動作而定,API 回應也可能會傳回 safety_response
:
{
"content": {
"parts": [
{
"text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95)."
},
{
"function_call": {
"name": "click_at",
"args": {
"x": 60,
"y": 100,
"safety_decision": {
"explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
"decision": "require_confirmation"
}
}
}
}
]
}
}
執行收到的動作
收到回覆後,模型需要執行收到的動作。
下列程式碼會從 Gemini 回應中擷取函式呼叫、將座標從 0 到 1000 的範圍轉換為實際像素、使用 Playwright 執行瀏覽器動作,並傳回每個動作的成功或失敗狀態:
import time
from typing import Any, List, Tuple
def normalize_x(x: int, screen_width: int) -> int:
"""Convert normalized x coordinate (0-1000) to actual pixel coordinate."""
return int(x / 1000 * screen_width)
def normalize_y(y: int, screen_height: int) -> int:
"""Convert normalized y coordinate (0-1000) to actual pixel coordinate."""
return int(y / 1000 * screen_height)
def execute_function_calls(response, page, screen_width: int, screen_height: int) -> List[Tuple[str, Any]]:
"""
Extract and execute function calls from Gemini response.
Args:
response: Gemini API response object
page: Playwright page object
screen_width: Screen width in pixels
screen_height: Screen height in pixels
Returns:
List of tuples: [(function_name, result), ...]
"""
# Extract function calls and thoughts from the model's response
candidate = response.candidates[0]
function_calls = []
thoughts = []
for part in candidate.content.parts:
if hasattr(part, 'function_call') and part.function_call:
function_calls.append(part.function_call)
elif hasattr(part, 'text') and part.text:
thoughts.append(part.text)
if thoughts:
print(f"Model Reasoning: {' '.join(thoughts)}")
# Execute each function call
results = []
for function_call in function_calls:
result = None
try:
if function_call.name == "open_web_browser":
print("Executing open_web_browser")
# Browser is already open via Playwright, so this is a no-op
result = "success"
elif function_call.name == "click_at":
actual_x = normalize_x(function_call.args["x"], screen_width)
actual_y = normalize_y(function_call.args["y"], screen_height)
print(f"Executing click_at: ({actual_x}, {actual_y})")
page.mouse.click(actual_x, actual_y)
result = "success"
elif function_call.name == "type_text_at":
actual_x = normalize_x(function_call.args["x"], screen_width)
actual_y = normalize_y(function_call.args["y"], screen_height)
text = function_call.args["text"]
press_enter = function_call.args.get("press_enter", False)
clear_before_typing = function_call.args.get("clear_before_typing", True)
print(f"Executing type_text_at: ({actual_x}, {actual_y}) text='{text}'")
# Click at the specified location
page.mouse.click(actual_x, actual_y)
time.sleep(0.1)
# Clear existing text if requested
if clear_before_typing:
page.keyboard.press("Control+A")
page.keyboard.press("Backspace")
# Type the text
page.keyboard.type(text)
# Press enter if requested
if press_enter:
page.keyboard.press("Enter")
result = "success"
else:
# For any functions not parsed above
print(f"Unrecognized function: {function_call.name}")
result = "unknown_function"
except Exception as e:
print(f"Error executing {function_call.name}: {e}")
result = f"error: {str(e)}"
results.append((function_call.name, result))
return results
如果傳回的 safety_decision
為 require_confirmation
,您必須先請使用者確認,再繼續執行動作。根據服務條款,您不得略過人工確認要求。
下列程式碼會在先前的程式碼中加入安全邏輯:
import termcolor
def get_safety_confirmation(safety_decision):
"""Prompt user for confirmation when safety check is triggered."""
termcolor.cprint("Safety service requires explicit confirmation!", color="red")
print(safety_decision["explanation"])
decision = ""
while decision.lower() not in ("y", "n", "ye", "yes", "no"):
decision = input("Do you wish to proceed? [Y]es/[N]o\n")
if decision.lower() in ("n", "no"):
return "TERMINATE"
return "CONTINUE"
def execute_function_calls(response, page, screen_width: int, screen_height: int):
# ... Extract function calls from response ...
for function_call in function_calls:
extra_fr_fields = {}
# Check for safety decision
if 'safety_decision' in function_call.args:
decision = get_safety_confirmation(function_call.args['safety_decision'])
if decision == "TERMINATE":
print("Terminating agent loop")
break
extra_fr_fields["safety_acknowledgement"] = "true"
# ... Execute function call and append to results ...
擷取新狀態
執行動作後,將函式執行結果傳回模型,模型就能使用這項資訊生成下一個動作。如果執行多個動作 (平行呼叫),您必須在後續使用者回合中,為每個動作傳送 FunctionResponse
。如果是使用者定義函式,FunctionResponse
應包含已執行函式的傳回值。
function_response_parts = []
for name, result in results:
# Take screenshot after each action
screenshot = page.screenshot()
current_url = page.url
function_response_parts.append(
FunctionResponse(
name=name,
response={"url": current_url}, # Include safety acknowledgement
parts=[
types.FunctionResponsePart(
inline_data=types.FunctionResponseBlob(
mime_type="image/png", data=screenshot
)
)
]
)
)
# Create the user feedback content with all responses
user_feedback_content = Content(
role="user",
parts=function_response_parts
)
# Append this feedback to the 'contents' history list for the next API call
contents.append(user_feedback_content)
建構代理程式迴圈
將上述步驟合併為迴圈,即可啟用多步驟互動。迴圈必須處理平行函式呼叫。請記得正確管理對話記錄 (內容陣列),方法是同時附加模型回覆和函式回覆。
Python
from google import genai from google.genai.types import Content, Part from playwright.sync_api import sync_playwright def has_function_calls(response): """Check if response contains any function calls.""" candidate = response.candidates[0] return any(hasattr(part, 'function_call') and part.function_call for part in candidate.content.parts) def main(): client = genai.Client() # ... (config setup from "Send a request to model" section) ... with sync_playwright() as p: browser = p.chromium.launch(headless=False) page = browser.new_page() page.goto("https://www.google.com") screen_width, screen_height = 1920, 1080 # ... (initial contents setup from "Send a request to model" section) ... # Agent loop: iterate until model provides final answer for iteration in range(10): print(f"\nIteration {iteration + 1}\n") # 1. Send request to model (see "Send a request to model" section) response = client.models.generate_content( model='gemini-2.5-computer-use-preview-10-2025', contents=contents, config=generate_content_config, ) contents.append(response.candidates[0].content) # 2. Check if done - no function calls means final answer if not has_function_calls(response): print(f"FINAL RESPONSE:\n{response.text}") break # 3. Execute actions (see "Execute the received actions" section) results = execute_function_calls(response, page, screen_width, screen_height) time.sleep(1) # 4. Capture state and create feedback (see "Capture the New State" section) contents.append(create_feedback(results, page)) input("\nPress Enter to close browser...") browser.close() if __name__ == "__main__": main()
電腦用途模型和行動裝置用途工具
以下範例說明如何定義自訂函式 (例如 open_app
、long_press_at
和 go_home
)、將這些函式與 Gemini 內建的電腦使用工具合併,以及排除不必要的瀏覽器專屬函式。註冊這些自訂函式後,模型就能智慧地呼叫這些函式,並搭配標準 UI 動作,在非瀏覽器環境中完成工作。
from typing import Optional, Dict, Any
from google import genai
from google.genai import types
from google.genai.types import Content, Part
client = genai.Client()
def open_app(app_name: str, intent: Optional[str] = None) -> Dict[str, Any]:
"""Opens an app by name.
Args:
app_name: Name of the app to open (any string).
intent: Optional deep-link or action to pass when launching, if the app supports it.
Returns:
JSON payload acknowledging the request (app name and optional intent).
"""
return {"status": "requested_open", "app_name": app_name, "intent": intent}
def long_press_at(x: int, y: int, duration_ms: int = 500) -> Dict[str, int]:
"""Long-press at a specific screen coordinate.
Args:
x: X coordinate (absolute), scaled to the device screen width (pixels).
y: Y coordinate (absolute), scaled to the device screen height (pixels).
duration_ms: Press duration in milliseconds. Defaults to 500.
Returns:
Object with the coordinates pressed and the duration used.
"""
return {"x": x, "y": y, "duration_ms": duration_ms}
def go_home() -> Dict[str, str]:
"""Navigates to the device home screen.
Returns:
A small acknowledgment payload.
"""
return {"status": "home_requested"}
# Build function declarations
CUSTOM_FUNCTION_DECLARATIONS = [
types.FunctionDeclaration.from_callable(client=client, callable=open_app),
types.FunctionDeclaration.from_callable(client=client, callable=long_press_at),
types.FunctionDeclaration.from_callable(client=client, callable=go_home),
]
# Exclude browser functions
EXCLUDED_PREDEFINED_FUNCTIONS = [
"open_web_browser",
"search",
"navigate",
"hover_at",
"scroll_document",
"go_forward",
"key_combination",
"drag_and_drop",
]
# Utility function to construct a GenerateContentConfig
def make_generate_content_config() -> genai.types.GenerateContentConfig:
"""Return a fixed GenerateContentConfig with Computer Use + custom functions."""
return genai.types.GenerateContentConfig(
tools=[
types.Tool(
computer_use=types.ComputerUse(
environment=types.Environment.ENVIRONMENT_BROWSER,
excluded_predefined_functions=EXCLUDED_PREDEFINED_FUNCTIONS,
)
),
types.Tool(function_declarations=CUSTOM_FUNCTION_DECLARATIONS),
]
)
# Create the content with user message
contents: list[Content] = [
Content(
role="user",
parts=[
# text instruction
Part(text="Open Chrome, then long-press at 200,400."),
# optional screenshot attachment
Part.from_bytes(
data=screenshot_image_bytes,
mime_type="image/png",
),
],
)
]
# Build your fixed config (from helper)
config = make_generate_content_config()
# Generate content with the configured settings
response = client.models.generate_content(
model="gemini-2.5-computer-use-preview-10-2025",
contents=contents,
config=generate_content_config,
)
print(response)
支援的動作
電腦使用模型和工具可讓模型使用 FunctionCall
要求執行下列動作。用戶端程式碼必須實作這些動作的執行邏輯。如需範例,請參閱參考實作方式。
指令名稱 | 說明 | 引數 (在函式呼叫中) | 函式呼叫範例 |
---|---|---|---|
open_web_browser | 開啟網路瀏覽器。 | 無 | {"name": "open_web_browser", "args": {}} |
wait_5_seconds | 暫停執行 5 秒,讓動態內容載入或動畫完成。 | 無 | {"name": "wait_5_seconds", "args": {}} |
go_back | 前往瀏覽器記錄中的上一頁。 | 無 | {"name": "go_back", "args": {}} |
go_forward | 前往瀏覽器記錄中的下一頁。 | 無 | {"name": "go_forward", "args": {}} |
搜尋 | 前往預設搜尋引擎的首頁 (例如 Google)。適合用來開始新的搜尋工作。 | 無 | {"name": "search", "args": {}} |
navigate | 直接將瀏覽器導向指定網址。 | url :str |
{"name": "navigate", "args": {"url": "https://www.wikipedia.org"}} |
click_at | 點選網頁上的特定座標。x 和 y 值是以 1000x1000 格線為準,並會縮放至螢幕尺寸。 | y :int (0 到 999),x :int (0 到 999) |
{"name": "click_at", "args": {"y": 300, "x": 500}} |
hover_at | 將滑鼠懸停在網頁上的特定座標。可用於顯示子選單。x 和 y 是以 1000x1000 格線為準。 | y : int (0-999) x : int (0-999) |
{"name": "hover_at", "args": {"y": 150, "x": 250}} |
type_text_at | 在特定座標輸入文字,預設會先清除欄位,然後在輸入完畢後按下 ENTER 鍵,但這些動作可以停用。x 和 y 座標是以 1000x1000 的格線為準。 | y :int (0-999)、x :int (0-999)、text :str、press_enter :bool (選用,預設為 True)、clear_before_typing :bool (選用,預設為 True) |
{"name": "type_text_at", "args": {"y": 250, "x": 400, "text": "search query", "press_enter": false}} |
key_combination | 按下鍵盤按鍵或組合鍵,例如「Ctrl+C」或「Enter」。可用於觸發動作 (例如使用「Enter」鍵提交表單) 或剪貼簿作業。 | keys :str (例如「enter」、「control+c」。如需允許使用的金鑰完整清單,請參閱 API 參考資料) |
{"name": "key_combination", "args": {"keys": "Control+A"}} |
scroll_document | 將整個網頁「向上」、「向下」、「向左」或「向右」捲動。 | direction :字串 (「up」、「down」、「left」或「right」) |
{"name": "scroll_document", "args": {"direction": "down"}} |
scroll_at | 在指定方向上,將特定元素或區域捲動特定幅度,座標為 (x, y)。座標和震級 (預設為 800) 是以 1000x1000 格線為準。 | y :int (0-999)、x :int (0-999)、direction :str ("up"、"down"、"left"、"right")、magnitude :int (0-999,選用,預設為 800) |
{"name": "scroll_at", "args": {"y": 500, "x": 500, "direction": "down", "magnitude": 400}} |
drag_and_drop | 從起始座標 (x, y) 拖曳元素,並在目的地座標 (destination_x, destination_y) 放開。所有座標都是以 1000x1000 的格線為準。 | y :int (0-999)、x :int (0-999)、destination_y :int (0-999)、destination_x :int (0-999) |
{"name": "drag_and_drop", "args": {"y": 100, "x": 100, "destination_y": 500, "destination_x": 500}} |
安全與安全性
本節說明「電腦使用」模型和工具採取的防護措施,可提升使用者控制權和安全性。此外,本文也說明如何運用最佳做法,降低這項工具可能帶來的潛在新風險。
確認安全決定
視動作而定,電腦使用模型和工具的回應可能包含來自內部安全系統的 safety_decision
。這項決定會驗證工具建議的安全措施。
{
"content": {
"parts": [
{
"text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95)."
},
{
"function_call": {
"name": "click_at",
"args": {
"x": 60,
"y": 100,
"safety_decision": {
"explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
"decision": "require_confirmation"
}
}
}
}
]
}
}
如果 safety_decision
為 require_confirmation
,您「必須」先請使用者確認,再繼續執行動作。
下列程式碼範例會在執行動作前,提示使用者確認。如果使用者未確認動作,迴圈就會終止。如果使用者確認動作,系統就會執行動作,並將 safety_acknowledgement
欄位標示為 True
。
import termcolor
def get_safety_confirmation(safety_decision):
"""Prompt user for confirmation when safety check is triggered."""
termcolor.cprint("Safety service requires explicit confirmation!", color="red")
print(safety_decision["explanation"])
decision = ""
while decision.lower() not in ("y", "n", "ye", "yes", "no"):
decision = input("Do you wish to proceed? [Y]es/[N]o\n")
if decision.lower() in ("n", "no"):
return "TERMINATE"
return "CONTINUE"
def execute_function_calls(response, page, screen_width: int, screen_height: int):
# ... Extract function calls from response ...
for function_call in function_calls:
extra_fr_fields = {}
# Check for safety decision
if 'safety_decision' in function_call.args:
decision = get_safety_confirmation(function_call.args['safety_decision'])
if decision == "TERMINATE":
print("Terminating agent loop")
break
extra_fr_fields["safety_acknowledgement"] = "true" # Safety acknowledgement
# ... Execute function call and append to results ...
如果使用者確認,您必須在 FunctionResponse
中加入安全確認聲明。
function_response_parts.append(
FunctionResponse(
name=name,
response={"url": current_url,
**extra_fr_fields}, # Include safety acknowledgement
parts=[
types.FunctionResponsePart(
inline_data=types.FunctionResponseBlob(
mime_type="image/png", data=screenshot
)
)
]
)
)
安全性最佳做法
電腦使用模型和工具是新穎的工具,會帶來開發人員應留意的全新風險:
- 不可信的內容和詐騙:模型會盡力達成使用者的目標,但可能會依據不可信的資訊來源和畫面上的指示執行操作。舉例來說,如果使用者的目標是購買 Pixel 手機,而模型遇到「完成問卷調查即可免費獲得 Pixel」的詐騙訊息,模型可能會完成問卷調查。
- 偶爾會發生非預期動作:模型可能會誤解使用者的目標或網頁內容,導致採取錯誤動作,例如點選錯誤的按鈕或填寫錯誤的表單。這可能會導致工作失敗或資料外洩。
- 違反政策:無論是蓄意或無意,API 的功能都可能用於違反 Google 政策 (《生成式 AI 使用限制政策》和《Gemini API 附加服務條款》) 的活動。包括可能干擾系統完整性、危害安全性、略過 CAPTCHA 等安全措施、控制醫療器材等行為。
為因應這些風險,您可以採取下列安全措施和最佳做法:
- 人機迴圈 (HITL):
- 實作使用者確認:如果安全回應指出 require_confirmation,您必須先實作使用者確認,才能執行作業。
- 提供自訂安全指示:除了內建的使用者確認檢查外,開發人員也可以選擇新增自訂系統指示,強制執行自己的安全政策,禁止模型執行特定動作,或要求使用者確認後,模型才能執行特定高風險的不可逆動作。以下是與模型互動時可加入的自訂安全系統指令範例。
按一下即可查看建立連結的範例
## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)** This is your first and most important check. If the next required action falls into any of the following categories, you MUST stop immediately, and seek the user's explicit permission. **Procedure for Seeking Confirmation:** * **For Consequential Actions:** Perform all preparatory steps (e.g., navigating, filling out forms, typing a message). You will ask for confirmation **AFTER** all necessary information is entered on the screen, but **BEFORE** you perform the final, irreversible action (e.g., before clicking "Send", "Submit", "Confirm Purchase", "Share"). * **For Prohibited Actions:** If the action is strictly forbidden (e.g., accepting legal terms, solving a CAPTCHA), you must first inform the user about the required action and ask for their confirmation to proceed. **USER_CONFIRMATION Categories:** * **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or agreeing to any of the following on the user's behalf. You must ask th e user to confirm before performing these actions. * Terms of Service * Privacy Policies * Cookie consent banners * End User License Agreements (EULAs) * Any other legally significant contracts or agreements. * **Robot Detection:** You MUST NEVER attempt to solve or bypass the following. You must ask the user to confirm before performing these actions. * CAPTCHAs (of any kind) * Any other anti-robot or human-verification mechanisms, even if you are capable. * **Financial Transactions:** * Completing any purchase. * Managing or moving money (e.g., transfers, payments). * Purchasing regulated goods or participating in gambling. * **Sending Communications:** * Sending emails. * Sending messages on any platform (e.g., social media, chat apps). * Posting content on social media or forums. * **Accessing or Modifying Sensitive Information:** * Health, financial, or government records (e.g., medical history, tax forms, passport status). * Revealing or modifying sensitive personal identifiers (e.g., SSN, bank account number, credit card number). * **User Data Management:** * Accessing, downloading, or saving files from the web. * Sharing or sending files/data to any third party. * Transferring user data between systems. * **Browser Data Usage:** * Accessing or managing Chrome browsing history, bookmarks, autofill data, or saved passwords. * **Security and Identity:** * Logging into any user account. * Any action that involves misrepresentation or impersonation (e.g., creating a fan account, posting as someone else). * **Insurmountable Obstacles:** If you are technically unable to interact with a user interface element or are stuck in a loop you cannot resolve, ask the user to take over. --- ## **RULE 2: Default Behavior (ACTUATE)** If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`, your default behavior is to **Actuate**. **Actuation Means:** You MUST proactively perform all necessary steps to move the user's request forward. Continue to actuate until you either complete the non-consequential task or encounter a condition defined in Rule 1. * **Example 1:** If asked to send money, you will navigate to the payment portal, enter the recipient's details, and enter the amount. You will then **STOP** as per Rule 1 and ask for confirmation before clicking the final "Send" button. * **Example 2:** If asked to post a message, you will navigate to the site, open the post composition window, and write the full message. You will then **STOP** as per Rule 1 and ask for confirmation before clicking the final "Post" button. After the user has confirmed, remember to get the user's latest screen before continuing to perform actions. # Final Response Guidelines: Write final response to the user in these cases: - User confirmation - When the task is complete or you have enough information to respond to the user
- 安全執行環境:在採用沙箱機制的安全環境中執行代理程式,以限制其潛在影響 (例如採用沙箱機制的虛擬機器 (VM)、容器 (例如 Docker) 或權限受限的專用瀏覽器設定檔)。
- 輸入內容清除:清除提示中所有使用者產生的文字,降低意外指令或提示注入的風險。這層安全防護措施很有幫助,但無法取代安全執行環境。
- 允許清單和封鎖清單:導入篩選機制,控管模型可前往的位置和可執行的動作。禁止存取的網站封鎖清單是不錯的起點,而限制更嚴格的允許清單則更加安全。
- 可觀測性和記錄:維護詳細記錄,以利偵錯、稽核和事件回應。客戶應記錄提示、螢幕截圖、模型建議的動作 (
function_call
)、安全防護回應,以及客戶最終執行的所有動作。
定價
電腦使用模型和工具的定價與 Gemini 2.5 Pro 相同,並使用相同的 SKU。如要拆分電腦使用模型和工具費用,請使用自訂中繼資料標籤。如要進一步瞭解如何使用自訂中繼資料標籤監控費用,請參閱「自訂中繼資料標籤」一文。