電腦使用模型和工具

Gemini 2.5 Computer Use 模型和工具可讓應用程式在瀏覽器中互動及自動執行工作。電腦使用模型可以根據螢幕截圖推斷電腦螢幕的相關資訊,並產生滑鼠點擊和鍵盤輸入等特定 UI 動作,進而執行動作。與函式呼叫類似,您需要編寫用戶端應用程式程式碼,才能接收 Computer Use 模型和工具函式呼叫,並執行對應的動作。

您可以使用電腦使用模型和工具,建構可執行下列動作的代理程式:

  • 自動在網站上輸入重複資料或填寫表單。
  • 瀏覽網站以收集資訊。
  • 在網頁應用程式中執行一連串動作,協助使用者。

本指南涵蓋下列主題:

本指南假設您使用 Python 適用的 Gen AI SDK,且熟悉 Playwright API

在預先發布期間,其他 SDK 語言或 Google Cloud 控制台不支援電腦使用模型和工具。

此外,您也可以在 GitHub 中查看電腦使用模型和工具的參考實作方式

電腦使用模型和工具的運作方式

電腦使用模型和工具不會生成文字回應,而是判斷何時要執行特定 UI 動作 (例如滑鼠點擊),並傳回執行這些動作所需的參數。您需要編寫用戶端應用程式程式碼,才能接收電腦使用模型和工具 function_call,並執行相應動作。

電腦使用模型和工具互動會遵循代理迴圈程序:

  1. 向模型傳送要求

    • 將 Computer Use 模型和工具,以及任何其他工具 (選用) 新增至 API 要求。
    • 根據使用者的要求和代表目前 GUI 狀態的螢幕截圖,提示電腦使用模型和工具。
  2. 接收模型回應

    • 模型會分析使用者要求和螢幕截圖,然後產生回應,其中包含建議的 function_call,代表 UI 動作 (例如「點選座標 (x,y)」或「輸入『文字』」)。如要查看模型支援的所有動作,請參閱「支援的動作」。
    • API 回應也可能包含內部安全系統的 safety_response,該系統已檢查模型建議的動作。這會將動作分類為:
        safety_response
      • 正常或允許:系統會將這類動作視為安全。這也可能以沒有 safety_response 的形式呈現。
      • 需要確認:模型即將執行可能具有風險的動作 (例如點選「接受 Cookie 橫幅」)。
  3. 執行收到的動作

    • 您的用戶端程式碼會收到 function_call 和任何隨附的 safety_response
    • 如果 safety_response 指出為一般或允許 (或沒有 safety_response),用戶端程式碼就能在目標環境 (例如網頁瀏覽器) 中執行指定的 function_call
    • 如果 safety_response 指出需要確認,應用程式必須先提示使用者確認,才能執行 function_call。如果使用者確認,請繼續執行動作。如果使用者拒絕,請勿執行動作。
  4. 擷取新環境狀態

    • 如果動作已執行,用戶端會擷取 GUI 和目前網址的新螢幕截圖,並做為 function_response 的一部分傳送回電腦使用模型和工具。
    • 如果安全系統封鎖某項動作,或使用者拒絕確認,應用程式可能會將不同形式的回饋傳送給模型,或終止互動。

系統會將更新後的狀態傳送給模型。系統會從步驟 2 開始重複執行程序,並使用電腦使用模型和工具,根據新螢幕截圖 (如有) 和持續進行的目標,建議下一個動作。這個迴圈會持續運作,直到工作完成、發生錯誤或程序終止為止 (例如,如果回應遭到安全篩選器或使用者決策封鎖)。

下圖說明電腦使用模型和工具的運作方式:

電腦使用模型和工具總覽

啟用「電腦使用」模型和工具

如要啟用「電腦使用」模型和工具,請使用 gemini-2.5-computer-use-preview-10-2025 做為模型,並將「電腦使用」模型和工具新增至已啟用工具的清單:

Python

from google import genai
from google.genai import types
from google.genai.types import Content, Part, FunctionResponse

client = genai.Client()

# Add Computer Use model and tool to the list of tools
generate_content_config = genai.types.GenerateContentConfig(
    tools=[
        types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
                )
              ),
            ]
          )

# Example request using the Computer Use model and tool
contents = [
    Content(
        role="user",
        parts=[
            Part(text="Go to google.com and search for 'weather in New York'"),
          ],
        )
      ]

response = client.models.generate_content(
    model="gemini-2.5-computer-use-preview-10-2025",
    contents=contents,
    config=generate_content_config,
  )
      

傳送要求

設定電腦使用模型和工具後,請將提示傳送至模型,其中包含使用者的目標和 GUI 的初始螢幕截圖。

您也可以視需要新增以下內容:

  • 排除的動作:如果清單中有任何支援的 UI 動作,是您不希望模型執行的,請在 excluded_predefined_functions 中指定這些動作。
  • 使用者定義函式:除了電腦使用模型和工具,您可能還想加入自訂使用者定義函式。

下列程式碼範例會啟用電腦使用模型和工具,並將要求傳送至模型:

Python

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

# Specify predefined functions to exclude (optional)
excluded_functions = ["drag_and_drop"]

# Configuration for the Computer Use model and tool with browser environment
generate_content_config = genai.types.GenerateContentConfig(
    tools=[
        # 1. Computer Use model and tool with browser environment
        types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
                # Optional: Exclude specific predefined functions
                excluded_predefined_functions=excluded_functions
                )
              ),
        # 2. Optional: Custom user-defined functions (need to defined above)
        # types.Tool(
           # function_declarations=custom_functions
           #   )
    ],
)

# Create the content with user message
contents: list[Content] = [
    Content(
        role="user",
        parts=[
            Part(text="Search for highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping. Create a bulleted list of the 3 cheapest options in the format of name, description, price in an easy-to-read layout."),
            # Optional: include a screenshot of the initial state
            # Part.from_bytes(
                 # data=screenshot_image_bytes,
                 # mime_type='image/png',
            # ),
        ],
    )
]

# Generate content with the configured settings
response = client.models.generate_content(
    model='gemini-2.5-computer-use-preview-10-2025',
    contents=contents,
    config=generate_content_config,
)

# Print the response output
print(response.text)
      

您也可以加入自訂使用者定義函式,擴充模型的函式。請參閱「針對行動裝置用途使用電腦用途模型和工具」,瞭解如何新增 open_applong_press_atgo_home 等動作,同時排除瀏覽器專屬動作,針對行動裝置用途設定電腦用途。

接收回覆

如果模型判斷完成工作需要 UI 動作或使用者定義函式,就會傳回一或多個 FunctionCalls。應用程式程式碼必須剖析這些動作、執行動作,並收集結果。電腦使用模型和工具支援平行函式呼叫,也就是說,模型可以在單一回合中傳回多個獨立動作。

{
  "content": {
    "parts": [
      {
        "text": "I will type the search query into the search bar. The search bar is in the center of the page."
      },
      {
        "function_call": {
          "name": "type_text_at",
          "args": {
            "x": 371,
            "y": 470,
            "text": "highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping",
            "press_enter": true
          }
        }
      }
    ]
  }
}

視動作而定,API 回應也可能會傳回 safety_response

{
  "content": {
    "parts": [
      {
        "text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95)."
      },
      {
        "function_call": {
          "name": "click_at",
          "args": {
            "x": 60,
            "y": 100,
            "safety_decision": {
              "explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
              "decision": "require_confirmation"
            }
          }
        }
      }
    ]
  }
}

執行收到的動作

收到回覆後,模型需要執行收到的動作。

下列程式碼會從 Gemini 回應中擷取函式呼叫、將座標從 0 到 1000 的範圍轉換為實際像素、使用 Playwright 執行瀏覽器動作,並傳回每個動作的成功或失敗狀態:

import time
from typing import Any, List, Tuple


def normalize_x(x: int, screen_width: int) -> int:
    """Convert normalized x coordinate (0-1000) to actual pixel coordinate."""
    return int(x / 1000 * screen_width)


def normalize_y(y: int, screen_height: int) -> int:
    """Convert normalized y coordinate (0-1000) to actual pixel coordinate."""
    return int(y / 1000 * screen_height)


def execute_function_calls(response, page, screen_width: int, screen_height: int) -> List[Tuple[str, Any]]:
    """
    Extract and execute function calls from Gemini response.

    Args:
        response: Gemini API response object
        page: Playwright page object
        screen_width: Screen width in pixels
        screen_height: Screen height in pixels

    Returns:
        List of tuples: [(function_name, result), ...]
    """
    # Extract function calls and thoughts from the model's response
    candidate = response.candidates[0]
    function_calls = []
    thoughts = []

    for part in candidate.content.parts:
        if hasattr(part, 'function_call') and part.function_call:
            function_calls.append(part.function_call)
        elif hasattr(part, 'text') and part.text:
            thoughts.append(part.text)

    if thoughts:
        print(f"Model Reasoning: {' '.join(thoughts)}")

    # Execute each function call
    results = []
    for function_call in function_calls:
        result = None

        try:
            if function_call.name == "open_web_browser":
                print("Executing open_web_browser")
                # Browser is already open via Playwright, so this is a no-op
                result = "success"

            elif function_call.name == "click_at":
                actual_x = normalize_x(function_call.args["x"], screen_width)
                actual_y = normalize_y(function_call.args["y"], screen_height)

                print(f"Executing click_at: ({actual_x}, {actual_y})")
                page.mouse.click(actual_x, actual_y)
                result = "success"

            elif function_call.name == "type_text_at":
                actual_x = normalize_x(function_call.args["x"], screen_width)
                actual_y = normalize_y(function_call.args["y"], screen_height)
                text = function_call.args["text"]
                press_enter = function_call.args.get("press_enter", False)
                clear_before_typing = function_call.args.get("clear_before_typing", True)

                print(f"Executing type_text_at: ({actual_x}, {actual_y}) text='{text}'")

                # Click at the specified location
                page.mouse.click(actual_x, actual_y)
                time.sleep(0.1)

                # Clear existing text if requested
                if clear_before_typing:
                    page.keyboard.press("Control+A")
                    page.keyboard.press("Backspace")

                # Type the text
                page.keyboard.type(text)

                # Press enter if requested
                if press_enter:
                    page.keyboard.press("Enter")

                result = "success"

            else:
                # For any functions not parsed above
                print(f"Unrecognized function: {function_call.name}")
                result = "unknown_function"

        except Exception as e:
            print(f"Error executing {function_call.name}: {e}")
            result = f"error: {str(e)}"

        results.append((function_call.name, result))

    return results

如果傳回的 safety_decisionrequire_confirmation,您必須先請使用者確認,再繼續執行動作。根據服務條款,您不得略過人工確認要求。

下列程式碼會在先前的程式碼中加入安全邏輯:

import termcolor


def get_safety_confirmation(safety_decision):
    """Prompt user for confirmation when safety check is triggered."""
    termcolor.cprint("Safety service requires explicit confirmation!", color="red")
    print(safety_decision["explanation"])

    decision = ""
    while decision.lower() not in ("y", "n", "ye", "yes", "no"):
        decision = input("Do you wish to proceed? [Y]es/[N]o\n")

    if decision.lower() in ("n", "no"):
        return "TERMINATE"
    return "CONTINUE"


def execute_function_calls(response, page, screen_width: int, screen_height: int):
    # ... Extract function calls from response ...

    for function_call in function_calls:
        extra_fr_fields = {}

        # Check for safety decision
        if 'safety_decision' in function_call.args:
            decision = get_safety_confirmation(function_call.args['safety_decision'])
            if decision == "TERMINATE":
                print("Terminating agent loop")
                break
            extra_fr_fields["safety_acknowledgement"] = "true"

        # ... Execute function call and append to results ...

擷取新狀態

執行動作後,將函式執行結果傳回模型,模型就能使用這項資訊生成下一個動作。如果執行多個動作 (平行呼叫),您必須在後續使用者回合中,為每個動作傳送 FunctionResponse。如果是使用者定義函式,FunctionResponse 應包含已執行函式的傳回值。

function_response_parts = []

for name, result in results:
    # Take screenshot after each action
    screenshot = page.screenshot()
    current_url = page.url

    function_response_parts.append(
        FunctionResponse(
            name=name,
            response={"url": current_url},  # Include safety acknowledgement
            parts=[
                types.FunctionResponsePart(
                    inline_data=types.FunctionResponseBlob(
                       mime_type="image/png", data=screenshot
                    )
                )
            ]
        )
    )

# Create the user feedback content with all responses
user_feedback_content = Content(
    role="user",
    parts=function_response_parts
)

# Append this feedback to the 'contents' history list for the next API call
contents.append(user_feedback_content)

建構代理程式迴圈

將上述步驟合併為迴圈,即可啟用多步驟互動。迴圈必須處理平行函式呼叫。請記得正確管理對話記錄 (內容陣列),方法是同時附加模型回覆和函式回覆。

Python

from google import genai
from google.genai.types import Content, Part
from playwright.sync_api import sync_playwright


def has_function_calls(response):
    """Check if response contains any function calls."""
    candidate = response.candidates[0]
    return any(hasattr(part, 'function_call') and part.function_call
               for part in candidate.content.parts)


def main():
    client = genai.Client()

    # ... (config setup from "Send a request to model" section) ...

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto("https://www.google.com")
        
        screen_width, screen_height = 1920, 1080
        
        # ... (initial contents setup from "Send a request to model" section) ...

        # Agent loop: iterate until model provides final answer
        for iteration in range(10):
            print(f"\nIteration {iteration + 1}\n")

            # 1. Send request to model (see "Send a request to model" section)
            response = client.models.generate_content(
                model='gemini-2.5-computer-use-preview-10-2025',
                contents=contents,
                config=generate_content_config,
            )

            contents.append(response.candidates[0].content)

            # 2. Check if done - no function calls means final answer
            if not has_function_calls(response):
                print(f"FINAL RESPONSE:\n{response.text}")
                break

            # 3. Execute actions (see "Execute the received actions" section)
            results = execute_function_calls(response, page, screen_width, screen_height)
            time.sleep(1)

            # 4. Capture state and create feedback (see "Capture the New State" section)
            contents.append(create_feedback(results, page))

        input("\nPress Enter to close browser...")
        browser.close()


if __name__ == "__main__":
    main()
      

電腦用途模型和行動裝置用途工具

以下範例說明如何定義自訂函式 (例如 open_applong_press_atgo_home)、將這些函式與 Gemini 內建的電腦使用工具合併,以及排除不必要的瀏覽器專屬函式。註冊這些自訂函式後,模型就能智慧地呼叫這些函式,並搭配標準 UI 動作,在非瀏覽器環境中完成工作。

from typing import Optional, Dict, Any

from google import genai
from google.genai import types
from google.genai.types import Content, Part


client = genai.Client()

def open_app(app_name: str, intent: Optional[str] = None) -> Dict[str, Any]:
    """Opens an app by name.

    Args:
        app_name: Name of the app to open (any string).
        intent: Optional deep-link or action to pass when launching, if the app supports it.

    Returns:
        JSON payload acknowledging the request (app name and optional intent).
    """
    return {"status": "requested_open", "app_name": app_name, "intent": intent}


def long_press_at(x: int, y: int, duration_ms: int = 500) -> Dict[str, int]:
    """Long-press at a specific screen coordinate.

    Args:
        x: X coordinate (absolute), scaled to the device screen width (pixels).
        y: Y coordinate (absolute), scaled to the device screen height (pixels).
        duration_ms: Press duration in milliseconds. Defaults to 500.

    Returns:
        Object with the coordinates pressed and the duration used.
    """
    return {"x": x, "y": y, "duration_ms": duration_ms}


def go_home() -> Dict[str, str]:
    """Navigates to the device home screen.

    Returns:
        A small acknowledgment payload.
    """
    return {"status": "home_requested"}


#  Build function declarations
CUSTOM_FUNCTION_DECLARATIONS = [
    types.FunctionDeclaration.from_callable(client=client, callable=open_app),
    types.FunctionDeclaration.from_callable(client=client, callable=long_press_at),
    types.FunctionDeclaration.from_callable(client=client, callable=go_home),
]

# Exclude browser functions

EXCLUDED_PREDEFINED_FUNCTIONS = [
    "open_web_browser",
    "search",
    "navigate",
    "hover_at",
    "scroll_document",
    "go_forward",
    "key_combination",
    "drag_and_drop",
]

# Utility function to construct a GenerateContentConfig

def make_generate_content_config() -> genai.types.GenerateContentConfig:
    """Return a fixed GenerateContentConfig with Computer Use + custom functions."""
    return genai.types.GenerateContentConfig(
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_BROWSER,
                    excluded_predefined_functions=EXCLUDED_PREDEFINED_FUNCTIONS,
                )
            ),
            types.Tool(function_declarations=CUSTOM_FUNCTION_DECLARATIONS),
        ]
    )


# Create the content with user message
contents: list[Content] = [
    Content(
        role="user",
        parts=[
            # text instruction
            Part(text="Open Chrome, then long-press at 200,400."),
            # optional screenshot attachment
            Part.from_bytes(
                data=screenshot_image_bytes,
                mime_type="image/png",
            ),
        ],
    )
]

# Build your fixed config (from helper)
config = make_generate_content_config()

# Generate content with the configured settings
response = client.models.generate_content(
        model="gemini-2.5-computer-use-preview-10-2025",
        contents=contents,
        config=generate_content_config,
    )

    print(response)

支援的動作

電腦使用模型和工具可讓模型使用 FunctionCall 要求執行下列動作。用戶端程式碼必須實作這些動作的執行邏輯。如需範例,請參閱參考實作方式。

指令名稱 說明 引數 (在函式呼叫中) 函式呼叫範例
open_web_browser 開啟網路瀏覽器。 {"name": "open_web_browser", "args": {}}
wait_5_seconds 暫停執行 5 秒,讓動態內容載入或動畫完成。 {"name": "wait_5_seconds", "args": {}}
go_back 前往瀏覽器記錄中的上一頁。 {"name": "go_back", "args": {}}
go_forward 前往瀏覽器記錄中的下一頁。 {"name": "go_forward", "args": {}}
搜尋 前往預設搜尋引擎的首頁 (例如 Google)。適合用來開始新的搜尋工作。 {"name": "search", "args": {}}
navigate 直接將瀏覽器導向指定網址。 url:str {"name": "navigate", "args": {"url": "https://www.wikipedia.org"}}
click_at 點選網頁上的特定座標。x 和 y 值是以 1000x1000 格線為準,並會縮放至螢幕尺寸。 y:int (0 到 999),x:int (0 到 999) {"name": "click_at", "args": {"y": 300, "x": 500}}
hover_at 將滑鼠懸停在網頁上的特定座標。可用於顯示子選單。x 和 y 是以 1000x1000 格線為準。 y: int (0-999) x: int (0-999) {"name": "hover_at", "args": {"y": 150, "x": 250}}
type_text_at 在特定座標輸入文字,預設會先清除欄位,然後在輸入完畢後按下 ENTER 鍵,但這些動作可以停用。x 和 y 座標是以 1000x1000 的格線為準。 y:int (0-999)、x:int (0-999)、text:str、press_enter:bool (選用,預設為 True)、clear_before_typing:bool (選用,預設為 True) {"name": "type_text_at", "args": {"y": 250, "x": 400, "text": "search query", "press_enter": false}}
key_combination 按下鍵盤按鍵或組合鍵,例如「Ctrl+C」或「Enter」。可用於觸發動作 (例如使用「Enter」鍵提交表單) 或剪貼簿作業。 keys:str (例如「enter」、「control+c」。如需允許使用的金鑰完整清單,請參閱 API 參考資料) {"name": "key_combination", "args": {"keys": "Control+A"}}
scroll_document 將整個網頁「向上」、「向下」、「向左」或「向右」捲動。 direction:字串 (「up」、「down」、「left」或「right」) {"name": "scroll_document", "args": {"direction": "down"}}
scroll_at 在指定方向上,將特定元素或區域捲動特定幅度,座標為 (x, y)。座標和震級 (預設為 800) 是以 1000x1000 格線為準。 y:int (0-999)、x:int (0-999)、direction:str ("up"、"down"、"left"、"right")、magnitude:int (0-999,選用,預設為 800) {"name": "scroll_at", "args": {"y": 500, "x": 500, "direction": "down", "magnitude": 400}}
drag_and_drop 從起始座標 (x, y) 拖曳元素,並在目的地座標 (destination_x, destination_y) 放開。所有座標都是以 1000x1000 的格線為準。 y:int (0-999)、x:int (0-999)、destination_y:int (0-999)、destination_x:int (0-999) {"name": "drag_and_drop", "args": {"y": 100, "x": 100, "destination_y": 500, "destination_x": 500}}

安全與安全性

本節說明「電腦使用」模型和工具採取的防護措施,可提升使用者控制權和安全性。此外,本文也說明如何運用最佳做法,降低這項工具可能帶來的潛在新風險。

確認安全決定

視動作而定,電腦使用模型和工具的回應可能包含來自內部安全系統的 safety_decision。這項決定會驗證工具建議的安全措施。

{
  "content": {
    "parts": [
      {
        "text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95)."
      },
      {
        "function_call": {
          "name": "click_at",
          "args": {
            "x": 60,
            "y": 100,
            "safety_decision": {
              "explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
              "decision": "require_confirmation"
            }
          }
        }
      }
    ]
  }
}

如果 safety_decisionrequire_confirmation,您「必須」先請使用者確認,再繼續執行動作。

下列程式碼範例會在執行動作前,提示使用者確認。如果使用者未確認動作,迴圈就會終止。如果使用者確認動作,系統就會執行動作,並將 safety_acknowledgement 欄位標示為 True

import termcolor

def get_safety_confirmation(safety_decision):
    """Prompt user for confirmation when safety check is triggered."""
    termcolor.cprint("Safety service requires explicit confirmation!", color="red")
    print(safety_decision["explanation"])

    decision = ""
    while decision.lower() not in ("y", "n", "ye", "yes", "no"):
        decision = input("Do you wish to proceed? [Y]es/[N]o\n")

    if decision.lower() in ("n", "no"):
        return "TERMINATE"
    return "CONTINUE"

def execute_function_calls(response, page, screen_width: int, screen_height: int):

    # ... Extract function calls from response ...

    for function_call in function_calls:
        extra_fr_fields = {}

        # Check for safety decision
        if 'safety_decision' in function_call.args:
            decision = get_safety_confirmation(function_call.args['safety_decision'])
            if decision == "TERMINATE":
                print("Terminating agent loop")
                break
            extra_fr_fields["safety_acknowledgement"] = "true" # Safety acknowledgement

        # ... Execute function call and append to results ...

如果使用者確認,您必須在 FunctionResponse 中加入安全確認聲明。

function_response_parts.append(
    FunctionResponse(
        name=name,
        response={"url": current_url,
                  **extra_fr_fields},  # Include safety acknowledgement
        parts=[
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/png", data=screenshot
                )
             )
           ]
         )
       )

安全性最佳做法

電腦使用模型和工具是新穎的工具,會帶來開發人員應留意的全新風險:

  • 不可信的內容和詐騙:模型會盡力達成使用者的目標,但可能會依據不可信的資訊來源和畫面上的指示執行操作。舉例來說,如果使用者的目標是購買 Pixel 手機,而模型遇到「完成問卷調查即可免費獲得 Pixel」的詐騙訊息,模型可能會完成問卷調查。
  • 偶爾會發生非預期動作:模型可能會誤解使用者的目標或網頁內容,導致採取錯誤動作,例如點選錯誤的按鈕或填寫錯誤的表單。這可能會導致工作失敗或資料外洩。
  • 違反政策:無論是蓄意或無意,API 的功能都可能用於違反 Google 政策 (《生成式 AI 使用限制政策》和《Gemini API 附加服務條款》) 的活動。包括可能干擾系統完整性、危害安全性、略過 CAPTCHA 等安全措施、控制醫療器材等行為。

為因應這些風險,您可以採取下列安全措施和最佳做法:

  1. 人機迴圈 (HITL):
    • 實作使用者確認:如果安全回應指出 require_confirmation,您必須先實作使用者確認,才能執行作業。
    • 提供自訂安全指示:除了內建的使用者確認檢查外,開發人員也可以選擇新增自訂系統指示,強制執行自己的安全政策,禁止模型執行特定動作,或要求使用者確認後,模型才能執行特定高風險的不可逆動作。以下是與模型互動時可加入的自訂安全系統指令範例。

    按一下即可查看建立連結的範例

    ## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)**
    
    This is your first and most important check. If the next required action falls
    into any of the following categories, you MUST stop immediately, and seek the
    user's explicit permission.
    
    **Procedure for Seeking Confirmation:**  * **For Consequential Actions:**
    Perform all preparatory steps (e.g., navigating, filling out forms, typing a
    message). You will ask for confirmation **AFTER** all necessary information is
    entered on the screen, but **BEFORE** you perform the final, irreversible action
    (e.g., before clicking "Send", "Submit", "Confirm Purchase", "Share").  * **For
    Prohibited Actions:** If the action is strictly forbidden (e.g., accepting legal
    terms, solving a CAPTCHA), you must first inform the user about the required
    action and ask for their confirmation to proceed.
    
    **USER_CONFIRMATION Categories:**
    
    *   **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or
        agreeing to any of the following on the user's behalf. You must ask th e
        user to confirm before performing these actions.
        *   Terms of Service
        *   Privacy Policies
        *   Cookie consent banners
        *   End User License Agreements (EULAs)
        *   Any other legally significant contracts or agreements.
    *   **Robot Detection:** You MUST NEVER attempt to solve or bypass the
        following. You must ask the user to confirm before performing these actions.
    *   CAPTCHAs (of any kind)
        *   Any other anti-robot or human-verification mechanisms, even if you are
            capable.
    *   **Financial Transactions:**
        *   Completing any purchase.
        *   Managing or moving money (e.g., transfers, payments).
        *   Purchasing regulated goods or participating in gambling.
    *   **Sending Communications:**
        *   Sending emails.
        *   Sending messages on any platform (e.g., social media, chat apps).
        *   Posting content on social media or forums.
    *   **Accessing or Modifying Sensitive Information:**
        *   Health, financial, or government records (e.g., medical history, tax
            forms, passport status).
        *   Revealing or modifying sensitive personal identifiers (e.g., SSN, bank
            account number, credit card number).
    *   **User Data Management:**
        *   Accessing, downloading, or saving files from the web.
        *   Sharing or sending files/data to any third party.
        *   Transferring user data between systems.
    *   **Browser Data Usage:**
        *   Accessing or managing Chrome browsing history, bookmarks, autofill data,
            or saved passwords.
    *   **Security and Identity:**
        *   Logging into any user account.
        *   Any action that involves misrepresentation or impersonation (e.g.,
            creating a fan account, posting as someone else).
    *   **Insurmountable Obstacles:** If you are technically unable to interact with
        a user interface element or are stuck in a loop you cannot resolve, ask the
        user to take over.
    ---
    
    ## **RULE 2: Default Behavior (ACTUATE)**
    
    If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`,
    your default behavior is to **Actuate**.
    
    **Actuation Means:**  You MUST proactively perform all necessary steps to move
    the user's request forward. Continue to actuate until you either complete the
    non-consequential task or encounter a condition defined in Rule 1.
    
    *   **Example 1:** If asked to send money, you will navigate to the payment
        portal, enter the recipient's details, and enter the amount. You will then
        **STOP** as per Rule 1 and ask for confirmation before clicking the final
        "Send" button.
    *   **Example 2:** If asked to post a message, you will navigate to the site,
        open the post composition window, and write the full message. You will then
        **STOP** as per Rule 1 and ask for confirmation before clicking the final
        "Post" button.
    
        After the user has confirmed, remember to get the user's latest screen
        before continuing to perform actions.
    
    # Final Response Guidelines:
    Write final response to the user in these cases:
    - User confirmation
    - When the task is complete or you have enough information to respond to the user
        
  2. 安全執行環境:在採用沙箱機制的安全環境中執行代理程式,以限制其潛在影響 (例如採用沙箱機制的虛擬機器 (VM)、容器 (例如 Docker) 或權限受限的專用瀏覽器設定檔)。
  3. 輸入內容清除:清除提示中所有使用者產生的文字,降低意外指令或提示注入的風險。這層安全防護措施很有幫助,但無法取代安全執行環境。
  4. 允許清單和封鎖清單:導入篩選機制,控管模型可前往的位置和可執行的動作。禁止存取的網站封鎖清單是不錯的起點,而限制更嚴格的允許清單則更加安全。
  5. 可觀測性和記錄:維護詳細記錄,以利偵錯、稽核和事件回應。客戶應記錄提示、螢幕截圖、模型建議的動作 (function_call)、安全防護回應,以及客戶最終執行的所有動作。

定價

電腦使用模型和工具的定價與 Gemini 2.5 Pro 相同,並使用相同的 SKU。如要拆分電腦使用模型和工具費用,請使用自訂中繼資料標籤。如要進一步瞭解如何使用自訂中繼資料標籤監控費用,請參閱「自訂中繼資料標籤」一文。