Collect URLScan IO logs

Supported in:

Google secops SIEM

This document explains how to ingest URLScan IO logs to Google Security Operations using Amazon S3.

Before you begin

Make sure you have the following prerequisites:

A Google SecOps instance
Privileged access to URLScan IO tenant
Privileged access to AWS (S3, IAM, Lambda, EventBridge)

Get URLScan IO prerequisites

Sign in to URLScan IO.
Click your profile icon.
Select API Key from the menu.
If you don't have an API key yet:
- Click Create API Key button.
- Enter a description for the API key (for example, Google SecOps Integration).
- Select the permissions for the key (for read-only access, select Read permissions).
- Click Generate API Key.
Copy and save in a secure location the following details:
- API_KEY: The generated API key string (format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
- API Base URL: https://urlscan.io/api/v1 (this is constant for all users)
Note your API quota limits:
- Free accounts: Limited to 1000 API calls per day, 60 per minute
- Pro accounts: Higher limits based on subscription tier
If you need to restrict searches to your organization's scans only, note down:
- User identifier: Your username or email (for use with user: search filter)
- Team identifier: If using teams feature (for use with team: search filter)

Configure AWS S3 bucket and IAM for Google SecOps

Create Amazon S3 bucket following this user guide: Creating a bucket.
Save bucket Name and Region for future reference (for example, urlscan-logs-bucket).
Create a User following this user guide: Creating an IAM user.
Select the created User.
Select Security credentials tab.
Click Create Access Key in section Access Keys.
Select Third-party service as Use case.
Click Next.
Optional: Add a description tag.
Click Create access key.
Click Download CSV file to save the Access Key and Secret Access Key for future reference.
Click Done.
Select Permissions tab.
Click Add permissions in section Permissions policies.
Select Add permissions.
Select Attach policies directly.
Search for AmazonS3FullAccess policy.
Select the policy.
Click Next.
Click Add permissions.

Configure the IAM policy and role for S3 uploads

In the AWS console, go to IAM > Policies.
Click Create policy > JSON tab.

Enter the following policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowPutObjects",
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::urlscan-logs-bucket/*"
    },
    {
      "Sid": "AllowGetStateObject",
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::urlscan-logs-bucket/urlscan/state.json"
    }
  ]
}

Replace urlscan-logs-bucket if you entered a different bucket name.

Click Next > Create policy.
Go to IAM > Roles > Create role > AWS service > Lambda.
Attach the newly created policy.
Name the role urlscan-lambda-role and click Create role.

Create the Lambda function

In the AWS Console, go to Lambda > Functions > Create function.
Click Author from scratch.
Provide the following configuration details:

Setting Value

Name urlscan-collector

Runtime Python 3.13

Architecture x86_64

Execution role urlscan-lambda-role

Setting	Value
Name	`urlscan-collector`
Runtime	Python 3.13
Architecture	x86_64
Execution role	`urlscan-lambda-role`

After the function is created, open the Code tab, delete the stub and enter the following code (urlscan-collector.py):

import json
import os
import boto3
from datetime import datetime, timedelta
import urllib3
import base64

s3 = boto3.client('s3')
http = urllib3.PoolManager()

def lambda_handler(event, context):
    # Environment variables
    bucket = os.environ['S3_BUCKET']
    prefix = os.environ['S3_PREFIX']
    state_key = os.environ['STATE_KEY']
    api_key = os.environ['API_KEY']
    api_base = os.environ['API_BASE']
    search_query = os.environ.get('SEARCH_QUERY', 'date:>now-1h')
    page_size = int(os.environ.get('PAGE_SIZE', '100'))
    max_pages = int(os.environ.get('MAX_PAGES', '10'))

    # Load state
    state = load_state(bucket, state_key)
    last_run = state.get('last_run')

    # Prepare search query
    if last_run:
        # Adjust search query based on last run
        search_time = datetime.fromisoformat(last_run)
        time_diff = datetime.utcnow() - search_time
        hours = int(time_diff.total_seconds() / 3600) + 1
        search_query = f'date:>now-{hours}h'

    # Search for scans
    headers = {'API-Key': api_key}
    all_results = []

    for page in range(max_pages):
        search_url = f"{api_base}/search/"
        params = {
            'q': search_query,
            'size': page_size,
            'offset': page * page_size
        }

        # Make search request
        response = http.request(
            'GET',
            search_url,
            fields=params,
            headers=headers
        )

        if response.status != 200:
            print(f"Search failed: {response.status}")
            break

        search_data = json.loads(response.data.decode('utf-8'))
        results = search_data.get('results', [])

        if not results:
            break

        # Fetch full result for each scan
        for result in results:
            uuid = result.get('task', {}).get('uuid')
            if uuid:
                result_url = f"{api_base}/result/{uuid}/"
                result_response = http.request(
                    'GET',
                    result_url,
                    headers=headers
                )

                if result_response.status == 200:
                    full_result = json.loads(result_response.data.decode('utf-8'))
                    all_results.append(full_result)
                else:
                    print(f"Failed to fetch result for {uuid}: {result_response.status}")

        # Check if we have more pages
        if len(results) < page_size:
            break

    # Write results to S3
    if all_results:
        now = datetime.utcnow()
        file_key = f"{prefix}year={now.year}/month={now.month:02d}/day={now.day:02d}/hour={now.hour:02d}/urlscan_{now.strftime('%Y%m%d_%H%M%S')}.json"

        # Create NDJSON content
        ndjson_content = '\n'.join([json.dumps(r, separators=(',', ':')) for r in all_results])

        # Upload to S3
        s3.put_object(
            Bucket=bucket,
            Key=file_key,
            Body=ndjson_content.encode('utf-8'),
            ContentType='application/x-ndjson'
        )

        print(f"Uploaded {len(all_results)} results to s3://{bucket}/{file_key}")

    # Update state
    state['last_run'] = datetime.utcnow().isoformat()
    save_state(bucket, state_key, state)

    return {
        'statusCode': 200,
        'body': json.dumps({
            'message': f'Processed {len(all_results)} scan results',
            'location': f"s3://{bucket}/{prefix}"
        })
    }

def load_state(bucket, key):
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        return json.loads(response['Body'].read())
    except s3.exceptions.NoSuchKey:
        return {}
    except Exception as e:
        print(f"Error loading state: {e}")
        return {}

def save_state(bucket, key, state):
    try:
        s3.put_object(
            Bucket=bucket,
            Key=key,
            Body=json.dumps(state),
            ContentType='application/json'
        )
    except Exception as e:
        print(f"Error saving state: {e}")

Go to Configuration > Environment variables.
Click Edit > Add new environment variable.

Enter the following environment variables, replacing with your values:

Key	Example value
`S3_BUCKET`	`urlscan-logs-bucket`
`S3_PREFIX`	`urlscan/`
`STATE_KEY`	`urlscan/state.json`
`API_KEY`	`<your-api-key>`
`API_BASE`	`https://urlscan.io/api/v1`
`SEARCH_QUERY`	`date:>now-1h`
`PAGE_SIZE`	`100`
`MAX_PAGES`	`10`

After the function is created, stay on its page (or open Lambda > Functions > your-function).
Select the Configuration tab.
In the General configuration panel click Edit.
Change Timeout to 5 minutes (300 seconds) and click Save.

Note: Timeout is a hard upper limit, meaning if the function is still running when the timer hits 5 minutes, AWS Lambda will terminate it. Any shorter execution just finishes normally; you are billed only for the actual run time.

Create an EventBridge schedule

Go to Amazon EventBridge > Scheduler > Create schedule.
Provide the following configuration details:
- Recurring schedule: Rate (1 hour).
- Target: your Lambda function urlscan-collector.
- Name: urlscan-collector-1h.
Click Create schedule.

Optional: Create read-only IAM user & keys for Google SecOps

Go to AWS Console > IAM > Users.
Click Add users.
Provide the following configuration details:
- User: Enter secops-reader.
- Access type: Select Access key – Programmatic access.
Click Create user.
Attach minimal read policy (custom): Users > secops-reader > Permissions > Add permissions > Attach policies directly > Create policy.

In the JSON editor, enter the following policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::urlscan-logs-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::urlscan-logs-bucket"
    }
  ]
}

Set the name to secops-reader-policy.
Go to Create policy > search/select > Next > Add permissions.
Go to Security credentials > Access keys > Create access key.
Download the CSV (these values are entered into the feed).

Configure a feed in Google SecOps to ingest URLScan IO logs

Go to SIEM Settings > Feeds.
Click Add New Feed.
In the Feed name field, enter a name for the feed (for example, URLScan IO logs).
Select Amazon S3 V2 as the Source type.
Select URLScan IO as the Log type.
Click Next.
Specify values for the following input parameters:
- S3 URI: s3://urlscan-logs-bucket/urlscan/
- Source deletion options: Select deletion option according to your preference.
- Maximum File Age: Include files modified in the last number of days. Default is 180 days.
- Access Key ID: User access key with access to the S3 bucket.
- Secret Access Key: User secret key with access to the S3 bucket.
- Asset namespace: The asset namespace.
- Ingestion labels: The label applied to the events from this feed.
Click Next.
Review your new feed configuration in the Finalize screen, and then click Submit.

Need more help? Get answers from Community members and Google SecOps professionals.