Monitor and troubleshoot data health
This document describes the Data Health Monitoring and Troubleshooting dashboard.
The Data Health Monitoring and Troubleshooting dashboard functions as a central location in Google SecOps for you to monitor the status and health of all of your data sources.
The Data Health Monitoring and Troubleshooting dashboard includes information about the following:
- Ingestion volumes and ingestion health.
- Parsing volumes from raw logs to Unified Data Model (UDM) events.
- Context—and links to interfaces with additional relevant information and functionality.
Irregular and failed sources and log types. The Data Health Monitoring and Troubleshooting dashboard detects irregularities on a per-customer basis. It uses statistical methods with a 30-day lookback period to analyze ingestion data. Items that are marked irregular identify surges or drops in data being ingested and processed by Google SecOps.
Key benefits
You can use the Data Health Monitoring and Troubleshooting dashboard to do the following:
- Monitor overall data health at a glance. View the core health status and associated metrics for each feed, data source, log type, and source (that is, the feed ID).
Monitor aggregated data-health metrics for:
- Ingestion and parsing over time with highlighted events (not necessarily irregularities) that link to filtered dashboards.
- irregularities—current and over time.
Access related dashboards, filtered by time ranges, log type, or feed.
Access the feed configuration to edit and fix or remediate a problem.
Access the parser configuration to edit and fix or remediate a problem.
Click a link to open the Cloud Monitoring interface, and from there, configure custom API-based alerts using Status and Log Volume metrics.
Key questions
This section refers to Data Health Monitoring and Troubleshooting dashboard components and parameters, which are described in the Interface section.
You can use the Data Health Monitoring and Troubleshooting dashboard to answer the following, typical, key questions about your data pipeline:
Are my logs reaching the SIEM system?
You can verify whether logs are reaching the SIEM system by using the Last Successful Ingestion and Last Collection metrics. These metrics confirm the last time data was successfully delivered. Additionally, the Ingestion Volume metrics (per source and per log type) show you the amount of data being ingested.
Are my logs being parsed correctly?
To confirm correct parsing, check the Last Normalization Time metric. This metric indicates when the last successful transformation from raw log into a UDM event occurred.
Why is ingestion or parsing not happening?
The text in the Issue column identifies specific problems. This helps pinpoint actionable errors and non-actionable errors. The text Forbidden 403: Permission denied is an example of an actionable error, where the auth account provided in the feed configuration lacks required permissions. The text Internal_error is an example of a non-actionable error, where the recommended action is to open a support case with Google SecOps.
Are there significant changes in the number of ingested and parsed logs?
The Status field shows your data's health (from Healthy to Failed), based on data volume. You can also view the Ingestion Volume graphs (per source and per log type) to identify sudden or sustained surges or drops.
How can I get alerted if my sources are failing?
The Data Health Monitoring and Troubleshooting dashboard feeds the Status and Log Volume metrics into Cloud Monitoring. In one of the Data Health Monitoring and Troubleshooting dashboard tables, click the relevant Alerts link to open the Cloud Monitoring interface. There, you can configure custom API-based alerts using Status and Log Volume metrics.
How do I infer a delay in a log-type ingestion?
An indication of a delay is indicated when the Latest Event Time is significantly behind the Last Ingestion Time. The Data Health Monitoring and Troubleshooting dashboard exposes the 95th percentile of the Last Ingestion Time–Latest Event Time delta—per log type. A high value suggests a latency problem within the Google SecOps pipeline, whereas a normal value might indicate the source is pushing old data.
Have any recent changes in my configuration caused feed failures?
If the Last Modified timestamp is close to the Last Successful Ingestion timestamp, it suggests that a recent configuration update may be the cause of a failure. This correlation helps in root cause analysis.
How has the health of ingestion and parsing been trending over time?
The Total Ingestion, Ingestion Health, and Parsers Health graphs show the historical trend of your data's health, letting you observe long-term patterns and irregularities.
Interface
The Data Health Monitoring and Troubleshooting dashboard displays the following widgets:
Big number widgets for parsing:
- Healthy Parsing: The number of log types parsed successfully. That is, the number of parsing statuses that aren't failed or irregular.
- Irregular Parsing: The number of irregular parsing statuses.
- Failed Parsing: The number of failed parsing statuses.
Total Ingested and Total Parsed: A line graph showing the Parsed Data and Incoming Data logs-per-minute curves over time.
Data Source Health Overview graph: A line graph showing the Failed and Irregular issues-per-day curves over time for data ingestion.
Parsing Health Overview: A line graph showing the Failed and Irregular issues-per-day curves over time for parsers.
Big number widgets for sources (feeds):
- Healthy Sources: The number of sources with no irregularities.
- Critical Sources: The number of sources with critical irregularities.
- Source Warnings: The number of sources with warnings.
Ingestion Health table, which includes the following columns:
- Status: The cumulative status of the feed (for example, Healthy or Irregular), derived from data volume, and configuration and API errors.
- Name: The feed name.
- Mechanism: The type of ingestion mechanism—for example, Ingestion API, Native Workspace Ingestion, or Azure Event Hub Feeds.
- Log Type: The log type.
- Issue Details: The problem, if one exists—for example, Failed parsing logs, Config credential issue, or Normalization issue. The stated issue can be actionable (for example, Incorrect Auth) or non-actionable (for example, Internal_error). When the Status is Healthy, the value is empty.
- Issue Duration: The number of days since the data source has been in an irregular or failed state. When the Status is Healthy, the value is empty.
- Confi Last Updated: The timestamp of the last change to the metric. Use this value to correlate configuration updates with observed irregularities, helping you determine the root cause of ingestion or parsing problems.
- Last Collected: The timestamp of the last data collection.
- Last Ingested: The timestamp of the last successful ingestion. Use this metric to identify whether your logs are reaching Google SecOps.
- View Ingestion Details: A link that opens a new tab with another dashboard, which contains additional, historical information—for deeper analysis.
- Edit Data Source: A link that opens a new tab with the corresponding feed configuration—where you can fix configuration-related irregularities.
- Set Up Alerts: A link, which opens a new tab with the corresponding Cloud Monitoring interface.
Parsing Health table that includes the following columns:
- Name: The log type—for example, DNS, USER, GENERIC, GCP SECURITYCENTER THREAT, or WEBPROXY.
- Status The cumulative status of the log type (for example, Healthy or Failed), derived from the normalization ratio.
- Issue Details: The parsing problem or problems, if there is one—for example, Failed parsing logs, Config credential issue, or Normalization issue. The stated issue can be actionable (for example, Incorrect Auth) or non-actionable (for example, Internal_error). When the Status is Healthy, no value is displayed.
- Issue Duration: The number of days since the data source has been in an anomirregular or failed state. When the Status is Healthy, no value is displayed.
- Last Ingested: The timestamp of the last successful ingestion. You can use this metric to determine whether logs are reaching Google SecOps.
Last Event: The event timestamp of the last normalized log.
Last Parsed: The timestamp of the last parsing and normalization action for the log type. You can use this metric to determine whether raw logs are successfully transformed into UDM events.
View Parsing Details: A link, which opens a new tab with another dashboard, which contains additional, historical information—for deeper analysis.
Edit Parser: A link, which opens a new tab with the corresponding parser configuration—where you can fix configuration-related irregularities.
Set Up Alert: A link, which opens a new tab with the corresponding Cloud Monitoring interface.
Irregularity-detection engine
The Data Health Monitoring and Troubleshooting dashboard uses the Google SecOps irregularity-detection engine to automatically identify significant changes in your data, letting you quickly detect and address potential problems.
Data ingestion irregularity-detection
The irregularity-detection engine uses the following calculations to detect unusual surges or drops in your data ingestion, Google SecOps analyzes daily volume changes, while considering normal weekly patterns:
- Daily and weekly comparisons: Google SecOps calculates the difference in ingestion volume between the current day and the previous day, and also the difference between the current day and the average volume over the past week.
Standardization: To understand the significance of these changes, Google SecOps standardizes them using the following z-score formula:
z = (xi − x_bar) _/ stdevwhere
z(z-score)= The standardized score (or Z-score) for an individual differencexi= An individual difference valuex_bar= The mean of the differencesstdev= The standard deviation of the differences
Irregularity flagging: Google SecOps flags an irregularity if both the daily and weekly standardized changes are statistically significant. Specifically, Google SecOps searches for:
- Drops: Both the daily and weekly standardized differences are less than -1.645.
- Surges: Both the daily and weekly standardized differences are greater than 1.645.
Normalization ratio
When calculating the ratio of ingested events to normalized events, the irregularity-detection engine uses a combined approach to ensure that only significant drops in normalization rates are flagged. The irregularity-detection engine generates an alert only when the following two conditions are met:
- There is a statistically significant drop in the normalization ratio compared to the previous day.
- The drop is also significant in absolute terms, with a magnitude of 0.05 or greater.
Parsing error irregularity detection
For errors that occur during data parsing, the irregularity-detection engine uses a ratio-based method. The irregularity-detection engine triggers an alert if the proportion of parser errors relative to the total number of ingested events increases by 5 percentage points or more compared to the previous day.
What's next
- Learn more about dashboards
- Learn how to create a custom dashboard
- Use Cloud Monitoring for ingestion notifications
Need more help? Get answers from Community members and Google SecOps professionals.