Monitor and troubleshoot data health

Supported in:

Google secops SIEM

This document describes the Data Health Monitoring and Troubleshootingdashboard.

The Data Health Monitoring and Troubleshootingdashboard functions as a central location in Google SecOps for you to monitor the status and health of all of your data sources.

The Data Health Monitoring and Troubleshootingdashboard includes information about the following:

Ingestion volumes and ingestion health.
Parsing volumes from raw logs to Unified Data Model (UDM) events .
Context—and links to interfaces with additional relevant information and functionality.
Irregular and failed sources and log types. The Data Health Monitoring and Troubleshootingdashboard detects irregularities on a per-customer basis. It uses statistical methods with a 30-day lookback period to analyze ingestion data. Items that are marked irregular identify surges or drops in data being ingested and processed by Google SecOps.

Note: An irregularity can indicate a parser problem, vendor schema change, or a change in your data pipeline.

Key benefits

You can use the Data Health Monitoring and Troubleshootingdashboard to do the following:

Monitor overall data health at a glance. View the core health status and associated metrics for each feed, data source, log type, and source (that is, the feed ID).
Monitor aggregated data-health metrics for:
- Ingestion and parsing over time with highlighted events (not necessarily irregularities) that link to filtered dashboards.
- irregularities—current and over time.
Access related dashboards, filtered by time ranges, log type, or feed.
Access the feed configuration to edit and fix or remediate a problem.
Access the parser configuration to edit and fix or remediate a problem.
Click a link to open the Cloud Monitoring interface, and from there, configure custom API-based alerts using Statusand Log Volumemetrics.

Key questions

This section refers to Data Health Monitoring and Troubleshootingdashboard components and parameters, which are described in the Interface section.

You can use the Data Health Monitoring and Troubleshootingdashboard to answer the following, typical, key questions about your data pipeline:

Are my logs reaching the SIEM system?

You can verify whether logs are reaching the SIEM system by using the Last Successful Ingestionand Last Collectionmetrics. These metrics confirm the last time data was successfully delivered. Additionally, the Ingestion Volumemetrics (per source and per log type) show you the amount of data being ingested.
Are my logs being parsed correctly?

To confirm correct parsing, check the Last Normalization Timemetric. This metric indicates when the last successful transformation from raw log into a UDM event occurred.
Why is ingestion or parsing not happening?

The text in the Issuecolumn identifies specific problems. This helps pinpoint actionable errors and non-actionable errors. The text Forbidden 403: Permission deniedis an example of an actionable error, where the auth account provided in the feed configuration lacks required permissions. The text Internal_erroris an example of a non-actionable error, where the recommended action is to open a support case with Google SecOps.
Are there significant changes in the number of ingested and parsed logs?

The Statusfield shows your data's health (from Healthyto Failed), based on data volume. You can also view the Ingestion Volumegraphs (per source and per log type) to identify sudden or sustained surges or drops.
How can I get alerted if my sources are failing?

The Data Health Monitoring and Troubleshootingdashboard feeds the Statusand Log Volumemetrics into Cloud Monitoring. In one of the Data Health Monitoring and Troubleshootingdashboard tables, click the relevant Alertslink to open the Cloud Monitoring interface. There, you can configure custom API-based alerts using Statusand Log Volumemetrics.
How do I infer a delay in a log-type ingestion?

An indication of a delay is indicated when the Latest Event Timeis significantly behind the Last Ingestion Time. The Data Health Monitoring and Troubleshootingdashboard exposes the 95 ^th percentile of the Last Ingestion Time– Latest Event Timedelta—per log type. A high value suggests a latency problem within the Google SecOps pipeline, whereas a normal value might indicate the source is pushing old data.
Have any recent changes in my configuration caused feed failures?

If the Last Modifiedtimestamp is close to the Last Successful Ingestiontimestamp, it suggests that a recent configuration update may be the cause of a failure. This correlation helps in root cause analysis.
How has the health of ingestion and parsing been trending over time?

The Total Ingestion, Ingestion Health, and Parsers Healthgraphs show the historical trend of your data's health, letting you observe long-term patterns and irregularities.

Interface

The Data Health Monitoring and Troubleshootingdashboard displays the following widgets:

Big number widgets for parsing:
- Healthy Parsing: The number of log types parsed successfully. That is, the number of parsing statuses that aren't failed or irregular.
- Irregular Parsing: The number of irregular parsing statuses.
- Failed Parsing: The number of failed parsing statuses.
Total Ingested and Total Parsed: A line graph showing the Parsed Dataand Incoming Datalogs-per-minute curves over time.
Data Source Health Overviewgraph: A line graph showing the Failedand Irregularissues-per-day curves over time for data ingestion.
Parsing Health Overview: A line graph showing the Failedand Irregularissues-per-day curves over time for parsers.
Big number widgets for sources (feeds):
- Healthy Sources: The number of sources with no irregularities.
- Critical Sources: The number of sources with critical irregularities.
- Source Warnings: The number of sources with warnings.
Ingestion Healthtable, which includes the following columns:
- Status: The cumulative status of the feed (for example, Healthyor Irregular), derived from data volume, and configuration and API errors.
- Name: The feed name.
- Mechanism: The type of ingestion mechanism—for example, Ingestion API, Native Workspace Ingestion, or Azure Event Hub Feeds.
- Log Type: The log type.
- Issue Details: The problem, if one exists—for example, Failed parsing logs, Config credential issue, or Normalization issue. The stated issue can be actionable (for example, Incorrect Auth) or non-actionable (for example, Internal_error). When the Statusis Healthy, the value is empty.
- Issue Duration: The number of days since the data source has been in an irregular or failed state. When the Statusis Healthy, the value is empty.
- Confi Last Updated: The timestamp of the last change to the metric. Use this value to correlate configuration updates with observed irregularities, helping you determine the root cause of ingestion or parsing problems.
- Last Collected: The timestamp of the last data collection.
- Last Ingested: The timestamp of the last successful ingestion. Use this metric to identify whether your logs are reaching Google SecOps.
- View Ingestion Details: A link that opens a new tab with another dashboard, which contains additional, historical information—for deeper analysis.
- Edit Data Source: A link that opens a new tab with the corresponding feed configuration —where you can fix configuration-related irregularities.
- Set Up Alerts: A link, which opens a new tab with the corresponding Cloud Monitoring interface.
Parsing Healthtable that includes the following columns:
- Name: The log type—for example, DNS, USER, GENERIC, GCP SECURITYCENTER THREAT, or WEBPROXY.
- StatusThe cumulative status of the log type (for example, Healthyor Failed), derived from the normalization ratio .
- Issue Details: The parsing problem or problems, if there is one—for example, Failed parsing logs, Config credential issue, or Normalization issue. The stated issue can be actionable (for example, Incorrect Auth) or non-actionable (for example, Internal_error). When the Statusis Healthy, no value is displayed.
- Issue Duration: The number of days since the data source has been in an anomirregular or failed state. When the Statusis Healthy, no value is displayed.
- Last Ingested: The timestamp of the last successful ingestion. You can use this metric to determine whether logs are reaching Google SecOps.
- Last Event: The event timestamp of the last normalized log.
  
  Note: The value is always the latest timestamp—even if an older event is ingested later.
- Last Parsed: The timestamp of the last parsing and normalization action for the log type. You can use this metric to determine whether raw logs are successfully transformed into UDM events .
- View Parsing Details: A link, which opens a new tab with another dashboard, which contains additional, historical information—for deeper analysis.
- Edit Parser: A link, which opens a new tab with the corresponding parser configuration —where you can fix configuration-related irregularities.
- Set Up Alert: A link, which opens a new tab with the corresponding Cloud Monitoring interface.

Irregularity-detection engine

The Data Health Monitoring and Troubleshootingdashboard uses the Google SecOps irregularity-detection engine to automatically identify significant changes in your data, letting you quickly detect and address potential problems.

Data ingestion irregularity-detection

The irregularity-detection engine uses the following calculations to detect unusual surges or drops in your data ingestion, Google SecOps analyzes daily volume changes, while considering normal weekly patterns:

Daily and weekly comparisons: Google SecOps calculates the difference in ingestion volume between the current day and the previous day, and also the difference between the current day and the average volume over the past week.
Standardization: To understand the significance of these changes, Google SecOps standardizes them using the following z-score formula:

z = (xi − x_bar) _/ stdev

where
- z(z-score) = The standardized score (or Z-score) for an individual difference
- xi = An individual difference value
- x_bar = The mean of the differences
- stdev = The standard deviation of the differences
Irregularity flagging: Google SecOps flags an irregularity if both the daily and weekly standardized changes are statistically significant. Specifically, Google SecOps searches for:
- Drops: Both the daily and weekly standardized differences are less than -1.645.
- Surges: Both the daily and weekly standardized differences are greater than 1.645.

Normalization ratio

When calculating the ratio of ingested events to normalized events, the irregularity-detection engine uses a combined approach to ensure that only significant drops in normalization rates are flagged. The irregularity-detection engine generates an alert only when the following two conditions are met:

There is a statistically significant drop in the normalization ratio compared to the previous day.
The drop is also significant in absolute terms, with a magnitude of 0.05 or greater.

Parsing error irregularity detection

For errors that occur during data parsing, the irregularity-detection engine uses a ratio-based method. The irregularity-detection engine triggers an alert if the proportion of parser errors relative to the total number of ingested events increases by 5 percentage points or more compared to the previous day.

What's next

Need more help? Get answers from Community members and Google SecOps professionals.