Troubleshoot replication issues

AlloyDB has an architecture that separates compute and storage, allowing each to scale independently. While the primary and read pool instances share the same underlying storage, replication is still a crucial process for maintaining data consistency and freshness across the read replicas. In an AlloyDB cluster, writes are performed on the primary instance and then recorded in the Write-Ahead Log (WAL) on shared storage. These changes are then replicated to the read pool instances' in-memory caches. Understanding the two main steps of this replication process is key to troubleshooting any issues:

WAL flush: the Write-Ahead Log (WAL), which contains the changes to the database, is sent from the primary to the replica. The replica then immediately persists the WAL to disk.
WAL apply: the persisted WAL is then replayed on the replica, meaning the changes are applied to the replica's local data.

Delays in either of these steps contribute to what is known as replication lag. However, this term might be ambiguous. To be more precise, we can break down replication lag into the following two components:

Flush or network lag: this is the delay in the WAL flush step. It's the time it takes for the WAL to be sent from the primary and persisted on the replica.
Replay lag: this is the delay in the WAL apply step. It's the time it takes for the replica to apply the changes from the WAL.

Whether you must be more concerned about flush lag or replay lag depends on your use case:

If you are concerned about data loss, for example, with cross-region replicas, then you must pay close attention to flush lag. If the WAL isn't yet persisted on the replica and the primary crashes, the changes are lost from the perspective of the replica.
If you are concerned about the freshness of data on your read replicas, then you must pay close attention to replay lag. A high replay lag means that the data on your read replicas is stale.

Check for replication lag

You can monitor the replication lag of your read pool instances in the Google Cloud console. For more information, see Monitor instances. You can also monitor your read pool replication lag and receive alerts at a specified threshold using Create metric-threshold alerting policies.

Common causes of replication lag

Here are some common causes of replication lag and how to address them.

Replication conflicts

Read queries can sometimes block the replication process because they hold on to resources that the replication process is waiting for. For example, if a read query has a lock on a database object that the replication process needs to update, the replication is blocked until the lock is released. These are known as buffer pin conflicts.

You can identify these conflicts by looking for canceling statement due to conflict with recovery messages in the postgres.log file in the Logs Explorer.

To mitigate replication conflicts, you can do the following:

Reduce max_standby_streaming_delay: this parameter determines how long the replication process waits before canceling queries that are blocking it. The default value is 30 seconds. Reducing this value can help to reduce replication lag, but it might also cause more read queries to be canceled. You can tune this parameter to find the best balance for your application.
Avoid long-running queries: long-running queries on read pools can increase the likelihood of replication conflicts. Consider moving long-running queries to a different read pool where low replication lag is not as critical.
Enable alloydb.promote_cancel_to_terminate: this flag, which is on by default, lets AlloyDB forcefully terminate query backends that are unresponsive to cancellation requests. This can help to prevent unresponsive backends from blocking replication for extended periods.

Resource contention

Replication might also be slowed down by contention for system resources, such as CPU and memory.

CPU and memory pressure: a heavy read workload on a read pool instance can compete with the replication process for CPU and memory resources. You can check the CPU and memory usage of your instances in the Google Cloud console. If you see high resource utilization, you might need to scale up or scale out your read pool instances.
Read pool node size: if your primary instance is much larger than your read pool nodes, it might generate replication logs faster than the read nodes can process them. In such cases, it is recommended to use larger-sized read nodes to give the replicas more resources.

Temporary read query throttling

AlloyDB also gives you the control on whether to enable lag-based throttling of a read query on read nodes using the flag google_storage.log_replay_throttle_read_transactions. If the parameter is set to its default value of on, then the read queries are throttled by pausing new queries from starting as well as reading new buffers for up to one minute when replication lag exceeds one second. This feature makes a trade-off that improves replication lag by giving replay more resources to catch up faster, at the cost of potentially increasing read latency. If your application is not sensitive to replication lag, you can prioritize improving read query latency by setting google_storage.log_replay_throttle_read_transactions to off.

You can monitor the impact of query throttling using the following methods:

Cloud monitoring: use the alloydb.googleapis.com/instance/postgresql/wait_count metric to see how many queries have been throttled. To do this, filter the metric by the wait_event_name and look for WAIT_EVENT_HIGH_LAG_THROTTLE. To see the total time queries were throttled, you can use the alloydb.googleapis.com/instance/postgresql/wait_time metric with the same filter. For more information, see the System insights metrics reference.
Query insights: in the Query insights dashboard, the Active queries view shows the WAIT_EVENT_HIGH_LAG_THROTTLE wait event in the Wait event column when a query is being throttled due to replication lag. For more details, see Monitor active queries.

Heavy workload

A sudden increase in the write workload on the primary instance can generate a large amount of replication logs, which can overwhelm the read pool instances and cause replication lag. You can monitor the write traffic on your primary instance in the Google Cloud console.

Large transactions

Large transactions, such as COMMIT or ABORT records that affect a large number of rows, can take a long time to replicate to the read pool instances. In PostgreSQL 14 and 15, a long-running transaction that holds a long list of exclusive locks might cause the memory usage of the read replica to grow, which might eventually lead to the read pool instance crashing.

To mitigate this issue, you can terminate the long-running transaction on the primary instance.

Troubleshoot issues that prevent replication

Before you can have replication lag, you must have a functioning read pool. The following issues might prevent replication from occurring altogether, either by preventing the creation of a read pool or by causing a read replica to crash.

Read pool creation issues

If a read pool fails to create, you might see a Failed to create read pool message in the AlloyDB logs on Cloud Logging. This might happen if the cluster has reached its maximum storage limit, preventing the primary instance from allocating more space. While AlloyDB automatically scales storage, you might need to investigate what consumes storage and delete unnecessary data, or contact support to request an increase to your storage quota.

Read pool instance crashes

In PostgreSQL 14 and 15, a long-running transaction on the primary instance that holds a long list of exclusive locks might cause the memory usage of a read replica to grow, which might eventually lead to the read pool instance crashing.

To mitigate this issue, you can terminate the long-running transaction on the primary instance.

Impact of instance resizing on replication lag

AlloyDB's storage architecture ensures that read pool flush lag is not impacted by instance resizing. However, the same doesn't apply to the replay. The replica's ability to replay depends on the load it has. If you update your instance configuration, for example, by resizing it, depending on the workload, the replica may not have a fully warmed cache when the operation completes. This means that it is slower to replay or process records that it hasn't cached yet. In that case, this might mean that replay lag increases temporarily.

Troubleshoot replication issues Stay organized with collections Save and categorize content based on your preferences.

Check for replication lag

Common causes of replication lag

Replication conflicts

Resource contention

Temporary read query throttling

Heavy workload

Large transactions

Troubleshoot issues that prevent replication

Read pool creation issues

Read pool instance crashes

Impact of instance resizing on replication lag

Troubleshoot replication issues