Instances are the basic building blocks of App Engine, providing all the resources needed to successfully host your application. At any given time, your application can be running on one or many instances with requests being spread across all of them. Each instance includes a security layer to ensure that instances cannot inadvertently affect each other.
App Engine can automatically create and shut down instances as traffic fluctuates, or you can specify a number of instances to run regardless of the amount of traffic. To determine how and when new instances are created, you specify a scaling type for your app. The scaling settings are applied at the App Engine version level as part of the app.yaml file.
Scaling types
App Engine supports the following scaling types, which controls how and when instances are created:
- Automatic (default)
- Basic
- Manual
You specify the scaling type in your app's
app.yaml.
By default, your app uses automatic scaling, which means App Engine will 
manage the number of idle instances.
- Automatic scaling
- Automatic scaling creates instances based on request rate, response latencies,
and other application metrics. You can specify thresholds for each of these
metrics, as well as a minimum number instances to keep running at all times
by configuring the automatic_scalingelement.
- Basic Scaling
- Basic scaling creates instances when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity.
- Manual scaling
- Manual scaling specifies the number of instances that continuously run regardless of the load level. This allows tasks such as complex initializations and applications that rely on the state of the memory over time.
| Feature | Automatic scaling | Basic scaling | Manual scaling | 
|---|---|---|---|
| Request timeout | 10 minutes for HTTP requests and task queue tasks. If your app doesn't
        return a request within this time limit, App Engine interrupts
        the request handler and
        emits
          an error for your code to handle. For legacy runtimes (Java 8, PHP 5, and Python 2): 
 | 24 hours for HTTP requests and task queue tasks. If your app doesn't
        return a request within this time limit, App Engine interrupts the
        request handler and
        emits
          an error for your code to handle. A basic-scaled instance can choose to handle  | Same as basic scaling. | 
| Background threads | Not allowed | Allowed | Allowed | 
| Residence | Instances are shut down based on usage patterns. | Instances are shut down based on the idle_timeoutparameter. If an instance has been idle, for example it has not received
        a request for more thanidle_timeout, then the
        instance is shut down. | Instances remain in memory and state is preserved across requests. When
        instances are stopped, an /_ah/stoprequest appears in
        the logs.
        
        If there is an/_ah/stophandler or a registered
        shutdown hook, it has 30 
        seconds to complete before shutdown occurs. | 
| Startup and shutdown | Instances are created on demand to handle requests and automatically turned down when idle. | Instances are created on demand to handle requests and automatically
        shut down when idle, based on the idle_timeoutconfiguration parameter. An instance that is
        manually stopped
        has 30 seconds to finish handling requests before it is forcibly
        terminated. | Instances are sent a start request automatically by App Engine in the
        form of an empty GET request to /_ah/start. As with
        basic scaling, an instance that is
        manually stopped
        has 30 seconds to finish handling requests before it is forcibly
        terminated. | 
| Instance addressability | Instances are anonymous. | Instance "i" of version "v" of service "s" is addressable at the URL: https://i-dot-v-dot-s-dot-app_id.REGION_ID.r.appspot.com.
        If you have set up a 
        wildcard subdomain mapping for a custom domain, you can also address
        a service or any of its instances via a URL of the formhttps://s.domain.comorhttps://i.s.domain.com.
        You can reliably cache state in each instance and retrieve it in
        subsequent requests. | Same as basic scaling. | 
| Scaling | App Engine scales the number of instances automatically in response to
        processing volume. This scaling factors in the automatic_scalingsettings that are provided on a
        per-version basis in the configuration file. | A service with basic scaling is configured by setting the maximum number
        of instances in the max_instancesparameter of thebasic_scalingsetting. The number of live instances scales
        with the processing volume. | You configure the number of instances of each version in that
        service's configuration file. The number of instances usually
        corresponds to the size of a dataset being held in memory or the desired
        throughput for offline work. You can adjust the
        number of instances of a manually-scaled version very quickly, without
        stopping instances that are currently running, using the Modules API set_num_instancesfunction. | 
Scaling dynamic instances
App Engine applications that use basic or automatic scaling are powered by any number of dynamic instances at a given time, depending on the volume of incoming requests. As requests for your application increase, the number of dynamic instances may increase as well.
Apps with basic scaling
If you use basic scaling, App Engine attempts to keep your cost low, even though that may result in higher latency as the volume of incoming requests increases.
When none of the existing instances are available to serve an incoming request, App Engine starts a new instance. Even after starting a new instance, some requests may need to be queued until the new instance completes its startup process. If you require the lowest latency possible consider using automatic scaling, which creates new instances preemptively to minimize latency.
Apps with automatic scaling
If you use automatic scaling, each instance in your app has its own queue for incoming requests. Before the queues become long enough to have a noticeable effect on your app's latency, App Engine automatically creates one or more new instances to handle the increasing load.
You can configure the settings for automatic scaling to achieve a trade-off between the performance you want and the cost you can incur. The following table describes these settings.
| Automatic scaling settings | Description | 
|---|---|
| Target CPU Utilization | Sets the CPU utilization ratio threshold to specify the CPU usage threshold at which more instances will be started to handle traffic. | 
| Target Throughput Utilization | Sets the throughput threshold for the number of concurrent requests after which more instances will be started to handle traffic. | 
| Max Concurrent Requests | Sets the max concurrent requests an instance can accept before the scheduler spawns a new instance. | 
Watch the App Engine Scheduler Settings video to see the effects of these settings.
Scaling down
When request volumes decrease, App Engine reduces the number of instances. This downward scaling helps ensure that all of your application's current instances are being used for optimal efficiency and cost effectiveness.
When an application is not being used at all, App Engine turns off its associated dynamic instances, but readily reloads them as soon as they are needed. Reloading instances can result in loading requests and additional latency for users.
You can specify a minimum number of idle instances. Setting an appropriate number of idle instances for your application based on request volume allows your application to serve every request with little latency, unless you are experiencing abnormally high request volume.
Scaling down in automatic scaling
If your app uses automatic scaling, it takes approximately 15 minutes of
inactivity for the idle instances to start shutting down. To keep one or more
idle instances running, set the value of min_idle_instances
to 1 or higher.
Scaling and batches of requests
If you are sending batches of requests to your services, for example, to a task queue for processing, a large number of instances will be created quickly. We recommend controlling this by rate limiting the number of request sent per second, if possible. For example, if you use Google Tasks, you can control the rate at which tasks are pushed.
Instance life cycle
Instance states
An instance of an auto-scaled service is always running. However, an instance of a manual or basic scaled service can be either running or stopped. All instances of the same service and version share the same state. You change the state of your instances by managing your versions. You can:
- Use the Versions page in the Google Cloud console
- Use gcloud app versions startandgcloud app versions stopcommands
- Use the Modules API
Startup
Each service instance is created in response to a start request, which is an
empty HTTP GET request to /_ah/start. App Engine sends this request
to bring an instance into existence; users cannot send a request to
/_ah/start. Manual and basic scaling instances must respond to the start
request before they can handle another request. The start request can be used
for two purposes:
- To start a program that runs indefinitely, without accepting further requests.
- To initialize an instance before it receives additional traffic.
Manual, basic, and automatically scaling instances startup differently. When you
start a manual scaling instance, App Engine immediately sends a
/_ah/start request to each instance. When you start an instance of a basic
scaling service, App Engine allows it to accept traffic, but the
/_ah/start request is not sent to an instance until it receives its first user
request. Multiple basic scaling instances are only started as necessary, in
order to handle increased traffic. Automatically scaling instances do not
receive any /_ah/start request.
When an instance responds to the /_ah/start request with an HTTP status code
of 200–299 or 404, it is considered to have successfully started and can
handle additional requests. Otherwise, App Engine terminates the
instance. Manual scaling instances are restarted immediately, while basic
scaling instances are restarted only when needed for serving traffic.
Shutdown
The shutdown process might be triggered by a variety of planned and unplanned events, such as:
- There are too many instances and not enough app requests (traffic).
- You manually stop an instance.
- You deploy an updated version to the service.
- The instance exceeds the maximum memory for its configured
instance_class.
- Your application runs out of Instance Hours quota.
- Your instance is moved to a different machine, either because the current machine that is running the instance is restarted, or App Engine moved your instance to improve load distribution.
One of the benefits App Engine standard environment's "pay for only what you need" platform as described earlier in Scaling Down is that the system autoscales the number of instances down to zero when there is no traffic. This helps make App Engine a cost-effective solution for small applications that don't receive continuous requests. When an instance needs to be shut down, new incoming requests are routed to other instances (if any) and requests that are currently being processed are given time to complete.
When an instance needs to be shut down, App Engine sends aKILL
(SIGKILL) signal, terminating the instance.
Loading requests
When App Engine creates a new instance for your application, the instance must first load any libraries and resources required to handle the request. This happens during the first request to the instance, called a Loading Request. During a loading request, your application undergoes initialization which causes the request to take longer.
The following best practices allow you to reduce the duration of loading requests:
- Load only the code needed for startup.
- Access the disk as little as possible.
- In some cases, loading code from a zip or jar file is faster than loading from many separate files.
Warmup requests
Warmup requests are a specific type of loading request that load application
code into an instance ahead of time, before any live requests are made.
Manual or basic scaling instances do not receive an /_ah/warmup request.
Instance uptime
App Engine attempts to keep manual and basic scaling instances running indefinitely. However, at this time there is no guaranteed uptime for manual and basic scaling instances. Hardware and software failures that cause early termination or frequent restarts can occur without prior warning and can take considerable time to resolve; thus, you should construct your application in a way that tolerates these failures.
Here are some good strategies for avoiding downtime due to instance restarts:
- Reduce the amount of time it takes for your instances restart or for new ones to start.
- For long-running computations, periodically create checkpoints so that you can resume from that state.
- Your app should be "stateless" so that nothing is stored on the instance.
- Use queues for performing asynchronous task execution.
- If you configure your instances to manual scaling:
- Use load balancing across multiple instances.
- Configure more instances than required to handle normal traffic.
- Write fall-back logic that uses cached results when a manual scaling instance is unavailable.
 
NTP with App Engine standard environment
The App Engine standard environment has network time protocol (NTP) services which use Google NTP servers. However, the NTP service is not editable.