TL;DR: Uptime is the percentage of time your app is up and working. "The nines" — 99%, 99.9%, 99.99% — describe how reliable a system is. 99.9% ("three nines") sounds great but means 8.76 hours of acceptable downtime per year. Your AI will automatically generate health check endpoints, uptime monitors, and status pages when you deploy. You need to understand the numbers so you know what you're promising — and what your hosting provider is actually guaranteeing you.

The Short Version

Every service on the internet goes down sometimes. The question isn't if — it's how often and for how long. Uptime is the metric that answers this.

When someone says "we have 99.9% uptime," they mean: out of every 1,000 hours the service runs, it's expected to be unavailable for less than 1 hour. That translates to about 8.76 hours of downtime per year. It's expressed as a percentage because percentages scale cleanly whether you're talking about a single server or a global platform handling billions of requests.

Availability is the slightly broader concept — the percentage of time a system is not just running, but genuinely usable. A system that's technically "up" but returning 500 errors to every user has good uptime but terrible availability. Most engineers use the terms interchangeably in everyday conversation, but availability is the one that actually matters to your users.

The reason this concept matters specifically to you — a builder using AI as your primary coding partner — is that the moment you deploy an app, uptime becomes your problem. Your AI will help you write health checks, configure monitors, and set up status pages. But it won't explain what "three nines" actually costs you in real downtime hours, or why your $10/month VPS probably can't deliver four nines no matter how well you configure it.

Why This Matters: The GitHub Story

In early 2026, a Hacker News thread titled "GitHub appears to be struggling with measly three nines availability" hit 356 upvotes and 188 comments. The discussion was pointed: GitHub — one of the most critical pieces of infrastructure in the software world, used by millions of developers daily — was delivering 99.9% uptime. Three nines. A number that software engineers associate with "good, but not great."

The comments ranged from sardonic ("three nines for a tool developers depend on for CI pipelines is... not inspiring") to more nuanced takes about the actual engineering complexity of running GitHub's scale. But the core complaint was this: when GitHub goes down, developers can't push code, CI pipelines fail, deploys halt, and teams across the globe are blocked. Eight and a half hours of downtime per year sounds small until it's 2am and your hotfix can't push because git.github.com isn't resolving.

This is the thing about uptime numbers: they're averages. Three nines doesn't mean you get 87.6 minutes of downtime spread evenly across 525,960 minutes of the year. It means you might get zero downtime for eleven months and then one catastrophic incident that eats your entire annual budget in a single afternoon.

Key Insight

Uptime percentages describe averages over time — not guarantees of even distribution. A service can hit its SLA target for the year while having one truly awful incident that makes engineers' lives miserable for several hours straight.

For GitHub specifically, the engineering challenge is enormous: they serve hundreds of millions of repositories, run CI/CD for countless pipelines, handle authentication for half the internet's OAuth flows, and maintain a global CDN for package distribution. Three nines at that scale with that feature complexity is genuinely hard. But it's still three nines — and the thread made clear that the developer community expected more.

For your app? Three nines is almost certainly fine. Probably better than fine. The lesson from the GitHub thread isn't "four nines or you're a failure" — it's "understand what you're promising, measure it honestly, and communicate clearly when things go wrong."

The Nines Explained: What Each Percentage Actually Costs You

Here's the table you'll reference every time someone throws a percentage at you. The math is simple: subtract the uptime percentage from 100% to get the downtime percentage, then multiply by the time period.

Availability Nickname Downtime / Year Downtime / Month Downtime / Week
99% Two nines 3.65 days 7.3 hours 1.68 hours
99.9% Three nines 8.76 hours 43.8 minutes 10.1 minutes
99.95% Three and a half nines 4.38 hours 21.9 minutes 5.04 minutes
99.99% Four nines 52.6 minutes 4.38 minutes 1.01 minutes
99.999% Five nines 5.26 minutes 26.3 seconds 6.05 seconds

Let that sink in for a moment. Going from three nines to four nines means reducing your acceptable annual downtime from 8.76 hours to 52 minutes. That tenth of a percent difference costs most companies millions of dollars in redundant infrastructure, automated failover systems, and on-call engineering time.

Going from four nines to five nines — another tenth — means your entire annual downtime budget is 5 minutes and change. You'd need to resolve any incident, including one that woke someone up at 3am, in under 5 minutes for the entire year. That's the level of reliability you see at payment processors, hospital systems, and air traffic control software. The engineering required is completely different from what you need for a SaaS side project.

⚠️ Reality Check

Most managed hosting platforms — Render, Railway, Fly.io, even Vercel — offer 99.9% uptime SLAs on their paid plans. If your hosting provider only guarantees three nines, you mathematically cannot deliver four nines to your users regardless of how well you write your code. Your availability ceiling is your infrastructure floor.

What Each Tier Actually Requires

99% (Two nines): A single server with no redundancy. If it crashes, it crashes. Acceptable for hobby projects and internal tools where no one is paying and downtime is an inconvenience, not a crisis.

99.9% (Three nines): A well-maintained single server with automated restarts, good deployment practices, and a monitoring setup that alerts you quickly. Most managed hosting platforms are here. Good enough for the vast majority of indie SaaS products, especially early-stage ones.

99.99% (Four nines): Multiple redundant servers, automated failover, load balancing, zero-downtime deploys, and a team that can respond to incidents in minutes. Requires real investment in infrastructure design. Appropriate once you have paying customers who depend on your service for their own business operations.

99.999% (Five nines): Geographic redundancy across multiple cloud regions, real-time replication, chaos engineering to test failure scenarios, and often a dedicated reliability engineering team. This is AWS, Google, or Stripe territory. You don't build this yourself unless reliability is literally your product.

What Actually Causes Downtime

Understanding the causes of downtime is more useful than chasing percentages. For apps built by solo developers and small teams using AI tools, the overwhelming majority of downtime comes from a short list of predictable problems.

Bad Deploys

The number one cause of downtime for small apps. You push code with a bug, the server starts crashing, and before you notice, your app has been returning 500 errors for 20 minutes. The fix: zero-downtime deploys (run the new version alongside the old one before cutting over), automated health checks post-deploy, and the ability to roll back instantly. Your AI can set all of this up — but only if you ask for it explicitly.

Resource Exhaustion

Your server runs out of memory, disk space, or database connections. The app starts thrashing, getting slower and slower until it stops responding entirely. This is surprisingly common on $5–$10/month VPS instances. The fix: monitoring dashboards that alert you before you hit limits, and sensible resource limits on database connection pools.

Dependency Failures

Your app depends on third-party services: a payment processor, an email provider, an AI API, a cloud storage bucket. When those go down, parts of your app go down too — even if your own server is running perfectly. This is why sophisticated apps implement graceful degradation: if the payment processor is down, show a user-friendly message instead of a generic crash. If the AI API is unavailable, fall back to cached results or a simplified response.

Misconfigured Environment Variables

You deploy to production and forgot to set DATABASE_URL. Your app boots, crashes immediately when it tries to connect, and the process manager restarts it in an infinite loop. Monitoring catches this — but only if you have monitoring. Without it, you might not notice for hours.

Traffic Spikes

Your app gets featured somewhere, traffic spikes 10x, and a server sized for normal load can't handle it. This is the happy-path downtime: you went viral and your success killed your site. The fix is autoscaling (cloud platforms can spin up more instances automatically) or at minimum knowing your capacity limits before you need them.

Infrastructure Provider Incidents

Sometimes it's not you — it's them. AWS has outages. Cloudflare has incidents. Your database provider's region goes down. This is the downtime you can't fully prevent, only mitigate with geographic redundancy. For most small apps, accepting this risk is the right call. Multi-region redundancy is genuinely complex and expensive.

What AI Generates for Monitoring

When you deploy an app and ask your AI to "add monitoring" or "set up health checks," here's what it typically generates. Understanding these patterns means you can verify they're correct, customize them intelligently, and debug them when something goes wrong.

Health Check Endpoint

The foundation of all uptime monitoring. A /health or /ping endpoint that an external monitor hits every minute. If it returns a 200 OK, the service is up. If it returns anything else — or times out — the monitor fires an alert.

// Express.js health check endpoint (Node.js)
// Claude generates this when you ask for "basic monitoring"

app.get('/health', async (req, res) => {
  const status = {
    status: 'ok',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    version: process.env.npm_package_version || '1.0.0'
  };

  // Optional: check database connectivity
  try {
    await db.query('SELECT 1');
    status.database = 'connected';
  } catch (err) {
    status.status = 'degraded';
    status.database = 'error';
    return res.status(503).json(status);
  }

  res.json(status);
});

Notice the database check. A shallow health check that just returns { status: 'ok' } tells you the server process is alive — but not whether it can actually do useful work. A deeper check that queries the database catches the much more common failure mode of "server is running but can't connect to the DB."

Python / FastAPI Version

from fastapi import FastAPI
from datetime import datetime
import time

app = FastAPI()
START_TIME = time.time()

@app.get("/health")
async def health_check():
    return {
        "status": "ok",
        "timestamp": datetime.utcnow().isoformat(),
        "uptime_seconds": round(time.time() - START_TIME, 2)
    }

@app.get("/health/ready")
async def readiness_check():
    # Deeper check: can we actually serve requests?
    try:
        # Check DB, cache, etc.
        return {"status": "ready"}
    except Exception as e:
        return {"status": "not_ready", "error": str(e)}, 503
💡 Two Types of Health Checks

Professional systems distinguish between liveness (is the process alive?) and readiness (is it ready to serve traffic?). A server can be alive but not ready — for example, while it's warming up a cache or waiting for a database migration to finish. Ask your AI for both when deploying to Kubernetes or any container platform.

Process Manager Auto-Restart (PM2)

On a VPS, your AI will often configure PM2 to automatically restart your app if it crashes. This isn't monitoring in the traditional sense — it's a first line of defense that keeps your app alive through transient crashes without paging you.

# ecosystem.config.js — PM2 configuration Claude generates
module.exports = {
  apps: [{
    name: 'my-app',
    script: 'server.js',
    instances: 'max',        // Use all CPU cores
    exec_mode: 'cluster',
    watch: false,
    max_memory_restart: '500M',  // Restart if memory exceeds 500MB
    env_production: {
      NODE_ENV: 'production',
      PORT: 3000
    },
    // Exponential backoff on crashes — prevents restart storms
    min_uptime: '5s',
    max_restarts: 10
  }]
};

PM2 handles the "server crashed at 3am" scenario automatically — it restarts the process and logs what happened. But it can't help with "server is running but returning errors to every user." That's what external monitoring catches.

Health Checks and Status Pages

An uptime monitor pings your health check endpoint from outside your network, on a schedule. If it detects a failure, it sends you an alert — email, SMS, Slack, PagerDuty, whatever you've configured. The most popular free option is UptimeRobot. Here's what your AI generates when you ask it to set up monitoring:

UptimeRobot Setup (What to Tell Your AI)

Add uptime monitoring for my app at https://myapp.com using UptimeRobot.
Monitor the /health endpoint every 5 minutes.
Alert me at alerts@myapp.com if it goes down.
Also monitor the main page (/) as a secondary check.

Please:
1. Give me the UptimeRobot monitor configuration
2. Update my /health endpoint to return proper HTTP status codes
   (200 for healthy, 503 for degraded)
3. Add a status badge URL I can put in my README

Status Pages

A status page is a public-facing webpage — usually at status.myapp.com — that shows your current and historical uptime. When your app goes down, users can check the status page to understand whether it's a known issue or something on their end. This dramatically reduces support requests and builds trust.

For small apps, you don't need to build a status page from scratch. Your AI can configure free tools:

UptimeRobot Status Page

Free, automatic, generated from your monitors. Lives at a custom subdomain. Shows current status and 90-day uptime history. No code required.

Upptime (GitHub-based)

Open source. Runs as a GitHub Action, stores data in your repo, hosts the status page on GitHub Pages. Completely free. Your AI can set it up with a YAML config file.

Upptime Configuration Example

# .github/workflows/upptime.yml
# Claude generates this when you ask for a free self-hosted status page

name: Upptime
on:
  schedule:
    - cron: '*/5 * * * *'  # Check every 5 minutes
  workflow_dispatch:

jobs:
  uptime:
    runs-on: ubuntu-latest
    steps:
      - uses: upptime/uptime-monitor@master
        with:
          UPPTIME_GH_PAT: ${{ secrets.GH_PAT }}

# .upptime.yml — which URLs to monitor
owner: your-github-username
repo: status
sites:
  - name: My App
    url: https://myapp.com/health
  - name: API
    url: https://api.myapp.com/health
  - name: Docs
    url: https://docs.myapp.com
⚠️ Don't Host Your Status Page on the Same Server

If your app and its status page are on the same server, and that server goes down, your status page also goes down — right when users need it most. Use a separate hosted service (UptimeRobot's hosted status pages, GitHub Pages for Upptime, or a separate static hosting service) that's completely independent from your main infrastructure.

SLAs Explained in Plain English

An SLA — Service Level Agreement — is a contract that specifies what uptime a provider promises to deliver, and what happens if they don't. When you sign up for a cloud platform, you're implicitly (or explicitly) agreeing to their SLA. When you sell software to businesses, they may ask for your SLA before signing a contract.

What Cloud Providers Actually Promise

Provider / Tier SLA Downtime Budget/Month Compensation
Vercel Pro 99.99% ~4 minutes Service credits
Railway Pro 99.9% ~44 minutes Service credits
Render Paid 99.95% ~22 minutes Service credits
AWS EC2 (single AZ) 99.5% ~3.6 hours Service credits
AWS EC2 (multi-AZ) 99.99% ~4 minutes Service credits
Neon (database) 99.95% ~22 minutes Service credits

The compensation column is worth noting. SLA violations typically result in service credits — a percentage of your monthly bill refunded as account credit. They don't give you money back, they don't compensate for downstream losses, and claiming them requires you to file a support ticket with the incident details. It's worth doing, but it's not the primary reason to care about SLAs.

The primary reason is that an SLA tells you the provider's actual engineering commitment. A 99.99% SLA means they've invested in redundancy, failover, and incident response. A 99.5% SLA on a single-AZ EC2 instance means "we'll try our best but one server can go down."

Do You Need Your Own SLA?

If you're building a consumer app or a B2C SaaS, probably not. Users expect reasonable uptime and a clear status page, but they're not signing contracts about it.

If you're selling to businesses (B2B), especially larger ones, expect them to ask. Enterprise buyers often have their own SLA requirements — they need to tell their leadership that the tools they're buying meet certain reliability thresholds. A common ask is 99.9% with credits for violations.

💡 What to Promise

Never promise more uptime than your infrastructure can deliver. If your hosting SLA is 99.9%, you can promise 99.9% to your users — but not 99.99%. Your availability ceiling is determined by the least reliable component in your stack. If any single point can fail and take down your app, you can't honestly promise four nines.

What to Tell Your AI

Your AI handles the implementation of monitoring well, but it needs context to give you the right setup. Here are the prompts that get you from "no monitoring" to a production-ready observability stack.

Basic Health Check (Starting Point)

Add a /health endpoint to my [Express/FastAPI/Rails/etc.] app that:
- Returns 200 with { status: 'ok', timestamp, uptime } when healthy
- Returns 503 with { status: 'error', detail } when unhealthy
- Checks database connectivity as part of the health check
- Does NOT require authentication (monitoring tools need to hit it publicly)

Full Monitoring Setup

I want to add uptime monitoring to my app. Please:

1. Add /health and /health/ready endpoints (liveness vs readiness)
2. Give me a UptimeRobot configuration to monitor every 5 minutes
3. Set up PM2 with auto-restart and memory limits on my VPS
4. Create a GitHub Actions workflow using Upptime for a public status page
   hosted at status.myapp.com via GitHub Pages

My stack: Node.js + Express + PostgreSQL (via Prisma)
My hosting: DigitalOcean VPS running Ubuntu 22.04

Fixing a Specific Downtime Problem

My app went down at 2am. PM2 logs show it restarted 7 times in 10 minutes
before backing off. Here are the last 50 lines of logs:

[paste logs here]

What caused this crash loop? How do I prevent it happening again?
Suggest specific code changes and monitoring improvements.

Adding Alerting

My app is monitored by UptimeRobot. Add Slack alerting so that:
- I get notified in #alerts when the app goes down
- I get a recovery message when it comes back up
- The alert includes the URL that failed and the HTTP status code

Also set up a PagerDuty escalation for incidents lasting more than 10 minutes.
Context Is Everything

The more specific you are about your stack, hosting environment, and what "healthy" means for your app, the better your AI's output. A generic "add monitoring" prompt gets you generic code. A detailed prompt specifying your database, your hosting platform, and your alert preferences gets you production-ready implementation.

FAQ

99.9% uptime ("three nines") means your service can be down for a maximum of 8.76 hours per year, or about 43.8 minutes per month. It sounds great until you realise that a single bad deploy or database outage can burn through your entire annual budget in one afternoon. Three nines is a reasonable target for most indie SaaS apps — it just means you can't have many incidents.

A health check endpoint is a special URL on your server — usually /health or /ping — that returns a simple OK response to show the service is alive. Monitoring tools ping this URL every minute or so. If it stops responding (or returns an error), they send you an alert. Your AI will generate this automatically when you ask it to add monitoring to your app. A good health check also tests your database connection, not just that the server process is running.

An SLA (Service Level Agreement) is a contract between a service provider and its customers that guarantees a minimum uptime percentage. If the provider misses that target, customers typically receive service credits — not cash, but account credit. Cloud providers like AWS, Vercel, and Railway all publish SLAs for their infrastructure. Your own app probably doesn't need a formal SLA unless you're selling to enterprise customers, who often require one as a condition of purchase.

Uptime and availability are often used interchangeably, but technically availability is the broader concept: the percentage of time a system is operational and genuinely usable. Uptime is the raw measurement of time a system has been running without stopping. A system can technically be "up" (the process is alive) but have degraded performance, return errors to every user, or have a broken database connection — so availability is the more meaningful metric. What your users experience is availability, not uptime.

Almost certainly not. Five nines is expensive, complex, and reserved for financial systems, emergency services, and large-scale platforms where downtime has direct legal or safety consequences. For most indie projects and small SaaS apps, 99.9% is perfectly acceptable. Focus on good error handling, graceful degradation, and a clear status page rather than chasing extreme uptime numbers. The engineering required for five nines is a full-time job for a team of reliability engineers.

For small apps built by solo developers, the most common causes of downtime are: bad deploys (pushing code that crashes the server), running out of memory or disk space, database connection limits being exceeded, third-party API outages, and misconfigured environment variables after a deploy. Most of these are preventable with basic health checks, deployment rollbacks, resource monitoring, and careful staging-before-production deploys. The number one thing you can do is add a health check endpoint and an external monitor that alerts you within minutes of a failure.

UptimeRobot has a generous free tier (50 monitors, 5-minute check intervals, email alerts). Better Stack (formerly Logtail) has a free plan with 3-minute checks. Freshping offers 50 free monitors with 1-minute intervals. For a public status page, Upptime is fully free and runs as a GitHub Action — your AI can configure it entirely from a YAML file. For most small apps, UptimeRobot free + Upptime for the status page is all you need to start.

What to Learn Next

Uptime and availability connect directly to several other concepts worth understanding as you build more reliable apps: