From Black Box to Glass Box: Monitoring Your NestJS API with OpenTelemetry - Andy Primawan

Tired of your API being a black box? Let's implement robust monitoring for a NestJS app using OpenTelemetry and Prometheus. We'll track RPM, failure rates, and top endpoints to gain crucial insights into your API's health and traffic patterns.

From Black Box to Glass Box: Monitoring Your NestJS API with OpenTelemetry - Andy Primawan
Monitoring your NestJS REST API with OpenTelemetry, Prometheus, Grafana - Andy Primawan

As full-stack engineers, we know the feeling. We push our backend API to production, it passes the health checks, and everything seems fine. But when users report slowness or an endpoint starts failing silently, our beautiful API suddenly feels like an opaque black box. Where do you even begin to look?

This is where observability comes in. Instead of guessing, we can collect data that tells us a story about our application's health. Today, we're going to focus on one of the key pillars of observability: metrics.

Our goal is to instrument a NestJS REST API to understand its performance and usage patterns. We'll track key metrics like:

  • Requests Per Minute (RPM) to see traffic flow.
  • HTTP 4xx and 5xx error rates to catch issues proactively.
  • A list of the most frequently called endpoints to identify hotspots.

We'll use the powerful combination of OpenTelemetry (via the excellent nestjs-otel library), Prometheus for data storage, and Grafana for beautiful, insightful dashboards. By the end of this post, you'll have a clear, actionable path to adding this essential monitoring to your own projects.

The Setup: Project Structure

We'll be working with a standard NestJS project structure. The key files we'll create or modify are:

src/
├── app.controller.ts
├── app.module.ts
├── app.service.ts
├── main.ts
└── tracing.ts        <-- We will create this

Step 1: Install the Dependencies

First things first, let's get the necessary packages. We'll need nestjs-otel which bundles many of the core OpenTelemetry packages for us, and the Prometheus exporter to expose our metrics.

npm install --save nestjs-otel @opentelemetry/exporter-prometheus @opentelemetry/api

@opentelemetry/api provides the core interfaces and functions for instrumentation, which nestjs-otel builds upon.

Step 2: Configure the Tracer

OpenTelemetry needs to be initialized before your application code runs to ensure all modules are properly instrumented.

Let's create a new file at src/tracing.ts. This file will handle the basic OpenTelemetry SDK configuration.

src/tracing.ts

import {
  CompositePropagator,
  W3CTraceContextPropagator,
  W3CBaggagePropagator,
} from '@opentelemetry/core';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerPropagator } from '@opentelemetry/propagator-jaeger';
import { B3Propagator } from '@opentelemetry/propagator-b3';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { NodeSDK } from '@opentelemetry/sdk-node';
import { AsyncLocalStorageContextManager } from '@opentelemetry/context-async-hooks';

const otelSDK = new NodeSDK({
  metricReader: new PrometheusExporter({
    port: 8081,
  }),
  spanProcessor: new BatchSpanProcessor(new JaegerExporter()),
  contextManager: new AsyncLocalStorageContextManager(),
  textMapPropagator: new CompositePropagator({
    propagators: [
      new JaegerPropagator(),
      new W3CTraceContextPropagator(),
      new W3CBaggagePropagator(),
      new B3Propagator(),
    ],
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

export default otelSDK;

// You can also use the shutdown method to gracefully shut down the SDK before process shutdown
// or on some operating system signal.
process.on('SIGTERM', () => {
  otelSDK
    .shutdown()
    .then(
      () => console.log('OpenTelemetry SDK shut down successfully'),
      (err) => console.log('Error shutting down OpenTelemetry SDK', err),
    )
    .finally(() => process.exit(0));
});

This setup initializes the SDK and configures it to use the PrometheusExporter, which automatically creates a /metrics endpoint that Prometheus can scrape.

Step 3: Start Tracing in main.ts

Now, let's hook our new tracer into our application's lifecycle. The key is to start it before NestFactory.create(). This allows OpenTelemetry's auto-instrumentation to patch libraries like http and express before NestJS even starts using them.

src/main.ts

import { NestFactory } from '@nestjs/core';
import { ConfigService } from '@nestjs/config';
import otelSDK from './tracing';
import { AppModule } from './app.module';

async function bootstrap() {
  // Start SDK before nestjs factory create
  otelSDK.start();

  const app = await NestFactory.create(AppModule);

  // Using ConfigService to get configuration
  const configService = app.get(ConfigService);
  const port = configService.get<number>('PORT', 3000);

  await app.listen(port);
  console.log(
    `Dojotek AI Chatbot backend is running on: http://localhost:${port}`,
  );
}

bootstrap().catch((error) => {
  console.error('Error starting Dojotek AI Chatbot backend:', error);
  process.exit(1);
});

With this, our application is now fully instrumented! All incoming HTTP requests and many other operations will automatically generate trace and metric data.

Step 4: Add the OpenTelemetry Module to app.module.ts

While auto-instrumentation covers a lot, nestjs-otel provides deeper integration with the NestJS framework itself (like associating traces with controllers and services). Let's add its module to our AppModule.

src/app.module.ts

import { Module } from '@nestjs/common';
import { AppController } from './app.controller';
import { AppService } from './app.service';
import { OpenTelemetryModule } from 'nestjs-otel';

@Module({
  imports: [
    OpenTelemetryModule.forRoot({
      metrics: {
        hostMetrics: true, // Includes Host Metrics
        apiMetrics: {
          // @deprecated - will be removed in 8.0 - you should start using the semcov from opentelemetry metrics instead
          enable: false, // Includes api metrics
          defaultAttributes: {
            // You can set default labels for api metrics
            // custom: 'label',
          },
          ignoreRoutes: ['/favicon.ico'], // You can ignore specific routes (See https://docs.nestjs.com/middleware#excluding-routes for options)
          ignoreUndefinedRoutes: false, // Records metrics for all URLs, even undefined ones
          prefix: 'my_prefix', // Add a custom prefix to all API metrics
        },
      },
    }),
  ],
  controllers: [AppController],
  providers: [AppService],
})
export class AppModule {}

This configuration enriches our metrics with useful Node.js and host-level information, giving us a more complete picture of our application's performance.

What's Next? Visualizing in Grafana

With the code in place, your /metrics endpoint is now serving a stream of OpenTelemetry data in a format Prometheus understands. The next steps are:

  1. Run Prometheus: Configure your prometheus.yml file to scrape the http://localhost:8081/metrics endpoint of your running NestJS application.
  2. Run Grafana: Add Prometheus as a data source in Grafana.
  3. Build Dashboards: Create panels using PromQL (Prometheus Query Language) to visualize the data.

Here are a few example queries to get you started:

NestJS HTTP Request per Minute PromQL

sum(rate(http_server_duration_count{http_route!=""}[$__rate_interval]))*60 or vector(0)

NestJS HTTP Top Called Routes PromQL

topk(10, sum by (http_method, http_route)(
  increase(http_server_duration_count{http_route!=""}[$__range])
))

NestJS HTTP 5xx Server Error Rate PromQL

(
  100 *
  sum(rate(http_server_duration_count{http_route!="", http_status_code=~"5.."}[$__rate_interval]))
  /
  clamp_min(
    sum(rate(http_server_duration_count{http_route!=""}[$__rate_interval])),
    1
  )
) or on() vector(0)

NestJS HTTP 4xx Client Error Rate PromQL

(
  100 *
  sum(rate(http_server_duration_count{http_route!="",http_status_code=~"4.."}[$__rate_interval]))
  /
  clamp_min(
    sum(rate(http_server_duration_count{http_route!=""}[$__rate_interval])),
    1
  )
) or on() vector(0)

NestJS HTTP Top Routes with 5xx Server Error PromQL

topk(10, sum by (http_method, http_route)(
  increase(http_server_duration_count{http_route!="",http_status_code=~"5.."}[$__range])
))

NestJS HTTP Top Routes with 4xx Client Error PromQL

topk(10, sum by (http_method, http_route)(
  increase(http_server_duration_count{http_route!="",http_status_code=~"4.."}[$__range])
))

Here the Grafana Dashbord for NestJS REST API on my laptop:

NestJS REST API and Open Telemetry metrics visualized on Grafana - Andy Primawan

See the Full Code

We've covered the core implementation, but seeing it all together in a working repository is even better. You can check out the final source code for these files in my GitHub repository:

Implementing observability isn't a one-and-done task, but by taking these first steps, you transform your NestJS API from a black box into a transparent, monitorable system. Happy coding

You also can import the Grafana Dashboard to visualize your NestJS REST API using JSON, from the following public GitHub gist: