Do you self-host the observability stack or use a SaaS like Datadog?

We self-host. The metrics, logs, and dashboard tier runs in our Kubernetes cluster on Talos with Cilium networking. You get real Grafana access at grafana.4thoctet.com scoped to your tenant via Cloudflare Access. We pay no per-device monitoring license, which is a structural cost advantage we pass through in pricing.

How is tenant isolation enforced if the backend is shared?

Every metric, log, and dashboard query carries an X-Scope-OrgID HTTP header tied to your tenant code. Mimir and Loki refuse to return data outside that scope. Grafana enforces the same boundary at the org level. Customer-facing orgs cannot run cross-tenant queries at all; only our internal NOC org has cross-tenant read access, and that access is logged.

Can we export our dashboards and data if we leave?

Yes. Dashboards export as JSON. Metrics are Prometheus-compatible via the Mimir API. Logs export via LogQL or as raw line files. We hand over your alert rules in Prometheus/Loki rule format. No proprietary lock-in by data type.

What kind of alerting do you build on top of this?

Alerts route through Alertmanager into n8n workflows that handle deduplication, business-hours routing, severity-based escalation, and integration with your ticketing or chat (HaloPSA, Teams, Slack). The first 30 days of onboarding is calibration time so your team is not woken up at 3am by known-noisy probes.

Observability & platform

Most managed services providers run the same five SaaS dashboards your last provider did, with the same blind spots and the same monthly bill that grows with every device you add. We do not. The 4th Octet observability stack is self-hosted, multi-tenant from day one, and built on the same open-source tier that runs the metrics for some of the largest infrastructure operators in the world, sized for the SMB and the mid-market.

The result is a platform you can see, query, and audit. Every dashboard, every metric, every log line, every alert rule is yours to inspect. If you ever want to leave, you take it with you in standard formats.

What you get

A working, tuned, multi-tenant observability platform stood up for your environment during onboarding, then operated by us continuously:

Per-tenant Grafana at grafana.4thoctet.com, scoped to your data only via Cloudflare Access with Entra-backed SSO. Dashboards your operations team actually opens at 8am (cluster health, network I/O, application latency, backup posture, certificate expiry, license drift).
Metrics ingest through Grafana Alloy collectors at your sites, scraping node exporters, SNMP-based device exporters, and any Prometheus endpoints you already publish. Pushed to a central Mimir tier over mTLS with per-tenant X-Scope-OrgID labeling.
Log ingest through the same Alloy collectors, capturing syslog, container logs, application logs, and audit trails into Loki. Same multi-tenant boundary; no cross-contamination by design.
Custom dashboards built to match how your team thinks about the environment, not a generic vendor template. The network team gets network views. Security gets detection views. Leadership gets QBR views with the metrics that map to actual business decisions.
Alerting and routing through Alertmanager and n8n workflows that deduplicate, route by severity and on-call schedule, and ship into your ticketing system. We tune alerts during onboarding so the first month is calibration, not pager fatigue.
Configuration backup for every supported network device through Oxidized, with diffs in git so you can see exactly what changed on which switch on which day.
Quarterly business reviews powered by the same data. Real graphs of real trends. No PowerPoint with screenshots; you can drill into the dashboard live during the call.

How we engage

Observability is part of the Core managed services package, not a paid-extra add-on. Every managed services client gets the platform at no incremental per-device cost, because our stack does not charge per device. The Core monthly fee covers metrics, logs, dashboards, alerting, and on-call routing.

Standalone observability engagements (you want the platform but do not want the rest of managed services) are available as a scoped project. We design the architecture for your environment, build it out, hand off operational ownership to your team, and step away. Pricing for that path runs through /quote/ because every environment is different.

The stack, named

Layer	Tool	Why this one
Visualization	Grafana 12.3	Industry standard, org-based multi-tenancy, fully terraform-provisioned for repeatability across clients
Metrics backend	Grafana Mimir	Horizontally scalable Prometheus-compatible backend with native multi-tenancy and proven scale beyond what an SMB will ever generate
Logs backend	Grafana Loki	LogQL is closer to PromQL than Elasticsearch DSL; storage cost is a fraction of ELK at our typical log volumes
Edge collector	Grafana Alloy	Single binary that replaces Promtail, node_exporter, Vector, and OpenTelemetry collectors; one config to maintain per site
Workflow + automation	n8n	Alert routing, ticket creation, scheduled jobs that need conditional logic, integration with HaloPSA / Teams / Slack
Network flow analysis	sFlow-RT	Flow analysis at line rate; feeds traffic-pattern metrics into Mimir for capacity and anomaly views
Config backup	Oxidized	Git-based device config backup across Cisco, Fortinet, Arista, Aruba, Palo Alto, and Dell
Compute substrate	Talos Linux + Cilium	Immutable Kubernetes nodes, eBPF networking. We run it because we have audited it.

Every layer is open source and self-hosted. We pay zero recurring SaaS license fees for the observability function itself. That is why we can include it in the Core package without breaking margin, and why our pricing for clients with large device counts stays predictable instead of climbing linearly with the inventory.

What it looks like in production

We run this stack today as a live multi-tenant platform across our internal operations, our sister entity Kennedy Consulting, and our managed and project-engagement clients. Tenant onboarding is terraform-driven: one module block plus terraform apply provisions the Grafana org, the Mimir and Loki datasources with the correct X-Scope-OrgID headers, the standard NOC and infrastructure dashboards, and the alerting baseline. From the day we sign the MSA, the first telemetry is usually flowing inside the same week.

The platform is also where we dogfood. The same Grafana org structure, the same Alloy collector pattern, the same alert pipeline we sell to clients is what we use to operate our own infrastructure. If it breaks for us, we feel it before you do.

For an example of the same evidence-capture discipline applied at clinical infrastructure scale, see the active/active DR commissioning we ran for a Chicago-area academic medical center, where every network test ran with concurrent compute, virtualization, and storage observation by design.

If you want to see it in motion before you sign anything, start a conversation and we will walk you through a live tenant view.

Frequently asked

Do you self-host the observability stack or use a SaaS like Datadog?: We self-host. The metrics, logs, and dashboard tier runs in our Kubernetes cluster on Talos with Cilium networking. You get real Grafana access at grafana.4thoctet.com scoped to your tenant via Cloudflare Access. We pay no per-device monitoring license, which is a structural cost advantage we pass through in pricing.
How is tenant isolation enforced if the backend is shared?: Every metric, log, and dashboard query carries an X-Scope-OrgID HTTP header tied to your tenant code. Mimir and Loki refuse to return data outside that scope. Grafana enforces the same boundary at the org level. Customer-facing orgs cannot run cross-tenant queries at all; only our internal NOC org has cross-tenant read access, and that access is logged.
Can we export our dashboards and data if we leave?: Yes. Dashboards export as JSON. Metrics are Prometheus-compatible via the Mimir API. Logs export via LogQL or as raw line files. We hand over your alert rules in Prometheus/Loki rule format. No proprietary lock-in by data type.
What kind of alerting do you build on top of this?: Alerts route through Alertmanager into n8n workflows that handle deduplication, business-hours routing, severity-based escalation, and integration with your ticketing or chat (HaloPSA, Teams, Slack). The first 30 days of onboarding is calibration time so your team is not woken up at 3am by known-noisy probes.