EU PROD - IoT-Flow incident

Major incident EU PROD ThingPark X IoT-Flow
2025-04-02 15:45 UTC · 1 hour, 6 minutes

Updates

Resolved

The incident is now closed. Here are some details.

Timeline

2:45PM UTC+2
Decoding of iot-flow payloads is unavailable.

4:40PM UTC+2
Hubs are stopped to buffer messages and stop the decoding loss.
Uplinks processing is suspended (uplinks are not lost).

5:30PM UTC+2
Hubs are restarted. Uplinks processing is resumed.

6PM UTC+2
End of the outage

Impacts

Iot-flow was unable to decode the payloads between 2:45PM (UTC+2) and 4:40PM (UTC+2) and again later between 5:30PM (UTC+2) and 6PM (UTC+2).
Messages were buffered (delayed processing with no loss) between 4:40PM (UTC+2) and 5:30PM (UTC+2).

Root cause

An ingress update on our kubernetes cluster (a compionent responsible for the network acess and routing od the trafic) caused the loss of a piece of configuration allowing access to the database to the decoder.
The missing configuration has been regenerated and the ingress restarted to solve the issue.

Enhancement

The missing configuration is being integrated to the ingress upgrade process to make sure it is automatically included.

April 2, 2025 · 16:41 UTC
Monitoring

The incident is over and the traffic is catching up towards connections since 15:30 UTC.

April 2, 2025 · 16:19 UTC
Investigating

Since 12:30 UTC, the payload decoding engine has been experiencing an issue that is resulting in an increased error rate during traffic delivery using TPX towards connections.

Our teams are actively investigating the issue.

April 2, 2025 · 12:30 UTC

← Back