File Transfer Service Performance Degradation

Incident Report for Orderful

Resolved

Problem: Starting at 11 PM PT on June 9th, our File Transfer Service (FTS), which is responsible for communication with Orderful and externally hosted SFTP servers, began experiencing a slow degradation in its polling functionality. This affected the incoming files to Orderful, resulting in a backlog that the service could not process promptly, leading to periodic delays in some file deliveries.

Root Cause: The degradation was initially caused by a bug in an FTP library that caused the "list files" operation to time out on slower servers past a certain threshold. These timeouts led to blocked polling cycles, causing tasks to restart. These restarts, along with continued timeouts, consumed a significant portion of our deployed resource capacity, reducing the time spent polling folders.

Additionally, there was a gap in our observability that prevented the degradation from triggering existing alerting functionalities. The current alerting mechanisms include:

Anomalous EDI job creation rates for native AS2 and FTP.

No FTP servers polled in the past 10 minutes.

Anomaly detected in FTP polling.

No files successfully ingested over the past 15 minutes.

Anomaly detected in the FTP failure rate.

Polling locked queue is backing up.

Elevated estimated wait time for the polled queue.

Since no files failed and the degradation occurred slowly enough to mimic standard cyclical volume changes (evenings, weekends, holidays, etc.), the anomaly alerts in place were not triggered.

Solution: Restarting the service restored full functionality, and scaling up the service allowed it to catch up on delayed files within minutes. Performance was fully restored and additional monitoring was in place by June 11, 5:20 PM PT

Preventative Actions:

Prioritized Polling:

On restart, we added parameters to prioritize polling servers that did not end their last polling cycle in a restart. This change is expected to mitigate degradation for servers not experiencing timing issues within a fixed resource context.

Improved Observability:

We have added a new monitor to record the count of files left in a folder after a polling cycle. This monitor would have shown a steady increase over the incident period and has been configured with a threshold to raise a P1 alarm (monitored 24/7).

Longer-term, we are replacing our existing telemetry system with one that will allow us to add more dimensions for increased granularity of reporting. For example, we will be able to monitor a communication channel by sender, receiver, and transaction type simultaneously. This new system will enable us to detect changes with greater precision, ensuring that degradations are identified as prominently as outages. The estimated completion time for this upgrade is early Q4.

Conclusion: We apologize for any inconvenience this incident may have caused. Our team is committed to preventing future occurrences and continuously improving our service reliability. If you have any questions or concerns, please do not hesitate to reach out.

Posted Aug 19, 2024 - 21:46 UTC