Skip to content

Conversation

@cawthorne
Copy link
Contributor

@cawthorne cawthorne commented Jan 20, 2026

PR Description:

Summary

Improves Tiingo WebSocket failover mechanism to balance primary/secondary URL attempts and enhance visibility.

Problem

During Tiingo incident (2026-01-13 03:19-03:32 UTC):

  • Old failover logic used 5:1 ratio (5 primary : 1 secondary attempts)
  • 83% of retries went to primary, only 17% to secondary
  • Required ~12 minutes before trying secondary URL
  • Failover logging at TRACE level - no visibility into URL selection
  • Could not confirm if failover triggered during incident (7 unresponsive detections observed)

Changes

1. Balanced Failover Ratio (3:3 instead of 5:1)

Before:
// 5 primary : 1 secondary (repeating 6-attempt cycles)
const url = cycle !== URL_SELECTION_CYCLE_LENGTH - 1 ? primaryUrl : secondaryUrl
// Pattern: P P P P P S P P P P P S ...
// 83% primary, 17% secondary


**After:**
// 3 primary : 3 secondary (repeating 6-attempt cycles)  
const cycle = zeroIndexedNumAttemptedConnections % URL_SELECTION_CYCLE_LENGTH
const url = cycle < 3 ? primaryUrl : secondaryUrl
// Pattern: P P P S S S P P P S S S ...
// 50% primary, 50% secondary

@changeset-bot
Copy link

changeset-bot bot commented Jan 20, 2026

⚠️ No Changeset found

Latest commit: 2d9f56e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Comment on lines -153 to -155
// business logic connection attempts (repeats):
// 5x try connecting to primary url
// 1x try connection to secondary url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets validate this logic with DQ. Should the secondary only be used in extreme circumstances or are they equivalent?

urlConfigFunctionParameters.streamHandlerInvocationsWithNoConnection - 1
const cycle = zeroIndexedNumAttemptedConnections % URL_SELECTION_CYCLE_LENGTH
const url = cycle !== URL_SELECTION_CYCLE_LENGTH - 1 ? primaryUrl : secondaryUrl
const url = cycle < 3 ? primaryUrl : secondaryUrl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points:

  • lets parametrize this functionality so we can control via config
    • Could effectively say cycle=1 and this would alternate, and give max downtime of 2m, possibly re-use URL_SELECTION_CYCLE_LENGTH
  • lets discuss with DOPs to assess the maximal accepted "downtime" before we failover to the secondary.
    • are these primary/secondary endpoints "equal", need DOPs input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants