Advice for mitigating service disruptions

We were impacted by the outage today (https://trust.okta.com/#incident/a9C2A000000PBikUAG) and I was wondering if you could provide some guidance on how we can mitigate this. How does one create redundant Okta Org Authorization Servers? Do we need to do that? What are your best practices?

–Ray

Apologies for the outage yesterday. This is the first one we have had since I’ve been at Okta that impacted the developer tier’s home and my applications were affected as well.

I think you ask a really good question, but understanding the root cause of the underlying issue, the only way this particular one could have been mitigated is by not being in that particular cell.

With that being said, Okta does give you the ability to run multiple authorization servers in the same organization, these could be used for failover if one authorization server went down. Their original purpose is for different audiences (one audience for each authorization server), but you could possibly set up multiple for redundancy.

I think there is an interest product enhancement out of your question. Which is how Okta could allow you to have the same organization in multiple cells, and if one goes down, we can elegantly funnel traffic to the other cell. I’m going to talk to our architects about this. It definitely seems feasible and would be a value-add for customers needing automatically failover and redundancy.

Let me know any questions - always happy to help.
Tom

2 Likes

Thanks Tom.

I’m not 100% familiar with your architecture yet so the difference between a cell and an authorization server is not clear to me, since the 400 failure happened when I called the authorization server. Nevertheless, I’d be very grateful to hear the outcome of your discussion with the Okta architects.

BTW, pass on my kudos to the team for a great product and to you and @robertjd for how responsive you are on these forums.

–Ray

1 Like

Oh yea, the terminology is a little confusing. A cell hosts a set of Okta organizations; you can imagine a cell as the set of machines and databases that run Okta. Your organization hosts your tenant, which is a directory, authorization servers, and a set of policies around all of it. This is a simplification, but I think that is all I need to cover. One of the benefits of having organizations in cells is it is very uncommon for service disruptions to affect all customers.

And thanks for your kind words, I’ll pass the feedback on to the team, it is a group effort for building the product and supporting people building on top of Okta. We strive to create a team that is motivated by customer success.

2 Likes

Hey Guys,

I’m getting intermittent 404’s on my dev-xxx.oktapreview.com/api/v1/authn . Trust says nothing’s going on but I’m 99% sure there is. Looks like one of the servers behind a load balancer is whacked or something.

Would appreciate confirmation of this to make sure I’m not going nuts.

https://trust.okta.com/

–Ray

@RayRenteria Are you still seeing this? I’ve sent about 50 requests to my own test orgs and haven’t noticed any 404s. It’s a small sample size though.

If you’re still experiencing errors, please email support@okta.com and provide logs (if you have them).

Thanks for the response, @nate.barbettini . Things are settled now. I’ll make meticulous recordings next time. My colleague didn’t experience the same issues I was experiencing. The errors were definitely being thrown by calls from within the authClient, though–a 404 on the authentication API call with a CORS anomaly being reported back.

I’ll be sure to share here if it ever happens again.

–Ray

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.