OV Enterprise HA - AutoScale Recycled both Master Nodes at same time

We did a test with OV Enterprise High Availability.

Approximately 1800 sessions, most of them with 1 publisher and 2 subscribers. Each session lasted about 2.5 hours.

AWS deployment. 2 Master Nodes C5.2xlarge, 2 Media Nodes C5.2xlarge.
AWS OpenSearch 7.10 M5.xlarge 1 node.

All was good for about 1.5 hours. Master Node CPU was around 30%. However, at that time, both master nodes started to grow in CPU until they were at about 70% over about 20 minutes. At that time the autoscale group for master nodes recycled both Master Nodes because they had failed their health checks.

Looking at the OpenSearch cluster health it appears there was some garbage collection happening at the time the Master Node health checks failed.

Is it possible that high latency on ElasticSearch caused the issue?

Do you have any recommendations on the scale of the ElasticSearch implementation to use with OVE?

Guys - I really need some advice here. The same thing happened to me this week, except there were 5 master nodes running (C5.xlarge). All was great for 1.5 hours and then the CPU started to creep up. At around 75% CPU, the autoscale group recycled all 5 master nodes at the same time. I used a large ElasticSearch cluster and there didn’t seem to be issues there. We had about 2500 sessions, each session with 2 streams (publisher from user A → subscriber to user B. Publisher stream recorded using individual recording.

It seems that the Master Nodes simply get overwhelmed at some point and they fail their health checks and are terminated / replaced by the autoscaling group. I can add more servers but I suspect the same thing will happen. I can use bigger servers. I can use more servers. But is there any advice on how to set this up. I need to support 6000 sessions in 5 days from now.

Hello @mrussiello

We need the logs of the system at the moment it crashes. If the CPU of all Master Nodes starts to increase at the same time, that makes me think that there may be a memory leak or a bottleneck at some common point. Maybe the AWS redis database also shows some kind of warning or error log.

Before your second message, my best guess was that the ElasticSearch was giving errors after receiving and processing the hundreds of thousands of messages that will carry the thousands of sessions of your test. After your second scenario, I am not that sure anymore. Either way, ElasticSerach errors should reflect on openvidu-server-pro logs.

So the plan of action could be summarized as collecting as many logs as possible at the moment the cluster starts to get out of control in its CPU usage. That should give us some intel to work with.

Hello @mrussiello, I’ve created a Cloudformation that you can use without needing an Elasticsearch deployment. Just to discard a possible error related with Elasticsearch as @pabloFuente said.

https://s3.eu-west-1.amazonaws.com/aws.openvidu.io/CF-OpenVidu-Enterprise-NO-ELK-2.22.1.yaml

Would you be able to replicate the load test and see what happen?

Thanks guys. I appreciate the support.

We are doing another test on this Sunday morning 1 am GMT. It should be 6000 sessions, each with 2 streams and sometimes 3 streams, and each doing individual recording of one stream. The use case is remote test proctoring with live proctors watching each testing session. This is double the number of sessions from last week (3000) and the previous two weeks (2000 and 1000). Each session lasts about 3 hours.

One thing I’ve noticed is that the test takers refreshes their browser and causes the session to connect and reconnect sometimes 10 times over the 3 hours. The test proctor also must do so in that case. So there is a lot of cycling of sessions.

I cannot do a Elasticsearch-free implementation unfortunately, since these tests are real sessions with real people. But my last ES was 3 nodes of M5.xlarge. This week I’ll use 4 nodes of M5.2xlarge to ensure there’s no issues there.

I will make sure I capture the OV logs at the instant of each crash. Will be in touch.

1 Like

Guys - reporting back on this.

Yesterday we implemented an OV Enterprise with 6000 sessions with one publisher and two subscribers and sometimes 2 publishers (so, ~18000 streams) each, all recorded (individual) and all lasting about 3 hours. We used C5.2xlarge for all nodes with 7 Master Nodes (avg CPU 40%) and 10 media Nodes (avg CPU 35%). The media nodes did well. There as only one media node crash. However, most of the master nodes eventually saw CPU creep up above 70% or so and then they crashed (autoscale group terminated and started a new instance due to failed health checks) after about 2 hours or runtime.

ElasticSearch (OpenSearch) used 4 M5.2xlarge nodes with 100GB each and didn’t seem to have any issues.

I tried to obtain the logs but the autoscale group kept terminating the nodes before I could copy/download. I see now I can prevent that (auto-termination). Will do next time we run the same scenario which is (31 July). Sorry I screwed this up this time.

Questions for now:

  1. Is it better to use two C5.2xlarge or one C5.4xlarge for each Master node? We are running the same scale event in two weeks.

  2. Sometimes, when subscribing to an existing session the videoElementCreated is received two times for the same stream and on the second one the video element (event.element) is null. I have also seen this happen in OV CE. By rebooting the OV CE server, this condition is removed. Just to be clear - we receive one streamCreated event and call ovsession.subscribe() once, but the new subscriber receives two videoElementCreated events one right after the other and one with no event.element. We have revised our code to ignore this second videoElementCreated event but I wanted to let you know about it.

Thanks for your help. Mike

Hello @mrusiello, first of all, thank you so much for your time and being so helpful for the community by posting this in the forum and for your detailed explanations. I will try to answer your questions:

  1. Giving the situation that we don’t know what it is happening, maybe a C5.4xlarge is better, but if there is some kind of memory leak or something, this scenario should take more time to happen (I’m speculating because I don’t know what could be happening). So I would give a try to C5.4xlarge and see how the system behaves if you want.

  2. Is this a consistent error? Or it happen after some time running the server with some usage? I’ll tag @pabloFuente just in case he knows if this happen to us previously or it is a known bug.

I want to ask you a thing about your deployment: Are you using Mediasoup as media server?

Thank you so much Mike, let’s see if the situation repeats in 31 July, as this bug looks consistent.

Best Regards,
Carlos

Thank you Carlos!

Yes, the Master Node crashes seem to happen consistently about 2 hours after the node was started and became under load.

Yes we are using Mediasoup as the media server.

The use case is test takers and test proctors, some in areas with poor Internet.

I’m thinking that the servers just get overwhelmed with requests from all the sessions and with new session requests or new publish or subscribe requests for the same session as both test takers and proctors can keep reloading their page if they have Internet issues.

I’ve been looking through my application for ways to reduce the number of session.fetch() operations and to make it so that users cannot continually reload their web pages and cause quickly repeated subscribe or publish requests. This is all ready to go for next Saturday.

I will try C5.4xlarge. I will also enable termination protection on the first couple of Master Nodes so that if they crash I’ll have time to obtain the OpenVidu log files.

You also mentioned the Redis log files. Can you help me understand how to obtain the Redis logs?

Thanks again,

Mike

1 Like

Hi Guys,

OK we did another big test event on OV Enterprise 2.22.0. 6000 sessions, 18000 streams of 3 hour duration, 1 stream recorded per session.

1st 2 hours - all went well. Then at least one media node crashed. The media node was terminated by AWS. I had removed scale in protection I believe. So perhaps I caused the problem, but the master nodes did not know the media node wasn’t there and this lasted a long time 60 mins. This caused the master nodes to select the phantom (terminated) media node for new sessions because they thought the load was low on that node. This caused many sessions to have no video connection since they were trying to connect to a media node that did not exist. You could see it in the master node ov logs. The only way we could recover was deleting the AWS stack and re-creating a stack. I have some log files for Master nodes operating with this issue if you want to see them.

Before this happened, however, several master nodes crashed and new nodes were created. I had enabled termination protection on the initial master nodes but they were still terminated on crash and I could not get their logs. Very frustrating. I used C5.4xlarge master nodes and have 5 of them. They all crashed at about the same time. ElasticSearch had 5 m5.2xlarge nodes and was fine. I will keep trying to get you some log files.

One question: Would placing the coturn server on the media nodes help avoid crashes on the master nodes? If yes, is this process stable yet?

Thanks again for your help. We are looking at 25000 simultaneous sessions and 75000 streams of 3 hour duration, with 25000 of them being recorded in a couple months. I appreciate your help in getting us ready for that!

Mike

Hello Mike,

Answering your last question: yes, I am sure that with such high volumes of network traffic and client connections, hosting the coturn server on the Master Node can overload it. So transfering it to the Media Nodes is IMHO not only advisable, but mandatory.
You can configure coturn in Media Nodes just by changing a configuration property. Search for OPENVIDU_PRO_COTURN_IN_MEDIA_NODES: OpenVidu configuration - OpenVidu Docs

Set it to true to host coturn in Media Nodes. This feature is consider experimental for now, but it is that way simply as a matter of precaution and to ensure a minimum number of hours of testing in real environments. We have not encountered any practical limitations or unexpected bugs so far, and probably this configuration property will become standard in an upcoming release, with true as default value.

Best regards.

Thank you Pablo. We will definitely do that. I expect we should then plan on more CPU capacity in the Media Nodes. I will let you know when we do another test with that configuration.

Your help has been much appreciated. We’re a huge fan of OpenVidu!

Mike