A race condition on 2.24

After installing openvidu pro version 2.24 on AWS (ap-northeast-2) and testing it, the following bug recurred. It was said to be fixed in 2.23, but the same bug still seems to exist.

OpenVidu Pro/Enterprise: there was a race condition when manually launching multiple Media Nodes at the same time that could end with some of them not being properly added to the cluster. This is now fixed.”

state: FAILED

Thanks.

Hello Jihman,

  • Are you using OpenVidu PRO Single Master?
  • Did you deployed a new cloudformation? Or did you just followed upgrading instructions?
  • When you say “state: FAILED”, where is this error appearing?

Hello cruizba
I installed Openvidu pro 2.24 by new AWS Cloudformation, single master, not updated.
the error messages are in the ‘openvidu logs’, like followed image

whenever i start Openvidu server with new KMS servres. it’s ok.
but i need previous KMS servers to monitor the KMS IPs.

But this is not an Error launching the media node, it is an ICE Server candidate which failed (which does not mean other Ice Candidate succeeded, and it does not mean the media server is incorrectly setup)

Is there any error affecting your working app? Or is it just the message which makes you think errors are happening?

In order to save AWS charges and monitor the same KMS ip , I stop the KMS , Openvidu server at 12:00 at night. and I start the media ec2 first at 6:00 in the morning, and start the Openvidu Ec2 10 minutes later every day.

When the error message occurs, one’s own video is visible, but the other party’s video is not visible. after ‘openvieu restart’ , They can watch each other’s videos. These errors occur unspecified often.

I expect this error to occur occasionally as the 3 media servers are started first and then the ovp server is started.

I changed the setting so that when the OVP server starts every morning, a new KMS server is created and connected. However, there are still occasional errors where the other party’s camera cannot be seen.
your answer is too late


Hello Jihman,

Sorry, we are hardly working on new features, holidays was in the middle and I missed to answer it.

May I ask why you start the media nodes before the master node?

Try this configuration parameter:

OPENVIDU_PRO_CLUSTER_MEDIA_NODES=3

Instead of stopping the media nodes, do this:

  1. Stop OpenVidu EC2 instance
  2. Terminate all media nodes

When you start the server again, with the property OPENVIDU_PRO_CLUSTER_MEDIA_NODES=3, 3 new media nodes should be created.

Some additional notes: Make sure your Media nodes are created with a Public IP (the subnet should be configured to automatically attach a Public IP to new instances). AWS is turning off automatic public IP on default subnets.

Sorry for any inconvenience.

Regards

Hello cruizba

If you should had read my previous post correctly, you would have gone straight to the next step.

I changed the openvidu settings so that a new 3 kms server is created every day after starting openvidu.
However, about once every four days, the other party’s camera invisible error occurred. And when I run ‘openvidu restart’ ,it comes back normally.

I already tested for ‘OPENVIDU_PRO_CLUSTER_MEDIA_NODES=3’

An Openvidu instance has a Public IP (EIP). However, all KMS nodes are private IP. Even with the current settings, my service is good. but sometimes an error occurs that the other party’s video cannot be seen.

If you continue to test that a new media node is created after new openvidu instance , and shut down for more than an hour, you will find the same error as us.
The same error was found in the 2.22 and 2.24 aws-pro versions.

Looking forward to a quick reply and progress to the next step…

Hello Jimahn,

I am sorry you are having these problems… I think we have some misunderstand here because of language gap…

So… If I understand you correctly, your problem is: after some time running OpenVidu, you have an ICE Candidate error.

Why is this related with a race condition? Having private IPs for KMS it’s not recommended, because all the traffic would go through the Master Node to the Media Nodes, causing a bottleneck, maybe you are overloading the Coturn service in the master node.

Could you record a video of the entire process to replicate the error?

PS: Anyway… Why don’t you use public IPs for the media nodes? You may overload the master node coturn service by doing this, because all the media traffic will be handled at the master node and then relayed to Media Nodes. I think that enabling public IPs for the Media Nodes subnets will be good for your situation, and maybe the error stop occurring.

Hi @Jimahn.Park, your logs showed ICE FAILED errors so I’ll try to help @cruizba with some questions to see it from the point of view of the WebRTC connection…

As I understand it, let me summarize, and please confirm if I got it right or if there is something wrong:

  • You shut down OpenVidu and its media nodes every night at 00:00. Currently, this means that the media nodes are terminated.
  • At the next morning, at 06:00, OpenVidu is brought up again.
  • Because of OPENVIDU_PRO_CLUSTER_MEDIA_NODES=3, it forces OpenVidu to create 3 new media nodes every morning.
  • The 3 media nodes wake up and are attached to OpenVidu successfully. You can see them in the OpenVidu inspector.
  • Then, about once every four days, the camera of some participant cannot be seen by other participants. If you look at the logs, you see the ICE FAILED error.

This looks like a typical WebRTC connection issue, so we need to troubleshoot from the point of view of WebRTC media connectivity and routing.

Some questions:

  • The days when the ICE FAILED error happens: is it permanent in time? I mean, will now happen every time the user tries to reconnect during the day? Or does it happen that some times it works, some times it doesn’t work and shows this error? (again, during the same day)
  • Also, when ICE FAILED happens, does it happen always to the same participant? If yes, then this issue might be caused by specific network issue of that user.
  • When you restart OpenVidu, then the error always disappears? The participant that had failed video, now works without issue?
  • When the error disappeared, it never happens again during the whole day?

Apart from those questions, I agree that the optimal setup includes having media nodes with public IP addresses. This allows the peer-to-peer nature of WebRTC, so participants will be able to connect directly with UDP ports to the media nodes. For example, a direct connection between Firefox/Chrome browser and the media node. Otherwise with private IPs, the media nodes are not reachable from Internet, so participants need to connect with the single central server proxy/relay (Coturn), which means all data has to travel through the same point, and this point will get overloaded.


Foot note: This bug fix announced in OpenVidu 2.23:

“OpenVidu Pro/Enterprise: there was a race condition when manually launching multiple Media Nodes at the same time that could end with some of them not being properly added to the cluster. This is now fixed.”

Meant that the media node would not be created successfully, and would not be reachable by OpenVidu. It meant totally broken cluster. If you see that the 3 media nodes are correctly created and added to the OpenVidu master, then that old bug didn’t happen to you.

Your issue seems a different thing, it is just a WebRTC connectivity problem, different issue altogether.

Hi j1elo

If the other party is not visible after 3 media servers are ready, it is still invisible even after calling several times.
Not only me, but also users of other sessions cannot see the other party throughout the day.

After restarting the OpenVidu server, I can see the face of the other party even if I call several times throughout the day. There is no report that the other person’s face is not visible. However, when I query “state: FAILED” with Kibana, it intermittently counts.
So it’s a situation that doesn’t make any more sense.

There are no users around 6:30 in the morning. It doesn’t seem to be a problem with private IPs that the other party is not visible even when it is not in an ‘overload’ situation.

To fundamentally solve this problem, I suggested to the manager to stop the daily shutdown of the server.
I explained that the repetition of shutting down and starting the OVP and media server every day cannot ensure service stability.
So the KMS server will be operated with minimum of 1 and maximum of 3 24/365

Thanks

Hello Jimahn,

Could you send a video with all the steps that are necessary to replicate your error? We want to replicate the same scenario to see if we can understand your problem and solve it.