Large scale (250+ people) webinar sessions lead to very degraded quality and crashes?

Hi there, our application uses OpenVidu Enterprise to run large scale webinars. Essentially 1 - 4 publishers streaming to anywhere from 10 people on the low end to hundreds of people on the high end. For the most part, our users typically stream to 10 - 100 people and these webinars tend to do fine.

We’ve had several webinars in the 100 - 200 people range and they do fine as well. 200 - 300 people things seem to start getting a little iffy and more recently, we’ve had a couple of webinars where 2 different users were streaming to 400 - 550ish people and there were many issues.

On the webinars with over 400 subscribers, the video quality was extremely degraded. It was very pixelated and blurry, to the point of barely being legible.

This is definitely an issue but the bigger problem is that the publisher in both of these webinars had multiple occasions where the video and audio completely cut out. And by “cut out”, I mean that all publisher video and audio completely stopped and the user had to refresh the page in order to start publishing video again.

We’re running OpenVidu Enterprise on AWS with a c5.4xlarge master node and c5.2xlarge media nodes (with autoscaling enabled).

Unfortunately, I was not in the room to inspect the console and see what blew up exactly. I can see in Kibana that we were nowhere near max CPU capacity for either the master/media nodes, I’ve attached an image of what the logs are showing. I also believe I haven’t hit a memory limit here from what I can see in the logs.

So I’m left scratching my head as to what is going on. I’m not sure if there is something going on on with our web application that is perhaps causing the publisher’s browser to crash with that many attendees live at once. Or if perhaps OpenVidu is not yet capable of, or still has issues with, publishing to multiple hundreds of subscribers in one session?

Any input here would be much appreciated.

Large scale sessions are in our roadmap, but not yet. What I would do is to divide those webinars in smaller sessiones (if it is a 1 to N session). Cons of dividing a webinar in multiple session is that signaling of the connection would need to be reworked in your webinars to work properly and session management would be different at your application server.

When you talk about “cut out” I am wondering if the networking bandwith of the Media Node was too high. Do you have data regarding bandwith usage of those large scale sessions? It is not widely known that AWS informs “badly” about network bandwith limits. They publicly inform that c5.2xlarge goes up to 10Gbits, but the baseline is 2.5 Gbits. I am not sure if this could be a problem, but I just want to mention it.

Regards,
Carlos.

Hi @cruizba , thanks for your response. I am looking forward to “official” support for large scale sessions, I believe @pabloFuente mentioned it could be coming in Q4 2022. Changing the internals of our server to use smaller sessions is probably a last resort for us currently as that will likely take considerable dev time.

That said, looking at the Kibana graph, it doesn’t seem like the media node was at max capacity so I’m wondering if perhaps this does have something to do with max network bandwidth for my AWS instances like you mentioned. So a few questions:

  1. Would enabling Simulcast potentially help lower bandwidth usage in larger 1:N sessions?

  2. Right now we use a c5.2xlarge for our media nodes. As per AWS documentation, I see some instances provide much higher bandwidth (see screenshot), like the c5n.4xlarge for example. Does OpenVidu support c5n.4xlarge (or smaller) media nodes?

  3. If yes, is it possible to just change the config file of my existing OpenVidu setup to use a different instance type for media nodes or do I need to use CloudFormation to spin up an entirely new master/media node setup?

Thanks!

Hi @pabloFuente @cruizba , just wondering if you guys have any insights regarding my last few questions.

One of our customers ran his webinar to several hundred participants again and I was able to monitor the statistics on AWS during the event this time.

I’ve attached a screenshot of the network out statistics, looks like for this event (roughly 300 people live) we’re seeing around 2 - 3 Gbps in outbound traffic during the event (if I’m reading the chart correctly). So based on this I am starting to suspect the issue we’re running into with larger events is a bandwidth issue like you mentioned.

That said, AWS has instances with higher levels of network performance available (as per the screenshot in my previous post) so it’d be really nice if I could select one of those as the media instances my setup should use. Unfortunately I don’t see that option using the CloudFormation template. Is there a way to do this?

Hello @phil917. Again, thank you so much for your open communication with the enterprise HA stack, it is guiding us a lot on improving multiple things.

We want for this Q2 2022 to simplify the deployment and also make it available as On premises, to not depend on AWS and its shadowy limitations.

Regarding your questions:


1. Would enabling Simulcast potentially help lower bandwidth usage in larger 1:N sessions?

No for publishers, yes for subscribers. In fact, by using simulcast, multiple layers of video are sent from the publisher. Simulcast improves the experience of subscribers which have a bad internet connection because subscribers uses the stream layer which fits best for its network conditions, but, if the publisher and subscriber has good network conditions, it will send the best possible quality. (More info here)

The problem is, as it is right now, it is not possible to select the lowest quality layer for subscribers. The layer changes depending on network conditions, so the unique possible solution is to limit the max bitrate of the publisher.

It is possible for mediasoup to limit the bitrate but it is not configurable yet via OpenVidu, we will plan to support it, I think, at version 2.24.0 (I will confirm you).

If you don’t want to wait, what you can do to reduce the bandwidth is to reconfigure the bitrate of your publishers using WebRTC API(RTCRtpEncodingParameters.maxBitrate - Web APIs | MDN). You can work around this limitation with openvidu-browser like this:

session.publish(publisher);
publisher.on('streamCreated', () =>  {
	for (let rtcSender of publisher.stream.getRTCPeerConnection().getSenders()) {
		if (rtcSender.track.kind === 'video') {
			const senderVideoParamters = rtcSender.getParameters();
			for (let encoding of senderVideoParamters.encodings) {
				encoding.maxBitrate = 100000; // <---- Max Bitrate here
			}
			rtcSender.setParameters(senderVideoParamters);
		}
	}
});

You can reconfigure the maxBitrate by modifying the underlying RTCRtpSender with a maxBitrate value. In the code snippet it is 100 kbps of bitrate.


2. Right now we use a c5.2xlarge for our media nodes. As per AWS documentation, I see some instances provide much higher bandwidth (see screenshot), like the c5n.4xlarge for example. Does OpenVidu support c5n.4xlarge (or smaller) media nodes?.
3. If yes, is it possible to just change the config file of my existing OpenVidu setup to use a different instance type for media nodes or do I need to use CloudFormation to spin up an entirely new master/media node setup?

Of course you can modify instance types, you just need to modify the LaunchConfiguration of the Autoscaling group of Media nodes and destroy current instances so new ones with the new configuration are deployed:

  1. Go to the Media Node Autoscaling Group Launch Configuration:
  2. Click at Copy Launch Configuration:
  3. Modify the instance type and click on Create Launch Configuration:

    image
  4. Then, you will have a new Launch Configuration created, which ends with ...Copy. This Launch Configuration needs to be applied to the Media Nodes Autoscaling Group:
  5. Go to your Media Nodes Autoscaling Group:
  6. Select it and click on Edit:
  7. Select the new Launch Configuration and click in Update:
    image
  8. Terminate old instances:

    9.Wait until new ones are created automatically:

As you can see, I’ve changed from t2.xlarge to c5.xlarge but the process is the same for any kind of instance.


I hope this helps with your application.

Best regards.

Thank you @cruizba for the very thorough response! Spoke with my team today and we’ve decided we’ll be attempting to change our media node type to one with a much higher bandwidth limit and see if that improves out performance with larger scale sessions.

Hopefully this resolves our issues in the meantime between now and whenever “official” support for large scale sessions arrives. We’ll also look into enabling Simulcast as I think there’s probably a net gain to be had there when dealing with hundreds of subscribers watching one publisher.

Thanks!

1 Like

Hi @cruizba, took us some time but our team is finally getting around to trying to implement the steps you outlined above. Just realized however that it looks like these instructions are for the Enterprise HA setup.

Our business is actually using the “Single Master deployment” of OpenVidu Enterprise. Is there a way to change the media instance type used when OpenVidu spawns new media nodes? Inside the master node .env file, I see at the bottom there is a “AWS_INSTANCE_TYPE” variable. Can I just change this to my desired instance type (c5n.2xlarge)?

Thanks!

Figured it out after doing some searching. For anyone else wondering:

Simply modify the “AWS_INSTANCE_TYPE” variable inside the .env file of the master node to the desired AWS instance you want (In my case I went from c5.2xlarge to c5n.2xlarge).

Restart the master instance with ./openvidu restart

Then kill off any currently running media nodes on the old instance type. Any new media nodes spawned after this will use the newly specified instance type.

Yes, you’re procedure is correct.
I always thought that you were using the HA Enterprise. Sorry about that.

Hi @pabloFuente @cruizba We recently made the change to upgrade our media nodes to c5n.2xlarge which have much higher bandwidth capacity (which would hopefully resolve the issues we’ve been seeing with large scale sessions).

Unfortunately, yesterday we had a user go live to a large group (250ish people at peak attendance) and we experienced a similar issue where the publisher video/audio just randomly cut out around the 1 hour and 40 minute mark.

Looking at the session recording, the publisher was speaking and then the video/audio just cuts out. After a few minutes, the publisher managed to rejoin with video/audio but obviously this is far from ideal. A lot of attendees seemed to drop from the session after this as well, which is also not ideal.

It seems like the previous potential solution (upgrading to a higher bandwidth capacity instance) here didn’t fix the problem for us unfortunately. Here’s some screenshots of relevant data collected from for this session (relevant sections outlined in red).


Kibana


AWS instance network out chart


According to AWS documentation, the baseline bandwidth available for c5n.2xlarge should be more than enough for the usage seen during this session.


I also see in the logs for this session, it does say there was some error with the publisher. It’s hard for me to understand/read this but perhaps you guys can make something of this?

With all that said, this doesn’t seem to be a one-off instance of this happening. I myself have witnessed 3 - 4 large scale sessions where the video/audio cuts out and I’ve heard some similar complaints from other users a few times as well.

As a business, we urgently need to figure out (and hopefully solve/mitigate) this issue. Is this an issue with OpenVidu? Is it something wrong with our implementation or our server setup? Is it possible this entirely due to the publisher’s having transient network issues?

Any suggestions/solutions would be much appreciated.

Best

Just had another user a few days ago experience a similar issue with around 250 attendees. The audio/video just randomly “dropped” a few times during the 2 hour session.

Not really sure how to handle this.