High CPU usage and bandwidth

Hi, folks!
Since OV 2.20 CE i have this problem… sometimes my KMS use 100% of CPU and my bandwidth goes to about 1000 mbps.
4 days ago, i updated to OV 2.25 CE and this problem persist.
I use a topology 1:N. Sometimes we have ~600 subscribers. But today this happened when we have about 400 subscribers in one session and 17 in antoher session.
I dont know if it was coincidence, but when start the second session the problem start.
In kurento logs i see a lot of messages approximately when the second session start.
Around 19:39, a large volume of messages starts in the logs.
Kurento-logs

I restarted the OV, so maybe the OV report it’s not so usefull.
I dont know what could be causing this. For some reason Kurento seems to freak out.

Specs:

  • OV CE 2.25:
    ** setting only min/max kbsp bitrate respectively (0, 700, 900, 700)
    ** Adding this volume on KMS service to subscribers can open two streams on different servers in Firefox. (My all 6 servers have this key)

    • /opt/openvidu/kms/WebRtcEndpoint.conf.ini:/etc/kurento/modules/kurento/WebRtcEndpoint.conf.ini
    • /opt/openvidu/kms/dtls.pem:/etc/kurento/modules/kurento/dtls.pem
  • Publisher
    ** Using chrome and virtual cam (plugin) on obs

  • Dedicated Server
    ** AMD Ryzen™ 7 3700X
    ** 64 GB DDR4 ECC
    ** 1 Gbit/s-Port

Thks!
Doug

Hi Doug, I had a look at the Kurento log you attached, and it seems to show constant creation of new WebRtcEndpoints.

Every time this line appears: WebRtcEndpointImpl() No QOS-DSCP value set it corresponds to a new attempt to establish a WebRTC session (a Subscriber, given your desciption). As you’ll notice, the log is completely swamped by these logs. These lines are appearing all the time, there are more than four thousand new endpoints created.

You’d have to correlate these logs with the logs of OpenVidu, because in those from Kurento we can only see what is happening, but we’re lacking the other side of the coin, why it happens. Maybe OpenVidu is reacting too strongly to a perceived loss of connectivity, and it’s attempting to reconnect all subscribers at once. Or some other similar issue.

For now, what we know for sure is that Kurento is being ordered to discard and create thousands of WebRTC endpoints. Let’s dig further from this observation.

Hi @j1elo ! Thks for your feedback. I will wait for this happen again and will get the full log.
The users connections stay alive, but with delay and slow motion. (because of the overload)
Thks again!
Doug

Hi @j1elo ! The problem happened again today.
OV log
Kurento log
The Openvidu logs starts around 19:49. The problems start around 19:45, as you can see in kurento log.

Thks Juan!
Doug

Hi, checking the OV logs, in the very first lines there are already thousands of objects created, which makes the log lines extremely long (we’re already discussing this in order to improve the logging and avoid this issue).

For now, what this means is that at the point you captured the OV logs, the damage was already done. We’d like to inspect what happens in the server that makes it create thousands of Kurento WebRtcEndpoints, so for that you’d need to include logs from when these objects are being created.

Another hint I saw is that most lines in the OV logs look like this (once cleaned up):

[ERROR] [AbstractJsonRpcClientWebSocket-reqResEventExec-e2-t66384] org.kurento.client.internal.client.RomClientObjectManager - Trying to propagate a event with type=IceCandidateFound and data={candidate={...}, ..., but that doesn't exist in the client. Objects are=(suppressed)

This “but that doesn't exist in the client” makes me think there is some issue with the media server. Did you notice if, when the problem arises, the media server process is exiting (maybe due to a bug) and being restarted in a rapid succession? This would explain that the OV server is getting a bit crazy and trying to continuously re-create the session.

Hi @j1elo ! I accessed the machine remotely (of publisher) to check the settings and I was able to verify that it did not have the settings that I had given to the operator.
He was not using the obs virtual cam plugin, but the virtual camera from OBS itself. So, this virtualcam wont respect the resolution setted in FE and uses your own. So it was sending in 1920x1080 to OV.
I believe this is what was leading to the problem. This high resolution.

High resolution > increase number of subscribers > high cpu > crash (or your suggestion, exiting and restart the process).

I’ll be monitoring it closely to see how it behaves now. We adjusted the output to 1280x720 and today, so far, everything is normal.

Really thks for your quickly feedback, @j1elo

Thks a lot!
Doug

Hi @j1elo ! The problem persist. But now the server can handle this CPU spike. It now has 2-5s spikes of 100% CPU usage. The stream dont stop, but show some lags and drop frames. The problem starts when the room have more then 600 viewers.

I capture a few lines of the log when the problem start. Kms.log

Is there a possibility that this is happening due to unnecessary reconnection requests?

This is very strange, because i use OV since 2.14 and with this one i have rooms up to 1000 subscribers with no problems. I don’t remember exactly, but these problems started in version 2.18 or 2.20.

The only difference in settings from 2.14 to 2.25 is the bitrate sent to the client:
2.14: min. 500 max. 700
2.25: min. 800 max. 900

Thnks again!
Doug

Hi @j1elo . Only to close this topic.
I found the problem. In my scenario i was using OPENVIDU_STREAMS_VIDEO_MAX_RECV_BANDWIDTH with 0. This is not good when you have a lot of viewers. Setting this to same value of max sending bitrate, solve the problem. Now the cpu usage is stable.
Thks!
Doug

1 Like