Intermittent disconnection between OpenVidu Server and Media Node (Kurento)

Hello,

We recently faced an intermittent issue in our deployment and I’d like to better understand the potential root causes.

Setup:

  • 1 VM running OpenVidu Server

  • 2 VMs running Media Node Controller

  • Openvidu v2 Pro (2.31.0)

  • Everything had been working normally before the incident

Problem observed:

  • Suddenly, one Media Node started rejecting every session created on it, while the other node was still fine.

  • OpenVidu Server wasn’t able to connect to that Media Node, which eventually crashed.

  • The next day, without any intervention or configuration change, the failing node started working again.

Troubleshooting already done:

  • Removed the failing node from the Media Node list and created a new VM for workload continuity.

  • Left the problematic VM untouched to investigate root cause.

  • Checked ELK logs → nothing alarming.

  • Checked CPU usage → high during the disconnection, but seems more like a consequence (loop between OpenVidu Server and the unreachable node) rather than the root cause.

  • Verified network connectivity → look normal.

  • Verified disk usage → look good (Media Node auto-cleaning log files).

My questions:

  • What could cause a disconnection between OpenVidu Server and a Media Node, that later resolves without any action?

  • What could be potential root causes for a Media Node to suddenly start rejecting sessions and then recover on its own?

  • Are there additional logs or metrics (besides ELK, CPU, disk, network) that you recommend monitoring to detect or explain these transient failures?

Thanks a lot in advance for your help and insights! If you need any details, let me know!

  • The next day, without any intervention or configuration change, the failing node started working again.
  • What could cause a disconnection between OpenVidu Server and a Media Node, that later resolves without any action?
  • What could be potential root causes for a Media Node to suddenly start rejecting sessions and then recover on its own?

The connection in the next day makes me thing of a temporary error in the network when the disconnection happened.

Are there additional logs or metrics (besides ELK, CPU, disk, network) that you recommend monitoring to detect or explain these transient failures?

Yes, logs from openvidu-server container at the master node and logs from the kms container in the Media Node.

What logs do you have from that specific moment? If this is the first time it happened to you, it is mostly a temporary network issue.