Hello,
We recently faced an intermittent issue in our deployment and I’d like to better understand the potential root causes.
Setup:
-
1 VM running OpenVidu Server
-
2 VMs running Media Node Controller
-
Openvidu v2 Pro (2.31.0)
-
Everything had been working normally before the incident
Problem observed:
-
Suddenly, one Media Node started rejecting every session created on it, while the other node was still fine.
-
OpenVidu Server wasn’t able to connect to that Media Node, which eventually crashed.
-
The next day, without any intervention or configuration change, the failing node started working again.
Troubleshooting already done:
-
Removed the failing node from the Media Node list and created a new VM for workload continuity.
-
Left the problematic VM untouched to investigate root cause.
-
Checked ELK logs → nothing alarming.
-
Checked CPU usage → high during the disconnection, but seems more like a consequence (loop between OpenVidu Server and the unreachable node) rather than the root cause.
-
Verified network connectivity → look normal.
-
Verified disk usage → look good (Media Node auto-cleaning log files).
My questions:
-
What could cause a disconnection between OpenVidu Server and a Media Node, that later resolves without any action?
-
What could be potential root causes for a Media Node to suddenly start rejecting sessions and then recover on its own?
-
Are there additional logs or metrics (besides ELK, CPU, disk, network) that you recommend monitoring to detect or explain these transient failures?
Thanks a lot in advance for your help and insights! If you need any details, let me know!