Incident Summary
Wednesday 6/22/2023
9:30PM EDT: Maintenance began on our cluster monitoring and management systems.
11:30PM EDT: NOC team was notified by internal monitoring systems that a small number of systems were experiencing internet connectivity issues
Thursday 6/23/2023
12:12 AM EDT: Incident opened, and notice sent to clients declaring a partial system outage
1:15 AM EDT: Issue was identified as an errant addition of a DHCP server on a virtual network switch
1:25 AM EDT: DHCP server was removed from the virtual switch which had impacted some system network connectivity. Remediation work began to restore network connectivity and revert changes made.
1:45AM EDT: Remediation work concluded, and all impacted systems were re-connected and stable.
Impact
A small number of Virtual Machines in the cluster were provided erroneous network information, which caused their internet connectivity to be interrupted.
Root Cause
A routine change made to our cluster software resulted in an errant addition of a virtual network component. The change has been sandboxed and tested in an alternate environment and the issue is not able to be replicated.
Corrective Action
Information on this issue has been sent to the software vendor for further review. Updates to the cluster monitoring software have been rolled back and paused until feedback is received from the vendor.