problomes on beefgirs server

its currently up but might go down many times in the next few hours/days

Emergency maintenance: Panelbox s101 – 102
Published on February 13, 2012 at 5:03 pm by admin in: status

Update Feb 14 7:30pm EST

An urgent maintenance is required for the servers s101 and s102. This maintenance will require the interruption of the web service for up to 12 hours, starting tonight at 8pm (February 14th). We will attempt to keep the email services accessible during this period. The maintenance must be performed urgently in order to establish the server replications and ensure a proper service and server performance.

Here are more details on the maintenance

Context
We are not immune to equipment failure, even if our servers have 3 levels of replication/backup:

Raid – Disks replicate between each other on the server
DRBD – The content on the server is replicated towards another server, and vice-versa, which explains the link between s101 and s102.
R1Soft – Backup system of the disks’ contents

History of the current problem and solutions
Yesterday, the s101 server met certain problems that lead us to diagnose a problem on the first level of replication, which was a defect RAID card. We then used our second level of replication by redirecting all sites from the s101 server its replica on the s102 server to avoid a long service interruption. This intervention partially reduces the server performances because it contains twice as much sites.

We then started the RAID card replacement on the s101 server. This operation usually takes several hours and is transparent to our customers that are using, without knowing, the server replica (DRBD) during this period. In this case, a data corruption on one of the server’s disks in turn corrupted the RAID reconstruction. Consequently, this corruption created a problem where instead of reconstructing the RAID in a few hours, we will need to resynchronize the DRBD replication back from s102 to s101. This means a lot of Gigabytes. This intervention can take up to 12 hours and uses almost all of the server’s resources, which makes server usability almost null due to major slow downs, and the replication would take several days if we kept it alive.

We will take this opportunity to perform a RAID card upgrade to a better one, and add additional memory to the server in order to improve server performances. That is not something we can normally do easily when the servers are up and running.

The intervention will start at 8pm tonight the 14th of February. We are preparing the hardware for the upcoming manipulations. We expect this to be finished by 8am Wednesday morning.

==========

Update Feb 14 (14:50 EST)

Servers are stable again, but are meeting certain latencies. s101 has continuous errors that might require an OS reinstall. Most of the ressources are being handled on s102 which creates the slow downs you might be experiencing at this time. We are looking into different solutions at this time.

=====

Update Feb 14 (13:40 EST)

The s101 server will have to be rebooted after a synchronization problem with s102. This renders both servers with intermittent latency issues. We are working on resolving these issues as quickly as possible.

=====

Update Feb 13 (17:45 EST)

The server is back online. It will remain under observation for a couple hours.

=====

Start time: immediately (Monday February 13th 2012 17:00 EST)
Resolution time: Estimated to 1 hour
Situation: Panelbox server s101 unreachable
Impact: The services provided by the Panelbox server s101 will be unavailable during the maintenance.

We are currently proceeding to the replacement of the RAID card in s101.

Thank you for your comprehension.

Should you have any inquiries, feel free to contact our support team: http://funio.com/contacts.