Forum Services Down — Major Outage — Actively Troubleshooting
[update5 — Tues 1012 Central time] The developer was not kidding when they estimated the database rebuild could take days and not hours. It’s still rebuilding and at 7.13GB in size…and it is still churning away. The wait continues. Below is a screenshot of the rebuild in progress on the QA server (no functional testing started yet as the database rebuild is not yet complete).
[update4 — Mon 1436 Central time] Here is an update from the vendor as we continue to work on rebuilding the forum….
The reconstruction of the latest database from scratch, using the export from the database and re-import into a new, empty database, continues to progress normally.
So far about 6.4 GB have been imported. We are monitoring to see when it’s completed. There is no way to know how much longer this will take. It could finish any moment; it could still be hours from now. That’s because in the export/import process a lot of old garbage gets removed. So the new database is likely to be considerably smaller than 12 GB. But we don’t know how much smaller.
When it finishes, we’d like to monitor it for a while to make sure connectivity is good from various browsers.
Of course since we don’t have your hundreds of thousands of attachments on our QA server, you won’t be able to confirm attachments until the database is moved back to your server. But they should all be reconnected automatically because the unique ids of the nodes remain the same.
[update3 — Mon 0850 Central time] We are in the process of a rebuild of the entire database from export-and-import into a fresh database. This process is still proceeding normally. The export/import procedure removes a lot of garbage that accumulated over time. So far we are up to 4.65 GB imported, and we are of course hoping the entire rebuild process is successful, and if so, that it may have a positive impact on the network related issues we have encountered. I hope to have an update again this evening (Monday).
[update2 — Sun 1800 Central time] Troubleshooting continues. The issue is that the IP connections for the site are dropping or being blocked, and we are not yet able to identify the source of this problem. It does not appear to be hardware related, however, since we have been testing the site on the developers own QA environment and they are experiencing the same issues on their devices as well. Attention has turned to the database, including the multiple back-ups of the database that we have. We would expect backups to work fine (as our site worked fine with them), but the same issue with connections dropping is also occurring with the database back-ups we have. Connections are dropping or being blocked until the system only allows 1 connection.
The developer is trying to run repairs on our database as they feel that is the root cause of connections dropping. I am still pressing for them to determine what network settings could be impacting our site as well. We continue to troubleshoot and I expect this to continue through tomorrow at a minimum as we work through the results.
My apologies again for this outage. We are all working to find the root cause of the problem so we can restore services.
[update1] We have been working all day with the software vendor and our hosting provider on the major outage we experienced overnight (four separate individuals have been on our case). Unfortunately, what we have encountered is still undergoing investigation. While the good news is that all data is safe and backed up, we are still not sure if it’s a hardware issue or if it is code-related/possible corruption issue.
Testing continues as we work to diagnose the issue. I am committed in investing in whatever the best solution is to resolve this, even if it means a full server replacement should it come down to a hardware problem. Whatever it takes to solve this problem, and my goal is to not only have the issue resolved but find a way to improve performance further (so that it’s worth the collective wait).
This is all going to take some time, so I ask for your continued patience. Especially if new hardware is required and needs to be ordered. I can’t project how soon we will be able to restore forum services other than stating our goal is to have services restored as soon as we can. Most likely we may require a few days to get this all resolved.