9:46 AM Eastern – At this time, we are experiencing an outage of landing page, email, and blog content for a portion of COS hosted customers.
Affected customers are those whose HubIDs end in 16, 17, 42, 66, 67, or 92.
HubSpot Engineering is working to resolve the issue now.
[ UPDATE 10:50 AM Eastern ] At this time, we continue to experience an outage of site pages, landing pages, email, and blog post content for a portion of COS-hosted customers.
HubSpot Engineering continues to work to restore service for customers affected by this outage.
[ UPDATE 11:20 AM Eastern ] At this time, affected portals are displaying content as expected, with the exception of a very small number of changes that occurred between 7:00 AM and 9:20 AM Eastern. We are in the process of restoring any content changes that are not reflected, and all content should be current and displaying as expected shortly.
[ UPDATE 1:20 PM Eastern ] At this point, all customer content is displaying correctly, with the exception of a very small number of changes that occurred between 7:00 AM and 9:20 AM Eastern. HubSpot’s Engineering team is in the process of ensuring that restoring these changes will not overwrite other changes, and we expect to have this process completed shortly.
[ UPDATE 6:30 PM Eastern ] At this time, all content for all portals is current, and is being displayed correctly.
We wanted to provide more information about the COS content delivery outage that affected some customers this morning, both what happened and how we recovered, and what we’re doing to prevent similar issues in the future.
What Happened
At 9:20 AM Eastern this morning, one of our developers began running a script on our COS databases for maintenance as a result of new features. Soon after beginning to run the script, it became apparent that the database change was having a very negative impact on the performance of the COS. The developer who was running the migration cancelled it immediately upon the impact becoming obvious.
Unfortunately, this left three of the database copies in a state where customer content was not available. This affected all portals with a Hub ID ending in 16, 17, 42, 66, 67, and 92.
At 10:50 AM Eastern, HubSpot’s Engineering team finished restoring all three of these database copies from backups. In two cases, these backups were only a couple of minutes old, and had no differences from the point at which the migration was run. In the other, the database backup was several hours old, and some content hosted there had been changed in the meantime. This meant that for some portals ending in 42 and 92, we were showing content that had potentially been changed.
After restoring this database, it was necessary to manually reconcile the changes that had been made and to ensure that those edits were reflected on our customers’ sites. HubSpot Engineering completed this process at 6:30 PM Eastern.
What We’re Doing To Address This Issue
We will be examining the processes that we follow when running database scripts to make sure that we are following best practices. We run dozens of these in any given week, in support of hundreds or thousands of changes to our applications. Among all of those changes, it is very rare that we would have any issues as a result. Ensuring that continues to be the case will be the primary focus of our examination of this outage.
We will also be taking this opportunity to make the windows between database backups even smaller than they already are. While there is always the risk that recovery of a database will involve initially losing some changes, this will allow us to make the window of changes that may initially be lost much smaller than it would be otherwise.
Finally, we are undertaking a new infrastructure initiative, which we will use when possible in the future, that will improve handling of any similar impact. Specifically, we are investigating a new error-handling mechanism for the COS that, in the unlikely event of a similar an issue in the future, would dramatically reduce the impact on public content.
We apologize for the impact and inconvenience of today’s events to all affected customers.