8 Mission Critical Software Outages and How to Avoid Them

by

Bernadette Catalan

March 5, 2024

5 Min

reads

As we know, software runs everything from small, fun apps and games to critical infrastructure and global enterprise software tools. Any kind of of downtime or disruption can lead to customer support problems, massive productivity issues, financial transaction issues, and in the worst cases, actual business injury. These functional disruptions that are caused by functional errors are often caused by multiple technicalities - syntax bugs, logic issues, and security bugs that can be planned or unplanned.

While all software developers strive to write excellent code, there are myriads of causes that can come together to introduce bugs and problems that result to system downtime. Here are a few high profile situations where bugs and system problems caused critical software outages.

Advertisers’ campaigns in Meta overspent by thousands

Meta experienced an ad delivery outage, resulting in campaigns overspending over the budget cap. Advertisers were alarmed by the overspending when a parallel increase in conversion did not justify it. Meta acknowledged and confirmed that there was an internal system issue related to the re-alignment of the ad delivery process during Apple’s iOS 14 update. This update limits data tracking visibility forcing Meta to change the way it’s delivering ads to the people who are the best fit for a campaign through machine learning.

Advertisers were alarmed the significant overspend on their accounts that blew up to 500% increase in CPM and daily budgets skyrocketed disregarding their daily cap. Affected advertisers reached out to their Marketing Pro and filed a ticket for refund.

Users unable to access their accounts Meta accounts

Users were logged out of their Meta accounts earlier today and was unable to log on. Andy Stone, Meta’s Spokesperson, clarified that the disruption was a technical issue that cause users to be locked out of their accounts for several hours.

Customers unable to access Atlassian products

Atlassian customers lost access to Atlassian products up to 14 days.  Miscommunication and lack of system warnings caused the major incident according to Atlassian’s post-incident review. Atlassian just completed acquiring and integrating an app for Jira Service Management called “Insight - Asset Management”. After the integration, the deletion of the legacy app was needed and was executed through an existing script.

According to Atlassian, the API could not detect whether the correct site and app ID were being passed. As soon as it detected a value being passed, it assumed that the information could be automatically deleted. If there was a cross-checking function that would enable the IDs to be validated, 883 sites would not have been deleted.

775 Atlassian customers were affected by the incident which resulted to three to 14 days of being locked out.

Small business owners unable to process online Square payments

Small businesses were unable to take cashless payments due to Square’s day-long outage. Square appeared to lose its web presence entirely due to changing the DNS zone.  During the process, the updates prevented our systems from properly communicating with each other which caused the disruption. In the incident timeline Square provided, the changes started 11:04 AM and the incident response began around 2:02 PM. This time frame is crucial for the team to identify whether there are issues with the changes being deployed to prevent the outage.

Small business owners in Portland were affected due to the decrease in revenue seen on that day. Customers were not able to push through with their online orders and some immediately changed plans after seeing the note, “Cash only” They tried to resort to offline payments but the feature was also affected by the outage.

Employees were unable to access their Notion workspaces

Notion encountered a similar issue with their DNS misconfiguration that caused hours of outage. According to Notion, an unusual DNS issue occurred at the registry operator level. Their engineers jumped on the issues immediately for users to access their workspace soonest they could.

Users locked out of their Instagram account

A software bug locked users out of their Instagram account for hours. Instagram mentioned that the cause was a bug and that it had been resolved after 8 hours of outage. A total of 7,000 accounts all over the work were affected by this incident.

Businesses operations affected by AWS outage

The AWS outage was caused by a bad piece of code that resulted in the systems behaving strangely and unexpectedly. The effect was business operations were unable to access the service provided so transactions and functionality could not be performed. Amazon.com intended to release an automated computer feature but they ran into an error.

AWS is Amazon’s on-demand cloud computing business that provides tools to compute power, database storage, and content delivery services.

GitLab.com users were disrupted by an outage after data deletion

GitLab.com accidentally deleted data from the primary server which caused losing modifications to database data such as projects, comments, user accounts, issues, and snippets, that took place between 17:20 and 00:00 UTC on January 31. According to GitLab.com’s postmortem report, the way that the database is set up - single primary and single secondary in hot-standby mode - has a single point of failure.

Their recovery procedures were also broken and data had been removed from the primary and secondary databases. This major incident affected 5,000 projects, 5,000 comments, and 700 new user accounts.

How to avoid business disruptions and software outages?

Detect errors at the earliest stage of the pipeline

Software is left vulnerable when no error detection mechanism alerts stakeholders when there is a threat to functionality and major issues. Railtown AI is a tool that allows error discovery as soon as they occur. It has deep and seamless integration with the CI/CD environment powered by AI and machine learning. Its integration includes a dashboard where stakeholders can monitor highlights of each environment, error distribution, and context by the exact line where an error was seen.

If your individual and team productivity is limited by software outages and bugs, sign up and get a 14-day Free Trial when you sign up, or chat with us to learn more.

Keep reading

Root Cause8 Mission Critical Software Outages and How to Avoid Them

As we know, software runs everything from small, fun apps and games to critical infrastructure and global enterprise software tools.

by

Bernadette Catalan

March 5, 2024

5 Min

reads

CI/CDRailtown AI’s The Conductor in action during Launch Builders Meetup

Railtown AI participated and presented a demo of The Conductor at Launch Academy’s Launch Builders Meetup. Tech enthusiasts, innovators, and entrepreneurs gathered in Microsoft Vancouver to witness tech companies in action and showcase the products they built.

by

Bernadette Catalan

February 15, 2024

5 Min

reads

Root CauseUnderstanding What AI CoPilot Is and The Many Benefits

Explore the world of AI CoPilots and learn how these advanced artificial intelligence-powered assistants can revolutionize various aspects of our lives, from navigation systems to software development environments

by

Marwan Haddad

April 7, 2023

5 Min

reads