Background
Advances in and the proliferation of modern portable computing devices and the wide availability of mobile and LAN/WAN/Internet connections have created an environment in which people connect to their favorable applications and/or websites on a continuous basis. They want to be “always on” in accessing business data, getting up-to-date information, responding to busines and private e-mails, posting files on social networks, and so on. However, in spite of powerful servers, the wide availability of the Internet, Wi-Fi and mobile connections, and high data transfer rates, end users still experience messages such as “server is down”, “site is temporary unavailable”, “your request can’t be completed now, try later”, “DNS failure”, “service unavailable”, “down for maintenance”, “sorry, something went wrong”, “network is unreachable”, “we’ll be back soon, thank you for your patience”, and so on. If users get these messages when they connect to Facebook, Instagram, and other social network sites, just from sharing messages/files, such messages do not result in negative financial effects for them. However, these and similar messages will in most cases result in, for instance, customers’ decisions to switch to another site/vendor/provider when trying to connect to an e-business/e-commerce site and consequently yield some negative financial effects to the company/vendor/provider. Not only “out of use” messages but also delays in application servers’ response time may cause customers to immediately switch to competitors and hence the loss of customers and money.
From a technology point of view, such messages can be the result of several types of hardware and software origins, such as server hardware glitches, server operating system crashes, application bugs, failures in data communication devices and lines, network disconnections, and bad IT operations. However, application servers may also go down and/or become unreachable for several hours/days due to electricity cuts, power outages, natural disasters, and pandemic diseases as well.
Technical issues related to IT infrastructure devices may occur within all types of information architecture models that are in use today such as the client-server on-premises model and client-cloud model. In many cases, these problems may cause the unavailability of application servers or of whole networks, which simply means the unavailability of information. If an application server goes down or a network is unreachable for some time, this situation is known as “system downtime”, and can be caused by a server hardware glitch, a server operating system crash, network component failure, and similar issues. These so-called downtime points in both on-premises client-server architecture and client-cloud architecture are considered mission critical for continuous computing and business continuity in the modern e-business world. According to the IT Disaster Recovery Preparedness Council Report (2015), one hour of downtime can cost small companies as much as $8,000, midsize companies up to $74,000, and large enterprises up to $700,000. Ponemon Institute (2016) reported that the average cost of a data center outage has steadily increased from $505,502 in 2010 to $740,357 today (or a 38% net change). An Information Technology Intelligence Consulting (ITIC) report (ITIC Report, 2016) found that 98% of organizations say a single hour of downtime costs over $100,000; 81% of respondents indicated an amount of over $300,000. And a record one-third, or 33%, of enterprises report that one hour of downtime costs their firms $1 million to over $5 million.
Emerson Network Power and the Ponemon Institute (2016) revealed that the average cost of data center downtime across industries was approximately $7,900 per minute. Raphael (2013) reported that a 49-minute failure of Amazon’s services on January 31, 2013, resulted in close to $5 million in missed revenue. Similar outages happened in January/February/March 2013 to Dropbox, Facebook, Microsoft, Google Drive, and Twitter. According to Aberdeen Report (2014), the average cost of an hour of downtime for large companies is $686,250; $215,638 for medium companies; and $8,581 for small companies. Gartner (2014) noted that “Based on industry surveys, the number we typically cite is $5,600 p/minute, which extrapolates to well over $300K p/hour”. With regard to network downtime, “the cost of improving availability remains high and downtime is less acceptable, making rightsizing network availability the key goal for enterprise network designers”. (Gartner, 2014). Emerson Report (2014) found that the most frequently cited total expense of unplanned outages includes: IT equipment failure; cybercrime; UPS system failure; water, heat, or CRAC failure; generator failure; weather incursion; accidental/human error. International Data Corporation (IDC) (2014) noted that IT applications and services have become a critical element in how companies interact with their customers, deliver new products and services, and improve the productivity of their own workforce. The biggest cloud outages in 2014 include those of Amazon Web Services, Verizon Wireless, Dropbox, Adobe, Samsung, Microsoft Lync, and Microsoft Exchange Online (Raphael, 2013). An Avaya report (2014) revealed that 80% of companies lose revenues when the network goes down, with the average company losing $140,003 per incident. Quorum Report (2013) found that hardware failures are the most common type within small and mid-sized businesses with the percentage of 55%, while in 22% of disasters, the reason was human error (system and network administrators’ mistakes).
According to the 2017 Veeam Availability Report (2017), 82% of enterprises face a gap between what users expect and what IT can deliver. This “availability gap” is significant – unplanned downtime costs enterprises an average of $21.8 million each per year. The Uptime Institute’s report (2018) revealed that the number of respondents that experienced an IT downtime incident or severe service degradation in the past year (31%) increased over last year’s survey (about 25%). And in the past three years, almost half of 2018 survey respondents had an outage. This is a higher-than-expected number. The Veeam Data Availability Report (2017) revealed that, on average, each downtime incident lasts about 90 minutes, costs on average $150k per outage, and represents $21.8M per year in losses. Gartner reported that the average cost of an IT outage is $5,600 per minute, and because there are so many differences in how businesses operate, downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end (Opiah, 2019). Veeam (2016) reported that
Recent Uptime Institute research (Uptime Institute report, 2019) found that major failures are not only still common, but that the consequences are high, and possibly higher than in the past – a result of our high reliance on IT systems in all aspects of life. In 2018, there were major outages of financial systems, daylong outages of 911 emergency service call numbers, aircraft losing services from ground-based IT landing systems, and healthcare systems lost during critical hours. The number of respondents that experienced an IT downtime incident or severe service degradation in the past year (31%) increased over last year’s survey (about 25%). And in the past three years, almost half of our 2018 survey respondents had an outage. This is a higher-than-expected number. The Business Continuity Institute (BCI Report, 2018) reported that the uptake of business continuity arrangements has experienced an upward trend. An increasing number of organizations embed business continuity to protect their supply chains, which also has a positive impact on other areas such as insurance and top management commitment.
Application defects, hardware failures, and operating system crashes may take different forms, such as bugs in programs, badly integrated applications, and process/file corruptions. Network problems, in addition to hardware glitches on data communication devices, include problems such as those related to Domain Name System (DNS) servers, network configuration files, network protocols. Human error may also cause data unavailability, which includes accidental or intentional removal of files, fault operations, and hazardous activities including sabotage, strikes, and vandalism. Accidental or intentional removal of system files performed by a system administrator can shut down the whole server and make applications/data unreachable. Another example is the loss of key IT personnel or the leaving of expert staff due to several reasons, for example, bad managerial decisions on IT staffing policy.
Adeshiyan et al. (2010) stated that traditional high-availability and disaster recovery solutions require proprietary hardware, complex configurations, application specific logic, highly skilled personnel, and a rigorous and lengthy testing process. Jarvelainen (2013) proposed a framework for business continuity management to the context of business information systems. Zambon et al. (2011) stated that having a reliable information system is crucial to safeguard enterprise revenues. Martin (2011) cited the results of a study by Emerson Network Power and the Ponemom Institute that revealed that the average data center downtime event costs $505,500, with the average incident lasting 90 minutes. ITIC Report (2009) revealed that “server hardware and server operating system reliability has improved vastly since the 1980s, 1990s and even in just the last two to three years”. This report underscores that common human error poses a bigger threat to server hardware and server operating system reliability than technical glitches. Venkatraman (2013) noted that more than a third of respondents viewed human error as the most likely cause of downtime. Clancy (2013) stated that it takes an average of 30 hours to recover from failures, which can be devastating for a business of any size. Sun et al. (2014) proposed a Markov-based model for evaluating system availability and estimating the availability index. Bhatt et al. (2010) considered IT infrastructure as the “enabler of organizational responsiveness and competitive advantage”. Versteeg and Bouwman (2006) defined the main elements of a business architecture as business domains within the new paradigm of relations between business strategy and information technologies. Yoo (2011) stated that the shift to cloud computing also means that applications providers will place less emphasis on the operating system running on individual desktops and greater focus on the operating system running on the relevant servers. Duffy et al. (2010) stated that “although the operating system is an integral component of a computer-based information system, for many MIS majors the study of operating systems falls into this ‘dry’ category” Lawler et al. (2008) explored the risks of IT application downtime and the increasing dependence on critical IT infrastructures and discussed several disaster tolerance techniques. Brende and Markov (2013) considered the most important risks inherent to cloud computing and focused on the risks that are relevant to the IT function being migrated to the cloud. Sandvig (2007) noted that four server-side technologies are needed to support e-business: a web server, server side programming technology, a database application, and a server operating system. ITIC Report (2009) indicated that “server hardware and server operating system reliability has improved vastly since the 1980s, 1990s”. According to this report, common human error poses a bigger threat to server hardware and server operating system reliability than technical glitches. In summary, it is more than evident that continuous computing features of modern server operating systems in terms of their availability, scalability, and reliability affect a business in such a way that more or less downtime simply means more or fewer financial losses. CIO (2013) reported that “Web-based services can crash and burn just like any other type of technology”. Marshall (2013) related a story about the cloud service provider Nirvana that “has told its customers they have two weeks to find another home for their terabytes of data because the company was closing its doors and shutting down its services”. Clancy (2013) stated that “Hardware failure is the biggest culprit, representing about 55 percent of all downtime events at SMBs, while human error accounts for about 22 percent of them”. According to Information Today Report (2012), network outages (50%) were the leading cause of unplanned downtime within the last year. Human error (45%), server failures (45%) and storage failures (42%) followed closely behind. An example of human error is an accidental or intentional operation of removing fi...