Monitoring as a Service (MaaS)

Monitoring as a Service (MaaS) is an all-inclusive managed services package offered to monitor Cloud services like Microsoft Azure, Amazon AWS, Rackspace and traditional onpremise IT infrastructure and provide you proactive alerts based on key metrics before they begin impacting your infrastructure and your customers. The true benefit lies in our expert interpretation of your data. Your monitoring systems hold little value without properly applying the results to your operations. MaaS is a new delivery model that is suited for organizations looking to adopt a monitoring framework quickly with minimal investments

Cloud computing and its pay-as-you-go economic model not only enable application developers and application service providers to perform on-demand utility computing, but also push the evolution of datacenter technologies to become more open and more consumer driven. Typically, in addition to rent virtual server instances and pay for certain middleware services based on their usage, such as load balancing in EC2, Cloud consumers also need to monitor the performance of their applications in response to unexpected peaks of service requests or performance degradation in their multi-tier application frameworks. Similarly, Cloud providers need to monitor the large number of computing nodes in their datacenters in response to virtual machine failures or performance degradation of virtual machines, ensuring the level of service quality agreement demanded by the Cloud consumers.

Today’s Cloud datacenters are complex composition of large-scale servers, virtual machines, physical and virtual networks, middleware, applications, and services. Their growing scale and complexity challenge our ability to closely monitor the state of various entities, and to utilize voluminous monitoring data for better operation. Providing Monitoringas-a-Service (MaaS)[5] brings a number of benefits to both Cloud providers and consumers.

First, MaaS minimizes the cost of ownership by leveraging the state of the art monitoring tools and functionalities. MaaS makes it easier for users to deploy state monitoring at different levels of Cloud services compared with developing ad-hoc monitoring tools or setting up dedicated monitoring hardware/software.

Second, MaaS enables the pay-asyou-go utility model for state monitoring. This is especially important for users to enjoy full-featured monitoring services based on their monitoring needs and available budget. MaaS also brings Cloud service providers the opportunity to consolidate monitoring demands at different levels (infrastructure, platform, and application) to achieve efficient and scalable monitoring.

Finally, MaaS pushes Cloud service providers to invest in state of the art monitoring technology and deliver continuous improvements on both monitoring service quality and performance. With the consolidated services and monitoring data, Cloud service providers can also develop value-add services for better Cloud environments and creating new revenue sources.

This dissertation research tackles critical challenges in MaaS with a layered approach that systematically addresses monitoring efficiency, scalability, reliability and utility at the monitoring infrastructure level, the monitoring functionality level and the monitoring data utility level. We analyze key limitations of existing techniques, and develop new techniques to offer more effective Cloud monitoring capabilities in this layered design. In addition, we built systems that help Cloud developers and users to access, process and utilize Cloud monitoring data. Specifically, this dissertation makes the following contributions.

1. Technical Challenges

Despite the attractiveness of MaaS, providing monitoring-as-a-service also involves big challenges at different levels.

Cloud-scale monitoring infrastructure

MaaS requires a Cloud-scale monitoring infrastructure with strict performance and scalability requirements. How can we collect a massive set of live information from hundreds of thousands of, even millions of manageable instances in a Cloud datacenter? Due to the on-demand provisioning nature of Cloud, monitoring demands can also change significantly over time. Hence, the monitoring infrastructure should not only achieve high scalability, but also embrace changes in monitoring demands. Furthermore, the monitoring infrastructure must also provide good multi-tenancy support to ensure a massive number of users enjoy Cloud monitoring services at the same time.

Advanced monitoring functionalities

Cloud monitoring needs vary heavily from task to task, and many monitoring tasks requires the support of advanced monitoring techniques to achieve communication efficiency, flexible tradeoff between accuracy and sampling cost as well as reliable distributed monitoring. For instance, Cloud service rate limiting requires intensive monitoring of per-user access rates across a large number of distributed servers which may be located in different continents. Such monitoring tasks require highly efficient monitoring-related communication. As another example, some monitoring tasks such as network traffic monitoring incur high monitoring data collection (sampling) cost. Achieving accurate yet efficient monitoring for these tasks is difficult. Furthermore, failures and malfunctions are the norm rather than the exception in large-scale distributed environments. As a result, monitoring data are almost always error-prone or incomplete. How can we prevent such data from generating misleading monitoring results? Or how can we maximize the utility of monitoring data with the presence of possible disruptions from different levels?

Utilization of monitoring data

Cloud datacenter monitoring generates tremendous amounts of data which often yield little usage besides simple event detection. For example, Amazon EC2’s monitoring service CloudWatch provides continuous web application performance and resource usage monitoring for simple dynamic server provisioning (autoscaling), which also produces considerable monitoring data. Can we leverage such data to offer intelligent functionalities to further simplify Cloud usage? For instance, performancedriven Cloud application provisioning is difficult due to the large number of candidate provisioning plans (e.g., different types of VMs, different cluster configurations, different hourly renting cost, etc.). Is it possible to utilize Cloud application performance monitoring data to simplify the provisioning planning process or even liberate Cloud users from the details of application provisioning and meet their performance goal at the same time? If it is possible, what techniques should we develop to support such functionalities?

2. Dissertation Scope and Contributions

This dissertation research tackles the above problems with a layered approach that systematically addresses monitoring efficiency, scalability, reliability and utility at the monitoring infrastructure level, the monitoring functionality level and the monitoring data utility level. We analyze key limitations of existing techniques, and develop new techniques to offer more effective Cloud monitoring capabilities in this layered design

Monitoring Infrastructure

At the monitoring infrastructure level, we propose REMO and Tide which contribute to a Cloud-scale monitoring infrastructure that ensures the efficiency, scalability and multi-tenancy support of Cloud monitoring.

  • Monitoring Topology Planning : Large-scale monitoring can incur significant overhead on distributed nodes participating in collection and processing of monitoring data. Existing techniques that focus on monitoring task level efficiency often introduce heavily skewed workload distributions on monitoring nodes and cause excessive resource usage on certain nodes. We developed REMO, a resource-aware monitoring system that considers node-level resource constraints, e.g. monitoring-related CPU utilization should less than 5%, as the first-class factor for scheduling multiple monitoring tasks collectively. REMO optimizes the throughput of the entire monitoring network without causing excessive resource consumption on any participating node, which ensures performance isolation in multitenent monitoring environments. It also explores cost sharing opportunities among tasks to optimize monitoring efficiency. We deployed REMO on Sysem S, a large-scale distributed stream processing system built at IBM TJ Watson Lab. Through resource-aware planning, REMO achieves 35%-45% error reduction compared to existing techniques.

  • Self-Scaling Monitoring Infrastructure :  From traces collected in production datacenters, we found that monitoring and management workloads in Cloud datacenters tend to be highly volatile due to their on-demand usage model. Such workloads often make the management server a performance bottleneck. To address this problem, we developed Tide, a self-scaling management system which automatically scales up or down its capacity according to the observed workloads. We built the prototype of Tide by modifying VMware’s vSphere management server and leveraging non-SQL Hadoop based HBase for scalable state persistence. The experimental results show that Tide provides consistent performance even with extreme volatile management workloads through self-scaling.

Monitoring Functionalities

For Cloud At the monitoring functionality level, we aim at providing new monitoring techniques to meet the unique and diverse Cloud monitoring needs, and we propose WISE, Volley and CrystalBall  which deliver accurate, cost-effective and reliable monitoring results by employing novel distributed monitoring algorithms for error-prone Cloud environments.

  • Efficient Continuous State Violation Detection : Most existing works on distributed state monitoring employ an instantaneous monitoring model, where the state is evaluated based on the most recent collected results, to simplify algorithm design. Such a model, however, tends to introduce false state alerts due to noises and outliers in monitoring data. To address this issue, we proposed WISE, window based state monitoring which utilizes temporal windows to capture continuous state violation in a distributed setting. WISE not only delivers the same results as those of a centralized monitoring system with a distributed implementation, but also decouples a global monitoring task into distributed local ones in a way that minimizes the overall communication cost. 

  • Violation-Likelihood based Monitoring : Asynchronized monitoring techniques such as periodical sampling often introduce cost-accuracy dilemma, e.g., frequent polling state information may produce fine-grained monitoring data but may also introduce high sampling cost for tasks such as deep packet inspection based network monitoring. To address this issue, we proposed Volley, a violation likelihood based approach which dynamically tunes monitoring intensity based on the likelihood of detecting important results. More importantly, it always safeguards a user-specified accuracy goal while minimizing monitoring cost. Volley also coordinates sampling over distributed nodes to maintain the task-level accuracy, and leverages inter-task state correlation to optimize multi-task sampling scheduling. When deployed in a testbed datacenter environment with 800 virtual machines, Volley reduces monitoring overhead up to 90% with negligible accuracy loss. 

  • Fault-Tolerant State Monitoring : While we often assume monitoring results are trustworthy and monitoring services are reliable, such assumptions do not always hold, especially in large scale distributed environments such as datacenters where transient device/network failures are the norm rather than the exception. As a result, distributed state monitoring approaches that depend on reliable communication may produce inaccurate results with the presence of failures. We developed CrystalBall, a robust distributed state monitoring approach that produces reliable monitoring results by continuously updating the accuracy estimation of the current results based on observed failures. It also adapts to long-term failures by coordinating distributed monitoring tasks to minimize accuracy loss. Experimental results show that CrystalBall consistently improves monitoring accuracy even under severe message loss and delay.

Monitoring Enhanced Cloud Management

At the monitoring data utility level, we study intelligent techniques that utilize monitoring data to offer advanced monitoring management capabilities. As an initial attempt, we propose Prism which offers an innovative application provisioning functionality based on knowledge learned from cumulative monitoring data. We aim at utilizing multitier Cloud application performance data to guide application provisioning. Prism is a prediction-based provisioning framework that simplifies application provisioning by using performance prediction to find a proper provisioning plan for a performance goal in a huge space of candidate plans. As its unique feature, Prism isolates and captures the performance impact of different provisioning options, e.g., virtual machine types and cluster configurations, from performance monitoring data with off-the-shelf machine learning techniques. This technique avoids exploring the huge space of candidate provisioning plans with experiments. As a result, Prism can quickly find the most cost-effective plan with little cost for training performance prediction models.

3. Organization of the Dissertation

This dissertation is organized as a series of chapters each addressing one of the problems described above. Each chapter presents the detail of the problem being addressed, provides basic concepts and then describes the development of a solution followed by the evaluation of the proposed solution. 

REMO distinguishes itself from existing works in several key aspects.

First, it jointly considers inter-task cost sharing opportunities and node-level resource constraints. Furthermore, it explicitly models the per-message processing overhead which can be substantial but is often ignored by previous works.

Second, REMO produces a forest of optimized monitoring trees through iterations of two phases. One phase explores costsharing opportunities between tasks, and the other refines the tree with resource-sensitive construction schemes.

Finally, REMO also employs an adaptive algorithm that balances the benefits and costs of overlay adaptation. This is particularly useful for large systems with constantly changing monitoring tasks.

Enabling self-scaling in management middleware involves two challenges.

  • First, the self-scaling process should take minimum time during workload bursts to avoid task execution delays.
  • Second, it should utilize as few resources as possible to avoid resource contention with application usage.

To meet these two goals, we propose Tide, a self-scaling framework for virtualized datacenter management. Tide is a distributed management server that can dynamically self-provision new management instances to meet the demand of management workloads. Tide achieves responsive and efficient self-scaling through a set of novel techniques, including a fast capacity-provisioning algorithm that supplies just-enough capacity and a workload dispatching scheme that maximizes task execution throughput with optimized task assignment. We evaluate the effectiveness of Tide with both synthetic and real world datacenter management traces. The results indicate that Tide significantly reduces the task execution delay for bursty management workloads. Furthermore, it also minimizes the number of dynamically provisioned management instances by fully utilizing provisioned instances.

A WIndow-based StatE monitoring framework (WISE) for efficiently managing applications in Cloud datacenters.

Window-based state monitoring reports alerts only when state violation is continuous within a specified time window. Our formal analysis and experimental evaluation of WISE both demonstrate that window-based state monitoring is not 8 only more resilient to temporary value bursts and outliers, but also can save considerable communication when implemented in a distributed manner. Experimental results show that WISE reduces communication by 50%-90% compared with instantaneous monitoring approaches and simple alternative schemes.

A violation likelihood based approach for efficient distributed state monitoring in datacenter environments.

Volley achieves both efficiency and accuracy with a flexible monitoring framework which uses dynamic monitoring intervals determined by the likelihood of detecting state violations. Our approach consists of three techniques.

  • First, we devise efficient node-level adaptation algorithms that minimize monitoring cost with controlled accuracy for both basic and advanced state monitoring models.
  • Second, Volley employs a distributed scheme that coordinates the monitoring on multiple monitoring nodes of the same task for optimal monitoring efficiency.
  • Finally, Volley enables cost reduction with minimum accuracy loss by exploring state correlation at the multi-task level, which is important for addressing workload issues in large-scale datacenters.

We perform extensive experiments to evaluate our approach on a testbed Cloud datacenter environment consisting of 800 VMs. Our results on system, network and application level monitoring show that Volley can reduce considerable monitoring cost and still deliver user specified monitoring accuracy under various monitoring scenarios.

Exposing and handling communication dynamics such as message delay and loss in Cloud monitoring environments.

Our approach delivers two distinct features.

  • First, it quantitatively estimates the accuracy of monitoring outputs to capture uncertainties introduced by messaging dynamics. This feature helps users to distinguish trustworthy monitoring results from ones heavily deviated from the truth, and is important for large-scale distributed monitoring where temporary communication issues are common.
  • Second, our approach adapts to non-transient messaging issues by reconfiguring distributed monitoring algorithms to minimize monitoring errors.

Our experimental results show that, even under severe message loss and delay, our approach consistently improves monitoring accuracy, and when applied to Cloud application auto-scaling, outperforms existing state monitoring techniques in terms of the ability to correctly trigger dynamic provisioning. 

A provisioning planning method

the most cost effective provisioning plan for a given performance goal by searching the space of candidate plans with performance prediction. Prism employs a set of novel techniques that can efficiently learn performance traits of applications, virtual machines and clusters from cumulative monitoring data to build models to predict the performance for an arbitrary provisioning plan. It utilizes historical performance monitoring data and data collected from a small set of automatic experiments to build a composite performance prediction model that takes application workloads, types of virtual server instances and cluster configuration as input, and outputs predicted performance

Benefits of Monitoring as a Service Platform

Scalability and Agility

Monitoring as a Service can support your business growth efficiently. Whether your network infrastructure or cloud service grows organically, or through acquisition, the "on demand" provisioning enables our team to add new cloud services or servers or network devices instantly. This also means that your team do not have to engineer for peak loads.

Recover faster

Get Notified by Email, SMS or Phone when your website, server or APIs are down before your customer does.

Reliability

Our MaaS offers you an outside view of what is going on with your Cloud services, Inhouse servers and workstations, independent of your own systems. In addition, such a solution is also useful for business continuity and disaster recovery situations as all asset information is held safely on the server-side.

No Infrastructure Costs

Our MaaS offering includes a team of 24x7 experts monitoring your services. Therefore, Customer does not need to invest in an in-house IT team having that particular technology expertise.

Simplified Management

Our team deals with on-going management, maintenance and upgrades of technology. The customer can focus on his core business needs.