Nowadays, scientific research increasingly relies on IT technologies, where large-scale and high performance computing systems (e.g. clusters, grids and supercomputers) are utilised by the communities of researchers to carry out their applications. Scientific applications are usually computation and data intensive, where complex computation tasks take a long time for execution and the generated datasets are often terabytes or petabytes in size. Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage. In recent years, cloud computing is emerging as the latest distributed computing paradigm which provides redundant, inexpensive and scalable resources on demand to system requirements. It offers researchers a new way for deploying computation and data intensive applications (e.g. scientific applications) without any infrastructure investments. Large generated application datasets can be flexibly stored or deleted (re-generate whenever needed) in the cloud, since theoretically unlimited storage and computation resources can be obtained from commercial cloud service providers. With the pay-as-you-go model, the total application cost for generated datasets in the cloud highly depends on the strategy of storing them, e.g. storing all the generated application datasets in the cloud may result in a high storage cost since some datasets may be seldom used but large in size; in contrast, if we delete all the generated datasets and regenerate them every time when needed, the computation cost may be very high too. Hence there is a trade-off between computation and storage in the cloud. In order to reduce the overall application cost, a good strategy is to find a balance to selectively store some popular datasets and regenerate the rest when needed. This thesis focuses on cost-effective datasets storage of scientific applications in the cloud, which is a leading-edge and challenging topic nowadays. By investigating the niche issue of computation and storage trade-off, we 1) propose a new cost model for datasets storage in the cloud; 2) develop novel benchmarking approaches to find the minimum cost of storing the application data; 3) design innovative runtime storage strategies to store the application data in the cloud. We start with introducing a motivating example from astrophysics and analyses the problems of computation and storage trade-off in the cloud. Based on the requirements identified, we propose a novel concept of Data Dependency Graph (DDG) and propose an effective datasets storage cost model in the cloud. DDG is based on data provenance, which records the generation relationship of all the datasets. With DDG, we know how to effectively regenerate datasets in the cloud and can further calculate their generation costs. The total application cost for the generated datasets includes both their generation cost and storage cost. Based on the cost model, we develop novel algorithms which can calculate the minimum cost for storing datasets in the cloud, i.e. the best trade-off between computation and storage. This minimum cost is a benchmark for evaluating the cost-effectiveness of different storage strategies in the cloud. For different situations, we develop different benchmarking approaches with polynomial time complexity for a seemingly NP-hard problem, where 1) the static on-demand approach is for the situation that only occasional benchmarking is requested; 2) the dynamic on-the-fly approach is suitable for the situation that more frequent benchmarking is requested at runtime. We develop novel cost-effective storage strategies for users to facilitate at runtime of the cloud. Different from the minimum cost benchmarking approach, sometimes users may have certain preferences on storing some particular datasets due to various reasons rather than cost, e.g. guaranteeing immediate access to certain datasets. Hence, users' preferences should also be considered in a storage strategy. Based on these considerations, we develop two cost-effective storage strategies for different situations: 1) the cost rate based strategy is highly efficient with fairly reasonable cost-effectiveness, and 2) the local-optimisation based strategy is highly cost-effective with very reasonable time complexity. To the best of our knowledge, this thesis is the first comprehensive and systematic work investigating the issue of computation and storage trade-off in the cloud in order to reduce the overall application cost. By proposing innovative concepts, theorems and algorithms, the major contribution of this thesis is that it helps bring the cost down dramatically for both cloud users and service providers to run computation and data intensive scientific applications in the cloud.
Copyright © 2012 Dong Yuan.
A thesis submitted for the degree of Doctor of Philosophy, Swinburne University of Technology, 2012.