Needless to say, big data is “big” in size. So there comes the challenge in storing and processing the data. Often times, you don’t need to worry about data storage because, in many cases you are just given a dataset to work on your hard drive or a server somewhere. If you are involved in data collection aspect of big data, however, you will need to consider where and how to store the data. It varies with industry, but when we talk about big data in today’s technology (late 2016), it usually means data size over 1 terabyte (TB). If you have a latest model of PC (e.g. Intel Core i7), you may be able to open a CSV or a Microsoft Excel file with a size of 700 to 800 gigabytes (GB). But, storing and processing data over 1TB in size is very challenging in today’s average computing environment. Right now, a rough threshold is 1TB, but a year from now this limit will change. That’s why people define big data as a dataset that is too large to handle with existing technology.
Figure 1. An example of a big data storage center
2. Where to store big data?
Local hard drive
One obvious way to store big data is to have everything on your local hard drive. At the minimum, you will need a pretty big hard drive (either internal or external) and a fast computing system that can store the data. Some people will argue that if you can save data locally on your hard drive, then it is not a big data. Well... that’s true, but unless you are in an enterprise environment with up-to-date tech support, your first and most likely option is to start with what you already have. For example, you can end up in a situation where you are given a 5 terabytes of dataset to analyze. Your best bet is to buy an external hard drive that is large enough to store that data locally. For actual analysis, you may need to split your data into multiple pieces so that an analysis software can load the data into a memory and carry out an analysis. Nevertheless, if the dataset you deal with has a fixed size, and if the data is not extensively large (usually less than 1-2 TB), then housing the dataset in your local drive is a reasonable option.
Another way to store big data is a clustered computing environment. A cluster computing system is a group of computers connected together into a common resource pool. This is where it gets tricky. Building a clustered computing environment requires a dedicated staff knowledgeable about a distributed and parallel system. It requires a software and hardware solution to manage a cluster network, share resources, and schedule tasks on individual nodes. A server system like Hadoop or Apache will need to be installed and configured to operate cluster computing environment. A good thing about this option is that system configuration is flexible and customizable. You can add or remove individual computers in the cluster, and the each computer doesn’t necessarily need to have a fast computing power. The downside is that you need an up-front investment and a dedicated staff who knows how to set up such system.
Last but not least, more preferable option compared to setting up your cluster system is to use a public cloud server. With cloud server, you can directly stream data over the internet without having to worry about storing data locally on your computer or on a cluster system. And that is the trend in big data analytics. As more and more companies and agencies are migrating to a cloud-based computing environment, most data collection and storage is being handled through network. One example is Apple Music or Amazon Music where you can play thousands songs on your phone without having to store them locally on your device.
Big data analytics is becoming just like Apple Music or Amazon Music. All you need is to stream hundreds terabytes of data to a cloud server, and you can access them while carrying out analysis on demand. Companies that offer cloud services charge customers based on usage. For example, Amazon Web Services charges $0.022/gb/month for storing data and $0.05/gb for downloading data. Amazon EMR, Amazon’s big data platform, charges from $0.011/hour to $0.27/hour depending on computing intensity. To analyze 2TBs of data for 10 hours per day on a such cloud system, it could cost $40 to $50 per month, plus some additional costs of downloading the results. An yearly cost of such cloud system could be $500 to $600. Cloud system is certainly not cheap, but it is a very cost-effective alternative to building and maintaining your own cluster computer, and it is a huge step-up from running data intensive analysis from your local machine.
In summary, there are roughly three ways to store big data. The first option is probably the most popular method for typical researchers wanting to analyze large data sets. The second option is a preferred method among large-size organizations that can build an in-house cloud system. One obvious advantage of this method is the data are stored and processed in-house, so there is little confidentiality or privacy issue with regards to data storage or sharing. The third option is probably the most feasible option for people that do not need to process confidential data. Storing and processing data on a cloud is faster and can be cost-effective. For this reason, I will focus on the third option of using a cloud system.
3. How to process big data?: A cloud system approach
The easiest entry to big data analysis is to use a cloud-based system. Google BigQuery, for example, offers an instant access to gigabytes and terabytes of data using a cloud system. Figure 2 shows the basic web interface of the Google BigQuery. The BigQuery allows users to upload large volumes of data to their cloud system, and enable users to use SQL (structured query language) to analyze data in a blazingly fast time. All the optimization and indexing are done automatically by Google, and the users are given an end-to-end platform as soon as they are done with uploading the data set to the cloud.
Figure 2. Google BigQuery Console
SQL is a fairly easy and universal language in database management. As shown in Figure 2, some famous datasets are already stored and publicly available for anyone to pull and use. For example, terabytes of NYC taxi trip data have been uploaded to BigQuery, and it is publicly available through the BigQuery platform.
Uploading data to the BigQuery is a bit time consuming process as the upload can take several days. As shown in Figure 3, users can upload data through a script to a google cloud and the Google SDK offers tools like gsutil to enable this process. Often times, if any errors occur during this process, the upload may fail. Therefore, it is a good practice to test run small subset of large data sets using a script before initiating a full upload. Below diagram describes the data uploading and processing steps through Google BigQuery.
Figure 3. End-to-end data pipeline linking users and the end products
The challenge of big data starts with processing large volumes of data. If the data is constantly streaming from multiple sources, a cloud-based system is the most desirable option. If the data is static and relatively large to handle, then a local database system, such as MySQL or PostgreSQL can be used to store and process the data. In this post, I have shown a simple overview of using Google BigQuery to upload the data to the cloud, and the next posting will be a bit more like a tutorial to follow step-by-step how to store and process big data through Google BigQuery.