Big data is such a buzz word that everyone is talking about it, and yet no one knows what it actually means and how we, normal citizens, can leverage the power of big data to our advantage. This post is intended to help a layperson, especially those in the government and non-profit sectors who have little knowledge and experience in programming, to unleash the power of big data and to incorporate big data analysis into their daily job using publicly available open-source tools and services. I will explore this topic in three separate posts: 1) introduction; 2) data storage and processing; and 3) data analysis and visualization.
2. What is big data?
Researchers and professionals in computer science typically describe big data as having the three key properties: volume, variety, and velocity (Figure 1). Big data are enormous in size, have multiple data types, both structured and unstructured, and scale up quickly, i.e. data are dynamically collected and processed usually in real-time. With this formal definition in mind, the way I understand big data is that the data size is too big to be handled by conventional software programs, such as MS Excel, Access, or even MySQL, which is specialized for handling large data sets. It's like pornography. In other words, when you see big data, you just know what it is. To simply put, if you can't handle the size of the data with your typical software at hand, then it's probably big data.
2. Open data movement and new opportunities
As I have discussed previously, big data are not so much different from normal data we see every day. It's just that big data are just too big to handle and require some special techniques to open them. Once it is opened and processed, there is nothing special about big data. The power of big data, however, comes from the fact that one can pull meaningful patterns from billions of records without aggregation. Previously, having such granularity in data analysis was very challenging because of time and resources it takes to analyze massive body of data. Thanks to the recent technological advancement, such as Hadoop and MapReduce, it is so much easier and faster to process and analyze big data.
In recent years, open-data and open-government movement have swept through the globe like an epidemic. Most big cities around the world now have open data policy (Figure 2). In the U.S., 46 cities and counties have an open data policy. NYC is definitely leading the crowd by engaging the public with new technology and tools that allow easy access to their data. NYC made 1,350 data sets available to the public as of July 2015, and has a plan to publish all of their digital public data by the year 2018. Other cities, such as Los Angeles, Chicago, Boston, and Seattle are following suit of New York City. Thanks to this widespread of open data policy around the world, it is now possible to access all kinds of government data at your finger tip. The type of data opened to the public ranges from a simple tally of federal employment to live traffic camera feeds from every street corner of New York City.
So, what are the implications of open data policy for big data analytics? Well, now that all kinds of government data are freely available and accessible through a web interface, it becomes so much easier and transparent to analyze what the governments (a.k.a. big brothers) are doing. The data available from open data platforms are mostly micro-data which are spatially and temporarily referenced. Such fine-grained data are too big to be handled by traditional desktop software. For example, NYC TLC (Taxi and Limousine Commission) made available GPS-derived trip and fare data for all the yellow and green taxi cabs operating in their jurisdiction. The size of the data is about 500 GB of taxi trip data available from 2009 to 2015, and the file size of one month trip data for yellow cab is about 1 GB. Working with this big-size file, let alone opening it, would be very challenging with traditional software, like Excel or Stata. With a big data technology, however, there is no limit in terms of handling the data this big. In short, big data analytics unleashes the real power of open data policy by giving ordinary citizen an opportunity to access massive government data.
3. Future of big data/open data for public and non-profit sectors
Big data coupled with the open data movement unlock new opportunities for public and non-profit sectors. Tools that allow big data analytics are becoming increasingly available for free, and the barriers to accessing, collecting, and processing big data are becoming less and less prevalent in the coming age. Previously, developing a custom big data platform required a special team of computer scientists, database administrators, and computing managers. Thanks to the database-as-a-service (DBaaS) or cloud database, there is no need for dealing with complicated tasks of setting up physical hardware, installing appropriate software, and tuning the system for optimal performance (Figure 3). Service providers take care of all the back-end maintenance tasks so that the users can simply use the database and focus only on data analysis and problem-solving.
I envision that big data analytics will be the new normal in the coming era. And the future of big data is much brighter for non-technical people in governments and non-profit sectors because of the wide availability of open-source database tools and platforms. Tools and services that allow big data analytics will be much easier to learn and use. No or minimal coding skills will be required to use the system and run an analysis. At the end of the day, what we really want is the "analysis" part of the big data, not the "installation" or "maintainence" part of the database system. Major IT companies, like IBM, are already moving in this direction, and new startups, like Alteryx, offer creative tools to analyze big data without the need to write codes (Figure 4).
4. Open-source tools and services for big data analysis
There are dozens of cloud database services that offer services for almost free or small fees per usage. For example, Google Cloud offers 1 GB data for free, and Google BigQuery offers free 1 TB of data processed per month. For a simple data analysis that does not require publishing results on the web, using these cloud services will cost about a fraction of the cost of developing and maintaining your own database system. The pricing structure varies by service providers, and steep competitions among different service providers make cloud-based system very affordable for typical users.
There are many big data solutions in the market right now. Amazon, Microsoft, IBM, Google, and Oracle are some of the major companies offering big data solutions. Smaller companies, like Splunk and Talend, also offer innovative big data solutions. I do not want to exhaust you with all these solutions. So, I will introduce some of the tools I use. I choose my tools based on four things: cost, ease of use, and interoperability. First of all, the cheaper the better, but most of the time you get what you pay for. So I focus on a solution that is the most bang for the buck. Ease of use is also important as I am not a computer scientist. Inter-operability means how flexible the solution is with regard to working across different platforms.
The big data solutions of my choice are: Google BigQuery, Tableau, and R. These tools may or may not work for you, but they may be good enough tools for you to get started. Google offers a variety of tools you can use for big data analysis, and the BigQuery offers a very fast and flexible solution for a fraction of costs compared to other service providers. Tableau is a great data analytic software that allows you to create beautiful charts, maps, and so on. When Tableau first came out, it only offered basic functions. Now the software has improved so much over the past several years. It has functionality to connect to various big data servers, including Amazon AWS, Microsoft Azure, and Google Cloud. It can also connect to Google BigQuery, which is the most useful function for me. Tableau desktop, which offers connection capability to big data solutions, is not free, but you can get a 1-year license if you are a student or a faculty. You can also try the free Tableau Public but with limited functionality. Lastly, R is a really powerful statistical software which is free and open-source. The coding is simple and elegant, but some people may feel a little intimidated because it requires some programming experience. But if you are willing to learn R, you will be amazed at how much you can do with it. Combining the capabilities provided by these three tools, big data analytics become so much easier and affordable.
Whether you like it or not, big data will be the new normal in any organizations. For-profit companies are already in the game, and are actively contributing to, what people call, a big data revolution. Some academic researchers are leading the trend in big data technology, and many universities now offer data science courses and certificates to foster next-generation work forces who will be versatile in big data analytics. On the other hand, government agencies and non-profits are lagging behind this sea of change in data analysis. As is the same with the open data initiatives around the globe, people in government and non-profit sectors will need to be able to deal with an impressive array of big data being produced from all sectors of society, and the technology to enable big data analysis is becoming more and more accessible and affordable. The purpose of my postings is to introduce big data to the folks in government and non-profit sectors, and to help them get some hands-on experience in storing, processing, analyzing, and visualizing big data for everyday tasks. Subscribe to my blog for future updates!