Letting the Outside In: the Internet as Big Data’s Biggest Problem
The last decade has seen astonishing advances in the ability of companies to capture, harmonize, process, analyze, draw insight from and act upon a dizzying amount of data. However, for the most part, the raw data that they operate on has been internal in nature: customer behavior, communications, logistics and supply chain, internal business processes, competitor tracking, cost optimization, sales support and so on. Companies tend to ignore a vast and extremely rich source of data ripe for analysis: the Internet at large.
Unfortunately, extracting useable information at scale from the Internet is a hard problem; indeed, it is a Big Data problem that dwarfs most internal Big Data challenges. Analysts currently spend significant time or money (or usually, both) attacking this problem, using a mixture of manual effort, ad-hoc programming, and re-purposing tools built for other uses. Successful and cost-effective examples of augmenting internal data with external resources remain a rarity.
That’s changing. Ongoing shifts in the landscape of data on the Internet are dramatically changing the resources available to companies attacking the promise of Big Data from the web. Some of these trends are described in this talk.
Specifically, we discuss:
- The evolution of search, from hierarchical to horizontal to vertical
- The rise of user-driven content: implicit, explicit, curatorial and relational
- The open data revolution
We explore how these elements combine to create a new environment for information engineers, knowledge managers and data analysts to leverage. We further talk about how these themes intersect with other, broader themes such as the consumerization of enterprise, and the API-fication of everything (X-as-a-Service).
Throughout, the technology, business and consumer aspects of these transitions are illustrated using case studies from the industry, most especially from the Toronto-based data platform Quandl (www.quandl.com).
Abraham Thomas is the co-founder and chief data officer of Quandl (www.quandl.com), a search engine / platform for numerical data on the Internet. Quandl has indexed millions of financial and economic datasets from hundreds of sources across the Internet, and makes the data available to users in any form they want (including via API). Quandl also enables sophisticated cloud-based structured data management solutions via its Platform-as-a-Service offering.
Prior to founding Quandl, Abraham spent over a decade in the finance industry, most recently as senior portfolio manager and head of US bond trading for Simplex Asset Management. Simplex is a multi-billion-dollar multi-strategy hedge fund headquartered in Tokyo and with offices around the globe. While at Simplex, Abraham created, designed, implemented and traded one of the industry’s first ‘real-time’ bond arbitrage systems, utilizing a variety of sophisticated data engineering, live macroeconomic analysis and statistical arbitrage techniques.
Abraham has a Bachelor of Technology from the Indian Institute of Technology (IIT), Bombay. He is an expert data user as well as an experienced data engineer. He speaks very occasionally at conferences, on the subject of markets, technology and their intersection; for the most part he prefers to build rather than talk.