Data Engineering From A Data Scientist’s Perspective

We’ve had technical people focused on the ingestion and management of data for decades. But, only recently has data engineering become a critical, widespread role. Why is that? This post will outline a somewhat contrarian view as to why data engineering has become a critical function and how we might expect the role to evolve over time.

IIA expert Jesse Anderson recently wrote a nice piece discussing why data engineering and data science must be viewed as distinct skill sets and how organizations get into trouble when asking people to work outside of their core skills. That piece, along with some recent client discussions, led me to this post.

Database Administration, ETL, And Such

It wasn’t long ago that the primary roles focused on enterprise data were largely involved with three primary areas. First, there are those who manage raw data collection into source systems. Such systems often use some sort of custom data format that is far from user friendly. Second, there are those focused on Extract, Transform, and Load (ETL) operations. ETL specialists extract data from a source system, perform whatever changes are needed to make it more user and analytics friendly, and then load that data into a repository intended for reporting and analysis. For many years, these systems were overwhelmingly relational databases. Third, there are database administrators who manage the relational systems to ensure data is accessed efficiently and that proper security is in place.

Much of the work of these traditional data roles is standardized both in terms of approach and in terms of mature tools that facilitate the work. For example, a relational database handles a lot of configuration work largely on its own. Database administrators don’t have to tell a database which disks to store data on or how to ensure relational integrity. Since relational technology is mature, it makes a lot of technically complex tasks very easy to execute if experts make use of the proper functions and commands. Similarly, ETL tools have adapters to access common source systems, built in functionality to handle many of the commonly needed transformation operations, and hooks into all the common destination databases and repositories. For years, there existed a relatively small number of mature tools interfacing with a relatively small number of mature data repositories. Life wasn’t too bad and data engineers weren’t yet necessary.

What Is Driving The Need For Data Engineers?

The roles described previously all still exist today in largely their traditional state. However, with the rise of big data, cloud, and tighter integration of analytics with operational systems, these roles are no longer sufficient. Some additional skills are needed today, and data engineers have stepped in to fill the void.

In today’s world, we have a wide range of new data types beyond the traditional data of the past. Images, text, streaming data, and video aren’t friendly for ETL tools or relational databases, and so new tools are needed to help with processing such data and new types of data repositories are needed to efficiently store and analyze the data. However, most of these new tools and repositories are not yet mature and so require a whole lot of detailed coding to make them operate as needed. In the early days of Hadoop, for example, data engineers had to struggle mightily to get the (then) highly immature Hadoop environment to behave properly and perform well.

To make matters worse, organizations are finding that they need to make use of a wide range of these new data management toolsets and repositories. Worse yet, it is often necessary to integrate multiple immature technologies in order to achieve desired analytical goals. Data engineers use their skills and experience to figure out how to do this integration when, in many cases, there are few documented examples. The integration tasks require very detailed and complex technical work to not just get the pieces and parts of the data pipeline to work together, but to also have them work in a reasonably efficient and secure manner. In many ways, companies can end up with Rube Goldberg style processes that manage to get a task done, but seem to require a lot more energy input and complexity than should be needed.

Adding even more complexity is the growing requirement to support hybrid architectures where data is not only spread across internal systems but spans internal systems and one or more distinct cloud environments. The effort required to make a data pipeline work gets even harder in such complex architectures. It is easy for outsiders to say, “Just pull that data together for us, what’s the hold up?” But, alas, while some tasks may be simple to define conceptually, they may not be at all simple to execute. That’s where it requires data engineers to step in.

A few differences jump out between data engineers and traditional data professionals. First, data engineers need to be much more focused on, and skilled at, creative problem solving. Next, data engineers need to be able to embrace and make use of an ever-wider array of tools and approaches. Last, data engineers need to give a lot more attention to integration and optimization between tools and platforms as opposed to optimizing workloads within a given tool or platform.

What’s The Future?

One safe bet is that the complexity and volume of data, the diversity of the tools and repositories to manage data, and the breadth and sophistication of analytics being applied to the data will all continue to increase. At the same time, much of what data engineers are doing today via brute force will become heavily standardized and require less skill and effort. In time, it won’t take the rare and highly skilled top-notch data engineer to do a lot of the wrangling that organizations are struggling with today. Hadoop, for example, quickly evolved administration tools, security tools, and an application ecosystem that made it much easier to work with, though it still has gaps. (The discussion as to whether or not Hadoop is dying is for a different day).

Does this mean that data engineers will go away? Not at all. The role will evolve. As today’s challenges are standardized, data engineers will move on to the next wave of challenges tied to the newest data, toolsets, architectures, and repositories. There will be plenty of data engineering work for years to come.

There is a parallel between data science and data engineering in this sense. A lot of what used to take a lot of a data scientist’s time and effort is being automated and standardized. “Citizen data scientists” can now handle a lot directly, while data scientists focus on the harder problems. Might we similarly soon see the evolution of “citizen data engineers” who make use of automated and standardized data engineering tools to handle many of the basics while data engineers handle the new frontiers? I see this as a very likely, if not inevitable, scenario.

Originally published by the International Institute for Analytics

Bill FranksComment