Prior to my time in Silicon Valley, I’ve held roles as a Business Intelligence Engineer and Data Engineer in the healthcare and technology sector. I learned more from just 2 years of observing big data practices than all of my previous experience combined.
1. Compute is cheap compared to the time of Data Engineers
While I came from the land of query optimization, it never occurred to me how relatively inexpensive compute is. Yes, some companies use their own inexpensive products, but this lesson holds true for every company out there. If you consider the cost of writing highly-optimized queries to reduce compute time, and compare that cost to writing equivalent, but substantially less efficient queries, query optimization looks great. However, when factoring in the cost of the time of the Data Engineer writing the query, which can be days or weeks for larger projects, the cost is clearly better than time spent optimizing. The same lesson applies for storage; instead of spending days writing, optimizing, and debugging a ‘merge’ statement, why not just snapshot the data at every refresh and store the result? This can cause data to accumulate rapidly; perhaps 1TB per pipeline per year. The cost of this much storage? Pennies, and it is decreasing with time.
2. SQL is code, and should be treated as such
While SQL is technically a query language, many companies value treating it in many of the same ways as code. SQL is committed, reviewed, follows formatting rules, is heavily documented, and complex and challenging for other developers to debug. Many large companies have tremendous infrastructure to support data strategy. Data is baked into the DNA of successful technology companies, and many internal tools have been built over the years to make querying, moving, storing, visualizing, and sharing data easy.
As systems change, so must queries. With SQL committed alongside object-oriented application code, it’s easy for application engineers to notify Data Engineers of any value changes. Additionally, as business context evolves over time, it’s easy to identify where all instances of data and pipelines exist, and ensure that definitions evolve synchronously.