What I learnt today — Lambda and Airflow
Thanks to the great article by Insight Data.
My takeaways:
Lambda architecture enables real time handling of data and data integrity. It is horizontally scalable and assumes that things will go wrong.
Real-time data is handled as is and updated as required by a batch process that builds the data from the beginning. Typically there are three tables — one that records the real time data as it arrives, one that stores the results of the last batch process and one that stores only the delta of the values that changed since the last batch run.
Airflow is a workflow engine for authoring, scheduling and monitoring batch processes (daily jobs, hourly jobs). It can, however, run jobs at one minute or 5-minute intervals.
Now am curious about:
- How costly (infrastructure, productivity) is it for a batch process to run?
- What is Plan B when the batch process crashes?
Did I get that right? Or is there a simpler way to understand this?