Data Lake Solution Accelerator for Low Latency Data Processing
- The requirement was to develop a data platform to collect, organize, process and provide insights into business, operational aspects and enable development of customer value added, data driven product features and dashboards.
- Raw data was stored with no oversight of the contents
- The platform needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp“.
- Real time streaming data from source systems (batch load scripts are also in place).
- Connectors developed for Oracle DB (PeopleSoft, Banner) and Workday.
- Uses Oracle DB Streams feature to identify changes from Redo Logs and stream to Kafka
- Data access is controlled with views set up in Apache Hive which is connected to data lake.
- ETLS run in loop and identify changed files (via Hive) and update Report Mart. Sample ET L scripts and reports developed for HR Diversity data. PostgreSQL acts as Report Mart.
- Data changes form source system are reflected in the reports within two minutes.
- Easier and quicker to populate as no transformation is involved
- Allows to import any amount of data that can come in real-time
- Allows organizations to generate different types of insights including reporting on historical data
- Ability to store all types of structured and unstructured data
- Elimination of data silos
- Democratized access to data via a single, unified view of data