The Problem
- The requirement was to develop a data platform to collect, organize, process and provide insights into business, operational aspects and enable development of customer value added, data driven product features and dashboards.
- Raw data was stored with no oversight of the contents
- The platform needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp“.
The Solution
- We have developed a data lake solution which has the following the features
- Real time streaming data from source systems.
- Connectors developed for Oracle DB (PeopleSoft, Banner) and Workday.
- Data access is controlled with views set up in Apache Hive which is connected to data lake.
- ETLS run in loop and identify changed files (via Hive) and update Report Mart. Sample ETL scripts and reports developed for HR Diversity data. PostgreSQL acts as Report Mart.
- Data changes form source system are reflected in the reports within two minutes.
The Result
- Easier and quicker to populate as no transformation is involved
- Allows to import any amount of data that can come in real-time
- Allows organizations to generate different types of insights including reporting on historical data
- Ability to store all types of structured and unstructured data
- Elimination of data silos
- Democratized access to data via a single, unified view of data