Howard University - Data Lake Solution Accelerator for Low Latency Data Processing
The requirement was to develop a data platform to collect, organize, process and provide insights into business, operational aspects and enable development of customer value added, data driven product features and dashboards.
Raw data was stored with no oversight of the contents
The platform needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp “.
Real time streaming data from source systems (batch load scripts are also in place).
Connectors developed for Oracle DB (PeopleSoft, Banner) and Workday.
Uses Oracle DB Streams feature to identify changes from Redo Logs and stream to Kafka
Data access is controlled with views set up in Apache Hive which is connected to data lake.
ETLS run in loop and identify changed files (via Hive) and update Report Mart. Sample ET L scripts and reports developed for HR Diversity data. PostgreSQL acts as Report Mart.
Data changes form source system are reflected in the reports within two minutes.
Easier and quicker to populate as no transformation is involved
Allows to import any amount of data that can come in real-time
Allows organizations to generate different types of insights including reporting on historical data
Ability to store all types of structured and unstructured data
Elimination of data silos
Democratized access to data via a single, unified view of data