Skip to main content

Data Processing

Once data is captured through extraction (refer to the Data Extraction Section), it undergoes a robust processing phase before delivery via webhooks or the ROOK API. This step ensures that data is harmonized, standardized, cleaned, normalized, and structured to provide optimal quality, consistency, and usability.


Key Processing Steps

ROOK’s data processing pipeline consists of four core stages:

  1. Data Harmonization
  2. Data Standardization
  3. Data Cleaning
  4. Data Normalization

These stages enable ROOK to deliver high-quality, consolidated, and structured health data.


1. Data Harmonization

Harmonization ensures consistency across data formats, units, and definitions from diverse sources.

Examples:

  • Converting distance values into a unified unit (e.g., kilometers).
  • Adjusting timestamps to align with the user’s local time zone.

Outcome: Uniform and consistent data representation across all health data sources.


2. Data Standardization

Standardization applies recognized industry standards to collected data, ensuring compatibility and reliability across different health data providers.

Examples:

  • Mapping sleep stages from various providers to a common standard.
  • Aligning heart rate intervals for consistent reporting.

Outcome: Standardized data compatible with cross-platform integration.


3. Data Cleaning

Data cleaning eliminates inconsistencies, resolves duplicates, and ensures the reliability of delivered data. This step integrates ROOK’s Duplicity Feature for managing data from multiple sources.

Key Components of Cleaning

  1. Data Prioritization:

    • Data sources are ranked based on their quality, comprehensiveness, and relevance.
    • Wearable device data takes precedence over SDK extractions and health kits.
  2. Rules for Data Prioritization:

    • Non-Null Value Rule: Valid non-null values from lower-priority sources override null values from higher-priority sources.
    • Higher Value Rule: For specific metrics (e.g., steps_number, calories_expenditure_kilocalories), the highest value is retained, regardless of source priority.
  3. Event Processing:

    • Events from multiple sources within a ±10-minute window are merged or prioritized based on source ranking.
    • Redundant events are discarded, ensuring a clean event record.
  4. Summary Processing:

    • Summaries are generated using the highest-priority data source.
    • Updates to summaries incorporate complementary data from secondary sources, ensuring enriched details.
    • Updated summaries are versioned for traceability using the document_version key.

Update Frequency for Summaries

  • Summaries are retained for 15 minutes before delivery to incorporate delayed updates from connected sources.
  • All data is ultimately reported in UTC for consistency.

Outcome: Clean, accurate, and enriched data that eliminates redundancies while preserving the best information.


4. Data Normalization

Normalization adjusts data to a uniform scale and format, ensuring comparability across sources.

Examples:

  • Converting calorie data to a standardized unit (e.g., kilocalories).
  • Aligning step counts into consistent time intervals.

Outcome: Normalized data that is actionable and comparable across diverse data sources.


5. Data Structuring

ROOK organizes all processed data into structured schemas, which unify metrics across various sources. These schemas ensure that clients receive consistent data, regardless of the original source.

  • Schemas: Detailed structures for Physical, Sleep, and Body Health pillars.
  • Key Benefits:
    • Consistent formatting and simplified integration for developers.
    • Predefined keys mapped across data sources for uniform data handling.

Explore schemas in the Data Types Section for further details.


Why Data Processing Matters

  1. Enhanced Data Quality: Ensures high-quality, actionable data for client applications.
  2. Cross-Source Integration: Harmonizes data from multiple sources into a single, consistent format.
  3. Streamlined Analysis: Structured and cleaned data simplifies downstream usage.
  4. Standardized Reporting: All data is delivered in UTC for global consistency.

For more details on how processed data is delivered, visit the Data Delivery Section.