Data Transformation

Overview#

Raw data from a commercial vessel tracking feed arrives in a proprietary binary format with latitude and longitude expressed as signed integers, vessel identifiers in IMO format, and timestamps in UTC epoch milliseconds. Before an analyst can search for a vessel by name, correlate its position with a known port, or calculate whether its route is consistent with declared cargo, that data needs to go through several transformation steps: type conversion, schema mapping, identifier normalisation, geospatial format conversion, and enrichment from a vessel registry. Each step needs to happen reliably, at ingestion speed, without manual intervention.

The Data Transformation module is the processing layer that handles this work. It provides 40-plus transformation types covering schema mapping, data cleansing, normalisation, external enrichment, and calculated field generation. Everything entering the Argus platform passes through the transformation layer before reaching platform record store. For intelligence agencies, financial crime units, healthcare data controllers, and critical infrastructure operators, this is where raw incoming data becomes something analysts can actually work with: clean, standardised, enriched, and ready for correlation and analysis.

Key Features#

Schema Mapping: Transform raw data structures into standardised internal structures with type-safe conversions, address normalisation, and identifier mapping. Source schemas vary; the internal model is consistent.
Data Cleansing: Apply quality validation rules, handle outliers, remove inconsistencies, and standardise formats so that data meets quality standards before reaching downstream processing.
Data Normalisation: Standardise schemas, resolve entities, and deduplicate records to create consistent, unified data representations across all sources. The same entity described differently by different feeds becomes a single normalised record.
External Data Enrichment: Augment records with data from external sources across multiple domains to add context. A vessel identifier maps to a registry record; an IP address maps to geolocation and ASN data; a company identifier maps to ownership and sanctions data.
Calculated Fields and Metrics: Generate derived values including risk scores, aggregated metrics, and domain-specific calculations. These turn raw measurements into the signals analysts actually use.
Type-Safe Transformations: All transformations enforce type safety to prevent data corruption and ensure reliable processing at every stage. Type errors are caught and reported rather than silently propagating.
Batch Processing: Process large volumes of data with batch-optimised transformation pipelines that scale to meet demand. Batch and streaming modes use the same transformation definitions.
Caching for Performance: Cache frequently used transformation results and reference data to accelerate processing and reduce load on external enrichment sources. Cache invalidation is configurable per source.
Monitoring and Metrics: Track transformation throughput, error rates, and processing times to identify bottlenecks and maintain performance standards as data volumes grow.
API-Accessible Transformations: Access all transformation capabilities programmatically for integration into automated workflows and custom applications.

Use Cases#

Multi-Source Data Standardisation: Normalise data from diverse sources with different schemas, formats, and conventions into a single consistent data model for unified analysis and reporting. Analysts work with one model regardless of how many sources feed it.
Risk Assessment Pipelines: Chain multiple transformations together to cleanse, enrich, and score entities for risk assessment, combining external data sources with calculated metrics. The output feeds directly into case intelligence.
Entity Resolution: Resolve and deduplicate entities across multiple datasets by normalising identifiers, matching on fuzzy criteria, and merging records into authoritative golden records.
Real-Time Data Processing: Apply transformations to streaming data as it arrives via Kafka Streams, ensuring that data is clean, enriched, and analysis-ready before it reaches dashboards or alerting systems.
Domain-Specific Analysis: Apply specialised transformation and enrichment pipelines for financial intelligence (sanctions screening, beneficial ownership), maritime data (vessel tracking, port calls), threat analysis (indicator enrichment), or other domain-specific investigation workflows.

Integration#

The Data Transformation module sits within the Argus ingestion pipeline, receiving data from the connector layer and delivering processed records to the platform record store. It connects with the data quality validation module for pre-transformation checks and post-transformation verification, and with the entity graph for relationship updates triggered by enrichment results. Transformation logic is accessible via the platform's ETL pipeline builder and programmatically through the API.

Open Standards#

OASIS STIX 2.1: Connectors feeding the transformation layer must produce validated STIX 2.1 DTOs as their primary output contract before the structural type-mapping stage consumes them.
NIEM (National Information Exchange Model): NIEM is a named primary output contract for connectors, meaning NIEM-structured records arrive at the transformation layer already conforming to the exchange model before schema mapping occurs.
CAP v1.2 (Common Alerting Protocol): Alert-domain connectors emit CAP v1.2 as their open-standard contract, and the transformation layer maps those validated CAP payloads into the internal Argus schema.
W3C PROV-DM (Provenance Data Model): Each ingestion and transformation event is recorded as a PROV-DM activity with entity and agent relationships, providing an auditable provenance chain for every transformed record.
OpenLineage Specification: START, COMPLETE, and FAIL lineage events are emitted for every normalisation job, tracking dataset inputs and outputs through the transformation pipeline in the OpenLineage event model.
ISO 8601: All timestamp fields are coerced and normalised to ISO 8601 format during type conversion, and job metadata timestamps throughout the pipeline are serialised in this format.
STANAG 4774 (Confidentiality Metadata Labels): A post-transformation classification hook applies STANAG 4774 confidentiality labels to normalised records immediately after transformation completes, before data reaches downstream stores.
RFC 5322 / ITU-T E.164: Entity extraction during transformation validates email addresses against RFC 5322 patterns and normalises telephone numbers to E.164 international format as part of the cleansing pipeline.

Last Reviewed: 2026-02-05 Last Updated: 2026-04-14