Format Conversion
Format Conversion
Format Conversion is the process of reading data stored in one physical file format and rewriting it in a different physical file format. In the modern data lakehouse ecosystem, this is one of the most computationally expensive and operationally consequential tasks that data pipelines perform. Every major data engineering workflow involves format conversion at least implicitly: raw CSV or JSON events become Parquet analytical tables; Avro-serialized Kafka messages become ORC Hive partitions; Delta Lake snapshots become Iceberg-compatible Parquet files via UniForm metadata generation.
Despite being pervasive, format conversion is frequently misunderstood, poorly optimized, and architected without full awareness of its costs and trade-offs. A data pipeline that converts data through too many intermediate formats, or that performs full-table re-conversions when only schema evolution is needed, can consume an enormous and entirely avoidable fraction of the team’s compute budget.
This guide provides a comprehensive, technical analysis of the major file formats used in lakehouse architectures, the precise mechanics of converting between them, the schema mapping challenges that make conversion error-prone, and the architectural strategies that minimize conversion costs in production deployments.
The Major File Formats: A Technical Comparison
To understand format conversion, you must first understand what is being converted — specifically, how each format structures its data on disk and why that structure produces radically different performance characteristics for different workloads.
Apache Parquet: The Columnar Analytical Standard
Parquet stores data in a columnar orientation. A table with 100 columns is stored such that all values from column 1 are written together, then all values from column 2, and so on. Within a large row group (typically 128MB), each column’s data is stored contiguously.
The implications for analytical queries are profound. If a query touches only 5 of a table’s 100 columns, the Parquet reader skips 95% of the file’s bytes entirely — it reads only the physical byte ranges corresponding to the 5 requested columns. This is IO reduction by physical file layout, not by predicate filtering.
Additionally, Parquet stores per-column statistics (min/max/null count) in the row group footer. These statistics allow query engines to skip entire row groups (and by extension entire files) for queries with selective predicates, before reading a single data byte.
Parquet also applies per-column compression (LZ4, Snappy, Zstandard, Gzip) and encoding (dictionary encoding, run-length encoding, delta encoding, bit packing). Similar values within a column compress far more efficiently than similar values scattered across rows, giving Parquet dramatically better compression ratios than row-based formats for analytical data.
The weakness of Parquet is write performance. Writing Parquet requires buffering an entire row group in memory, sorting the data into columnar order, applying encoding and compression per column, and writing the resulting column chunks. For small, frequent writes (e.g., a streaming pipeline writing 1,000 records every few seconds), the per-row-group overhead makes Parquet significantly less efficient than row-based formats.
Apache ORC: The Optimized Columnar Alternative
ORC (Optimized Row Columnar) is also a columnar format, developed originally for the Apache Hive ecosystem. Its structure is similar to Parquet: data is stored in stripes (analogous to Parquet’s row groups), with per-column data written contiguously within each stripe.
ORC distinguishes itself with a more integrated metadata and indexing story. ORC stores three levels of statistics: file-level (global min/max/count for the entire file), stripe-level (per-stripe min/max/count), and row-group-level (per-8,000-row block within a stripe). This three-tier indexing allows extremely fine-grained predicate pushdown, where query engines can skip not just entire files or stripes but individual row group blocks within a stripe.
ORC also includes native ACID support in its design (predating Iceberg and Delta Lake’s approaches) through Hive’s ORC ACID transaction model, though this model is specific to the Hive ecosystem and is not portable to other query engines.
In practice, ORC provides marginally better compression than Parquet for Hive-native workloads due to its integer encoding strategies, and its indexing is more granular. However, Parquet has broader ecosystem support across non-Hive engines (Spark, Trino, Flink, Snowflake, DuckDB), making it the more commonly chosen format for new lakehouse deployments that need multi-engine compatibility.
Apache Avro: The Row-Based Schema Evolution Standard
Avro stores data in a row-based orientation. All columns for record 1 are stored together, then all columns for record 2, and so on. This makes Avro substantially faster for write operations (no need to buffer and reorder into columnar format) and for operations that process entire records (like streaming deserialization in Kafka consumers).
Avro’s most distinctive strength is its schema evolution story. Avro stores the full schema as a JSON document in the file header (or in a Schema Registry for streaming contexts). When a new reader encounters an Avro file written with an older schema, it uses explicit schema resolution rules to map the old writer’s schema fields to the new reader’s schema fields. Fields present in the writer’s schema but absent in the reader’s schema are ignored. Fields present in the reader’s schema but absent in the writer’s schema are filled with their default values. Field renames are tracked through explicit aliases.
This schema evolution model is critically important for streaming pipelines where the producing application might change its event schema independently of the consuming analytics pipeline. Avro provides a language-independent, format-native mechanism for handling these mismatches gracefully.
The weakness of Avro for analytical workloads is the same as any row-based format: a query that reads only 5 of 100 columns must still read all 100 columns’ bytes for every row, because the bytes are interleaved row by row. Avro is the wrong format for analytical scans.
CSV and JSON: Legacy Input Formats
CSV and JSON are the most common formats for raw data arriving from external systems — web application logs, API exports, database exports, IoT sensor streams. Neither is efficient for analytical use. Both are plain text (no compression by default), row-based, and lack schema metadata embedded in the file format itself.
CSV has no native support for complex types (arrays, structs, maps), no standard for null value representation, and no metadata about data types (everything is a string that the reader must interpret). JSON is more expressive but incurs heavy parsing overhead due to repeated key names on every record.
The standard practice is to treat CSV and JSON as transient ingestion formats that are immediately converted to Parquet or ORC as part of the Bronze layer ingestion pipeline.
The Mechanics of Format Conversion
Format conversion in the lakehouse context is almost always performed by a distributed processing engine — Apache Spark, Apache Flink, Trino, DuckDB, or Apache Arrow-based tools. The conversion process is logically simple but operationally demanding:
Step 1: Read the Source Format
The compute engine reads the source files using a format-specific reader that deserializes the bytes into an in-memory representation (typically Apache Arrow’s columnar in-memory format, which all modern engines support as a universal in-memory intermediate). For CSV and JSON, this involves text parsing. For Parquet and ORC, this involves decompressing column chunks and reconstructing record batches. For Avro, this involves binary deserialization using the schema from the file header.
Step 2: Apply Transformations (Optional)
At this point, the data is in memory as Arrow record batches. The pipeline may apply any number of transformations: filtering rows, renaming columns, casting data types, joining with reference tables, applying business logic. These transformations operate on the in-memory representation, independent of both the source and target file formats.
Step 3: Write the Target Format
The compute engine writes the in-memory record batches to the target format using a format-specific writer. For Parquet, this involves buffering records into row groups, applying per-column encoding and compression, and writing the row group footer statistics. For ORC, this involves constructing stripe indexes and applying ORC-specific encoding. For Avro, this involves binary serialization of each record in row order.
Vectorized Readers and Writers
Modern format conversion achieves high throughput through vectorized (batch) reads and writes rather than per-row processing. Instead of deserializing one record at a time, engines read thousands of records simultaneously as Arrow columnar batches, apply transformations to entire columns at once using SIMD CPU instructions, and write thousands of records at once. This vectorized pipeline can sustain conversion throughput of hundreds of megabytes per second per CPU core for Parquet-to-Parquet operations.
Schema Mapping: The Primary Source of Conversion Failures
Format conversion is conceptually straightforward when the source and target schemas are identical. In real-world pipelines, they never are. Schema mapping — the process of reconciling differences between the source schema and the target schema — is where most conversion failures originate.
Column Name Mismatches
Source systems frequently use naming conventions that are incompatible with the target schema. A Kafka message might use camelCase field names (customerId, orderTimestamp) while the Iceberg target table uses snake_case (customer_id, order_timestamp). The conversion pipeline must explicitly map each source field name to its target field name.
If the field name mapping is not explicitly configured and the conversion engine defaults to a positional mapping (matching fields by their order in the schema rather than by name), any schema evolution in either the source or target will silently corrupt the data by mapping the wrong source values to the wrong target columns.
Data Type Incompatibilities
Every file format and query engine has its own type system with subtly different capabilities and semantics. Common conversion type conflicts include:
Timestamp precision: Avro timestamps are stored in milliseconds since epoch by default. Parquet supports both millisecond and microsecond precision. If a Avro-to-Parquet conversion defaults to milliseconds but the target Iceberg table expects microsecond precision, all timestamps will be stored with false precision (the microsecond digits will always be zero). The inverse — converting microsecond Parquet timestamps to millisecond Avro — silently truncates precision.
Decimal precision and scale: All three columnar formats support arbitrary-precision decimals. However, if the source Parquet file stores a decimal with precision 28, scale 6, and the target ORC table is defined with precision 10, scale 2, the conversion engine must either truncate (silently losing data), raise an error, or widen the target type. None of these outcomes is automatically correct; the correct behavior must be explicitly specified in the pipeline configuration.
Null handling in arrays: Parquet distinguishes between a list that contains a null element and a list that is itself null. Some SQL type systems (including older versions of Hive’s DDL) do not support this distinction. Converting between Parquet and ORC in environments that handle nested null semantics differently can produce incorrect null representations that pass validation but produce wrong results in downstream queries.
Integer widening: Converting a source INT32 Parquet column to a target BIGINT ORC column is safe. Converting in the reverse direction — BIGINT to INT32 — silently truncates any values exceeding 2,147,483,647, producing incorrect data without any error.
Schema Drift Management
Schema drift occurs when the source schema changes over time without a corresponding change to the target schema or the conversion pipeline configuration. A new field is added to the Kafka event schema. The conversion pipeline was not updated. The new field is silently dropped during conversion. The analytics team wonders why the new metric they added to the event is not appearing in the Gold table.
Strategies for managing schema drift include:
Schema Registries: A Schema Registry (like Confluent Schema Registry for Kafka) maintains a versioned history of every schema change, enforces compatibility policies (backward compatible, forward compatible, full compatible), and allows the conversion pipeline to detect when a new schema version is different from the last schema it processed.
Bronze Layer schema-on-read: Storing raw data in the Bronze layer in a permissive format (NDJSON or Avro with a permissive schema) allows new fields to be captured without modification to the ingestion pipeline. Schema enforcement is applied only in the Silver layer transformation pipeline, which can be updated independently of the ingestion cadence.
Table format schema evolution: Rather than failing when a new field appears in the source, the conversion pipeline can automatically issue a schema evolution command to the target Iceberg or Delta table (adding the new column) before writing the new data. This is the recommended pattern for most ELT pipelines that evolve their source schemas iteratively.
The Performance Dimensions of Format Conversion
CPU Cost
Converting between row and columnar formats (Avro to Parquet, CSV to ORC) requires physically reordering the data in memory. This is a CPU-intensive operation: for a table with 100 columns and 1 million rows, converting from row to columnar requires reading 100 million individual column values and redistributing them into 100 separate columnar arrays. Modern vectorized execution engines do this efficiently, but it is never free.
Converting between two columnar formats (Parquet to ORC) is slightly cheaper, because the source data is already in columnar order — the conversion is primarily a re-encoding and re-compression operation rather than a data layout transformation.
IO Cost and the Small File Problem
Format conversion is an IO-amplifying operation: the compute engine reads some input bytes, processes them, and writes output bytes. For a pure column reformatting with no compression ratio difference, the IO cost is 2x the data size (one read, one write). For a conversion that also applies compression (CSV to Parquet), the write IO may be dramatically less than the read IO — but the read IO is still the full raw data size.
The small file problem is particularly acute in streaming conversion pipelines. If a Flink streaming job converts Kafka Avro messages to Parquet every minute, it will write one small Parquet file per minute per partition. After a week, a table might have 7,000+ small Parquet files. Small files have the same metadata overhead as large files (each file requires a Manifest entry in Iceberg), but deliver far less analytical throughput because the per-file setup cost is amortized over far fewer rows. Compaction must be regularly run to merge these small files into appropriately sized ones.
Conversion Caching and Skipping
Modern conversion pipelines can avoid redundant conversions through incremental processing patterns. Instead of re-converting the entire source dataset every time the pipeline runs, the pipeline tracks a watermark (the last successfully processed offset in Kafka, or the last successfully read Iceberg snapshot in a source table), and converts only the data that arrived after the watermark. This incremental pattern reduces conversion cost by orders of magnitude for stable, slowly-growing datasets.
Conversion in the Open Table Format Context
A critically important architectural insight: Open Table Formats do not necessarily require format conversion to achieve cross-format compatibility. As described in the Apache XTable and Delta UniForm articles, metadata translation tools generate new metadata pointing to existing Parquet files in a different format’s structure. The physical Parquet files are never re-encoded. Only the metadata layer changes.
This means that “converting” a Delta Lake table to be readable as Iceberg via UniForm is not a format conversion in the traditional sense — it is a metadata translation. The Parquet files remain byte-for-byte identical. This distinction is important because it means the compute cost of cross-format interoperability through UniForm or XTable is dramatically lower than the compute cost of a true format conversion (re-reading and re-writing all the data bytes).
True format conversion (e.g., reading Avro messages from Kafka and writing Parquet to an Iceberg table) remains a real cost that must be planned for and optimized in the pipeline architecture.
Conclusion
Format Conversion is the fundamental data engineering operation that bridges the semantic gap between how data is produced (streaming row-based messages, raw CSV exports, application log JSON) and how data is most efficiently consumed (columnar Parquet or ORC optimized for analytical engines). Its cost is real — CPU, IO, time, and infrastructure — and must be minimized through intelligent pipeline architecture: ingest raw data in permissive formats, convert to columnar exactly once in the Bronze-to-Silver pipeline, apply schema enforcement at conversion time, manage small files through compaction, and leverage metadata translation tools (UniForm, XTable) for cross-format interoperability to avoid unnecessary full-data reconversions. Engineers who treat format conversion as a minor implementation detail rather than a first-class architectural concern will consistently find their pipeline compute costs and data freshness latencies dominated by avoidable, redundant conversion operations.
Visual Architecture
