Using native and binary formats in ClickHouse
ClickHouse supports multiple binary formats, which result in better performance and space efficiency. Binary formats are also safe in character encoding since data is saved in a binary form.
We're going to use some_data table and data for demonstration, feel free to reproduce that on your ClickHouse instance.
Exporting in a Native ClickHouse format
The most efficient data format to export and import data between ClickHouse nodes is Native format. Exporting is done using INTO OUTFILE
clause:
SELECT * FROM some_data
INTO OUTFILE 'data.clickhouse' FORMAT Native
This will create data.clickhouse file in a native format.
Importing from a Native format
To import data, we can use file() for smaller files or exploration purposes:
DESCRIBE file('data.clickhouse', Native);
┌─name──┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ path │ String │ │ │ │ │ │
│ month │ Date │ │ │ │ │ │
│ hits │ UInt32 │ │ │ │ │ │
└───────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
When using the file()
function, with ClickHouse Cloud you will need to run the commands in clickhouse client
on the machine where the file resides. Another option is to use clickhouse-local
to explore files locally.
In production, we use FROM INFILE
to import data:
INSERT INTO sometable
FROM INFILE 'data.clickhouse'
FORMAT Native
Native format compression
We can also enable compression while exporting data to Native format (as well as most other formats) using a COMPRESSION
clause:
SELECT * FROM some_data
INTO OUTFILE 'data.clickhouse'
COMPRESSION 'lz4'
FORMAT Native
We've used LZ4 compression for export. We'll have to specify it while importing data:
INSERT INTO sometable
FROM INFILE 'data.clickhouse'
COMPRESSION 'lz4'
FORMAT Native
Exporting to RowBinary
Another binary format supported is RowBinary, which allows importing and exporting data in binary-represented rows:
SELECT * FROM some_data
INTO OUTFILE 'data.binary' FORMAT RowBinary
This will generate data.binary file in a binary rows format.
Exploring RowBinary files
Automatic schema inference is not supported for this format, so to explore before loading, we have to define schema explicitly:
SELECT *
FROM file('data.binary', RowBinary, 'path String, month Date, hits UInt32')
LIMIT 5
┌─path───────────────────────────┬──────month─┬─hits─┐
│ Bangor_City_Forest │ 2015-07-01 │ 34 │
│ Alireza_Afzal │ 2017-02-01 │ 24 │
│ Akhaura-Laksam-Chittagong_Line │ 2015-09-01 │ 30 │
│ 1973_National_500 │ 2017-10-01 │ 80 │
│ Attachment │ 2017-09-01 │ 1356 │
└────────────────────────────────┴────────────┴──────┘
Consider using RowBinaryWithNames, which also adds a header row with a columns list. RowBinaryWithNamesAndTypes will also add an additional header row with column types.
Importing from RowBinary files
To load data from a RowBinary file, we can use a FROM INFILE
clause:
INSERT INTO sometable
FROM INFILE 'data.binary'
FORMAT RowBinary
Importing single binary value using RawBLOB
Suppose we want to read an entire binary file and save it into a field in a table. This is the case when the RawBLOB format can be used. This format can be directly used with a single-column table only:
CREATE TABLE images(data String) Engine = Memory
Let's save an image file to the images
table:
cat image.jpg | clickhouse-client -q "INSERT INTO images FORMAT RawBLOB"
We can check the data
field length which will be equal to the original file size:
SELECT length(data) FROM images
┌─length(data)─┐
│ 6121 │
└──────────────┘
Exporting RawBLOB data
This format can also be used to export data using an INTO OUTFILE
clause:
SELECT * FROM images LIMIT 1
INTO OUTFILE 'out.jpg'
FORMAT RawBLOB
Note that we had to use LIMIT 1
because exporting more than a single value will create a corrupted file.
MessagePack
ClickHouse supports importing and exporting to MessagePack using the MsgPack. To export to MessagePack format:
SELECT *
FROM some_data
INTO OUTFILE 'data.msgpk'
FORMAT MsgPack
To import data from a MessagePack file:
INSERT INTO sometable
FROM INFILE 'data.msgpk'
FORMAT MsgPack
Protocol Buffers
To work with Protocol Buffers we first need to define a schema file:
syntax = "proto3";
message MessageType {
string path = 1;
date month = 2;
uint32 hits = 3;
};
Path to this schema file (schema.proto
in our case) is set in a format_schema
settings option for the Protobuf format:
SELECT * FROM some_data
INTO OUTFILE 'proto.bin'
FORMAT Protobuf
SETTINGS format_schema = 'schema:MessageType'
This saves data to the proto.bin file. ClickHouse also supports importing Protobuf data as well as nested messages. Consider using ProtobufSingle to work with a single Protocol Buffer message (length delimiters will be omitted in this case).
Cap’n Proto
Another popular binary serialization format supported by ClickHouse is Cap’n Proto. Similarly to Protobuf
format, we have to define a schema file (schema.capnp) in our example:
@0xec8ff1a10aa10dbe;
struct PathStats {
path @0 :Text;
month @1 :UInt32;
hits @2 :UInt32;
}
Now we can import and export using CapnProto format and this schema:
SELECT
path,
CAST(month, 'UInt32') AS month,
hits
FROM some_data
INTO OUTFILE 'capnp.bin'
FORMAT CapnProto
SETTINGS format_schema = 'schema:PathStats'
Note that we had to cast the Date
column as UInt32
to match corresponding types.
Other formats
ClickHouse introduces support for many formats, both text, and binary, to cover various scenarios and platforms. Explore more formats and ways to work with them in the following articles:
- CSV and TSV formats
- Parquet
- JSON formats
- Regex and templates
- Native and binary formats
- SQL formats
And also check clickhouse-local - a portable full-featured tool to work on local/remote files without starting ClickHouse server.