Dump to multiple jsonlines files

Dump multiple iterables incrementally to the specified jsonlines file paths, optimizing memory usage.

The files can be compressed using gzip, bzip2, or xz formats. If the file extension is not recognized, it will be dumped to a text file.

Example #1

This example uses jsonl.dump_fork to incrementally write fake daily temperature data for multiple cities to separate JSON Lines files, exporting records for the first days of specified years. It efficiently manages data by creating individual files for each city, optimizing memory usage.

import datetime
import itertools
import random

import jsonl


def fetch_temperature_by_city():
    """
    Yielding filenames for each city with fake daily temperature data for the initial days of
    the specified years.
    """

    years = [2023, 2024]
    first_days = 10
    cities = ["New York", "Los Angeles", "Chicago"]

    for year, city in itertools.product(years, cities):
        start = datetime.datetime(year, 1, 1)
        dates = (start + datetime.timedelta(days=day) for day in range(first_days))
        daily_temperature = (
            {"date": date.isoformat(), "city": city, "temperature": round(random.uniform(-10, 35), 2)}
            for date in dates
        )
        yield (f"{city}.jsonl", daily_temperature)


# Write the generated data to files in JSON Lines format
jsonl.dump_fork(fetch_temperature_by_city())

Example #2

This example demonstrates how to dump data using different JSON libraries. You can install orjson and ujson to run the following example.

pip install orjson ujson # Ignore this command if these libraries are already installed.

import orjson
import ujson
import jsonl


def worker():
    yield ("num.jsonl", ({"value": 1}, {"value": 2}))
    yield ("foo.jsonl", iter(({"a": "1"}, {"b": 2})))
    yield ("num.jsonl", [{"value": 3}])
    yield ("foo.jsonl", ())


# Dump the data using the default json.dumps function.
jsonl.dump_fork(worker())

# Dump the data using the ujson library.
jsonl.dump_fork(worker(), json_dumps=ujson.dumps, ensure_ascii=False)

# Dump the data using the orjson library.
jsonl.dump_fork(worker(), json_dumps=orjson.dumps)  # using (orjson)