Python:jsonl
2026/1/25大约 1 分钟
1 逐行读取为列表
import json
def read_jsonl(path):
data = []
with open(path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
data.append(json.loads(line))
return data
# 用法
records = read_jsonl("data.jsonl")
print(len(records))
print(records[0])- 内存中是
list[dict] - 适合中小规模数据
- 可随便索引、遍历、处理
2 大文件推荐:生成器版本(不吃内存)
如果你是像在做 RAG / PathRAG / KG 数据流水线 那种百万行级别的 jsonl,这个更专业 👇
import json
def iter_jsonl(path):
with open(path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
yield json.loads(line)
# 用法
for record in iter_jsonl("data.jsonl"):
print(record)tqdm版本
from tqdm import tqdm
import json
def iter_jsonl(path, show_progress=True):
total = None
if show_progress:
with open(path, "r", encoding="utf-8") as f:
total = sum(1 for _ in f)
with open(path, "r", encoding="utf-8") as f:
it = f
if show_progress:
it = tqdm(f, total=total, desc="Reading JSONL")
for line in it:
line = line.strip()
if line:
yield json.loads(line)- 一行一行处理
- 内存占用极低
- 非常适合管道式处理 / 批量写入向量库 / 图构建
3 带异常保护(工程推荐版)
import json
def read_jsonl_safe(path):
data = []
with open(path, "r", encoding="utf-8") as f:
for i, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
data.append(json.loads(line))
except json.JSONDecodeError as e:
print(f"Line {i} JSON parse error: {e}")
return data4 Pandas DataFrame
import pandas as pd
df = pd.read_json("data.jsonl", lines=True)
print(df.head()).jsonl不是一个大 JSON 数组,不能直接json.load()- 必须
lines=True才能用 pandas
更新日志
2026/1/30 19:51
查看所有更新日志
72fda-于25785-于95b1c-于7fe01-于
