跳转至

文件读取工具

FileReader 工具提供多格式文件读取功能,支持行号显示、分页和安全上限。支持纯文本文件、Jupyter Notebook 和 PDF。

类概述

  • FileReader - 四种读取方法,对应不同文件格式:
    • read() - 文本文件,带行号和分页
    • read_notebook() - Jupyter Notebook(.ipynb
    • read_pdf() - PDF 文件(需要可选依赖)
    • read_image() - 图片文件,返回多模态内容块(需要可选依赖)

使用方法

读取文本文件

from toolregistry_hub import FileReader

# 读取文件,显示行号
content = FileReader.read("/path/to/file.py")
print(content)
# [/path/to/file.py] lines 1-50 of 200 (use offset=51 to read more)
# 1 | import os
# 2 | import sys
# 3 |
# 4 | def main():
# ...

# 分页读取
content = FileReader.read("/path/to/file.py", offset=50, limit=25)

读取 Jupyter Notebook

# 读取 notebook 单元格,显示类型标记和输出
content = FileReader.read_notebook("analysis.ipynb")
# [Notebook: analysis.ipynb]
#
# --- Cell 1 [markdown] ---
# # Data Analysis
#
# --- Cell 2 [code] ---
# ```python
# import pandas as pd
# df = pd.read_csv("data.csv")
# ```
# Output:
# ...

无需外部依赖 -- 使用标准库 json

读取 PDF

# 读取所有页面(上限 20 页)
content = FileReader.read_pdf("document.pdf")

# 读取指定页面范围
content = FileReader.read_pdf("document.pdf", pages="5-10")

# 读取单页
content = FileReader.read_pdf("document.pdf", pages="3")

需要安装 pypdfpdfplumber

pip install toolregistry-hub[reader]

如果两者都已安装,优先使用 pdfplumber 以获得更好的文本质量。

读取图片

# 读取图片 — 返回多模态内容块
blocks = FileReader.read_image("screenshot.png")
# [
#   {"type": "text", "text": "[Image: screenshot.png (image/png, 45321 bytes)]"},
#   {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "iVBOR..."}}
# ]

# 自定义最大尺寸(默认 5 MB base64)
blocks = FileReader.read_image("large_photo.jpg", max_size=1_000_000)

支持格式:.png.jpg.jpeg.gif.webp

如果 base64 编码后的图片超过 max_size,将使用 Pillow 进行自适应质量压缩。需要安装 Pillow

pip install toolregistry-hub[reader_image]

如果未安装 Pillow,将返回原始图片并记录警告日志。

参数

read()

参数 类型 默认值 描述
path str 必填 文本文件路径
offset int 1 起始行号(从 1 开始)
limit int \| None None 最大读取行数(默认 2000)

read_notebook()

参数 类型 默认值 描述
path str 必填 .ipynb 文件路径

read_pdf()

参数 类型 默认值 描述
path str 必填 PDF 文件路径
pages str \| None None 页面范围(如 "1-5""3"

read_image()

参数 类型 默认值 描述
path str 必填 图片文件路径
max_size int 5242880 base64 编码最大字节数(5 MB)

安全上限

  • 文本文件:最大 10 MB
  • 文本行数:每次读取默认 2000 行
  • PDF 页数:每次调用最多 20 页
  • Notebook 输出:每个单元格输出最大 10 KB
  • 图片:base64 编码最大 5 MB(超出时自动压缩)

MCP 服务端点

POST /tools/reader/read
POST /tools/reader/read_pdf
POST /tools/reader/read_notebook
POST /tools/reader/read_image

API 参考

toolregistry_hub.file_reader.FileReader

Multi-format file reader with line numbers and pagination.

read staticmethod

read(path: str, offset: int = 1, limit: int | None = None) -> str

Read a text file with line numbers.

Parameters:

Name Type Description Default
path str

Path to file.

required
offset int

Starting line number (1-indexed). Defaults to 1.

1
limit int | None

Maximum number of lines to read. Defaults to 2000.

None

Returns:

Type Description
str

File content with line numbers in "N | content" format.

str

Includes a metadata header with file path, total lines, and

str

the range actually read.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

IsADirectoryError

If the path is a directory.

ValueError

If offset is less than 1.

Source code in toolregistry_hub/file_reader.py
@staticmethod
def read(
    path: str,
    offset: int = 1,
    limit: int | None = None,
) -> str:
    """Read a text file with line numbers.

    Args:
        path: Path to file.
        offset: Starting line number (1-indexed). Defaults to 1.
        limit: Maximum number of lines to read. Defaults to 2000.

    Returns:
        File content with line numbers in ``"N | content"`` format.
        Includes a metadata header with file path, total lines, and
        the range actually read.

    Raises:
        FileNotFoundError: If the file does not exist.
        IsADirectoryError: If the path is a directory.
        ValueError: If offset is less than 1.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")
    if p.is_dir():
        raise IsADirectoryError(f"Path is a directory, not a file: {path}")
    if offset < 1:
        raise ValueError("offset must be >= 1")

    effective_limit = limit if limit is not None else _MAX_LINES_DEFAULT

    # Size guard
    file_size = p.stat().st_size
    if file_size > _MAX_FILE_SIZE_BYTES:
        return (
            f"[File too large: {file_size:,} bytes "
            f"(limit {_MAX_FILE_SIZE_BYTES:,}). "
            f"Use offset/limit to read in segments.]"
        )

    text = p.read_text(encoding="utf-8", errors="replace")
    all_lines = text.splitlines()
    total_lines = len(all_lines)

    start = offset - 1  # convert to 0-indexed
    end = min(start + effective_limit, total_lines)
    selected = all_lines[start:end]

    # Build line-numbered output
    width = len(str(end))
    numbered = [
        f"{i + offset:>{width}} | {line}" for i, line in enumerate(selected)
    ]

    # Metadata header
    range_str = f"{offset}-{start + len(selected)}"
    header = f"[{path}] lines {range_str} of {total_lines}"
    if end < total_lines:
        header += f" (use offset={end + 1} to read more)"

    return header + "\n" + "\n".join(numbered)

read_image staticmethod

read_image(path: str, max_size: int = _MAX_IMAGE_SIZE_BYTES) -> list

Read an image file and return as multimodal content blocks.

Returns a list of content blocks (TextBlock + ImageBlock) that the toolregistry pipeline can expand into format-specific multimodal messages via expand_content_blocks().

If the base64-encoded image exceeds max_size, Pillow is used to downsample it. If Pillow is not installed, the original image is returned with a warning.

Parameters:

Name Type Description Default
path str

Path to image file (.png, .jpg, .jpeg, .gif, .webp).

required
max_size int

Maximum base64-encoded size in bytes. Defaults to 5 MB.

_MAX_IMAGE_SIZE_BYTES

Returns:

Type Description
list

A list of two content blocks::

[ {"type": "text", "text": "[Image: name (mime, size)]"}, {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": "iVBOR..." }} ]

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the file extension is not supported.

Source code in toolregistry_hub/file_reader.py
@staticmethod
def read_image(
    path: str,
    max_size: int = _MAX_IMAGE_SIZE_BYTES,
) -> list:
    """Read an image file and return as multimodal content blocks.

    Returns a list of content blocks (TextBlock + ImageBlock) that the
    toolregistry pipeline can expand into format-specific multimodal
    messages via ``expand_content_blocks()``.

    If the base64-encoded image exceeds ``max_size``, Pillow is used to
    downsample it. If Pillow is not installed, the original image is
    returned with a warning.

    Args:
        path: Path to image file (.png, .jpg, .jpeg, .gif, .webp).
        max_size: Maximum base64-encoded size in bytes. Defaults to 5 MB.

    Returns:
        A list of two content blocks::

            [
                {"type": "text", "text": "[Image: name (mime, size)]"},
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": "iVBOR..."
                }}
            ]

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the file extension is not supported.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")

    ext = p.suffix.lower()
    if ext not in _SUPPORTED_IMAGE_EXTENSIONS:
        raise ValueError(
            f"Unsupported image format: '{ext}'. "
            f"Supported: {', '.join(sorted(_SUPPORTED_IMAGE_EXTENSIONS))}"
        )

    media_type = _EXTENSION_TO_MIME[ext]
    img_data = p.read_bytes()
    raw_size = len(img_data)

    b64_data = base64.b64encode(img_data).decode("ascii")

    if len(b64_data) > max_size:
        img_data, media_type = FileReader._downsample_image(
            img_data, media_type, max_size
        )
        b64_data = base64.b64encode(img_data).decode("ascii")

    return [
        {
            "type": "text",
            "text": f"[Image: {p.name} ({media_type}, {raw_size} bytes)]",
        },
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": media_type,
                "data": b64_data,
            },
        },
    ]

read_notebook staticmethod

read_notebook(path: str) -> str

Read a Jupyter notebook and return formatted cell contents.

Uses stdlib json only — no external dependencies.

Parameters:

Name Type Description Default
path str

Path to .ipynb file.

required

Returns:

Type Description
str

All cells with type markers (code/markdown) and outputs.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the file is not a valid notebook.

Source code in toolregistry_hub/file_reader.py
@staticmethod
def read_notebook(path: str) -> str:
    """Read a Jupyter notebook and return formatted cell contents.

    Uses stdlib ``json`` only — no external dependencies.

    Args:
        path: Path to ``.ipynb`` file.

    Returns:
        All cells with type markers (code/markdown) and outputs.

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the file is not a valid notebook.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")

    try:
        data = json.loads(p.read_text(encoding="utf-8"))
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid notebook JSON: {e}") from e

    if "cells" not in data:
        raise ValueError(f"Not a valid Jupyter notebook (no 'cells' key): {path}")

    # Detect language from kernel info
    lang = "python"
    kernel_info = data.get("metadata", {}).get("kernelspec", {})
    if kernel_info.get("language"):
        lang = kernel_info["language"]

    lines: list[str] = []
    lines.append(f"[Notebook: {path}]")

    for i, cell in enumerate(data["cells"]):
        cell_type = cell.get("cell_type", "unknown")
        source = "".join(cell.get("source", []))

        lines.append(f"\n--- Cell {i + 1} [{cell_type}] ---")

        if cell_type == "code":
            lines.append(f"```{lang}")
            lines.append(source)
            lines.append("```")

            # Process outputs
            for output in cell.get("outputs", []):
                output_text = FileReader._extract_notebook_output(output)
                if output_text:
                    lines.append(f"Output:\n{output_text}")
        else:
            lines.append(source)

    return "\n".join(lines)

read_pdf staticmethod

read_pdf(path: str, pages: str | None = None) -> str

Read a PDF file and extract text.

Uses pypdf (zero-dependency, BSD) by default. If pdfplumber is installed, uses it for better text quality.

Parameters:

Name Type Description Default
path str

Path to PDF file.

required
pages str | None

Page range string (e.g. "1-5", "3", "10-20"). Max 20 pages per call. Defaults to all pages (up to cap).

None

Returns:

Type Description
str

Extracted text content with page markers.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ImportError

If neither pypdf nor pdfplumber is installed.

ValueError

If page range is invalid.

Source code in toolregistry_hub/file_reader.py
@staticmethod
def read_pdf(
    path: str,
    pages: str | None = None,
) -> str:
    """Read a PDF file and extract text.

    Uses ``pypdf`` (zero-dependency, BSD) by default. If ``pdfplumber``
    is installed, uses it for better text quality.

    Args:
        path: Path to PDF file.
        pages: Page range string (e.g. ``"1-5"``, ``"3"``, ``"10-20"``).
            Max 20 pages per call. Defaults to all pages (up to cap).

    Returns:
        Extracted text content with page markers.

    Raises:
        FileNotFoundError: If the file does not exist.
        ImportError: If neither ``pypdf`` nor ``pdfplumber`` is installed.
        ValueError: If page range is invalid.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")

    start_page, end_page = FileReader._parse_page_range(pages)

    # Try pdfplumber first (better quality), fall back to pypdf
    try:
        return FileReader._read_pdf_pdfplumber(p, start_page, end_page)
    except ImportError:
        pass

    try:
        return FileReader._read_pdf_pypdf(p, start_page, end_page)
    except ImportError:
        raise ImportError(
            "PDF reading requires 'pypdf' or 'pdfplumber'. "
            "Install with: pip install toolregistry-hub[reader]"
        ) from None