Adaptive Unstructured Data Transformation Engine

Revision: 0.3.2

This micro-controller service leverages sophisticated Large Language Models (LLMs, specifically targeting GPT-4.1-mini) coupled with pydantic-ai for dissecting messy or natural language text into strongly typed, extractable attributes. The output guarantees structural fidelity across JSON, YAML, and TOML serialization targets. The architecture is optimized for resilience against unpredictable source material, striving for maximal data structuring, though complete accuracy on pathological inputs is not guaranteed.

🌟 Core Value Proposition

While standard LLM endpoints often require explicit schema definitions for structured output, this engine distinguishes itself through automated discovery and rigorous validation:

🔑💡 Autonomous Attribute Discovery: Its primary strength is the capacity to intelligently identify relevant attribute names and their associated values from opaque text without requiring a pre-supplied dictionary of expected keys. This contrasts sharply with schema-bound methods, making it ideal for exploratory data analysis or unknown data schemas.
🛡️🧱 High Resilience to Data Imperfection: It maintains superior performance when dealing with colloquialisms, typographical errors, or unstructured narratives where typical single-pass structured extraction approaches typically fail. The multi-stage processing chain is engineered for data sifting.
🌍🗣️ Enhanced Multilingual Contextualization: Integrates spaCy's Named Entity Recognition (NER) capabilities for Japanese, English, and Chinese texts as a preliminary step. This enriches the LLM context, significantly boosting extraction precision in supported languages.
🔄✍️ Refinement Loop for Type Accuracy: It employs an advanced sequence: initial LLM extraction, LLM-driven type inference, LLM-based type verification, and final rule-based/LLM-fallback type coercion. This iterative approach dramatically improves context-aware data typing.
✅🛡️ Mandatory Type Safety via Pydantic: The final output is rigorously validated against the internal Pydantic model schema, ensuring downstream systems receive dependable, type-checked data payloads.
📊⚙️ Predictable Response Contracts: The service guarantees the delivery of a structurally sound response envelope, regardless of extraction success level, which is fundamental for reliable automated workflows.

Changelog Summary

v0.3.2

Maintenance: Corrected an operational failure related to FastMCP execution.

v0.3.1

Improvement: Refined the prompt engineering used for type evaluation to enhance correction fidelity.
Documentation: Integrated key differentiators into README.md.

v0.2.0

Fix: Corrected language code handling for zh-cn / zh-tw.

v0.1.0

Initial deployment.

Available Endpoints (Tools)

/extract_json : Retrieves type-validated attributes serialized into JSON format from the provided text.
/extract_yaml : Retrieves type-validated attributes serialized into YAML format from the provided text.
/extract_toml : Retrieves type-validated attributes serialized into TOML format from the provided text.
- Caveat: Refer to the 'TOML Output Serialization Constraints' section regarding the representation of complex lists/objects.

Operational Notes: - Language Support: Native processing for Japanese, English, and Chinese (zh-cn / zh-tw). Other languages result in an error state. - Dependency: Extraction fidelity is coupled with LLM performance and pydantic-ai. 100% extraction is not feasible. - Latency: Processing time scales with input length. Expect increased latency for larger textual inputs. - Initialization: First execution triggers necessary spaCy model downloads, leading to increased initial startup time.

Performance Benchmark Sample

Input Token Count	Approx. Character Count	Median Runtime (seconds)	Model Used
200	~400	~15	gpt-4.1-mini

Note: Actual throughput is subject to external API latency, network conditions, and current model utilization. Even short inputs might require 15+ seconds.

Feature Set

Format Agnostic Ingestion: Accepts and attempts parsing of all input text, including corrupted or highly unstructured data.
Multilingual Pre-Analysis: Full support for JP/EN/ZH (Simplified/Traditional) via language detection and subsequent spaCy NER augmentation.
Schema Enforcement: Output types are guaranteed via Pydantic validation.
Serialization Flexibility: Outputs available as JSON, YAML, or TOML.
Failure Safety: Always returns a valid, parsable response wrapper, even if content extraction is minimal or fails.
Quality Extraction: Utilizes GPT-4.1-mini for core extraction/annotation steps, reinforced by Pydantic verification.

Verified Use Cases

Testing encompasses a broad range of inputs: - Standard, clear key-value entries. - Text where required data is deeply embedded or obscured by noise. - Scenarios requiring conversion between different output serialization standards.

Transformation Pipeline Visualization

Below illustrates the sequential data transformation stages implemented in server.py:

mermaid flowchart TD A[Source Text Input] --> B[Phase 0: Pre-Processing (Language ID then spaCy NER)] B --> C[Phase 1: Attribute/Value Extraction - LLM Core] C --> D[Phase 2: Inferential Type Tagging - LLM] D --> E[Phase 3: Type Validation & Correction - LLM] E --> F[Phase 4: Canonical Type Coercion - Rules & LLM Backup] F --> G[Phase 5: Final Pydantic Structuring & Validation] G --> H[Formatted Output: JSON/YAML/TOML]

Phase 0: Multilingual Context Augmentation (spaCy)

This system pre-processes input using spaCy after automatically determining the input language. Supported models are ja_core_news_md, en_core_web_sm, and zh_core_web_sm.

Language detection relies on langdetect.
If the language is not JP, EN, or ZH, the system terminates with an error: Unsupported lang detected.
Necessary spaCy models are downloaded and instantiated on-demand; no manual setup is needed.
The resulting entity list is injected into the main LLM prompt as follows:

[Contextual Entities from spaCy NER] This list comprises phrases identified by the language-specific spaCy model applied to the input. These entities might include dates, locations, names, or numbers. This list serves purely as contextual enrichment; the primary LLM will use its judgment across the entire text for final attribute inference.

Detailed Phase Descriptions

Phase 0: Pre-Processing (Language Detection & NER)

Objective: Automatically determine input language and generate a list of named entities to guide the LLM.
Mechanism: Uses langdetect and loads appropriate spaCy models.
Yield: Entity list appended to the LLM prompt to aid extraction.

Phase 1: Key-Value Extraction (Primary LLM Pass)

Objective: Leverage GPT-4.1-mini to pull out attribute-value pairs.
Details: Prompts are structured to instruct the LLM to aggregate values under a single key into list representations when applicable. Few-shot examples demonstrate this list format.
Result Example: key: collaborator, value: ["Tanaka", "Sato"]

Phase 2: Type Annotation (LLM Inference)

Objective: Infer the intended Python data type (e.g., integer, string, boolean, list) for every extracted attribute.
Details: The LLM evaluates the nature of the extracted value to assign a corresponding type hint.
Result Example: key: collaborator, value: ["Tanaka", "Sato"] -> list[str]

Phase 3: Type Verification (LLM Second Pass)

Objective: A dedicated LLM pass to review and correct the type inferences from Phase 2.
Details: GPT-4.1-mini scrutinizes the pair against its context. It rectifies incorrect types (e.g., numeric string designated as 'int' when 'str' is required) or resolves ambiguous typing.
Output: The revised, type-validated attribute list.

Phase 4: Canonical Type Coercion (Static Logic & Fallback)

Objective: Convert inferred types into native Python primitives (int, float, bool, str, list, None).
Details: Static conversion routines (regex, direct casting) are applied first. If these fail, an LLM fallback handles complex coercions (e.g., normalizing date strings, converting CSV text to Python lists).
Error Handling: Values resistant to conversion are safely cast to None or preserved as str.
Output: The key-value list with standardized Python types.

Phase 5: Final Structuring with Pydantic

Objective: Map the type-normalized data into the final Pydantic structure (e.g., KVOut).
Details: Pydantic models perform the final integrity check, validating against the expected schema for scalar, list, null, or complex types. Errors are logged internally, prioritizing the return of successfully parsed data.
Final Output: The verified data structure serialized into the user-requested format (JSON/YAML/TOML).

This multi-stage design is extensible for future list format enhancements and Pydantic schema evolution.

TOML Serialization Constraints

Simple lists (e.g., tags = ["A", "B"]) map directly to native TOML arrays.
However, arrays of complex structures (objects/dictionaries) or deeply nested data cannot be natively represented per TOML specification.
Consequently, such complex lists (e.g., [{"id": 1}, {"id": 2}]) are serialized internally as JSON strings assigned to the TOML value field.
JSON and YAML formats retain native representation for nested data.

Demonstration Input/Output

Input:

Thank you for your order (Order Number: ORD-98765). Product: High-Performance Laptop, Price: 89,800 JPY (tax excluded), Delivery: May 15-17. Shipping address: 1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101. Phone: 090-1234-5678. Payment: Credit Card (VISA, last 4 digits: 1234). For changes, contact support@example.com.

Output (JSON):

{ "order_number": "ORD-98765", "product_name": "High-Performance Laptop", "price": 89800, "price_currency": "JPY", "tax_excluded": true, "delivery_start_date": "20240515", "delivery_end_date": "20240517", "shipping_address": "1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101", "phone_number": "090-1234-5678", "payment_method": "Credit Card", "card_type": "VISA", "card_last4": "1234", "customer_support_email": "support@example.com" }

Output (YAML): yaml order_number: ORD-98765 product_name: High-Performance Laptop price: 89800 price_currency: JPY tax_excluded: true delivery_start_date: '20240515' delivery_end_date: '20240517' shipping_address: 1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101 phone_number: 090-1234-5678 payment_method: Credit Card card_type: VISA card_last4: '1234' customer_support_email: support@example.com

Output (TOML, simple structures): toml order_number = "ORD-98765" product_name = "High-Performance Laptop" price = 89800 price_currency = "JPY" tax_excluded = true delivery_start_date = "20240515" delivery_end_date = "20240517" shipping_address = "1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101" phone_number = "090-1234-5678" payment_method = "Credit Card" card_type = "VISA" card_last4 = "1234"

Output (TOML, complex structures requiring JSON stringification): toml items = '[{"name": "A", "qty": 2}, {"name": "B", "qty": 5}]' addresses = '[{"city": "Tokyo", "zip": "160-0022"}, {"city": "Osaka", "zip": "530-0001"}]'

Constraint Reminder: Nested data elements are encapsulated as JSON text within TOML values.

Service Methods

1. `extract_json`

Purpose: Ingests raw text and yields extracted key-value attributes validated and formatted as JSON.
Input: input_text (string): The unstructured source material.
Output Contract: { "success": True, "result": ... } or { "success": False, "error": ... }
Result Example: {"success": true, "result": { "setting_a": 42, "setting_b": "text_value" }}

2. `extract_yaml`

Purpose: Ingests raw text and yields extracted key-value attributes validated and formatted as a YAML string.
Input: input_text (string): The unstructured source material.
Output Contract: { "success": True, "result": ... } or { "success": False, "error": ... }
Result Example: {"success": true, "result": "setting_a: 42\nsetting_b: text_value"}

3. `extract_toml`

Purpose: Ingests raw text and yields extracted key-value attributes validated and formatted as a TOML string.
Input: input_text (string): The unstructured source material.
Output Contract: { "success": True, "result": ... } or { "success": False, "error": ... }
Result Example: {"success": true, "result": "setting_a = 42\nsetting_b = \"text_value\""}

Deployment Instructions

Installation via Smithery

To deploy multilingual-kv-structurer-service to your Claude environment via the Smithery CLI:

bash npx -y @smithery/cli install @KunihiroS/kv-extractor-mcp-server --client claude

Prerequisites

Python Runtime 3.9 or newer.
Valid OpenAI API credential (configure within settings.json under the env block).

Local Execution

bash python server.py

Run this command if executing the service outside of a managed environment.

MCP Host Configuration Directives

When bootstrapping this MCP component, logging verbosity and the absolute path for log persistence must be declared via command-line flags.

--log=off : Inhibits all log output generation.
--log=on --logfile=/path/to/your/absolute/log.log : Enables logging and directs all output to the specified, fully qualified file path.
Both flags are mandatory if logging is active. Failure to provide both results in immediate termination with an error message.

Configuration Example: Logging Deactivated

"multilingual-kv-structurer-service": { "command": "pipx", "args": ["run", "kv-extractor-mcp-server", "--log=off"], "env": { "OPENAI_API_KEY": "{apikey}" } }

Configuration Example: Logging Active (Absolute Path Required)

"multilingual-kv-structurer-service": { "command": "pipx", "args": ["run", "kv-extractor-mcp-server", "--log=on", "--logfile=/workspace/logs/kv-structurer.log"], "env": { "OPENAI_API_KEY": "{apikey}" } }

Critical Logging Policy: - Log output is strictly confined to the absolute file path when --log=on. Relative paths or missing --logfile trigger initialization failure. - If logging is suppressed, no output is written. - Ensure the specified log file location is fully writable by the execution context. - If startup errors persist, verify that the configuration specifies the very latest component version (replace x.y.z below with the current version number) to bypass potential stale package caches:

"multilingual-kv-structurer-service": { "command": "pipx", "args": ["run", "kv-extractor-mcp-server==x.y.z", "--log=off"], "env": { "OPENAI_API_KEY": "{apikey}" } }

License

Distributed under the GPL-3.0-or-later license.

Custodian

KunihiroS (and contributing developers)

Background Context (From Wikipedia on XMLHttpRequest)

XMLHttpRequest (XHR) is a browser-based API implemented as a JavaScript object. Its methods facilitate sending HTTP requests to a backend server asynchronously, allowing web applications to update content without full page reloads—a core concept of Ajax. Historically, interactions relied solely on standard form submissions or hyperlink navigation, which always resulted in a page refresh.

== Historical Genesis == This technique originated around 2000 with Microsoft Outlook developers and first materialized in Internet Explorer 5 (1999). Early implementations used proprietary identifiers like ActiveXObject("Msxml2.XMLHTTP"). By the time Internet Explorer 7 launched (2006), the standardized XMLHttpRequest identifier became universally adopted across major browser engines, including Mozilla's Gecko (2002), Safari 1.2 (2004), and Opera 8.0 (2005).

=== Standardization Trajectory === The W3C took ownership, releasing a Working Draft in April 2006. Level 2 specifications followed in February 2008, introducing event progress monitoring, cross-site request capabilities, and byte stream handling. Level 2 concepts were later merged back into the primary specification by late 2011. Development transitioned to WHATWG in late 2012, maintaining the living Web IDL document.

== Operational Usage Pattern == Standard XHR interaction involves a sequence of programmatic calls:

Instantiation: Create a new XMLHttpRequest object instance.
Configuration: Invoke the "open" method to define the request method (GET/POST, etc.), target URI, and whether the operation should be synchronous or asynchronous.
Asynchronous Hook: For non-blocking behavior, register an event handler to monitor state transitions.
Transmission: Start the request via the "send" method, optionally passing data payloads.
Response Handling: Monitor state changes within the listener. State 4 signals completion, and the resulting text is typically found in the responseText property.

Beyond these basics, XHR offers granular control: setting custom headers to influence server behavior, uploading data payloads via the send argument, parsing raw responses into native JavaScript objects (like JSON), or processing data chunks incrementally as they arrive. Requests can also be prematurely terminated or set with timeouts.

== Security Considerations (Cross-Domain) == Early web architecture imposed strict limitations on requesting resources from domains different from the originating page, a constraint XHR initially inherited until extensions like CORS were developed to safely manage cross-origin data exchange.

multilingual-kv-structurer-service

Author

KunihiroS

Quick Info

Actions

Tags