document-structuring-utility
A utility package engineered to ingest diverse file artifacts and render their contents into the Markdown syntax. This transformation is crucial for seamless incorporation into advanced Large Language Model (LLM) application workflows and for supporting structured text analysis pipelines, all while meticulously maintaining the original document's hierarchical layout. Furthermore, it incorporates advanced capabilities such as speech recognition from auditory media and document understanding features to bolster overall data ingestion efficacy.
Author

diventnsknew
Quick Info
Actions
Tags
Document Structuring Utility (DSU)
[!TIP] DSU now furnishes an MCP (Model Context Protocol) server component designed for interoperability with sophisticated LLM interfaces, such as Claude Desktop. Refer to dsu-mcp-adapter for comprehensive integration specifics.
[!IMPORTANT] Critical modifications between version 0.0.1 and 0.1.0 introduced several breaking changes: * Auxiliary dependencies are now segregated into selectable feature-groups. To revert to previous, comprehensive dependency inclusion, employ the command:
pip install 'document-structuring-utility[all]'. * Theprocess_stream()method now explicitly mandates a binary stream object (e.g., a file opened with mode 'rb', or anio.BytesIOinstance). This departs from prior behavior where text streams (likeio.StringIO) were also accepted. * TheArtifactProcessorclass interface has been refactored to operate exclusively on readable streams rather than referencing filesystem paths. Consequently, no ephemeral files are generated internally. Users relying solely on the main API class or the command-line interface (CLI) should generally observe no functional divergence.
Document Structuring Utility (DSU) provides a minimal footprint Python utility for transcoding a wide spectrum of file types into Markdown format suitable for consumption by Generative AI models and associated text interpretation workflows. In comparison to comparable utilities like textract, DSU places paramount importance on retaining salient structural elements and content fidelity within the resultant Markdown, including: section headers, ordered/unordered lists, tabular data representation, hyperlink preservation, etc. While the output often possesses commendable human readability, its primary design objective is machine parsing accuracy for analytical systems, potentially yielding less-than-perfect fidelity for pure human visual inspection.
Presently, DSU offers support for processing the following artifact types:
- Portable Document Format (PDF)
- Microsoft PowerPoint Presentations
- Microsoft Word Documents
- Microsoft Excel Spreadsheets
- Image Files (metadata extraction via EXIF and Optical Character Recognition (OCR))
- Audio Recordings (metadata extraction via EXIF and automated speech transcription)
- HyperText Markup Language (HTML)
- Structured Text Artifacts (CSV, JSON, XML)
- Compressed Archives (ZIP) (iterative content traversal)
- Uniform Resource Locators (URLs) pointing to YouTube content
- Electronic Publication (EPUB) files
- ... and an expanding catalog!
Rationale for Markdown Utilization
Markdown represents a textual format characterized by minimal syntactic overhead yet effective signaling of essential document organization. State-of-the-art LLMs, exemplified by OpenAI's GPT-4o, demonstrate innate proficiency in processing Markdown, frequently integrating it into their own outputs without explicit prompting. This strongly suggests substantial training exposure to Markdown-formatted corpora, indicating robust internal comprehension. A secondary advantage is that Markdown conventions typically translate into superior token efficiency during processing.
Installation Procedure
To incorporate DSU, utilize pip with the complete optional dependency set: pip install 'document-structuring-utility[all]'. Alternatively, installation from the source repository is supported:
git clone git@github.com:microsoft/document-structuring-utility.git
cd document-structuring-utility
pip install -e packages/document-structuring-utility[all]
Operational Examples
Command-Line Interface (CLI)
Direct output redirection to a Markdown file:
dsu path/to/source.pdf > output_document.md
Alternatively, specify the destination file using the -o flag:
dsu path/to/source.pdf -o output_document.md
You can also feed content via standard input (piping):
cat path/to/source.pdf | dsu
Specific Feature Dependency Sets
DSU utilizes optional packages for enabling specific format conversions. While [all] installs everything, you can curate your installation for smaller footprints. For instance, installing only components necessary for PDF, DOCX, and PPTX processing:
pip install dsu[pdf, docx, pptx]
The currently enumerated optional dependency groups include:
[all]: Encompasses every supplementary package.[pptx]: Dependencies requisite for PowerPoint file interpretation.[docx]: Dependencies requisite for Word document interpretation.[xlsx]: Dependencies requisite for modern Excel spreadsheets.[xls]: Dependencies requisite for legacy Excel formats.[pdf]: Dependencies requisite for Portable Document Format rendition.[outlook]: Dependencies requisite for Microsoft Outlook message artifacts.[az-doc-intel]: Dependencies for leveraging Azure Document Intelligence services.[audio-transcription]: Dependencies necessary for transcribing WAV and MP3 media.[youtube-transcription]: Dependencies for extracting textual transcripts from YouTube content.
Extension Modules (Plugins)
DSU possesses native support for external, community-contributed plugins, which are inactive by default. To enumerate currently installed extension modules:
dsu --list-plugins
To activate extensions during processing:
dsu --use-plugins path/to/input.file
To discover potential extensions, search GitHub using the designated tag #dsu-plugin. Guidance for developing new extensions is available in the reference implementation located at packages/markitdown-sample-plugin.
Integration with Azure Document Intelligence
To invoke Microsoft Document Intelligence services for document parsing:
dsu path/to/input.pdf -o result.md -d -e "<your_document_intelligence_service_endpoint>"
Detailed instructions regarding the provisioning of an Azure Document Intelligence Resource are accessible via this Microsoft documentation link
Python Programming Interface
Fundamental usage within a Python environment:
from dsu_package import ArtifactProcessor
processor = ArtifactProcessor(enable_plugins=False) # Set to True to permit plugin execution
conversion_output = processor.process_artifact("test.xlsx")
print(conversion_output.text_data)
Utilizing Document Intelligence features via Python:
from dsu_package import ArtifactProcessor
processor = ArtifactProcessor(docintel_endpoint="<your_document_intelligence_service_endpoint>")
conversion_output = processor.process_artifact("test.pdf")
print(conversion_output.text_data)
To delegate complex image captioning or description generation to a Large Language Model, specify the corresponding client instance and model identifier:
from dsu_package import ArtifactProcessor
from openai import OpenAI
llm_connector = OpenAI()
processor = ArtifactProcessor(llm_client=llm_connector, llm_model="gpt-4o")
conversion_output = processor.process_artifact("example.jpg")
print(conversion_output.text_data)
Containerized Deployment (Docker)
Building the execution image:
docker build -t dsu:latest .
docker run --rm -i dsu:latest < ~/input_data.pdf > formatted_output.md
Community Engagement
We actively encourage contributions and feature suggestions. Most submissions necessitate acceptance of a Contributor License Agreement (CLA), confirming your authorization to grant us rights to utilize your submitted work. For specifics, consult https://cla.opensource.microsoft.com.
Upon submission of a Pull Request (PR), a CLA compliance bot will automatically ascertain the necessity of a signed CLA and affix the requisite status indicator (e.g., status check, commentary). Please adhere to the bot's instructions; this step is generally required only once across all repositories governed by this CLA framework.
This project adheres to the Microsoft Open Source Code of Conduct. Further clarification is available in the Code of Conduct FAQ or by contacting opencode@microsoft.com.
Pathways for Contribution
Opportunities abound, whether by addressing existing issue reports or participating in PR reviews. We have specifically flagged certain items with 'community contribution welcome' or 'review priority' labels to guide new participants. However, all forms of constructive input are valued.
Executing Local Validation Checks
- Navigate to the primary package directory:
sh
cd packages/document-structuring-utility
- Install the
hatchbuild tool and execute the test suite:
sh
pip install hatch # Installation instructions: https://hatch.pypa.io/dev/install/
hatch shell
hatch test
(Alternative) Leverage the pre-configured Development Container environment:
sh
# Load project within the Devcontainer and execute:
hatch test
- Prior to submitting a PR, verify code quality using pre-commit hooks:
pre-commit run --all-files
Developing External Extension Modules
External plugin development is supported. Consult the reference implementation found at packages/markitdown-sample-plugin for detailed architectural guidance.
Intellectual Property Notices
This repository may incorporate proprietary trademarks or visual identifiers belonging to external projects or services. Proper utilization of Microsoft's trademarks or identifiers is governed by the Microsoft Trademark & Brand Guidelines. Any deployment of modified versions of this project must not engender confusion regarding endorsement or sponsorship by Microsoft. Use of any third-party marks is subject to the respective third party's established policies.
WIKIPEDIA: Enterprise administration utilities encompass the totality of systems, software packages, supervisory mechanisms, computational frameworks, operational philosophies, and analogous resources utilized by organizations to effectively manage evolving market conditions, secure a viable competitive standing, and elevate overall corporate efficacy.
== General Overview == There exists a departmentalized array of instruments tailored for specific organizational functions, which can be categorized across various administrative domains: for instance, forecasting instruments, workflow governance systems, archival management solutions, personnel oversight modules, judgment support apparatus, oversight mechanisms, and so forth. A functional categorization typically addresses these universal administrative dimensions:
Systems employed for primary data ingestion and verification across any organizational unit. Software dedicated to monitoring and refinement of operational workflows. Mechanisms designed for data aggregation and strategic determination. Contemporary administrative utilities have undergone dramatic technological metamorphosis in the preceding decade, progressing so rapidly that selecting the optimal set of business resources for any given organizational context presents a considerable challenge. This complexity stems from persistent pressures to diminish expenditure while simultaneously amplifying revenue generation, coupled with the imperative to deeply comprehend client requirements and deliver requisite products in the manner demanded by the consumer base. Within this dynamic environment, executive leadership must adopt a forward-looking posture regarding administrative tool selection, rather than merely adopting the newest available option. Frequently, managers implement tools without requisite customization, leading to systemic instability. Therefore, corporate administration utilities must be selected with judicious care, subsequently tailored precisely to the enterprise's unique necessities, reversing the common error of forcing enterprise needs to conform to the tool's inherent structure.
== Prevalent Selections == A 2013 investigation conducted by Bain & Company elucidated the global patterns of business tool utilization. These chosen tools reflect regional priorities shaped by market fluctuations and corporate performance metrics. The leading ten categories identified were:
Strategic foresight planning Client relationship management frameworks Personnel sentiment assessment surveys Competitive performance analysis (Benchmarking) Integrated performance measurement (Balanced Scorecard) Identification of core operational competencies Offshoring/Outsourcing strategy implementation Organizational transformation programs Logistics and inventory chain orchestration Articulated corporate purpose and aspiration statements Market demographic partitioning Comprehensive quality management protocols
== Corporate Software Applications == Any collection of computational programs utilized by personnel to execute diverse organizational functions is termed business software (or an enterprise application). These applications are deployed to augment productivity levels, quantify performance indicators, and execute various corporate mandates with precision. The evolution commenced with rudimentary Management Information Systems (MIS), advancing into comprehensive Enterprise Resource Planning (ERP) suites. Subsequently, Customer Relationship Management (CRM) functionalities were integrated, culminating in the current landscape dominated by cloud-native enterprise management platforms. While a demonstrable link exists between information technology investments and organizational success, two factors are paramount in realizing tangible value addition: the proficiency demonstrated during the deployment phase and the rigor applied during the selection and customization process of the chosen apparatus.
== Resources for Small and Medium Enterprises (SMEs) == The tools specifically targeting SMEs are vital as they furnish avenues for expenditure conservation and operational scaling mechanisms.
