Python: Split Word Documents

2024-03-27 01:03:24 Written by Koohji

font size decrease font size increase font size
Print
E-mail

Rate this item

(0 votes)

Efficiently managing Word documents often requires the task of splitting them into smaller sections. However, manually performing this task can be time-consuming and labor-intensive. Fortunately, Spire.Doc for Python provides a convenient and efficient way to programmatically segment Word documents, helping users to extract specific parts of a document, split lengthy documents into smaller chunks, and streamline data extraction. This article demonstrates how to use Spire.Doc for Python to split a Word document into multiple documents in Python.

The splitting of a Word document is typically done by page breaks and section breaks due to the dynamic nature of document content. Therefore, this article focuses on the following two parts:

Split a Word Document by Page Breaks with Python
Split a Word Document by Section Breaks with Python

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.Doc

If you are unsure how to install, please refer to: How to Install Spire.Doc for Python on Windows

Split a Word Document by Page Breaks with Python

Page breaks allow for the forced pagination of a document, thereby achieving a fixed division of content. By using page breaks as divisions, we can split a Word document into smaller content-related documents. The detailed steps for splitting a Word document by page breaks are as follows:

Create an instance of Document class and load a Word document using Document.LoadFromFile() method.
Create a new document, add a section to it using Document.AddSection() method.
Iterate through all body child objects in each section in the original document and check if the child object is a paragraph or a table.
If the child object is a table, add it to the section in the new document using Section.Body.ChildObjects.Add() method.
If the child object is a paragraph, add the paragraph object to the section in the new document. Then, iterate through all child objects of the paragraph and check if a child object is a page break.
If the child object in the paragraph is a page break, get its index using Paragraph.ChildObjects.IndexOf() method and remove it from the paragraph by its index.
Save the new document using Document.SaveToFile() method and repeat the above process.

Python

from spire.doc import *
from spire.doc.common import *

inputFile = "Sample.docx"
outputFolder = "output/SplitDocument/"

# Create an instance of Document
original = Document()
# Load a Word document
original.LoadFromFile(inputFile)

# Create a new word document and add a section to it
newWord = Document()
section = newWord.AddSection()
original.CloneDefaultStyleTo(newWord)
original.CloneThemesTo(newWord)
original.CloneCompatibilityTo(newWord)

index = 0
# Iterate through all sections of original document
for m in range(original.Sections.Count):
    sec = original.Sections.get_Item(m)
    # Iterate through all body child objects of each section
    for k in range(sec.Body.ChildObjects.Count):
        obj = sec.Body.ChildObjects.get_Item(k)
        if isinstance(obj, Paragraph):
            para = obj if isinstance(obj, Paragraph) else None
            sec.CloneSectionPropertiesTo(section)
            # Add paragraph object in original section into section of new document
            section.Body.ChildObjects.Add(para.Clone())
            for j in range(para.ChildObjects.Count):
                parobj = para.ChildObjects.get_Item(j)
                if isinstance(parobj, Break) and ( parobj if isinstance(parobj, Break) else None).BreakType == BreakType.PageBreak:
                    # Get the index of page break in paragraph
                    i = para.ChildObjects.IndexOf(parobj)
                    # Remove the page break from its paragraph
                    section.Body.LastParagraph.ChildObjects.RemoveAt(i)
                    # Save the new document
                    resultF = outputFolder
                    resultF += "SplitByPageBreak-{0}.docx".format(index)
                    newWord.SaveToFile(resultF, FileFormat.Docx)
                    index += 1
                    # Create a new document and add a section
                    newWord = Document()
                    section = newWord.AddSection()
                    original.CloneDefaultStyleTo(newWord)
                    original.CloneThemesTo(newWord)
                    original.CloneCompatibilityTo(newWord)
                    sec.CloneSectionPropertiesTo(section)
                    # Add paragraph object in original section into section of new document
                    section.Body.ChildObjects.Add(para.Clone())
                    if section.Paragraphs[0].ChildObjects.Count == 0:
                        # Remove the first blank paragraph
                        section.Body.ChildObjects.RemoveAt(0)
                    else:
                        # Remove the child objects before the page break
                        while i >= 0:
                            section.Paragraphs[0].ChildObjects.RemoveAt(i)
                            i -= 1
        if isinstance(obj, Table):
            # Add table object in original section into section of new document
            section.Body.ChildObjects.Add(obj.Clone())

# Save the document
result = outputFolder+"SplitByPageBreak-{0}.docx".format(index)
newWord.SaveToFile(result, FileFormat.Docx2013)
newWord.Close()

Python: Split Word Documents

Split a Word Document by Section Breaks with Python

Sections divide a Word document into different logical parts and allow for independent formatting for each section. By splitting a Word document into sections, we can obtain multiple documents with relatively independent content and formatting. The detailed steps for splitting a Word document by section breaks are as follows:

Create an instance of Document class and load a Word document using Document.LoadFromFile() method.
Iterate through each section in the document.
Get a section using Document.Sections.get_Item() method.
Create a new Word document and copy the section in the original document to the new document using Document.Sections.Add() method.
Save the new document using Document.SaveToFile() method.

Python

from spire.doc import *
from spire.doc.common import *

# Create an instance of Document class
document = Document()
# Load a Word document
document.LoadFromFile("Sample.docx")

# Iterate through all sections
for i in range(document.Sections.Count):
    section = document.Sections.get_Item(i)
    result = "output/SplitDocument/" + "SplitBySectionBreak_{0}.docx".format(i+1)
    # Create a new Word document
    newWord = Document()
    # Add the section to the new document
    newWord.Sections.Add(section.Clone())
    #Save the new document
    newWord.SaveToFile(result)
    newWord.Close()

Python: Split Word Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Additional Info

tutorial_title:

Last modified on Thursday, 25 April 2024 02:07

Read 5093 times

Published in Document Operation

Tagged under

doc Python Document Operation

Social sharing

News Category