Tags

, , ,

 

OpenXml SDK and its OpenXmlPowerTools… extensions counter part facilitate the creation and processing of XML-based file format.  For more information refer to wikipedia.    Most of the time, beside just opening and closing an open xml file, we also find ourselves needing to load or create a document in memory and manipulating it some sort of way.  OpenXmlMemoryStreamDocument is an in memory representation of OpenXmlPowerToolsDocument, the base class of WmlDocument. WmlDocument is “Power tools” representation and encapsulation of WordProcesingDocument and has a nice SaveAs method for flushing the document to disk.  OpenXmlMemoryStreamDocument has a static method named  CreateWordprocessingDocument for creating a document in memory.   We could also create/open an open xml document from a Byte[] or MemoryStream. 

var memStream = OpenXmlMemoryStreamDocument.CreateWordprocessingDocument().GetWordprocessingDocument()

or
    wmlDoc =OpenXmlMemoryStreamDocument.CreateWordprocessingDocument().GetModifiedWmlDocument()
    wmlDoc.FileName = newFileName
or
WmlDocument(string theNewfileName, byte[] byteArray)

or
using(var memStr = new MemoryStream())
{
       // add some valid bare bone open xml word doc to the stream.  \
      // and then call
      using (var wordDoc = WordprocessingDocument.Open(mem, true))
     {
      }
}

Narrowing of focus on WordProcessingDocument, I noticed its most important property being MainDocumentPart of type MainDocumentPart.  Sooo, the way MainDocumentPart  projects itself in my magnificent head is as having a bunch or parts ( we add and remove parts to a document from this property)for everything conceivable in a word document that is not a simple text run Smile  from headers (HeaderPart)  to Images (ImagePart), and Charts, etc…; Well, all this is based on the Open Xml Package approach to having different parts (PackagePart) and relationships(PackageRelationship) defined and referencing a main part defining the package type which in the case happens to be MainDocumentPart; and the whole assembled document being a Package.   WordProcessingDocument.MainDocumentPart.Document property is the root element for the main document part, and its Body property is the actual container for the things we see on the screen but laid out in xml format .  Package, PackagePart, and PackageRelationship are defined inside System.IO.Packaging namespace and dll.  OpenXml defines OpenXmlPackage and OpenXmlPackageContainer.  this is just a blur representation of what WordProcessingDocument might be and just enough for me to hit the ground running.  I recommend anyone reading this post beside me to lean on the actual documentation found on msdn  as well OpenXml source code on gitHub.  We could for example access the properties of a document that usually gets display on a normal Microsoft Office Word status bar such as character count and page count from an instance of WordProcessingDocument ExtendedFilePropertiesPart property.

for example we could try the following code snippet to get a document page count

using(var document = WordprocessingDocument.Open(FilePath, false))
{
        var propertiesPart = document.ExtendedFilePropertiesPart;
        var pages = propertiesPart.Properties.Pages;
        count = pages.Count();
   }

but that does not work.  Bug ?? instead we have to take the following approach to get a page count:

public Int32 Count()
        {
            var count = 1;
            var breakDetected = false;

            using(var document = WordprocessingDocument.Open(FilePath, false))
            {
                var body = document.MainDocumentPart.Document.Body;
                var blocks = (from ele in body.ChildElements
                            where !(ele is SectionProperties)
                            let pr = ele as Paragraph
                            let containBreak = pr.FirstChild is Run && pr.FirstChild.GetFirstChild<Break>() != null
                            select new
                            {
                                Element = pr,
                                ContainBreak = containBreak
                            }).ToList();

                foreach (var block in blocks)
                {
                    if(breakDetected)
                    {
                        count++;
                        breakDetected = false;
                    }
                    if (block.ContainBreak)
                    {
                        breakDetected = true;
                        continue;
                    }
                }
            }

            

            return count;
        }

 

this is a rudimentary approach, since a Break could be found inside a table cell or any other Run element beside the first element of a Paragraph.

when Copying part of a document to a new doc, you could use document builder

var sources = new List<Source>(){new Source(wmlDoc, currentPage.Index, currentPage.Count, true){ DiscardHeadersAndFootersInKeptSections = false, KeepSections = true},}; // true to keep sections : header and footer
                   //  ** var sources = new List<Source>() { new Source(wmlDoc, true) { DiscardHeadersAndFootersInKeptSections = false, KeepSections = true }, }; // true to keep sections : header and footer
                    //DocumentBuilder.BuildDocument(sources, );

                   // WmlDocument wmlDc = DocumentBuilder.BuildDocument(sources);

, but when the file gets complex, it becomes best to do the manual modification yourself mainly because headers and footers as well as some other parts such as images do not get copied over. Open XML org forum has a thread explaining that is the case. 

Advertisements