John Davidson

php - How to extract bold text from docx

0 comments
Message:


I want to extract bold text from word docx using php. I create docx zip file and extract. Then, read document.xml. In xml, presence of <w:b/> show that text is bold.


sample.docx:



Create zip and extract


<?php
$docname="sample";
echo copy($docname.".docx",$docname.".zip");

$zip = new ZipArchive;
if ($zip->open($docname.".zip") === TRUE) {
$zip->extractTo($docname."/");
$zip->close();
} else {
echo 'failed';
}
?>

Extract bold word to array (Reference:search-bold


<?php
//https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
$folder="sample";
$xmlFile = $folder."/word/document.xml";
$reader = new XMLReader;
$reader->open($xmlFile);
$bold_words=[];
while($reader->read()){
if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:p'){
$paragraph = new XMLReader;
$p = $reader->readOuterXML();
$paragraph->xml($p);
while ($paragraph->read()){
if ($paragraph->nodeType == XMLREADER::ELEMENT && $paragraph->name === 'w:r'){
$node = trim($paragraph->readInnerXML());
//strstr() function searches for the first occurrence of a string inside another string
if(strstr($node,'<w:b/>'))
{
$bold_words[]=$node;
}
}
}
}
}
echo "<pre>";
var_dump($bold_words);
echo "</pre>";
?>

The result show :


array(1) {
[0]=>string(364) "Title content"
}

There should be 5 bold words shown in result, but, only has one. I have checked document.xml. <w:b/> only appear once.



How list text bold formatted in document.xml?



My Answer:
To list all the bold text formatted in the document.xml, you can modify the code to check for the presence of the tag within the (run properties) element. Here is an updated version of the code:

php
$folder="sample";
$xmlFile = $folder."/word/document.xml";
$reader = new XMLReader;
$reader->open($xmlFile);
$bold_words=[];

while($reader->read()){
if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:r'){
$run = new XMLReader;
$r = $reader->readOuterXML();
$run->xml($r);

while ($run->read()){
if ($run->nodeType == XMLREADER::ELEMENT && $run->name === 'w:rPr'){
$rPr = new XMLReader;
$rPr->xml($run->readOuterXML());

while ($rPr->read()){
if ($rPr->nodeType == XMLREADER::ELEMENT && $rPr->name === 'w:b'){
$bold_words[] = trim($run->readInnerXML());
break;
}
}
}
}
}
}

echo "
";
var_dump($bold_words);
echo "
";
?>


This code will now correctly extract all the bold text from the document.xml file. Make sure to adjust the file paths and folder names according to your specific setup.

Rate this post

4 of 5 based on 4305 votes

Comments




© 2024 Hayatsk.info - Personal Blogs Platform. All Rights Reserved.
Create blog  |  Privacy Policy  |  Terms & Conditions  |  Contact Us