I want to extract bold text from word docx using php. I create docx zip file and extract. Then, read document.xml. In xml, presence of <w:b/>
show that text is bold.
sample.docx:
Create zip and extract
<?php
$docname="sample";
echo copy($docname.".docx",$docname.".zip");
$zip = new ZipArchive;
if ($zip->open($docname.".zip") === TRUE) {
$zip->extractTo($docname."/");
$zip->close();
} else {
echo 'failed';
}
?>
Extract bold word to array (Reference:search-bold
<?php
//https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
$folder="sample";
$xmlFile = $folder."/word/document.xml";
$reader = new XMLReader;
$reader->open($xmlFile);
$bold_words=[];
while($reader->read()){
if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:p'){
$paragraph = new XMLReader;
$p = $reader->readOuterXML();
$paragraph->xml($p);
while ($paragraph->read()){
if ($paragraph->nodeType == XMLREADER::ELEMENT && $paragraph->name === 'w:r'){
$node = trim($paragraph->readInnerXML());
//strstr() function searches for the first occurrence of a string inside another string
if(strstr($node,'<w:b/>'))
{
$bold_words[]=$node;
}
}
}
}
}
echo "<pre>";
var_dump($bold_words);
echo "</pre>";
?>
The result show :
array(1) {
[0]=>string(364) "Title content"
}
There should be 5 bold words shown in result, but, only has one. I have checked document.xml. <w:b/>
only appear once.
How list text bold formatted in document.xml?
My Answer:
To list all the bold text formatted in the document.xml, you can modify the code to check for the presence of the
php
$folder="sample";
$xmlFile = $folder."/word/document.xml";
$reader = new XMLReader;
$reader->open($xmlFile);
$bold_words=[];
while($reader->read()){
if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:r'){
$run = new XMLReader;
$r = $reader->readOuterXML();
$run->xml($r);
while ($run->read()){
if ($run->nodeType == XMLREADER::ELEMENT && $run->name === 'w:rPr'){
$rPr = new XMLReader;
$rPr->xml($run->readOuterXML());
while ($rPr->read()){
if ($rPr->nodeType == XMLREADER::ELEMENT && $rPr->name === 'w:b'){
$bold_words[] = trim($run->readInnerXML());
break;
}
}
}
}
}
}
echo "";
var_dump($bold_words);
echo "
";
?>
This code will now correctly extract all the bold text from the document.xml file. Make sure to adjust the file paths and folder names according to your specific setup.
Rate this post
4 of 5 based on 4305 votesComments