How to use lxml to find an element by text?

You are very close. Use text()= rather than @text (which indicates an attribute). e = root.xpath(‘.//a[text()=”TEXT A”]’) Or, if you know only that the text contains “TEXT A”, e = root.xpath(‘.//a[contains(text(),”TEXT A”)]’) Or, if you know only that text starts with “TEXT A”, e = root.xpath(‘.//a[starts-with(text(),”TEXT A”)]’) See the docs for more on the available … Read more

Pretty print in lxml is failing when I add tags to a parsed tree

It has to do with how lxml treats whitespace — see the lxml FAQ for details. To fix this, change the loading part of the file to the following: parser = etree.XMLParser(remove_blank_text=True) root = etree.parse(‘file.xml’, parser).getroot() I didn’t test it, but it should indent your file just fine with this change.

Using Python Iterparse For Large XML Files

Try Liza Daly’s fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings. def fast_iter(context, func, *args, **kwargs): “”” http://lxml.de/parsing.html#modifying-the-tree Based on Liza Daly’s fast_iter http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ See also http://effbot.org/zone/element-iterparse.htm “”” for event, elem in context: func(elem, *args, **kwargs) # It’s safe to call clear() here because no descendants … Read more

Remove namespace and prefix from xml in python using lxml

We can get the desired output document in two steps: Remove namespace URIs from element names Remove unused namespace declarations from the XML tree Example code from lxml import etree input_xml = “”” <package xmlns=”http://apple.com/itunes/importer”> <provider>some data</provider> <language>en-GB</language> <!– some comment –> <?xml-some-processing-instruction ?> </package> “”” root = etree.fromstring(input_xml) # Iterate through all XML elements … Read more

Find python lxml version

You can get the version by looking at etree: >>> from lxml import etree >>> etree.LXML_VERSION (3, 0, -198, 0) Other versions of interest can be: etree.LIBXML_VERSION, etree.LIBXML_COMPILED_VERSION, etree.LIBXSLT_VERSION and etree.LIBXSLT_COMPILED_VERSION.

How to get path of an element in lxml?

Use getpath from ElementTree objects. from lxml import etree root = etree.fromstring(”’ <foo><bar>Data</bar><bar><baz>data</baz> <baz>data</baz></bar></foo> ”’) tree = etree.ElementTree(root) for e in root.iter(): print(tree.getpath(e)) Prints /foo /foo/bar[1] /foo/bar[2] /foo/bar[2]/baz[1] /foo/bar[2]/baz[2]