XML Developer's Guide


Quickstart
==========


Hi, welcome to the quick-start tutorial.

Installing xmldataset
----------------------

Use pip to install xmldataset ::

   pip install xmldataset

Configuring xmldataset
----------------------

To get started, you need to import the xmldataset package :) ::

   import xmldataset

Configuring pretty print
------------------------

For these examples, it is worth using the pretty print module as this provides a great way of validating the returned results.  Configure this as follows:

::

   import pprint

   # Setup Pretty Printing
   ppsetup = pprint.PrettyPrinter(indent=4)
   pp = ppsetup.pprint

Adding XML content
------------------

We need some XML to work with, all of the examples will use the same XML although the profile will change to depict different behaviour.  Set the XML
up as follows:

::

   xml = """<?xml version="1.0"?>
     <catalog>
        <shop number="1">
           <book id="bk101">
              <author>Gambardella, Matthew</author>
              <title>XML Developer's Guide</title>
              <genre>Computer</genre>
              <price>44.95</price>
              <publish_date>2000-10-01</publish_date>
              <description>An in-depth look at creating applications 
              with XML.</description>
           </book>
           <book id="bk102">
              <author>Ralls, Kim</author>
              <title>Midnight Rain</title>
              <genre>Fantasy</genre>
              <price>5.95</price>
              <publish_date>2000-12-16</publish_date>
              <description>A former architect battles corporate zombies, 
              an evil sorceress, and her own childhood to become queen 
              of the world.</description>
           </book>
        </shop>
        <shop number="2">
           <book id="bk103">
              <author>Corets, Eva</author>
              <title>Maeve Ascendant</title>
              <genre>Fantasy</genre>
              <price>5.95</price>
              <publish_date>2000-11-17</publish_date>
              <description>After the collapse of a nanotechnology 
              society in England, the young survivors lay the 
              foundation for a new society.</description>
           </book>
           <book id="bk104">
              <author>Corets, Eva</author>
              <title>Oberon's Legacy</title>
              <genre>Fantasy</genre>
              <price>5.95</price>
              <publish_date>2001-03-10</publish_date>
              <description>In post-apocalypse England, the mysterious 
              agent known only as Oberon helps to create a new life 
              for the inhabitants of London. Sequel to Maeve 
              Ascendant.</description>
           </book>
        </shop>
     </catalog>"""

Example 1 - Extracting the title and author from the dataset
------------------------------------------------------------

Configure the profile, according to the XML, we need to step down past
the element catalog, shop and book before we reach author and title.

We can then specify that author and title are stored in the dataset **title_and_author**:

::

   profile="""
   catalog
       shop
           book
               author = dataset:title_and_author
               title  = dataset:title_and_author"""

Running this results in the following output:

::

   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'Ralls, Kim',
                                   'title': 'Midnight Rain'},
                               {   'author': 'Corets, Eva',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'Corets, Eva',
                                   'title': "Oberon's Legacy"}]}

Example 2 - Working with multiple datasets
------------------------------------------

Let's say we want to expand our collection so that we also capture the
title and genre.  This can be done quite simply by updating the profile
to include an additional dataset for title and a definition for genre:

::

   profile="""
   catalog
       shop
           book
               author = dataset:title_and_author
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre"""

With this executed, we now have an additional dataset of title_and_genre as
well as the original title_and_author:

::

   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                'title': "XML Developer's Guide"},
                            {   'author': 'Ralls, Kim',
                                'title': 'Midnight Rain'},
                            {   'author': 'Corets, Eva',
                                'title': 'Maeve Ascendant'},
                            {   'author': 'Corets, Eva',
                                'title': "Oberon's Legacy"}],
    'title_and_genre': [   {   'genre': 'Computer',
                               'title': "XML Developer's Guide"},
                           {   'genre': 'Fantasy', 'title': 'Midnight Rain'},
                           {   'genre': 'Fantasy', 'title': 'Maeve Ascendant'},
                           {   'genre': 'Fantasy', 'title': "Oberon's Legacy"}]}

Example 3 - Handling XML attributes
-----------------------------------

XML Attributes are treated in the profile as a sub level key/value in the profile. The following example depicts the inclusion of the attribute 'id' in the returned datasets. Note how id is indented under book and on the same level as author, title, genre etc:

::

   profile="""
   catalog
       shop
           book
               id     = dataset:title_and_author dataset:title_and_genre
               author = dataset:title_and_author
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre"""

With the following output:

::

   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                   'id': 'bk101',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'Ralls, Kim',
                                   'id': 'bk102',
                                   'title': 'Midnight Rain'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk103',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk104',
                                   'title': "Oberon's Legacy"}],
       'title_and_genre': [   {   'genre': 'Computer',
                                  'id': 'bk101',
                                  'title': "XML Developer's Guide"},
                              {   'genre': 'Fantasy',
                                  'id': 'bk102',
                                  'title': 'Midnight Rain'},
                              {   'genre': 'Fantasy',
                                  'id': 'bk103',
                                  'title': 'Maeve Ascendant'},
                              {   'genre': 'Fantasy',
                                  'id': 'bk104',
                                  'title': "Oberon's Legacy"}]}

Example 4 - Using higher level data across datasets
---------------------------------------------------

Note how the 'number' attribute is at a higher level than the existing data that we have captured.  Owning to it's hierachial position, this data may be of relevence to the datasets formed below.

Information that is available at a higher level to that of the specified dataset information can be referenced and included in datasets using a combination of the external_dataset and __EXTERNAL_VALUE__ markers.

The external_dataset marker informs the parser to store the information for later use. It follows the format of external_dataset:<target> where <target> is a reference name that identifies the external store.

The __EXTERNAL_VALUE__ marker informs the parser to reference a value that is or will be stored externally. It follows the format of __EXTERNAL_VALUE__ = <external_store>:<external_value>:<target_dataset>

::

   profile="""
   catalog
       shop
           number     = external_dataset:shop_information
           book
               id     = dataset:title_and_author dataset:title_and_genre
               author = dataset:title_and_author
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre
               __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre"""

Produces the following, n.b. the change of number for books 3 and 4 on each dataset:

::

   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                   'id': 'bk101',
                                   'number': '1',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'Ralls, Kim',
                                   'id': 'bk102',
                                   'number': '1',
                                   'title': 'Midnight Rain'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk103',
                                   'number': '2',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk104',
                                   'number': '2',
                                   'title': "Oberon's Legacy"}],
       'title_and_genre': [   {   'genre': 'Computer',
                                  'id': 'bk101',
                                  'number': '1',
                                  'title': "XML Developer's Guide"},
                              {   'genre': 'Fantasy',
                                  'id': 'bk102',
                                  'number': '1',
                                  'title': 'Midnight Rain'},
                              {   'genre': 'Fantasy',
                                  'id': 'bk103',
                                  'number': '2',
                                  'title': 'Maeve Ascendant'},
                              {   'genre': 'Fantasy',
                                  'id': 'bk104',
                                  'number': '2',
                                  'title': "Oberon's Legacy"}]}

Example 5 - Optional Dataset Parameters: name
---------------------------------------------

Dataset declarations can receive additional parameters through comma seperated inclusions. In this example the XML element of 'genre' is renamed to 'style' during processing using the name declaration.

::

   profile="""
   catalog
       shop
           number     = external_dataset:shop_information
           book
               id     = dataset:title_and_author dataset:title_and_genre
               author = dataset:title_and_author
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre,name:style
               __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre"""

Within the dataset title_and_genre, the keyword 'genre' is now changed to 'style':

::
   
   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                   'id': 'bk101',
                                   'number': '1',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'Ralls, Kim',
                                   'id': 'bk102',
                                   'number': '1',
                                   'title': 'Midnight Rain'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk103',
                                   'number': '2',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk104',
                                   'number': '2',
                                   'title': "Oberon's Legacy"}],
       'title_and_genre': [   {   'id': 'bk101',
                                  'number': '1',
                                  'style': 'Computer',
                                  'title': "XML Developer's Guide"},
                              {   'id': 'bk102',
                                  'number': '1',
                                  'style': 'Fantasy',
                                  'title': 'Midnight Rain'},
                              {   'id': 'bk103',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': 'Maeve Ascendant'},
                              {   'id': 'bk104',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': "Oberon's Legacy"}]}

Example 6 - Optional Dataset Parameters: prefix
-----------------------------------------------

The prefix declaration assigns a prefix to the assignment name, for example genre with a prefix of shop_information_ will become shop_information_genre

For consistency, in this example, the external information of name uses the additional optional parameter of :<override_name> as mentioned in Example 4 to override the external name

::

   profile="""
   catalog
       shop
           number     = external_dataset:shop_information
           book
               id     = dataset:title_and_author,prefix:shop_information_ dataset:title_and_genre,prefix:shop_information_
               author = dataset:title_and_author,prefix:shop_information_
               title  = dataset:title_and_author,prefix:shop_information_ dataset:title_and_genre,prefix:shop_information_,prefix:shop_information_
               genre  = dataset:title_and_genre,prefix:shop_information_
               __EXTERNAL_VALUE__ = shop_information:number:title_and_author:shop_information_number shop_information:number:title_and_genre:shop_information_number"""


Resulting in the following:

::
   
   {   'title_and_author': [   {   'shop_information_author': 'Gambardella, Matthew',
                                   'shop_information_id': 'bk101',
                                   'shop_information_number': '1',
                                   'shop_information_title': "XML Developer's Guide"},
                               {   'shop_information_author': 'Ralls, Kim',
                                   'shop_information_id': 'bk102',
                                   'shop_information_number': '1',
                                   'shop_information_title': 'Midnight Rain'},
                               {   'shop_information_author': 'Corets, Eva',
                                   'shop_information_id': 'bk103',
                                   'shop_information_number': '2',
                                   'shop_information_title': 'Maeve Ascendant'},
                               {   'shop_information_author': 'Corets, Eva',
                                   'shop_information_id': 'bk104',
                                   'shop_information_number': '2',
                                   'shop_information_title': "Oberon's Legacy"}],
       'title_and_genre': [   {   'shop_information_genre': 'Computer',
                                  'shop_information_id': 'bk101',
                                  'shop_information_number': '1',
                                  'shop_information_title': "XML Developer's Guide"},
                              {   'shop_information_genre': 'Fantasy',
                                  'shop_information_id': 'bk102',
                                  'shop_information_number': '1',
                                  'shop_information_title': 'Midnight Rain'},
                              {   'shop_information_genre': 'Fantasy',
                                  'shop_information_id': 'bk103',
                                  'shop_information_number': '2',
                                  'shop_information_title': 'Maeve Ascendant'},
                              {   'shop_information_genre': 'Fantasy',
                                  'shop_information_id': 'bk104',
                                  'shop_information_number': '2',
                                  'shop_information_title': "Oberon's Legacy"}]}

Example 7 - Optional Dataset Parameters: process
------------------------------------------------

The process parameter can be used for inline manipulation of data. In this example the author is passed through a simple subroutine that returns an uppercase value.

The parser expects methods specified by the process to be passed by the parse_using_profile method as per the example:

::

   def to_upper(value):
       return value.upper()

   profile="""
   catalog
       shop
           number     = external_dataset:shop_information
           book
               id     = dataset:title_and_author dataset:title_and_genre
               author = dataset:title_and_author,process:to_upper
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre,name:style
               __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre"""

   # Pretty Print the output
   output = xmldataset.parse_using_profile(xml,profile, process = { 'to_upper' : to_upper })
   pp(output)

We've specifically targetted the dataset title_and_author for the author value to be processed through **to_upper**:

::

   {   'title_and_author': [   {   'author': 'GAMBARDELLA, MATTHEW',
                                   'id': 'bk101',
                                   'number': '1',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'RALLS, KIM',
                                   'id': 'bk102',
                                   'number': '1',
                                   'title': 'Midnight Rain'},
                               {   'author': 'CORETS, EVA',
                                   'id': 'bk103',
                                   'number': '2',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'CORETS, EVA',
                                   'id': 'bk104',
                                   'number': '2',
                                   'title': "Oberon's Legacy"}],
       'title_and_genre': [   {   'id': 'bk101',
                                  'number': '1',
                                  'style': 'Computer',
                                  'title': "XML Developer's Guide"},
                              {   'id': 'bk102',
                                  'number': '1',
                                  'style': 'Fantasy',
                                  'title': 'Midnight Rain'},
                              {   'id': 'bk103',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': 'Maeve Ascendant'},
                              {   'id': 'bk104',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': "Oberon's Legacy"}]}

Example 8 - Hinting for new datasets
------------------------------------

During processing, the parser looks for indicators that it should create a new dataset. As an example, when new data is encountered rather than overriding the existing data, a new dataset is created. Unfortunately this may lead to unexpected results when working with poorly structured input where subsets of information may be missing from the XML structure.

To mitigate this, the hint __NEW_DATASET__ = <dataset> is available to force the creation of a new dataset upon entering a block.

If there are any concerns about the consistency of the XML document then it is recommended that the __NEW_DATASET__ declaration is made within all respective blocks as part of the profile definition.

::

   profile="""
   catalog
       shop
           number     = external_dataset:shop_information
           book
               __NEW_DATASET__ = title_and_author title_and_genre
               id     = dataset:title_and_author dataset:title_and_genre
               author = dataset:title_and_author
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre,name:style
               __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre"""

Output as follows:

::

   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                   'id': 'bk101',
                                   'number': '1',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'Ralls, Kim',
                                   'id': 'bk102',
                                   'number': '1',
                                   'title': 'Midnight Rain'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk103',
                                   'number': '2',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk104',
                                   'number': '2',
                                   'title': "Oberon's Legacy"}],
       'title_and_genre': [   {   'id': 'bk101',
                                  'number': '1',
                                  'style': 'Computer',
                                  'title': "XML Developer's Guide"},
                              {   'id': 'bk102',
                                  'number': '1',
                                  'style': 'Fantasy',
                                  'title': 'Midnight Rain'},
                              {   'id': 'bk103',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': 'Maeve Ascendant'},
                              {   'id': 'bk104',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': "Oberon's Legacy"}]}

Example 9 - Dispatching datasets
--------------------------------

Datasets can be dispatched during processing.  This is beneficial especially where memory is concerned as the datasets can be handed off to another method and processed as opposed to filling up
memory before being returned.  The __generic__ keyword allows you to target all datasets:

::

   profile="""
   catalog
       shop
           number     = external_dataset:shop_information
           book
               id     = dataset:title_and_author dataset:title_and_genre
               author = dataset:title_and_author
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre,name:style
               __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre"""

   def print_dataset(value):
       pp(value)

   # Pretty Print the output
   output = xmldataset.parse_using_profile(xml,profile, dispatch = { 
           '__generic__' : { 
                   'counter' : 2, 
                   'coderef' : print_dataset 
           } 
   })


As the counter is set to 2, every dataset is passed as an object to the print_dataset method, note how each array now holds 2 entries:


::
   
   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                   'id': 'bk101',
                                   'number': '1',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'Ralls, Kim',
                                   'id': 'bk102',
                                   'number': '1',
                                   'title': 'Midnight Rain'}]}
   {   'title_and_genre': [   {   'id': 'bk101',
                                  'number': '1',
                                  'style': 'Computer',
                                  'title': "XML Developer's Guide"},
                              {   'id': 'bk102',
                                  'number': '1',
                                  'style': 'Fantasy',
                                  'title': 'Midnight Rain'}]}
   {   'title_and_genre': [   {   'id': 'bk103',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': 'Maeve Ascendant'},
                              {   'id': 'bk104',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': "Oberon's Legacy"}]}
   {   'title_and_author': [   {   'author': 'Corets, Eva',
                                   'id': 'bk103',
                                   'number': '2',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk104',
                                   'number': '2',
                                   'title': "Oberon's Legacy"}]}

Example 10 - Dispatching multiple datasets
------------------------------------------

It is possible to dispatch datasets to different methods or to specify dataset specific counters:

::

   profile="""
   catalog
       shop
           number     = external_dataset:shop_information
           book
               id     = dataset:title_and_author dataset:title_and_genre
               author = dataset:title_and_author
               title  = dataset:title_and_author dataset:title_and_genre
               genre  = dataset:title_and_genre,name:style
               __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre"""

   def print_dataset(value):
       pp(value)

   # Pretty Print the output
   output = xmldataset.parse_using_profile(xml,profile, dispatch = { 
           'title_and_author' : { 
                   'counter' : 2, 
                   'coderef' : print_dataset 
           }, 
           'title_and_genre' : { 
                   'counter' : 3, 
                   'coderef' : print_dataset 
           } 
   })


As the title_and_genre is now dispatching 3 datasets, it's final dump is of a single dataset for the remaining entry:

:: 

   {   'title_and_author': [   {   'author': 'Gambardella, Matthew',
                                   'id': 'bk101',
                                   'number': '1',
                                   'title': "XML Developer's Guide"},
                               {   'author': 'Ralls, Kim',
                                   'id': 'bk102',
                                   'number': '1',
                                   'title': 'Midnight Rain'}]}
   {   'title_and_genre': [   {   'id': 'bk101',
                                  'number': '1',
                                  'style': 'Computer',
                                  'title': "XML Developer's Guide"},
                              {   'id': 'bk102',
                                  'number': '1',
                                  'style': 'Fantasy',
                                  'title': 'Midnight Rain'},
                              {   'id': 'bk103',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': 'Maeve Ascendant'}]}
   {   'title_and_genre': [   {   'id': 'bk104',
                                  'number': '2',
                                  'style': 'Fantasy',
                                  'title': "Oberon's Legacy"}]}
   {   'title_and_author': [   {   'author': 'Corets, Eva',
                                   'id': 'bk103',
                                   'number': '2',
                                   'title': 'Maeve Ascendant'},
                               {   'author': 'Corets, Eva',
                                   'id': 'bk104',
                                   'number': '2',
                                   'title': "Oberon's Legacy"}]}

Example 11 - Using xmldataset as an input to pandas
------------------------------------------

Thanks to keluc for this one, xmldataset works well as an input to pandas with the from_records method

::

   result = xmldataset.parse_using_profile(xml, profile)
   df = pd.DataFrame.from_records(result['...'])