Quickstart ========== Hi, welcome to the quick-start tutorial. Installing xmldataset ---------------------- Use pip to install xmldataset :: pip install xmldataset Configuring xmldataset ---------------------- To get started, you need to import the xmldataset package :) :: import xmldataset Configuring pretty print ------------------------ For these examples, it is worth using the pretty print module as this provides a great way of validating the returned results. Configure this as follows: :: import pprint # Setup Pretty Printing ppsetup = pprint.PrettyPrinter(indent=4) pp = ppsetup.pprint Adding XML content ------------------ We need some XML to work with, all of the examples will use the same XML although the profile will change to depict different behaviour. Set the XML up as follows: :: xml = """ Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML. Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world. Corets, Eva Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society. Corets, Eva Oberon's Legacy Fantasy 5.95 2001-03-10 In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant. """ Example 1 - Extracting the title and author from the dataset ------------------------------------------------------------ Configure the profile, according to the XML, we need to step down past the element catalog, shop and book before we reach author and title. We can then specify that author and title are stored in the dataset **title_and_author**: :: profile=""" catalog shop book author = dataset:title_and_author title = dataset:title_and_author""" Running this results in the following output: :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'title': 'Midnight Rain'}, { 'author': 'Corets, Eva', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'title': "Oberon's Legacy"}]} Example 2 - Working with multiple datasets ------------------------------------------ Let's say we want to expand our collection so that we also capture the title and genre. This can be done quite simply by updating the profile to include an additional dataset for title and a definition for genre: :: profile=""" catalog shop book author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre""" With this executed, we now have an additional dataset of title_and_genre as well as the original title_and_author: :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'title': 'Midnight Rain'}, { 'author': 'Corets, Eva', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'title': "Oberon's Legacy"}], 'title_and_genre': [ { 'genre': 'Computer', 'title': "XML Developer's Guide"}, { 'genre': 'Fantasy', 'title': 'Midnight Rain'}, { 'genre': 'Fantasy', 'title': 'Maeve Ascendant'}, { 'genre': 'Fantasy', 'title': "Oberon's Legacy"}]} Example 3 - Handling XML attributes ----------------------------------- XML Attributes are treated in the profile as a sub level key/value in the profile. The following example depicts the inclusion of the attribute 'id' in the returned datasets. Note how id is indented under book and on the same level as author, title, genre etc: :: profile=""" catalog shop book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre""" With the following output: :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'id': 'bk101', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'id': 'bk102', 'title': 'Midnight Rain'}, { 'author': 'Corets, Eva', 'id': 'bk103', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'id': 'bk104', 'title': "Oberon's Legacy"}], 'title_and_genre': [ { 'genre': 'Computer', 'id': 'bk101', 'title': "XML Developer's Guide"}, { 'genre': 'Fantasy', 'id': 'bk102', 'title': 'Midnight Rain'}, { 'genre': 'Fantasy', 'id': 'bk103', 'title': 'Maeve Ascendant'}, { 'genre': 'Fantasy', 'id': 'bk104', 'title': "Oberon's Legacy"}]} Example 4 - Using higher level data across datasets --------------------------------------------------- Note how the 'number' attribute is at a higher level than the existing data that we have captured. Owning to it's hierachial position, this data may be of relevence to the datasets formed below. Information that is available at a higher level to that of the specified dataset information can be referenced and included in datasets using a combination of the external_dataset and __EXTERNAL_VALUE__ markers. The external_dataset marker informs the parser to store the information for later use. It follows the format of external_dataset: where is a reference name that identifies the external store. The __EXTERNAL_VALUE__ marker informs the parser to reference a value that is or will be stored externally. It follows the format of __EXTERNAL_VALUE__ = :: :: profile=""" catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre""" Produces the following, n.b. the change of number for books 3 and 4 on each dataset: :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'id': 'bk101', 'number': '1', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'id': 'bk102', 'number': '1', 'title': 'Midnight Rain'}, { 'author': 'Corets, Eva', 'id': 'bk103', 'number': '2', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'id': 'bk104', 'number': '2', 'title': "Oberon's Legacy"}], 'title_and_genre': [ { 'genre': 'Computer', 'id': 'bk101', 'number': '1', 'title': "XML Developer's Guide"}, { 'genre': 'Fantasy', 'id': 'bk102', 'number': '1', 'title': 'Midnight Rain'}, { 'genre': 'Fantasy', 'id': 'bk103', 'number': '2', 'title': 'Maeve Ascendant'}, { 'genre': 'Fantasy', 'id': 'bk104', 'number': '2', 'title': "Oberon's Legacy"}]} Example 5 - Optional Dataset Parameters: name --------------------------------------------- Dataset declarations can receive additional parameters through comma seperated inclusions. In this example the XML element of 'genre' is renamed to 'style' during processing using the name declaration. :: profile=""" catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre,name:style __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre""" Within the dataset title_and_genre, the keyword 'genre' is now changed to 'style': :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'id': 'bk101', 'number': '1', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'id': 'bk102', 'number': '1', 'title': 'Midnight Rain'}, { 'author': 'Corets, Eva', 'id': 'bk103', 'number': '2', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'id': 'bk104', 'number': '2', 'title': "Oberon's Legacy"}], 'title_and_genre': [ { 'id': 'bk101', 'number': '1', 'style': 'Computer', 'title': "XML Developer's Guide"}, { 'id': 'bk102', 'number': '1', 'style': 'Fantasy', 'title': 'Midnight Rain'}, { 'id': 'bk103', 'number': '2', 'style': 'Fantasy', 'title': 'Maeve Ascendant'}, { 'id': 'bk104', 'number': '2', 'style': 'Fantasy', 'title': "Oberon's Legacy"}]} Example 6 - Optional Dataset Parameters: prefix ----------------------------------------------- The prefix declaration assigns a prefix to the assignment name, for example genre with a prefix of shop_information_ will become shop_information_genre For consistency, in this example, the external information of name uses the additional optional parameter of : as mentioned in Example 4 to override the external name :: profile=""" catalog shop number = external_dataset:shop_information book id = dataset:title_and_author,prefix:shop_information_ dataset:title_and_genre,prefix:shop_information_ author = dataset:title_and_author,prefix:shop_information_ title = dataset:title_and_author,prefix:shop_information_ dataset:title_and_genre,prefix:shop_information_,prefix:shop_information_ genre = dataset:title_and_genre,prefix:shop_information_ __EXTERNAL_VALUE__ = shop_information:number:title_and_author:shop_information_number shop_information:number:title_and_genre:shop_information_number""" Resulting in the following: :: { 'title_and_author': [ { 'shop_information_author': 'Gambardella, Matthew', 'shop_information_id': 'bk101', 'shop_information_number': '1', 'shop_information_title': "XML Developer's Guide"}, { 'shop_information_author': 'Ralls, Kim', 'shop_information_id': 'bk102', 'shop_information_number': '1', 'shop_information_title': 'Midnight Rain'}, { 'shop_information_author': 'Corets, Eva', 'shop_information_id': 'bk103', 'shop_information_number': '2', 'shop_information_title': 'Maeve Ascendant'}, { 'shop_information_author': 'Corets, Eva', 'shop_information_id': 'bk104', 'shop_information_number': '2', 'shop_information_title': "Oberon's Legacy"}], 'title_and_genre': [ { 'shop_information_genre': 'Computer', 'shop_information_id': 'bk101', 'shop_information_number': '1', 'shop_information_title': "XML Developer's Guide"}, { 'shop_information_genre': 'Fantasy', 'shop_information_id': 'bk102', 'shop_information_number': '1', 'shop_information_title': 'Midnight Rain'}, { 'shop_information_genre': 'Fantasy', 'shop_information_id': 'bk103', 'shop_information_number': '2', 'shop_information_title': 'Maeve Ascendant'}, { 'shop_information_genre': 'Fantasy', 'shop_information_id': 'bk104', 'shop_information_number': '2', 'shop_information_title': "Oberon's Legacy"}]} Example 7 - Optional Dataset Parameters: process ------------------------------------------------ The process parameter can be used for inline manipulation of data. In this example the author is passed through a simple subroutine that returns an uppercase value. The parser expects methods specified by the process to be passed by the parse_using_profile method as per the example: :: def to_upper(value): return value.upper() profile=""" catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author,process:to_upper title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre,name:style __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre""" # Pretty Print the output output = xmldataset.parse_using_profile(xml,profile, process = { 'to_upper' : to_upper }) pp(output) We've specifically targetted the dataset title_and_author for the author value to be processed through **to_upper**: :: { 'title_and_author': [ { 'author': 'GAMBARDELLA, MATTHEW', 'id': 'bk101', 'number': '1', 'title': "XML Developer's Guide"}, { 'author': 'RALLS, KIM', 'id': 'bk102', 'number': '1', 'title': 'Midnight Rain'}, { 'author': 'CORETS, EVA', 'id': 'bk103', 'number': '2', 'title': 'Maeve Ascendant'}, { 'author': 'CORETS, EVA', 'id': 'bk104', 'number': '2', 'title': "Oberon's Legacy"}], 'title_and_genre': [ { 'id': 'bk101', 'number': '1', 'style': 'Computer', 'title': "XML Developer's Guide"}, { 'id': 'bk102', 'number': '1', 'style': 'Fantasy', 'title': 'Midnight Rain'}, { 'id': 'bk103', 'number': '2', 'style': 'Fantasy', 'title': 'Maeve Ascendant'}, { 'id': 'bk104', 'number': '2', 'style': 'Fantasy', 'title': "Oberon's Legacy"}]} Example 8 - Hinting for new datasets ------------------------------------ During processing, the parser looks for indicators that it should create a new dataset. As an example, when new data is encountered rather than overriding the existing data, a new dataset is created. Unfortunately this may lead to unexpected results when working with poorly structured input where subsets of information may be missing from the XML structure. To mitigate this, the hint __NEW_DATASET__ = is available to force the creation of a new dataset upon entering a block. If there are any concerns about the consistency of the XML document then it is recommended that the __NEW_DATASET__ declaration is made within all respective blocks as part of the profile definition. :: profile=""" catalog shop number = external_dataset:shop_information book __NEW_DATASET__ = title_and_author title_and_genre id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre,name:style __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre""" Output as follows: :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'id': 'bk101', 'number': '1', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'id': 'bk102', 'number': '1', 'title': 'Midnight Rain'}, { 'author': 'Corets, Eva', 'id': 'bk103', 'number': '2', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'id': 'bk104', 'number': '2', 'title': "Oberon's Legacy"}], 'title_and_genre': [ { 'id': 'bk101', 'number': '1', 'style': 'Computer', 'title': "XML Developer's Guide"}, { 'id': 'bk102', 'number': '1', 'style': 'Fantasy', 'title': 'Midnight Rain'}, { 'id': 'bk103', 'number': '2', 'style': 'Fantasy', 'title': 'Maeve Ascendant'}, { 'id': 'bk104', 'number': '2', 'style': 'Fantasy', 'title': "Oberon's Legacy"}]} Example 9 - Dispatching datasets -------------------------------- Datasets can be dispatched during processing. This is beneficial especially where memory is concerned as the datasets can be handed off to another method and processed as opposed to filling up memory before being returned. The __generic__ keyword allows you to target all datasets: :: profile=""" catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre,name:style __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre""" def print_dataset(value): pp(value) # Pretty Print the output output = xmldataset.parse_using_profile(xml,profile, dispatch = { '__generic__' : { 'counter' : 2, 'coderef' : print_dataset } }) As the counter is set to 2, every dataset is passed as an object to the print_dataset method, note how each array now holds 2 entries: :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'id': 'bk101', 'number': '1', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'id': 'bk102', 'number': '1', 'title': 'Midnight Rain'}]} { 'title_and_genre': [ { 'id': 'bk101', 'number': '1', 'style': 'Computer', 'title': "XML Developer's Guide"}, { 'id': 'bk102', 'number': '1', 'style': 'Fantasy', 'title': 'Midnight Rain'}]} { 'title_and_genre': [ { 'id': 'bk103', 'number': '2', 'style': 'Fantasy', 'title': 'Maeve Ascendant'}, { 'id': 'bk104', 'number': '2', 'style': 'Fantasy', 'title': "Oberon's Legacy"}]} { 'title_and_author': [ { 'author': 'Corets, Eva', 'id': 'bk103', 'number': '2', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'id': 'bk104', 'number': '2', 'title': "Oberon's Legacy"}]} Example 10 - Dispatching multiple datasets ------------------------------------------ It is possible to dispatch datasets to different methods or to specify dataset specific counters: :: profile=""" catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre,name:style __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre""" def print_dataset(value): pp(value) # Pretty Print the output output = xmldataset.parse_using_profile(xml,profile, dispatch = { 'title_and_author' : { 'counter' : 2, 'coderef' : print_dataset }, 'title_and_genre' : { 'counter' : 3, 'coderef' : print_dataset } }) As the title_and_genre is now dispatching 3 datasets, it's final dump is of a single dataset for the remaining entry: :: { 'title_and_author': [ { 'author': 'Gambardella, Matthew', 'id': 'bk101', 'number': '1', 'title': "XML Developer's Guide"}, { 'author': 'Ralls, Kim', 'id': 'bk102', 'number': '1', 'title': 'Midnight Rain'}]} { 'title_and_genre': [ { 'id': 'bk101', 'number': '1', 'style': 'Computer', 'title': "XML Developer's Guide"}, { 'id': 'bk102', 'number': '1', 'style': 'Fantasy', 'title': 'Midnight Rain'}, { 'id': 'bk103', 'number': '2', 'style': 'Fantasy', 'title': 'Maeve Ascendant'}]} { 'title_and_genre': [ { 'id': 'bk104', 'number': '2', 'style': 'Fantasy', 'title': "Oberon's Legacy"}]} { 'title_and_author': [ { 'author': 'Corets, Eva', 'id': 'bk103', 'number': '2', 'title': 'Maeve Ascendant'}, { 'author': 'Corets, Eva', 'id': 'bk104', 'number': '2', 'title': "Oberon's Legacy"}]} Example 11 - Using xmldataset as an input to pandas ------------------------------------------ Thanks to keluc for this one, xmldataset works well as an input to pandas with the from_records method :: result = xmldataset.parse_using_profile(xml, profile) df = pd.DataFrame.from_records(result['...'])