328 lines
No EOL
21 KiB
Markdown
328 lines
No EOL
21 KiB
Markdown
# Migrating XML files into Drupal
|
||
|
||
Today we will learn how to migrate content from a **XML file** into Drupal using the [Migrate Plus module](https://www.drupal.org/project/migrate_plus). We will show how to configure the migration to read files from the *local file system* and *remote locations*. We will also talk about the difference between two data parsers provided the module. The example includes *node*, *images*, and *paragraphs* migrations. Let's get started.
|
||
|
||
*Note*: Migrate Plus has many more features. For example, it contains source plugins to import from [JSON files](https://understanddrupal.com/articles/migrating-json-files-drupal) and SOAP endpoints. It provides many useful process plugins for DOM manipulation, string replacement, transliteration, etc. The module also lets you define migration plugins as configurations and create groups to share settings. It offers a custom event to modify the source data before processing begins. In today's blog post, we are focusing on importing XML files. Other features will be covered in future entries.
|
||
|
||
## Getting the code
|
||
|
||
You can get the full code example at <https://github.com/dinarcon/ud_migrations> The module to enable is `UD XML source migration` whose machine name is `ud_migrations_xml_source`. It comes with four migrations: `udm_xml_source_paragraph`, `udm_xml_source_image`, `udm_xml_source_node_local`, and `udm_xml_source_node_remote`.
|
||
|
||
You can get the Migrate Plus module using [composer](https://getcomposer.org/): `composer require 'drupal/migrate_plus:^5.0'`. This will install the `8.x-5.x` branch, where new development will happen. This branch was created to introduce breaking changes in preparation for Drupal 9\. As of this writing, the `8.x-4.x` branch has feature parity with the newer branch. If your Drupal site is not composer-based, you can download the module manually.
|
||
|
||
## Understanding the example set up
|
||
|
||
This migration will reuse the same configuration from the [introduction to paragraph migrations](https://understanddrupal.com/articles/introduction-paragraphs-migrations-drupal) example. Refer to that article for details on the configuration: the destinations will be the same content type, paragraph type, and fields. The source will be changed in today's example, as we use it to explain XML migrations. The end result will again be nodes containing an image and a paragraph with information about someone's favorite book. The major difference is that we are going to read from XML. In fact, three of the migrations will read from the same file. The following snippet shows a reduced version of the file to get a sense of its structure:
|
||
|
||
```xml
|
||
<?xml version="1.0" encoding="UTF-8" ?>
|
||
<data>
|
||
<udm_people>
|
||
<unique_id>1</unique_id>
|
||
<name>Michele Metts</name>
|
||
<photo_file>P01</photo_file>
|
||
<book_ref>B10</book_ref>
|
||
</udm_people>
|
||
<udm_people>
|
||
...
|
||
</udm_people>
|
||
<udm_people>
|
||
...
|
||
</udm_people>
|
||
<udm_book_paragraph>
|
||
<book_id>B10</book_id>
|
||
<book_details>
|
||
<title>The definite guide to Drupal 7</title>
|
||
<author>Benjamin Melançon et al.</author>
|
||
</book_details>
|
||
</udm_book_paragraph>
|
||
<udm_book_paragraph>
|
||
...
|
||
</udm_book_paragraph>
|
||
<udm_book_paragraph>
|
||
...
|
||
</udm_book_paragraph>
|
||
<udm_photos>
|
||
<photo_id>P01</photo_id>
|
||
<photo_url>https://udrupal.com/photos/freescholar.jpg</photo_url>
|
||
<photo_dimensions>
|
||
<width>240</width>
|
||
<height>351</height>
|
||
</photo_dimensions>
|
||
</udm_photos>
|
||
<udm_photos>
|
||
...
|
||
</udm_photos>
|
||
<udm_photos>
|
||
...
|
||
</udm_photos>
|
||
</data>
|
||
```
|
||
|
||
*Note*: You can literally swap migration sources *without changing any other part of the migration*. This is a powerful feature of [ETL frameworks](https://understanddrupal.com/articles/drupal-migrations-understanding-etl-process) like Drupal's Migrate API. Although possible, the example includes slight changes to demonstrate various plugin configuration options. Also, some *machine names* had to be changed to avoid conflicts with other examples in the demo repository.
|
||
|
||
## Migrating nodes from a XML file
|
||
|
||
In any migration project, understanding the source is very important. For XML migrations, there are two major considerations. First, where in the *XML tree* hierarchy lies the data that you want to import. It can be at the root of the file or several levels deep in the hierarchy. You use an [XPath](https://en.wikipedia.org/wiki/XPath) expression to select a *set of nodes* from the *XML document*. In this article, the term `element` when referring to an *XML document node* to distinguish it from a [Drupal node](https://understanddrupal.com/articles/what-difference-between-node-and-content-type-drupal). Second, when you get to the *set of elements* that you want to import, what child elements are going to be made available to the migration. It is possible that each element contains more data than needed. In XML imports, you have to manually include the child elements that will be required for the migration. The following code snippet shows part of the *local* XML file relevant to the *node* migration:
|
||
|
||
```xml
|
||
<?xml version="1.0" encoding="UTF-8" ?>
|
||
<data>
|
||
<udm_people>
|
||
<unique_id>1</unique_id>
|
||
<name>Michele Metts</name>
|
||
<photo_file>P01</photo_file>
|
||
<book_ref>B10</book_ref>
|
||
</udm_people>
|
||
<udm_people>
|
||
...
|
||
</udm_people>
|
||
<udm_people>
|
||
...
|
||
</udm_people>
|
||
</data>
|
||
```
|
||
|
||
The *set of elements* containing node data lies two levels deep in the hierarchy. Starting with `data` at the root and then descending one level to `udm_people`. Each element of this array is an object with four properties:
|
||
|
||
- `unique_id` is the *unique identifier* for each element **returned by** the `data/udm_people` hierarchy.
|
||
- `name` is the name of a person. This will be used in the node title.
|
||
- `photo_file` is the *unique identifier* of an image that was created in a separate migration.
|
||
- `book_ref` is the *unique identifier* of a book paragraph that was created in a separate migration.
|
||
|
||
The following snippet shows the configuration to read a *local* XML file for the *node* migration:
|
||
|
||
```yaml
|
||
source:
|
||
plugin: url
|
||
# This configuration is ignored by the 'xml' data parser plugin.
|
||
# It only has effect when using the 'simple_xml' data parser plugin.
|
||
data_fetcher_plugin: file
|
||
# Set to 'xml' to use XMLReader https://www.php.net/manual/en/book.xmlreader.php
|
||
# Set to 'simple_xml' to use SimpleXML https://www.php.net/manual/en/ref.simplexml.php
|
||
data_parser_plugin: xml
|
||
urls:
|
||
- modules/custom/ud_migrations/ud_migrations_xml_source/sources/udm_data.xml
|
||
# XPath expression. It is common that it starts with a slash (/).
|
||
item_selector: /data/udm_people
|
||
fields:
|
||
- name: src_unique_id
|
||
label: 'Unique ID'
|
||
selector: unique_id
|
||
- name: src_name
|
||
label: 'Name'
|
||
selector: name
|
||
- name: src_photo_file
|
||
label: 'Photo ID'
|
||
selector: photo_file
|
||
- name: src_book_ref
|
||
label: 'Book paragraph ID'
|
||
selector: book_ref
|
||
ids:
|
||
src_unique_id:
|
||
type: integer
|
||
```
|
||
|
||
The name of the plugin is `url`. Because we are reading a local file, the `data_fetcher_plugin` is set to `file` and the `data_parser_plugin` to `xml`. The `urls` configuration contains an array of file paths *relative to the Drupal root*. In the example we are reading from one file only, but you can read from multiple files at once. In that case, it is important that they have a homogeneous structure. The settings that follow will apply equally to all the files listed in `urls`.
|
||
|
||
*Technical note*: Migrate Plus provides two data parser plugins for XML files. [xml](https://git.drupalcode.org/project/migrate_plus/blob/8.x-5.x/src/Plugin/migrate_plus/data_parser/Xml.php) uses [XMLReader](https://www.php.net/manual/en/book.xmlreader.php) while [simple_xml](https://git.drupalcode.org/project/migrate_plus/blob/8.x-5.x/src/Plugin/migrate_plus/data_parser/SimpleXml.php) uses [SimpleXML](https://www.php.net/manual/en/ref.simplexml.php). The parser to use is configured in the `data_parser_plugin` configuration. Also note that when you use the `xml` parser, the `data_fetcher_plugin` setting is ignored. More details below.
|
||
|
||
The `item_selector` configuration indicates where in the XML file lies the *set of elements* to be migrated. Its value is an XPath expression used to traverse the file hierarchy. In this case, the value is `/data/udm_people`. Verify that your expression is valid and select the elements you intend to import. It is common that it starts with a *slash* (**/**).
|
||
|
||
`fields` has to be set to an *array*. Each element represents a field that will be made available to the migration. The following options can be set:
|
||
|
||
- `name` is required. This is how the field is going to be referenced in the migration. The name itself can be arbitrary. If it contained spaces, you need to put *double quotation marks* (**"**) around it when referring to it in the migration.
|
||
- `label` is optional. This is a description used when presenting details about the migration. For example, in the user interface provided by the [Migrate Tools module](https://www.drupal.org/project/migrate_tools). When defined, you **do not use** the *label* to refer to the field. Keep using the *name*.
|
||
- `selector` is required. This is another XPath-like string to find the field to import. The value must be relative to the subtree specified by the `item_selector` configuration. In the example, the fields are direct children of the elements to migrate. Therefore, the XPath expression only includes the element name (e.g., `unique_id`). If you had nested elements, you could use a *slash* (**/**) character to go deeper in the hierarchy. This will be demonstrated in the *image* and *paragraph* migrations.
|
||
|
||
Finally, you specify an `ids` *array* of field *names* that would uniquely identify each record. As already stated, the `unique_id` field servers that purpose. The following snippet shows part of the *process*, *destination*, and *dependencies* configuration of the node migration:
|
||
|
||
```yaml
|
||
process:
|
||
field_ud_image/target_id:
|
||
plugin: migration_lookup
|
||
migration: udm_xml_source_image
|
||
source: src_photo_file
|
||
destination:
|
||
plugin: 'entity:node'
|
||
default_bundle: ud_paragraphs
|
||
migration_dependencies:
|
||
required:
|
||
- udm_xml_source_image
|
||
- udm_xml_source_paragraph
|
||
optional: []
|
||
```
|
||
|
||
The `source` for the setting the image reference is `src_photo_file`. Again, this is the `name` of the field, not the `label` nor `selector`. The configuration of the migration lookup plugin and dependencies point to two XML migrations that come with this example. One is for migrating *images* and the other for migrating *paragraphs*.
|
||
|
||
## Migrating paragraphs from a XML file
|
||
|
||
Let's consider an example where the elements to migrate have many levels of nesting. The following snippets show part of the *local* XML file and *source* plugin configuration for the *paragraph* migration:
|
||
|
||
```xml
|
||
<?xml version="1.0" encoding="UTF-8" ?>
|
||
<data>
|
||
<udm_book_paragraph>
|
||
<book_id>B10</book_id>
|
||
<book_details>
|
||
<title>The definite guide to Drupal 7</title>
|
||
<author>Benjamin Melançon et al.</author>
|
||
</book_details>
|
||
</udm_book_paragraph>
|
||
<udm_book_paragraph>
|
||
...
|
||
</udm_book_paragraph>
|
||
<udm_book_paragraph>
|
||
...
|
||
</udm_book_paragraph>
|
||
</data>
|
||
```
|
||
|
||
```yaml
|
||
source:
|
||
plugin: url
|
||
# This configuration is ignored by the 'xml' data parser plugin.
|
||
# It only has effect when using the 'simple_xml' data parser plugin.
|
||
data_fetcher_plugin: file
|
||
# Set to 'xml' to use XMLReader https://www.php.net/manual/en/book.xmlreader.php
|
||
# Set to 'simple_xml' to use SimpleXML https://www.php.net/manual/en/ref.simplexml.php
|
||
data_parser_plugin: xml
|
||
urls:
|
||
- modules/custom/ud_migrations/ud_migrations_xml_source/sources/udm_data.xml
|
||
# XPath expression. It is common that it starts with a slash (/).
|
||
item_selector: /data/udm_book_paragraph
|
||
fields:
|
||
- name: src_book_id
|
||
label: 'Book ID'
|
||
selector: book_id
|
||
- name: src_book_title
|
||
label: 'Title'
|
||
selector: book_details/title
|
||
- name: src_book_author
|
||
label: 'Author'
|
||
selector: book_details/author
|
||
ids:
|
||
src_book_id:
|
||
type: string
|
||
```
|
||
|
||
The `plugin`, `data_fetcher_plugin`, `data_parser_plugin` and `urls` configurations have the same values as in the *node* migration. The `item_selector` and `ids` configurations are slightly different to represent the path to *paragraph* elements and the unique identifier field, respectively.
|
||
|
||
The interesting part is the value of the `fields` configuration. Taking `data/udm_book_paragraph` as a starting point, the records with *paragraph* data have a *nested structure*. Particularly, the `book_details` element has two children: `title` and `author`. To refer to them, the selectors are `book_details/title` and `book_details/author`, respectively. Note that you can go as many level deeps in the hierarchy to find the value that should be assigned to the field. Every level in the hierarchy could be separated by a *slash* (**/**).
|
||
|
||
In this example, the target is a single paragraph type. But a similar technique can be used to migrate multiple types. One way to configure the XML file is having two children. `paragraph_id` would contain the *unique identifier* for the record. `paragraph_data` would contain a child element to specify the paragraph type. It would also have an arbitrary number of extra child elements with the data to be migrated. In the *process* section, you would iterate over the children to map the paragraph fields.
|
||
|
||
The following snippet shows part of the *process* configuration of the *paragraph* migration:
|
||
|
||
```yaml
|
||
process:
|
||
field_ud_book_paragraph_title: src_book_title
|
||
field_ud_book_paragraph_author: src_book_author
|
||
```
|
||
|
||
## Migrating images from a XML file
|
||
|
||
Let's consider an example where the elements to migrate have *more data than needed*. The following snippets show part of the *local* XML file and *source* plugin configuration for the *image* migration:
|
||
|
||
```xml
|
||
<?xml version="1.0" encoding="UTF-8" ?>
|
||
<data>
|
||
<udm_photos>
|
||
<photo_id>P01</photo_id>
|
||
<photo_url>https://udrupal.com/photos/freescholar.jpg</photo_url>
|
||
<photo_dimensions>
|
||
<width>240</width>
|
||
<height>351</height>
|
||
</photo_dimensions>
|
||
</udm_photos>
|
||
<udm_photos>
|
||
...
|
||
</udm_photos>
|
||
<udm_photos>
|
||
...
|
||
</udm_photos>
|
||
</data>
|
||
```
|
||
|
||
```yaml
|
||
source:
|
||
plugin: url
|
||
# This configuration is ignored by the 'xml' data parser plugin.
|
||
# It only has effect when using the 'simple_xml' data parser plugin.
|
||
data_fetcher_plugin: file
|
||
# Set to 'xml' to use XMLReader https://www.php.net/manual/en/book.xmlreader.php
|
||
# Set to 'simple_xml' to use SimpleXML https://www.php.net/manual/en/ref.simplexml.php
|
||
data_parser_plugin: xml
|
||
urls:
|
||
- modules/custom/ud_migrations/ud_migrations_xml_source/sources/udm_data.xml
|
||
# XPath expression. It is common that it starts with a slash (/).
|
||
item_selector: /data/udm_photos
|
||
fields:
|
||
- name: src_photo_id
|
||
label: 'Photo ID'
|
||
selector: photo_id
|
||
- name: src_photo_url
|
||
label: 'Photo URL'
|
||
selector: photo_url
|
||
ids:
|
||
src_photo_id:
|
||
type: string
|
||
```
|
||
|
||
The `plugin`, `data_fetcher_plugin`, `data_parser_plugin` and `urls` configurations have the same values as in the *node* migration. The `item_selector` and `ids` configurations are slightly different to represent the path to *image* elements and the unique identifier field, respectively.
|
||
|
||
The interesting part is the value of the `fields` configuration. Taking `data/udm_photos` as a starting point, the elements with *image* data have extra children that are not used in the migration. Particularly, the `photo_dimensions` element has two children representing the width and height of the image. To ignore this subtree, you simply omit it from the `fields` configuration. In case you wanted to use it, the selectors would be `photo_dimensions/width` and `photo_dimensions/height`, respectively.
|
||
|
||
The following snippet shows part of the *process* configuration of the *image* migration:
|
||
|
||
```yaml
|
||
process:
|
||
psf_destination_filename:
|
||
plugin: callback
|
||
callable: basename
|
||
source: src_photo_url
|
||
```
|
||
|
||
## XML file location
|
||
|
||
**Important**: What is described in this section **only applies** when you use either (1) the `xml` data parser or (2) the `simple_xml` parser with the `file` data fetcher.
|
||
|
||
When using the `file` data fetcher plugin, you have three options to indicate the location to the XML files in the `urls` configuration:
|
||
|
||
- Use a *relative path* from the **Drupal root**. The path *should not start* with a *slash* (**/**). This is the approach used in this demo. For example, `modules/custom/my_module/xml_files/example.xml`.
|
||
- Use an *absolute path* pointing to the XML location in the file system. The path *should start* with a *slash* (**/**). For example, `/var/www/drupal/modules/custom/my_module/xml_files/example.xml`.
|
||
- Use a *fully-qualified URL* to any [built-in wrapper](https://www.php.net/manual/en/wrappers.php) like `http`, `https`, `ftp`, `ftps`, etc. For example, `https://understanddrupal.com/xml-files/example.xml`.
|
||
- Use a [custom stream wrapper](https://api.drupal.org/api/drupal/namespace/Drupal!Core!StreamWrapper/8.8.x).
|
||
|
||
Being able to use stream wrappers gives you many more options. For instance:
|
||
|
||
- Files located in the [public](https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21StreamWrapper%21PublicStream.php/class/PublicStream/8.8.x), [private](https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21StreamWrapper%21PrivateStream.php/class/PrivateStream/8.8.x), and [temporary](https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21StreamWrapper%21TemporaryStream.php/class/TemporaryStream/8.8.x) file systems managed by Drupal. This leveragers functionality already available in Drupal core. For example: `public://xml_files/example.xml`.
|
||
- Files located in profiles, modules, and themes. You can use the [System stream wrapper module](https://www.drupal.org/project/system_stream_wrapper) or [apply](https://www.drupal.org/patch/apply) this [core patch](https://www.drupal.org/project/drupal/issues/1308152) to get this functionality. For example, `module://my_module/xml_files/example.xml`.
|
||
- Files located in [AWS Amazon S3](https://aws.amazon.com/s3/). You can use the [S3 File System module](https://www.drupal.org/project/s3fs) along with the [S3FS File Proxy to S3 module](https://www.drupal.org/project/s3fs_file_proxy_to_s3) to get this functionality.
|
||
|
||
## Migrating remote XML files
|
||
|
||
**Important**: What is described in this section **only applies** when you use the `http` data fetcher plugin.
|
||
|
||
Migrare Plus provides another data fetcher plugin named `http`. Under the hood, it uses the [Guzzle HTTP Client](https://github.com/guzzle/guzzle) library. You can use it to fetch files using any [protocol supported](https://curl.haxx.se/libcurl/c/CURLOPT_PROTOCOLS.html) by [curl](https://curl.haxx.se/libcurl/) like `http`, `https`, `ftp`, `ftps`, `sftp`, etc. In a future blog post we will explain this data fetcher in more detail. For now, the `udm_xml_source_node_remote` migration demonstrates a basic setup for this plugin. Note that only the `data_fetcher_plugin`, `data_parser_plugin`, and `urls` configurations are different from the local file example. The following snippet shows part of the configuration to read a *remote* XML file for the *node* migration:
|
||
|
||
```yaml
|
||
source:
|
||
plugin: url
|
||
data_fetcher_plugin: http
|
||
# 'simple_xml' is configured to be able to use the 'http' fetcher.
|
||
data_parser_plugin: simple_xml
|
||
urls:
|
||
- https://sendeyo.com/up/d/478f835718
|
||
item_selector: /data/udm_people
|
||
fields: ...
|
||
ids: ...
|
||
```
|
||
|
||
And that is how you can use XML files as the *source* of your migrations. Many more configurations are possible when you use the `simple_xml` parser with the `http` fetcher. For example, you can provide authentication information to get access to protected resources. You can also set custom HTTP headers. Examples will be presented in a future entry.
|
||
|
||
## XMLReader vs SimpleXML in Drupal migrations
|
||
|
||
As noted in the module's [README file](https://git.drupalcode.org/project/migrate_plus/blob/8.x-5.x/README.txt#L48), the `xml` parser plugin uses the [XMLReader](https://www.php.net/manual/en/ref.simplexml.php) interface to incrementally parse XML files. The reader acts as a cursor going forward on the document stream and stopping at each node on the way. This should be used for XML sources which are potentially very large. On the other than, the `simple_xml` parser plugin uses the [SimpleXML](https://www.php.net/manual/en/ref.simplexml.php) interface to fully parse XML files. This should be used for XML sources where you need to be able to use complex XPath expressions for your item selectors, or have to access elements outside of the current item element via XPath.
|
||
|
||
What did you learn in today's blog post? Have you migrated from XML files before? If so, what challenges have you found? Did you know that you can read local and remote files? Did you know that the `data_fetcher_plugin` configuration is ignored when using the `xml` data parser? Please share your answers in the comments. Also, I would be grateful if you shared this blog post with others. |