12 KiB
Migrating Microsoft Excel and LibreOffice Calc files into Drupal
Today we will learn how to migrate content from Microsoft Excel and LibreOffice Calc files into Drupal using the Migrate Spreadsheet module. We will give instructions on getting the module and its dependencies. Then, we will present how to configure the module for spreadsheets with or without a header row. There are two example migrations: images and paragraphs. Let's get started.
Getting the code
You can get the full code example at https://github.com/dinarcon/ud_migrations The module to enable is UD Google Sheets, Microsoft Excel, and LibreOffice Calc source migration
whose machine name is ud_migrations_sheets_sources
. It comes with four migrations: udm_google_sheets_source_node.yml
, udm_libreoffice_calc_source_paragraph.yml
, udm_microsoft_excel_source_image.yml
, and udm_backup_csv_source_node.yml
. The image migration uses a Microsoft Excel file as source. The paragraph migration uses a LibreOffice Calc file as source. The CSV migration is a backup in case the Google Sheet is not available. To execute the last one you would need the Migrate Source CSV module.
You can get the Migrate Google Sheets module using composer: composer require 'drupal/migrate_spreadsheet:^1.0'. This module depends on the [PHPOffice/PhpSpreadsheet](https://github.com/PHPOffice/PhpSpreadsheet) library **and** many PHP extensions *including*
ext-zip`. Check this page for a full list of dependencies. If any required extension is missing the installation will fail. If your Drupal site is not composer-based, you will not be able to use Migrate Spreadsheet, unless you go around a lot of hoops.
Understanding the example set up
This migration will reuse the same configuration from the introduction to paragraph migrations example. Refer to that article for details on the configuration. The destinations will be the same content type, paragraph type, and fields. The source will be changed in today's example, as we use it to explain Microsoft Excel and LibreOffice Calc migrations. The end result will again be nodes containing an image and a paragraph with information about someone's favorite book. The major difference is that we are going to read from different sources.
Note: You can literally swap migration sources without changing any other part of the migration. This is a powerful feature of ETL frameworks like Drupal's Migrate API. Although possible, the example includes slight changes to demonstrate various plugin configuration options. Also, some machine names had to be changed to avoid conflicts with other examples in the demo repository.
Understanding the source document and plugin configuration
In any migration project, understanding the source is very important. For Microsoft Excel and LibreOffice Calc migrations, the primary thing to consider is whether or not the file contains a row of headers. Also, a workbook (file) might contain several worksheets (tabs). You can only migrate from one worksheet at a time. The example documents have two worksheets: UD Example Sheet
and Do not peek in here
. We are going to be working with the first one.
The spreadsheet
source plugin exposes seven configuration options. The values to use might change depending on the presence of a header row, but all of them apply for both types of document. Here is a summary of the available configurations:
file
is required. It stores the path to the document to process. You can use a relative path from the Drupal root, an absolute path, or stream wrappers.worksheet
is required. It contains the name of the one worksheet to process.header_row
is optional. This number indicates which row contains the headers. Contrary to CSV migrations, the row number is not zero-based. So, set this value to1
if headers are on the first row,2
if they are on the second, and so on.origin
is optional and defaults toA2
. It indicates which non-header cell contains the first value you want to import. It assumes a grid layout and you only need to indicate the position of the top-left cell value.columns
is optional. It is the list of columns you want to make available for the migration. In case of files with a header row, use those header values in this list. Otherwise, use the default title for columns:A
,B
,C
, etc. If this setting is missing, the plugin will return all columns. This is not ideal, especially for very large files containing more columns than needed for the migration.row_index_column
is optional. This is a special column that contains the row number for each record. This can be used as a unique identifier for the records in case your dataset does not provide a suitable value. Exposing this special column in the migration is up to you. If so, you can come up with any name as long as it does not conflict with header row names set in thecolumns
configuration. Important: this is an autogenerated column, not any of the columns that comes with your dataset.keys
is optional and, if not set, it defaults to the value ofrow_index_column
. It contains an array of column names that uniquely identify each record. For files with a header row, you can use the values set in thecolumns
configuration. Otherwise, use default column titles likeA
,B
,C
, etc. In both cases, you can use therow_index_column
column if it was set. Each value in the array will contain database storage details for the column.
Note that nowhere in the plugin configuration you specify the file type. The same setup applies for both Microsoft Excel and LibreOffice Calc files. The library will take care of detecting and validating the proper type.
Migrating spreadsheet files with a header row
This example is for the paragraph migration and uses a LibreOffice Calc file. The following snippets shows the UD Example Sheet
worksheet and the configuration of the source plugin:
book_id, book_title, Book author
B10, The definite guide to Drupal 7, Benjamin Melançon et al.
B20, Understanding Drupal Views, Carlos Dinarte
B30, Understanding Drupal Migrations, Mauricio Dinarte
source:
plugin: spreadsheet
file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_book_paragraph.ods
worksheet: 'UD Example Sheet'
header_row: 1
origin: A2
columns:
- book_id
- book_title
- 'Book author'
row_index_column: 'Document Row Index'
keys:
book_id:
type: string
The name of the plugin is spreadsheet
. Then you use the file
configuration to indicate the path to the file. In this case, it is relative to the Drupal root. The UD Example Sheet
is set as the worksheet
to process. Because the first row of the file contains the header rows, then header_row
is set to 1
and origin
to A2
.
Then specify which columns
to make available to the migration. In this case, we listed all of them so this setting could have been left unassigned. It is better to get into the habit of being explicit about what you import. If the file were to change and more columns were added, you would not have to update the file to prevent unneeded data to be fetched. The row_index_column
is not actually used in the migration, but it is set to show all the configuration options in the example. The values will be 1
, 2
, 3
, etc. Finally, the keys
is set the column that serves as unique identifiers for the records.
The rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows the process and destination sections for the LibreOffice Calc paragraph migration.
process:
field_ud_book_paragraph_title: book_title
field_ud_book_paragraph_author: 'Book author'
destination:
plugin: 'entity_reference_revisions:paragraph'
default_bundle: ud_book_paragraph
Migrating spreadsheet files without a header row
Now let's consider an example of a spreadsheet file that does not have a header row. This example is for the image migration and uses a Microsoft Excel file. The following snippets shows the UD Example Sheet
worksheet and the configuration of the source plugin:
P01, https://udrupal.com/photos/freescholar.jpg
P02, https://udrupal.com/photos/mlncn.jpg
P03, https://udrupal.com/photos/sfreudenberg.jpg
source:
plugin: spreadsheet
file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_photos.xlsx
worksheet: 'UD Example Sheet'
# The file does not have a headers row.
header_row: null
origin: A1
# If no header row is available, you use the spreadsheet's column names: A, B, C, etc.
# If you do not manually add a list of columns, all columns that contain data in the worksheet would be returned.
# The same names need to used in the process section.
columns:
- A # This column contains the photo ID. Example: 'P01'.
- B # This column contains the photo URL.
row_index_column: null
keys:
A:
type: string
The plugin
, file
, amd worksheet
configurations follow the same pattern as the paragraph migration. The difference for files with no header row is reflected in the other parameters. header_row
is set to null
to indicate the lack of headers and origin
is to A1
. Because there are no column names to use, you have to use the ones provided by the spreadsheet. In this case, we want to use the first two columns: A
and B
. Contrary to CSV migrations, the spreadsheet
plugin does not allow you to define aliases for unnamed columns. That means that you would have to use A
, B
in the process section to refer to this columns.
row_index_column
is set to null
because it will not be used. And finally, in the keys
section, we use the A
column as the primary key. This might seem like an odd choice. Why use that value if you could use the row_index_column
as the unique identifier for each row? If this were an isolated migration, that would be a valid option. But this migration is referenced from the node migration explained in the previous example. The lookup is made based on the values stored in the A
column. If we used the index of the row as the unique identifier, we would have to update the other migration or the lookup would fail. In many cases, that is not feasible nor desirable.
Except for the name of the columns, the rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows part of the process and destination section for the Microsoft Excel image migration.
process:
psf_destination_filename:
plugin: callback
callable: basename
source: B # This is the photo URL column.
destination:
plugin: 'entity:file'
Refer to this entry to know how to run migrations that depend on others. In this case, you can execute them all by running: drush migrate:import --tag='UD Sheets Source'
. And that is how you can use Microsoft Excel and LibreOffice Calc files as the source of your migrations. This example is very interesting because each of the migration uses a different source type. The node migration explained in the previous post uses a Google Sheet. This is a great example of how powerful and flexible the Migrate API is.
What did you learn in today's blog post? Have you migrated from Microsoft Excel and LibreOffice Calc files before? If so, what challenges have you found? Did you know the source plugin configuration is not dependent on the file type? Share your answers in the comments. Also, I would be grateful if you shared this blog post with others.