Have you ever wanted to extract information from a web page to use in a calculation, or as a reference for something else you may be working on? I have used similar functions in the past (pre Power Query days) to grab stock price information or Forex rates for personal finance reasons from structured data websites.
In many cases, the information contained on a webpage is not structured in a table format for easy importing into Microsoft Excel. Thanks to a great post by Ivan Bondarenko Using Power Query for Extracting Non-Tabular Data from Web Pages, we can also use the same approach to get our engineering information.
I will go through an example using the techniques learned from Ivan and others to show how to get data from scratch. This example will go through the steps on how to put together a Periodic Table of Elements from information contained on efunda website. We will step through from starting from the sites below:
To end up with this in Microsoft Excel:
Get the Element Table without the Electron Configuration (Structured Data)
Go to the Element List by Name web page and copy the URL: http://www.efunda.com/materials/elements/element_list.cfm?list_order=name
Then fire up Excel and do the following steps.
- Data tab → Get & Transform Data → From Web → Paste the above URL into the box → OK
In the Navigator window, under the Display Options, some items will be listed under a folder icon of the web page. (In writing this post I did it a couple of time, and the number of sub listings changed, so yours may look a bit different from mine).
Clicking through the listings, you will see that one will display the Element Listing table in the Table View on the right side. Once you find the relevant attribute, click Edit.
Depending on your Excel settings, Power Query may automatically change the column types. For me, it applied type text to all columns except for the Atomic Number where it used the number type. If this didn’t happen for you, apply the data types as noted and do the remaining steps.
- Select the column(s) → Transform → Detect Data Type, or you can manually select from the Data Type List (if not done automatically)
- Rename the query to ElementList
- Go to Home → Close & Load
Now you have the table without the Electron Configuration. Success so far! Next, we will get the Electron Configuration details.
Here is the M code for the ElementList query
let Source = Web.Page(Web.Contents("http://www.efunda.com/materials/elements/element_list.cfm?list_order=name")), Data1 = Source{1}[Data], #"Changed Type" = Table.TransformColumnTypes(Data1,{{"Element Name", type text}, {"Symbol", type text}, {"Atomic Number", Int64.Type}, {"Atomic Weight", type text}}) in #"Changed Type"
Get the Electron Configuration from the Atomic Detail Web Page (Non-Structured Data)
Now we have the element table with most of the data except for the Electron Configuration. Let’s get cracking into getting that info.
Click on an element and go to that web page. For this example, I will select my favourite element: Titanium. (When I started engineering in the mining industry we designed Titanium explosion cladded pressure vessels)
The URL for Ti is http://www.efunda.com/materials/elements/element_info.cfm?Element_ID=Ti
Alt +Tab on your keyboard back to Excel and do the following steps:
- Data tab → Get & Transform Data → From Web → Paste the above URL into the box → OK
- Right click on the folder icon in the Navigator left pane and select Edit.
- In the Power Query Editor click the cog next to the Source step in the Applied Steps in the Query Settings
- In the From Web dialog box → Open file as select Text File → OK
The data will be imported in its HTML markup format, which looks messy. Bear with me, as with only a few steps we will be able to extract the information required from the text.
Use Chrome to Inspect the HTML text
The easiest way to find the desired data is to inspect the web page using Google Chrome. Right click on the web page and click Inspect. Hover the mouse over the code and notice when it highlights the area on the web page where the data is. You’ll probably need to expand the paragraphs until you finally see the info you are looking for. In this example the common HTML text is shown in red:
<td align=”left” class=”smalltext”>Atomic Number</td>
Now copy the text in red and go back to the Query Editor and:
- Select Column1 → Click the Filter Icon → Text Filters → Contains… → paste the red text → OK
Now you have to scan the filtered rows and look for any useful patterns. For this example, the required data is contained in the first six rows.
You can use the Keep Rows or Remove Rows functions in the Home ribbon, but I use an index column and filter it on the desired rows ( Note: an index column will be required later):
- Go to Add Column → Index Column
- Select Index → Filter Icon → Number Filters → Less Than Or Equal To… → 5 → OK
This gives us what we are looking for, but we now need to get rid of the surrounding HTML text. To do this, we can extract the text between some known delimiters.
- Select Column1 (by clicking the header) → Go to Transform → Extract → Text Between Delimiters → Start delimiter type in > → End delimiter type in </td> → OK
Lovely, except that the information we need is stacked vertically over six rows and we need to get this into a single row.
Creating a Modulo Column
To get stacked data into a single row, I will use a Modulo column. I discovered this in Ken, and Miguel excellent book M is for (Data) Monkey The Excel Pro’s Definitive Guide to Power Query. (I originally purchased this book on Kobo, and since I refer to it often, I needed to access it on the web, so I purchased it again for my Amazon Kindle!)
Now back to adding a Modulo column. To add one, you need an index type column which I created above.
- Click the Index Column → Go to the Add Column tab → Click Standard → Click Modulo from the drop-down → Enter 2 in the Value (Our stacked information is row 1 heading, row 2 data, row 3 heading, row 4 data… so the data is grouped in each set of 2 rows)
Pivoting the Data (Twice)
- Click on the Inserted Modulo column → Go to the Transform tab → Pivot Column → Select the column name where the data is located. In our example, it is Column1 → Expand the Advanced options → Aggregate Value Function select Don’t Aggregate → OK
- Select the values column (1) → Transform tab → Fill → Up
- Filter the attribute column (0) by un-checking the (null)
- Click the Index column → Home tab → Remove Columns (or right-click the header and click Remove)
- Select the attribute column (0) → Transform tab → Pivot Column → Select the values column (1) in the Values Column drop-down → Expand Advanced options → Select Don’t Aggregate for the Aggregate Value Function → OK
- Rename the query to AtomicDetails
- Go Home → Close & Load → Close & Load To… → Select Only Create Connection → OK
Going back to the Query Editor our AtomicDetails query will be a single row of information containing some atomic details for one element.
Here is the M code for the AtomicDetails query:
let Source = Table.FromColumns({Lines.FromBinary(Web.Contents("http://www.efunda.com/materials/elements/element_info.cfm?Element_ID=Ti"))}), #"Filtered Rows" = Table.SelectRows(Source, each Text.Contains([Column1], "<td align=""left"" class=""smalltext""")), #"Added Index" = Table.AddIndexColumn(#"Filtered Rows", "Index", 0, 1), #"Filtered Rows1" = Table.SelectRows(#"Added Index", each [Index] <= 5), #"Extracted Text Between Delimiters" = Table.TransformColumns(#"Filtered Rows1", {{"Column1", each Text.BetweenDelimiters(_, ">", "</td>", 0, 0), type text}}), #"Inserted Modulo" = Table.AddColumn(#"Extracted Text Between Delimiters", "Inserted Modulo", each Number.Mod([Index], 2), type number), #"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Inserted Modulo", {{"Inserted Modulo", type text}}, "en-AU"), List.Distinct(Table.TransformColumnTypes(#"Inserted Modulo", {{"Inserted Modulo", type text}}, "en-AU")[#"Inserted Modulo"]), "Inserted Modulo", "Column1"), #"Filled Up" = Table.FillUp(#"Pivoted Column",{"1"}), #"Filtered Rows2" = Table.SelectRows(#"Filled Up", each ([0] <> null)), #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows2",{"Index"}), #"Pivoted Column1" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[#"0"]), "0", "1") in #"Pivoted Column1"
Add a Parameter for the Atomic Details Web Page
Now that we have the electron configuration for a single element, Titanium, we need to scrape the data for all the elements in the element list. To do this, we first need to use the Parameter functionality in Power Query.
Copy the URL for Titanium: http://www.efunda.com/materials/elements/element_info.cfm?Element_ID=Ti
- Go to Home → Manage Parameters → New Parameter → Name set to URL → Type set to Text → Current Value paste the URL for Titanium → OK
Now we need to put this newly created URL parameter into the Source step of the AtomicDetails query which will replace the hard-coded URL.
- Go to the AtomicDetails query → In the Query Settings → Applied Steps → Source → click the setting cog
- In the Text Window → URL → Click the ABC box → Select Parameter → URL should show in the box and if not click the drop-down and select URL → OK
Turn the Atomic Details Query Into a Function
The AtomicDetails query gives us the details for Titanium. Great, but we need to be able to cycle through the Element List and get the details for all the elements. To do this, we need to convert our Query into a Function.
- Right-click the AtomicDetails Query → Create Function… → Function name (standard practice here from the gurus is to prefix your query name with f or fn) type in fAtomicDetails → OK
Compile the Element Table WITH the Electron Configuration
We are almost there! One last step and I have to say this is pretty cool.
Head back to the ElementList query (without the Electron Configuration), and we will add a couple of columns for the URL of each element and run the fAtomicDetails function for each element in the list.
The URL for each element is always the same except for the element Symbol suffix at the end. For example:
http://www.efunda.com/materials/elements/element_info.cfm?Element_ID=Ti
Hence we need to copy the URL part in red which is the same for each element. We will create a column which combines this text and the element symbol concatenated at the end.
- Click on the ElementList query → Add Column → Custom Column → set New column name to URL → Custom column formula paste the URL in red, and add double quotes before and after → type & → double click Symbol in the Available columns (= “http://www.efunda.com/materials/elements/element_info.cfm?Element_ID=”&[Symbol] → OK
- Go to Add Column → Custom Column → Custom column formula should be = fAtomicDetails([URL]) → OK
- In the newly created Custom Column click the double arrows on the right side → Select Expand → keep the check next to Electron Config. only → uncheck Use original column name as prefix → OK
- Right-click the URL column and remove
- Go Home → Close & Load
BAM. How cool is that? Now you have an Element Table populated with the Electron Configuration which has scrapped over +100 web pages.
Final thoughts
Apologies, I couldn’t find a more exciting example. But once you master these few steps, one day it may come in real handy.
As always, I am just learning this and have probably performed a heap of steps that are not required or could be more efficient and quicker. Please drop a line in the comments, and I’ll be sure to update the post accordingly.
Here is the final M code for the ElementList query.
let Source = Web.Page(Web.Contents("http://www.efunda.com/materials/elements/element_list.cfm?list_order=name")), Data1 = Source{1}[Data], #"Changed Type" = Table.TransformColumnTypes(Data1,{{"Element Name", type text}, {"Symbol", type text}, {"Atomic Number", Int64.Type}, {"Atomic Weight", type text}}), #"Added Custom" = Table.AddColumn(#"Changed Type", "URL", each "http://www.efunda.com/materials/elements/element_info.cfm?Element_ID=" & [Symbol]), #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom", each fAtomicDetails([URL])), #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom1", "Custom", {"Electron Config."}, {"Electron Config."}), #"Removed Columns" = Table.RemoveColumns(#"Expanded Custom",{"URL"}) in #"Removed Columns"
Just to Recap
To summarise the above of extracting a list of information on a website and then scraping further details in separate web pages and compiling it into an Excel Table for later analysis or reference.
- Load the base data from the web (Element List)
- Rename the query
- Load the detail data from the internet (Atomic Details)
- Open files as a Text File
- Find the required information using Chrome’s Inspect
- Filtering the data for the necessary information
- Add an Index Column
- Extract Text Between Delimiters
- Create a Modulo Column to get a list of repeating rows of data
- Pivot the data based on the Modulo Column
- Create a Parameter to store the URL of the detail data web pages (Atomic Details)
- Create a Parameter using the URL of a single element
- Replace the hardcoded URL in the details query with the parameter
- Convert the detail data query into a Function (Atomic Details)
- Combine the detail Data (Atomic Details) into the base data (Element List)
- Add a custom column to create the URL for each element
- Add a custom column to get the detail data for each element in the table
- Expand the detail data for each element
- Remove columns not required
- Presto! Base data with all details scrapped from over 100+ web pages