Advanced Column Splitting in Power Query

9 March, 2015 (10:00) | M, Power Query | By: Mark V

A friend of mine left a question on one of my Power Query Decathlon posts asking if it was possible to split a column using a carriage return as a delimiter. After a little poking around myself, I was not finding the answer I sought. So, I reached out for help. Curt Hagenlocher, a developer at Microsoft was kind enough to help me out. As it turns out, it appears that splitting a column based on carriage returns does not work. However, you CAN split columns based on the line feed character.

Note: M is actually informal name for the Power Query Formula Language.

The sample data

First, I created a collection of lines of text separated by using the <Enter> key. I did this in NotePad++, a free but powerful utility. I activated the feature which allows viewing of all non-printing characters. That is shown in Figure 1.

Figure 1

image

In Figure 1, you can clearly see the CR and LF characters after lines 1 and 2.

I then highlighted this block of text and pasted it into each of the cells in Field2 in the table in Excel shown in Figure 2. Not, I had to paste this into each cell using the formula bar, not into the cells directly.

Figure 2

image

The result is a very simple table where all of the values in Field2 contain <CR><LF> characters, hidden by default.

I then added a query with this table as its source by using the From Table button on the Power Query Ribbon. The result is the query shown in Figure 3.

Figure 3

image

This is our starting point for two different solutions.

M-less Solution

The first, and easiest solution starts with a simple Split Column transformation on the Field2 column. Figure 4 shows how to configure the Split Column for this example.

Figure 4

image

We start by creating a basic Split Column making sure of the following:

  1. Select whichever delimiter you want. However, it is best if you can find one that actually does not appear in the data.
  2. Make sure the “At each occurrence of the delimiter” option is selected. This is the default, anyway.
  3. Since there are a total of 3 lines of text in our Field2 domain of values, we need to set the “Number of columns to split into” option. You would think that the option indicating to split on each delimiter would mean this step would not be necessary. Unfortunately, as least as of the March 2015 update of Power Query, it is.

When you click OK, the result is shown in Figure 5.

Figure 5

image

Since there are no commas in this text, the result is a total of three columns, but only one of them has data. However, this is a necessary first step. The use of non-printing characters as delimiters is not supported by the Power Query UI. So, we need to manually edit the M for the Split Column by Delimiter step. Yes, I know I said this was M-less. However, the edit that is required is tiny and you are not really writing any M.

Figure 6 shows the unmodified M code for the Split Column by Delimiter step.

Figure 6

image

Notice the delimiter shown by the arrow. It is the comma since that is what we chose for this step initially. In order to change that to the <LF> character, we replace that comma with the #(lf) value as show in Figure 7.

Figure 7

image

The result of this change is shown in Figure 8.

Figure 8

image

Great, right? Awesome? Well, sort of. While this worked in our very simple situation in which we knew how many line of text were present, this was OK. And if that is your situation, then hazzah! However, reality is typically more complicated than that.

I saved this query as M-less in the Excel file provided with this post.

M-ful Solution

The more elegant and flexible solution requires M. And, I am still very much a student of M, and a beginner at that. The M code featured in this solution was provided by Microsoft’s Curt Hagenlocher, mentioned above, who was tremendously helpful in answering this question and very patient with my M noobness.

With this solution, the first step after the source is to change the data type of Field2 to Text. While I tried it without this step and it worked just fine on my simple sample data, it is not a bad precaution to ensure the Field2 column is text before proceeding. So, I left it in. The result is shown in Figure 9. Note that this is the last step in this solution that can be accomplished in the UI. It is all M from here.

Figure 9

image

There isn’t much to see here. But if you click on the View tab of the ribbon and then click on the Advanced Editor, you will see the M code for the entire query. That is shown in Figure 10.

Figure 10

image

The next step is to paste in the M provided by Curt. That code replaces the bottom section of the query, shown in the rectangle. Once the new code is pasted in, we will highlight a few parts. The final query is shown in Figure 11.

Figure 11

image

The entire query is pasted below to make it easy for you to copy. Note that, following the pattern of how M works, each step in the function, shown highlighted, has a name that is then referenced by subsequent steps.

 

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Field2", type text}}),
    SplitByDelimiter = (table, column, delimiter) =>
        let
            Count = List.Count(List.Select(Text.ToList(Table.Column(table, column){0}), each _ = delimiter)) + 1,
            Names = List.Transform(List.Numbers(1, Count), each column & "." & Text.From(_)),
            Types = List.Transform(Names, each {_, type text}),
            Split = Table.SplitColumn(table, column, Splitter.SplitTextByDelimiter(delimiter), Names),
            Typed = Table.TransformColumnTypes(Split, Types)
        in
            Typed,

    #"Split Column by Delimiter" = SplitByDelimiter(#"Changed Type","Field2","#(lf)")
in
    #"Split Column by Delimiter"

 

The steps in the function, as clearly as I can understand/explain them, are as follows:

Count = This step takes in the parameters for the table and the column and examines the contents in the column to discover how many times the delimiter passed in the parameter occurs in that column. This allows the function to dynamically create the correct number of columns instead of having to hard code it like we had to do in the M-less Solution.

Name = This step creates column names for as many columns as were determined in the Count step. The name for each column is prefaced with the original column name passed into the function.

Types = This step creates a list object containing all of the column names created by the Name step and appends to each the code necessary to converting them to text. This column will be used in a subsequent step to make it easy to convert all of the columns to text before exiting the function. This is cool. :)

Split = This step uses the Splitter.SplitTextByDelimiter function to perform the splits of the column into the columns provided in the Name step above using the specified delimiter.

Typed = This step uses the list of column names and the strings from the Types step to convert all of the newly created columns to Text. The results of this step are then returned from the function.

That completes the function walkthrough. Pretty neat, if you ask me.

The final step of the query is to invoke this function. This is done via the section of code in the red rectangle in Figure 11. For the table parameter of the function, this step uses the Changed Type query step we performed in the UI. This is a great example of how steps in queries, at least in M, are not limited to referencing only the single step that immediately preceded them.

The final result is shown in Figure 12.

Figure 12

image

This query was saved a M-ful in the working file provided.

NOTE: Neither of these solutions work for the <CR> character. So, replacing the #(lf) above for #(cr) does not produce the desired result. I even tried using #(cr)#(lf) and that does not work either. Just #(lf) worked for me.

There you have it. You can find the working file I used for this post right here.

One of the aspects I like best about blogging is the learning I get out of it. In order to try to explain things clearly in my posts, I need to study them and understand them. Through a previous post, I got a great question I had not thought about. Through the investigation, I found yet another great and helpful person within Microsoft in Curt Hagenlocher. In putting this post together, I obtained a far greater understanding of M and it capabilities than I had before.

Comments

Comment from Mike Diehl (Imaginet Winnipeg)
Time March 10, 2015 at 4:27 pm

Hey Mark, I’ve been thinking that some code like this might also be used for a problem I am having splitting text that is essentially fixed width, but not the same width in each file. I’m working with a text file that has column names, then a row of — —– ———- ——— etc and then rows of text data. I think I’m going to try your technique here but use the position of the spaces in the — —– ——- line to determine the split locations for the lines following. Do you think that will work?

Comment from Mark V
Time March 31, 2015 at 6:49 pm

Sorry for the late response, Mike. Do you mean that the — —- stuff indicates the width of each column in the file and varies each time? Yikes. Is the file content confidential? I would love to get my hands on an example, even with dummy data, to make sure I fully understand what you need. Sounds like a fun puzzle.

Comment from Mike
Time March 31, 2015 at 6:58 pm

Yeah, the position of the space in the — line indicates the length of the field. http://aer.ca/data-and-publications/statistical-reports/st49
The text files are linked on this page. The column widths for well name, contractor name, and licensee address are based on the longest strings within each text file. They don’t change within the file, but each file could be different.

Comment from Curt Hagenlocher
Time April 7, 2015 at 9:45 am

See this code sample for an approach: https://gist.github.com/CurtHagenlocher/10eb34d0fc3a48480fea

Comment from Eleftheria Kiourtzoglou
Time April 17, 2015 at 4:55 am

Hello Mr Vaillancourt,

Nice blog! I am editor at .NET Code Geeks (www.dotnetcodegeeks.com). We have the NCG program (see http://www.dotnetcodegeeks.com/join-us/ncg/), that I think you’d be perfect for.

If you’re interested, send me an email to eleftheria[dot]kiourtzoglou[at]dotnetcodegeeks[dot]com and we can discuss further.

Best regards,
Eleftheria

Comment from Sayth Renshaw
Time June 5, 2016 at 4:44 am

Is there a way to test for types so that if I want to split on a number I can? I have a text column where going right to left and splitting on first occurrence of a number would be awesome.

Comment from Mark V
Time June 9, 2016 at 12:02 pm

Hi, Sayth. I am not sure about that. The best thing I found in poking around a bit was this: http://community.powerbi.com/t5/Desktop/Detect-type-of-column-in-Power-Query-Formula-Language/td-p/20518

Comment from Bob
Time November 8, 2016 at 6:54 pm

Hi,
Thank you for a great post and working code. This worked like a charm.

Cheers
Bob

Comment from Mark V
Time November 9, 2016 at 9:53 am

Glad you found it helpful, Bob! Thanks.

Comment from Bob
Time November 16, 2016 at 3:26 pm

Hi Mark,

When I started to use the parsing query against some real data, it started only returning the 1st parsed value.
My data varied in that some entries had 1 or 2 elements, some entries had 9 or 10.
To get the code to work, I added a helper column with text.length([columnToParse]) and then sorted the helper column from largest to smallest.

Success. All the entries were fully parsed and continued to work like a charm.

Cheers
Bob

Comment from Amit
Time February 7, 2017 at 1:19 am

Thank you Bob. I was literally banging my head as the script was not working. Upon using the helper column with text.length([columnToParse]) as suggested by you, it worked in a flash!
Big thanks to Mark too.

Comment from Aravind
Time February 8, 2017 at 6:24 am

Hi, I’m looking for advanced split function, after split I would like to have one column which contain max value in single step. For example, In a cell if I have 3002010;3001030;3002134, which is delimited by semicolon, I would like to get max value of this cell. As a first step I will split by delimiter ‘semicolon’ (;) then find the maximum and store this max value in one column. Is this possible in power query in single step? can you please suggest what would be the right way to do this?

Comment from Mark V
Time February 14, 2017 at 3:11 pm

Thanks for sharing that, Bob.

Comment from Mark V
Time February 14, 2017 at 3:18 pm

Hi, Aravind. My gut tells me it should be possible. However, my own M-fu is not that advanced. You may want to ask this for Ken Puls (www.excelguru.ca) or Chris Webb (www.technitrain.com). Both Ken and Chris as far better M skills than I do. :)

Comment from David
Time February 21, 2017 at 1:55 pm

@Aravind,

In this type of situation, you need to create a custom function that will dynamically split the column, then unpivot the data (creates a pair of columns for attribute and value), then sort by value, and finally keep only the top (or bottom, depending on the sort) row.

You would invoke this function as a custom column and it would do the previous steps for each row in your table, leaving you with only the max values for each row.

Comment from Aravind
Time February 26, 2017 at 10:37 am

@Mark, @David: Thanks a lot for the help.

Write a comment