Skip to content Skip to sidebar Skip to footer

Snowflake Sql Regex ~ Extracting Multiple Vals

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr() The values that I want to access are in quotes: ... Type: a: - !

Solution 1:

This select statement will give you what you want ... sorta. You should notice that it will look for the a specific occurence of "val" and then give you the next word character after that.

REGEX to my knowledge evaluates to the first occurence of the expression, so once the pattern is found it's done. You may want to look at the Snowflake JavaScript Stored Procedure to see if you can take the example below and iterate through, incrementing the appropriate value to produce the expected output.enter image description here

SELECT REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 1, 'e', 1) as A,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 2, 'e', 1) as B,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 3, 'e', 1) as C;

Solution 2:

You have to extract the values in two stages;

  1. Extract the section of the document below Type: a: containing all the val: "data".
  2. Extract the "data" as an array or use REGEXP_SUBSTR() + index n to extract the nth element
SELECT'Type:\\s+\\w+:((\\s+- !<string>\\s+val:\\s+"[^"]")+)' type_section_rx
  REGEXP_SUBSTR(col, type_section_rx, 1, 1, 'i', 1) vals,
  PARSE_JSON('[0'|| REPLACE(vals, REGEXP_SUBSTR(vals, '[^"]+'), ', ') ||']') raw_array,
  ARRAY_SLICE(raw_array, 1, ARRAY_SIZE(raw_array)) val_array,
  val_array[1] B
FROM INPUT_STRING

The result is an array where you can access the first value with the index [0] etc. The first regexp can be shortened down to a "least effort" 'Type:\\s+\\w+:(([^"]+"[^"]+")+)'.

Solution 3:

One more angle -- Use javascript regex capabilities in a UDF.

For example:

create or replace functionmy_regexp(S text)
  returns array
  language javascript
as
$$
  const re = /(\w+)/greturn [...S.match(re)]
$$
;

Invoked this way:

set S = 'Type:
  a:
    - !<string>
     val: "A"
    - !<string>
     val: "B"
    - !<string>
     val: "C"';select my_regexp($S);

Yields:

[ "Type", "a", "string", "val", "A", "string", "val", "B", "string", "val", "C" ]

Implementing your full regex is a little more work but as you can see, this gets around the single value limitation.

That said, if performance is your priority, I would expect Snowflake native regex support to outperform, even though you specify the regex multiple times, though I haven't tested this.

Post a Comment for "Snowflake Sql Regex ~ Extracting Multiple Vals"