String manipulation in R, conditional capture and conditional append. ?regex solution
I am using hospital data. I want to make a regex expression in R and I am struggling to do this without using string manipulation outside of a single regex expression.
The string I want to search is:
"W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941".
The string represents procedures, which are described in groups.
The groups are written in the generic form:
(Procedure code, joint code, laterality code)
Procedure codes are "[A-Z]\d{3}", joint code which is "W84\d" and then a laterality code "Z94\d"
This format can be repeated multiple times.
In some circumstances the code may be written:
(Procedure code1, joint code1), (Procedure code2, joint code2), lateralityALL
This is done when the laterality applies to each group.
I want to capture the codes up to and including the laterality code, if present.
If there is only one laterality code at the end of all the string groups, this should be appended to each group.
# Example data:
string = c("W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941")
Desired output:
Group 1: "W779 Y767 W835 W848 Y189 Z846 Z941"
Group 2: "T625 Z843 Z941"
df <- data.frame(string = "W779 Y767 W835 W848 Y189 Z846 Z941")
What I have done:
I have taken a inefficient approach to identify string with shared laterality and those without. When shared I manually append the laterality to the first group.
df <- data.frame(string = c("W779 Y767 W835 W848 Y189 Z846 Z941",
"Y189 Z846 Z941",
"W779 Y767 W835 W848"))
df %>%
mutate(joint_count = str_count(string, "(W84[1-9]{1})|(Z84[1-9]{1})"),
laterality_count = str_count(string, "Z94[1-9]{1}"),
laterality = str_extract_all(string, "Z94[1-9]{1}"),
joint_laterality_count = str_count(string, "(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
laterality_end = str_detect(string, "Z94[1-9]{1}$"),
shared_laterality = case_when(joint_count>laterality_count & laterality_end==T~1,
.default = 0),
single_joint_laterality = case_when(joint_count==1 & joint_laterality_count==1 & laterality_end==T~T, .default = F),
op_group_1 = case_when(single_joint_laterality == T ~ str_extract(string, "^.*(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
shared_laterality == T ~ paste(str_extract(string, "^.*(W84[1-9]{1})|(Z84[1-9]{1})"),laterality)
),
op_group_2 = case_when(shared_laterality == T ~ str_extract(string, "(?<=(W84[1-9]{1})|(Z84[1-9]{1})).*")
)
)
My data have 100’s of millions of rows so I want to have the most efficient approach and this probably is not it.
Read more here: Source link
