python – Regex: Get rid of consecutive punctuation
I was trying to clean words in list using the following code:
#define function to clean list of words
def clear_list(words_list):
regex = re.compile('[wd]{2,}', re.U)
filtered = [i for i in words_list if regex.match(i)]
return filtered
clear_list_udf = sf.udf(clear_list, ArrayType(StringType()))
items = items.withColumn("clear_words", clear_list_udf(sf.col("words")))
I need just words bigger than 1 letter without punctuation. But I have the problem in the following cases:
what I have:
[“””непутевые, заметки””, с, дмитрием, крыловым”] –>
[заметки””, дмитрием, крыловым”]
what I need:
[“””непутевые, заметки””, с, дмитрием, крыловым”] –>
[непутевые, заметки, дмитрием, крыловым]
Read more here: Source link