python – Regex: Get rid of consecutive punctuation

I was trying to clean words in list using the following code:

#define function to clean list of words
def clear_list(words_list):
    regex = re.compile('[wd]{2,}', re.U)
    filtered = [i for i in words_list if regex.match(i)]
    return filtered

clear_list_udf = sf.udf(clear_list, ArrayType(StringType()))

items = items.withColumn("clear_words", clear_list_udf(sf.col("words")))

I need just words bigger than 1 letter without punctuation. But I have the problem in the following cases:

what I have:
[“””непутевые, заметки””, с, дмитрием, крыловым”] –>
[заметки””, дмитрием, крыловым”]

what I need:
[“””непутевые, заметки””, с, дмитрием, крыловым”] –>
[непутевые, заметки, дмитрием, крыловым]

Read more here: Source link