Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings? -
I have several strings that are multilingual, which use whitespace as a word divider (English, French, etc.) And such languages (Chinese, Japanese, Korean)
Looking at such a string, I want to separate the English / French / etc. part from the use of whitespace as separate parts and I want to separate the Chinese / Japanese.
And I have to put all those separate components in one list.
Some examples might obviously explain:
case 1 : English-only string is the case:
& Gt; & Gt; & Gt; "I love Python". There is only one word: , 'love', 'python']
episode 2 :
Gt; & Gt; & Gt; In this case (I "I 爱 蟒蛇") [u '\ u2611', u '\ u7231', u '\ u87d2', u '\ u86c7']
In this case I You can do the string in the list of Chinese characters. But in the list, I am getting Unicode representations:
[u '\ u2611', u '\ u7231', u '\ u87d2', u '\ u86c7']
How do I display actual characters instead of Unicode? Something like this:
['m', '爱', '蟒', '蛇']
??
Case 3 : English & amp; Chinese:
I want to turn on an input string such as
"我 爱 python"
and put it in the list There is something like this:
['me', '爱', 'python']
Is such a thing possible?
I thought I'm also showing the regex approach, too. It does not seem right to me, but this is mostly because I have noticed that with all the specific i18n constraints I am worried that regular expression can not be flexible enough for all of them - but you can Nobody needs his (In other words - overdays.)
# - * - Coding: UTF-8 - * - Import then DF Group_dade: Regex = [] # Match a whole word: Regex + [Ur '\ w'] # match the CJK character: regex + = [ur '[\ u4e00- \ ufaff]'] # Exclude the blank, match anything else: regex + = [ur ' [^ \ S] '] regex = "|" .join (regex) r = re.compile (regex) returns r.findall (s) if __name__ == "__main__": print group_words (u "testing english text") print group_words (U "我 爱 蟒蛇") Print group_eded ("test the english text")
In practice, you probably want to compile the regex only once, do not pause on each one. Again, filling up the details of the character group is up to you.
Comments
Post a Comment