{"cells":[{"cell_type":"markdown","metadata":{"id":"7if8oWEvRZ2O"},"source":["<a href=\"https://colab.research.google.com/github/smart-stats/ds4bio_book/blob/main/book/tooling_nlp.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n","\n","# Text processing\n","\n","## On text encoding\n","\n","Computers represent characters via character encodings. Basically, a\n","mapping from a binary representation to a character symbol\n","value. There's many character encodings and most often our text system\n","autodetects them.\n","\n","Character encodings existed a long time before computers, since people\n","seem to always want to represent letters as numbers. On a computer,\n","the program has to interpret the character code and display it in a\n","way that we want. ASCII represents characters as 7 bits, (unicode)\n","UTF-8, UTF-16 and UTF-32 represent characters as 8, 16, and 32\n","bits. Of course the greater the bit representation, the larger the\n","character set that can be represented. ASCII contains only the usual\n","keyboard characters whereas Unicode contains much more.\n","\n","[Here's](https://docs.python.org/3/howto/unicode.html#the-unicode-type)\n","some info about python unicode. For example, python's default\n","character encoding is UTF-8. So strings automatically allow UTF-8\n","characters"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"OYEAgHkeRZ2t","executionInfo":{"status":"ok","timestamp":1713579109332,"user_tz":240,"elapsed":80,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"0f116e0a-1cd8-4b7f-ff25-3db0416ebd70"},"source":["print(\"😉\")\n","print(\"\\N{WINKING FACE}\")\n","print(\"\\U0001F609\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["😉\n","😉\n","😉\n"]}]},{"cell_type":"markdown","metadata":{"id":"XCk3kyISRZ2w"},"source":["Unicode characters can be variable names, but emojis can't.  (However,\n","see\n","[here](https://betterprogramming.pub/emojis-as-python-variables-sure-why-not-96ce955dada1)\n","where they magically do `import pandas as 🐼`). Greek letters are fair\n","game."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"pflSFcdDRZ2w","executionInfo":{"status":"ok","timestamp":1713579109333,"user_tz":240,"elapsed":71,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"58442506-181a-4720-e337-d9222e504889"},"source":["import numpy as np\n","\n","x = np.array([1, 5, 8, 10, 2, 4])\n","μ = x.mean()\n","σ = x.std()\n","print( (μ, σ) )"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["(5.0, 3.1622776601683795)\n"]}]},{"cell_type":"markdown","metadata":{"id":"oauma4RARZ23"},"source":["This is annoying for programming. But, useful for labeling plot axes and whatnot.\n","\n","## Regular expressions\n","\n","Regular expressions (regexes) are simply advanced pattern matching\n","tools. Regular expressions represent regular characters, a, b, ..., z,\n","A, B, ..., Z exactly like you'd think. But, some other characters, .,\n","+, and others are metacharacters that are used to help us match\n","things. A backslash, \\, can be used to directly reference a\n","metacharacter, so \"\\\\+\", references the character \"+\". It can also be\n","used to escape a regular character. So \"\\\\d\" references the set of\n","digits rather than the character `d`.\n","\n","Honestly, I have to look up regex definitions everytime I use\n","them. [Here's](https://en.wikipedia.org/wiki/Regular_expression) a\n","table from wikipedia\n","reduced. [Here's](https://docs.python.org/3/library/re.html) the\n","python regex docs.\n","\n","| Metacharacter\t| Description |\n","|---|---|\n","| ^\t    | Matches the starting position within the string.|\n","| .\t    | Matches any single character. |\n","| [ ]\t| Matches a single character that is contained within the brackets. |\n","| [^ ]\t| Matches a single character that is not contained within the brackets.|\n","| $    \t| Matches the ending position of the string or the position just before a string-ending newline.|\n","| ( )   | Defines a marked subexpression.|\n","| \\n    | Matches what the nth marked subexpression matched, where n is a digit from 1 to 9 |\n","| *\t    | Matches the preceding element zero or more times.|\n","| {m,n}\t| Matches the preceding element at least m and not more than n times.|\n","\n","Many search functions in R and python allow for regexes. Especially,\n","the `re` python module. Try these simple examples and look at the\n","methods associated with the output. It contains methods for where the\n","regex occurs in the string, the whole input string itself etc. It\n","returns nil if there's no match."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"ueKTWoBPRZ23","executionInfo":{"status":"ok","timestamp":1713579109333,"user_tz":240,"elapsed":61,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"8635e1b9-9316-4e50-ef6f-cbe43fb39dfe"},"source":["import re\n","out = re.search('[hcb]a', 'hat')\n","print( (out.group(0), out.string) )\n","out = re.search('[hcb]a', 'phat')\n","print( (out.group(0), out.string) )\n","out = re.search('[hcb]a', 'rat')\n","print(out)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["('ha', 'hat')\n","('ha', 'phat')\n","None\n"]}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"v0yZm2qlRZ24","executionInfo":{"status":"ok","timestamp":1713579109334,"user_tz":240,"elapsed":55,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"bf2428fc-c49b-4f83-8208-72dee8606860"},"source":["out = re.findall('[hcb]a', 'hatcathat')\n","print(out)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["['ha', 'ca', 'ha']\n"]}]},{"cell_type":"markdown","metadata":{"id":"COFRLIz_RZ25"},"source":["We can match any letter with \".\"."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Y4ircsuzRZ28","executionInfo":{"status":"ok","timestamp":1713579109334,"user_tz":240,"elapsed":49,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"b840de03-6097-4579-a3fb-cd8d46032fde"},"source":["out = re.search('.a', 'rat')\n","print( (out.group(0), out.string) )\n","out = re.search('.a', 'phat')\n","print( (out.group(0), out.string) )"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["('ra', 'rat')\n","('ha', 'phat')\n"]}]},{"cell_type":"markdown","metadata":{"id":"Dqtmv6qaRZ29"},"source":["We can search for things like any number using a dash."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"CnXko7KoRZ2-","executionInfo":{"status":"ok","timestamp":1713579109335,"user_tz":240,"elapsed":41,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"a0b755e3-bdf0-437f-f633-63c4f72e1143"},"source":["out = re.search('subject[0-9].txt', 'subject3.txt')\n","print( (out.group(0), out.string) )\n","out = re.search('subject[0-9][0-9][0-9].txt', 'subject015.txt')\n","print( (out.group(0), out.string) )"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["('subject3.txt', 'subject3.txt')\n","('subject015.txt', 'subject015.txt')\n"]}]},{"cell_type":"markdown","metadata":{"id":"mdrUhiYGRZ2_"},"source":["'CHARACTER*' will match any number of the character, inncluding 0,\n","whereas `CHAR+` matches one or more times."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"P6ODH-ZSRZ2_","executionInfo":{"status":"ok","timestamp":1713579109336,"user_tz":240,"elapsed":35,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"d89d67a7-7c3a-4302-d8f6-461a26a0ee6f"},"source":["out = re.search('subject0*.txt', 'subject.txt')\n","print( (out.group(0), out.string) )\n","out = re.search('subject0*.txt', 'subject000.txt')\n","print( (out.group(0), out.string) )\n","out = re.search('subject0*.txt', 'subject123.txt')\n","print(out)\n","out = re.search('subject0+.txt', 'subject.txt')\n","print(out)\n","out = re.search('subject0+.txt', 'subject000.txt')\n","print( (out.group(0), out.string) )\n","out = re.search('subject0+.txt', 'subject123.txt')\n","print(out)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["('subject.txt', 'subject.txt')\n","('subject000.txt', 'subject000.txt')\n","None\n","None\n","('subject000.txt', 'subject000.txt')\n","None\n"]}]},{"cell_type":"markdown","metadata":{"id":"mWeLXXgqRZ2_"},"source":["`CHARACTER?` matches one or zero times."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DkRWrGrnRZ3A","executionInfo":{"status":"ok","timestamp":1713579109337,"user_tz":240,"elapsed":30,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"68aae250-c4e1-406d-ee54-35db8f8613ba"},"source":["out = re.search('subject0?.txt', 'subject.txt')\n","print( (out.group(0), out.string) )\n","out = re.search('subject0?.txt', 'subject0.txt')\n","print( (out.group(0), out.string) )\n","out = re.search('subject0?.txt', 'subject00.txt')\n","print(out)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["('subject.txt', 'subject.txt')\n","('subject0.txt', 'subject0.txt')\n","None\n"]}]},{"cell_type":"markdown","metadata":{"id":"m9RrFF3dRZ3B"},"source":["You can string together regexes to obtain more complex searches. For example, the following searches for `subject[GREATER THAN ONE NUMBER].txt`"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"F1ujsHfGRZ3B","executionInfo":{"status":"ok","timestamp":1713579109337,"user_tz":240,"elapsed":27,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"56f98e68-cc38-4815-fbb6-0500f684f9c4"},"source":["out = re.search('[0-9]+.txt', 'subject501.txt')\n","print( (out.group(0), out.string) )"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["('501.txt', 'subject501.txt')\n"]}]},{"cell_type":"markdown","metadata":{"id":"FQo8oBs4RZ3D"},"source":["Python's `re` library has several other methods than search including:\n","`match`, `findall()`, `finditer()` (see [here]()).\n","\n","That's enough regexes. My typical workflow for regexes is to simply\n","relearn them every time I need them, since I don't use them enough to\n","get terribly fluent.\n","\n","## Natural language tool kit\n","\n","nltk is probably the most widely used natural language toolkit.\n","To install nltk, just use conda or pip. However, I also had to run `nltk.download()` to download the extra stuff required for the package to run. I found [this](https://realpython.com/nltk-nlp-python/) tutorial helpful."]},{"cell_type":"code","metadata":{"id":"vK3yf6g1RZ3E"},"source":["import nltk\n","## Download the language libraries\n","## this will walk you through it. I just did\n","## all-nltk\n","##nltk.download()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"v2YExSf6RZ3E"},"source":["Tokenizing words and sentences. Consider this history and physical exam note from [here](https://www.med.unc.edu/medclerk/education/grading/history-and-physical-examination-h-p-examples/).\n"]},{"cell_type":"code","metadata":{"id":"WYjDJlDtRZ3F"},"source":["note = \"\"\"HPI is a 76 yo man with h/o HTN, DM, and sleep apnea who presented to the ED complaining of chest pain. He states that the pain began the day before and consisted of a sharp pain that lasted around 30 seconds, followed by a dull pain that would last around 2 minutes. The pain was located over his left chest area somewhat near his shoulder. The onset of pain came while the patient was walking in his home. He did not sit and rest during the pain, but continued to do household chores. Later on in the afternoon he went to the gym where he walked 1 mile on the treadmill, rode the bike for 5 minutes, and swam in the pool. After returning from the gym he did some work out in the yard, cutting back some vines. He did not have any reoccurrences of chest pain while at the gym or later in the evening. The following morning (of his presentation to the ED) he noticed the pain as he was getting out of bed. Once again it was a dull pain, preceded by a short interval of a sharp pain. The patient did experience some tingling in his right arm after the pain ceased. He continued to have several episodes of the pain throughout the morning, so his daughter-in-law decided to take him to the ED around 12:30pm. The painful episodes did not increase in intensity or severity during this time. At the ED the patient was given nitroglycerin, which he claims helped alleviate the pain somewhat. HPI has not experienced any shortness of breath, nausea, or diaphoresis during these episodes of pain. He has never had chest pain in the past. He has been told “years ago” that he has a right bundle branch block and premature heart beats.\"\"\""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Q1dU27OTRZ3F"},"source":["Let's tokenize this note by sentence and word."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"pXcXFdrlRZ3F","executionInfo":{"status":"ok","timestamp":1713580793784,"user_tz":240,"elapsed":430,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"99358e12-32ec-4e72-eca7-c9d1df02a27c"},"source":["sentences = nltk.tokenize.sent_tokenize(note)\n","words = nltk.tokenize.word_tokenize(note)\n","for i in range(3):\n","    print(sentences[i])\n","print(words)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["HPI is a 76 yo man with h/o HTN, DM, and sleep apnea who presented to the ED complaining of chest pain.\n","He states that the pain began the day before and consisted of a sharp pain that lasted around 30 seconds, followed by a dull pain that would last around 2 minutes.\n","The pain was located over his left chest area somewhat near his shoulder.\n","['HPI', 'is', 'a', '76', 'yo', 'man', 'with', 'h/o', 'HTN', ',', 'DM', ',', 'and', 'sleep', 'apnea', 'who', 'presented', 'to', 'the', 'ED', 'complaining', 'of', 'chest', 'pain', '.', 'He', 'states', 'that', 'the', 'pain', 'began', 'the', 'day', 'before', 'and', 'consisted', 'of', 'a', 'sharp', 'pain', 'that', 'lasted', 'around', '30', 'seconds', ',', 'followed', 'by', 'a', 'dull', 'pain', 'that', 'would', 'last', 'around', '2', 'minutes', '.', 'The', 'pain', 'was', 'located', 'over', 'his', 'left', 'chest', 'area', 'somewhat', 'near', 'his', 'shoulder', '.', 'The', 'onset', 'of', 'pain', 'came', 'while', 'the', 'patient', 'was', 'walking', 'in', 'his', 'home', '.', 'He', 'did', 'not', 'sit', 'and', 'rest', 'during', 'the', 'pain', ',', 'but', 'continued', 'to', 'do', 'household', 'chores', '.', 'Later', 'on', 'in', 'the', 'afternoon', 'he', 'went', 'to', 'the', 'gym', 'where', 'he', 'walked', '1', 'mile', 'on', 'the', 'treadmill', ',', 'rode', 'the', 'bike', 'for', '5', 'minutes', ',', 'and', 'swam', 'in', 'the', 'pool', '.', 'After', 'returning', 'from', 'the', 'gym', 'he', 'did', 'some', 'work', 'out', 'in', 'the', 'yard', ',', 'cutting', 'back', 'some', 'vines', '.', 'He', 'did', 'not', 'have', 'any', 'reoccurrences', 'of', 'chest', 'pain', 'while', 'at', 'the', 'gym', 'or', 'later', 'in', 'the', 'evening', '.', 'The', 'following', 'morning', '(', 'of', 'his', 'presentation', 'to', 'the', 'ED', ')', 'he', 'noticed', 'the', 'pain', 'as', 'he', 'was', 'getting', 'out', 'of', 'bed', '.', 'Once', 'again', 'it', 'was', 'a', 'dull', 'pain', ',', 'preceded', 'by', 'a', 'short', 'interval', 'of', 'a', 'sharp', 'pain', '.', 'The', 'patient', 'did', 'experience', 'some', 'tingling', 'in', 'his', 'right', 'arm', 'after', 'the', 'pain', 'ceased', '.', 'He', 'continued', 'to', 'have', 'several', 'episodes', 'of', 'the', 'pain', 'throughout', 'the', 'morning', ',', 'so', 'his', 'daughter-in-law', 'decided', 'to', 'take', 'him', 'to', 'the', 'ED', 'around', '12:30pm', '.', 'The', 'painful', 'episodes', 'did', 'not', 'increase', 'in', 'intensity', 'or', 'severity', 'during', 'this', 'time', '.', 'At', 'the', 'ED', 'the', 'patient', 'was', 'given', 'nitroglycerin', ',', 'which', 'he', 'claims', 'helped', 'alleviate', 'the', 'pain', 'somewhat', '.', 'HPI', 'has', 'not', 'experienced', 'any', 'shortness', 'of', 'breath', ',', 'nausea', ',', 'or', 'diaphoresis', 'during', 'these', 'episodes', 'of', 'pain', '.', 'He', 'has', 'never', 'had', 'chest', 'pain', 'in', 'the', 'past', '.', 'He', 'has', 'been', 'told', '“', 'years', 'ago', '”', 'that', 'he', 'has', 'a', 'right', 'bundle', 'branch', 'block', 'and', 'premature', 'heart', 'beats', '.']\n"]}]},{"cell_type":"markdown","metadata":{"id":"o2XzF4oMRZ3G"},"source":["Let's filter out some common English filler words."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Bd-zLJnFRZ3G","executionInfo":{"status":"ok","timestamp":1713580834903,"user_tz":240,"elapsed":157,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"7c22e307-25b5-4327-bcd5-f9a3843ff817"},"source":["filter_words = set(nltk.corpus.stopwords.words(\"english\"))\n","filtered = [w for w in words if w in filter_words]\n","print(filtered[0 : 10])"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["['is', 'a', 'with', 'and', 'who', 'to', 'the', 'of', 'that', 'the']\n"]}]},{"cell_type":"markdown","metadata":{"id":"amW3C943RZ3G"},"source":["We can also reduce words to their stems, i.e. grammatical root form."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"mN2GhLalRZ3H","executionInfo":{"status":"ok","timestamp":1713580837103,"user_tz":240,"elapsed":221,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"0977b9b0-a134-45f9-a982-80695a183a1d"},"source":["stemmer = nltk.stem.PorterStemmer().stem\n","print(stemmer(\"Diagnosed\"))\n","print(stemmer(\"Diagnosing\"))\n","print(stemmer(\"diagnosed\"))"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["diagnos\n","diagnos\n","diagnos\n"]}]},{"cell_type":"markdown","metadata":{"id":"X07PelJ7RZ3H"},"source":["Lemmatization reduces words to an underlying meaing."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"8AWK20OSRZ3L","executionInfo":{"status":"ok","timestamp":1713580840859,"user_tz":240,"elapsed":1771,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"cbd57796-8ddb-40da-ec8c-e9534a93de71"},"source":["from nltk.stem import WordNetLemmatizer\n","lemmatize = WordNetLemmatizer().lemmatize\n","## Here v for verb\n","print(lemmatize(\"am\", pos = \"v\"))\n","print(lemmatize(\"are\", pos = \"v\"))\n","print(lemmatize(\"worst\", pos = \"n\"))\n","print(lemmatize(\"worst\", pos = \"a\"))"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["be\n","be\n","worst\n","bad\n"]}]},{"cell_type":"markdown","metadata":{"id":"S69D-iIERZ3L"},"source":["## Sentiment\n","\n","Sentiment is predicting a score about the tone of a text. The compound value is a numeric sentiment score.  "]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"D8iYF1izRZ3M","executionInfo":{"status":"ok","timestamp":1713580844683,"user_tz":240,"elapsed":231,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"c0cd0ebb-9a5b-4f42-ccc2-e91bc1e90245"},"source":["from nltk.sentiment import SentimentIntensityAnalyzer\n","sentiment = SentimentIntensityAnalyzer().polarity_scores\n","print(sentiment(note))\n","print(sentiment(\"Sentiment analysis is great!\"))\n","print(sentiment(\"The world is doomed.\"))"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["{'neg': 0.214, 'neu': 0.786, 'pos': 0.0, 'compound': -0.9971}\n","{'neg': 0.0, 'neu': 0.406, 'pos': 0.594, 'compound': 0.6588}\n","{'neg': 0.583, 'neu': 0.417, 'pos': 0.0, 'compound': -0.6369}\n"]}]}],"metadata":{"kernelspec":{"name":"python3","language":"python","display_name":"Python 3"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}