{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Toy_dataset.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "#Preliminary Code" ], "metadata": { "id": "EyFJUn2r9YYA" } }, { "cell_type": "markdown", "metadata": { "id": "RVbTCZgUZG3K" }, "source": [ "## Mount Google Drive Data" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-cnkjmUFZBIQ", "outputId": "9132cc55-4b86-4df8-c616-552413383320" }, "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "ccNtiNZM22CB" }, "source": [ "First, I am going to work with just trying to get a k-mer coutning program working. Thus, I will start small with one sequence of length 100. Then I will move on to multiple sequences of length 100 to replicate using contigs as opposed to an assembled genome. Eventually, once I get the k-mer counting working, I will move on to trying to utilize the k-mer counts in machine learning algorithms." ] }, { "cell_type": "markdown", "source": [ "## Import Packages" ], "metadata": { "id": "u1YMvNSO7fLk" } }, { "cell_type": "code", "source": [ "#Import packages\n", "from datetime import datetime, date\n", "import random\n", "from operator import itemgetter\n", "from sklearn.model_selection import train_test_split, KFold\n", "from sklearn import svm\n", "import sys\n", "import numpy as np" ], "metadata": { "id": "Pu8MbgNtVJW4" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "HoZYY5NP3svM" }, "source": [ "## Create Random Sequence" ] }, { "cell_type": "code", "metadata": { "id": "GRenITuZ2dl_" }, "source": [ "# def generate_rand_sequence(length):\n", "# sequence = \"\" #Initialize sequence\n", "# for i in range(0,length):\n", "# check = random.randint(1,4)\n", "# if check == 1: sequence += \"A\"\n", "# elif check == 2: sequence += \"T\"\n", "# elif check == 3: sequence += \"C\"\n", "# else: sequence += \"G\"\n", "# return(sequence)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ARt7tuFd6VVg" }, "source": [ "Saving first sequence created in its own variable so it will not change with each run." ] }, { "cell_type": "code", "metadata": { "id": "noYFJPZD6N7F" }, "source": [ "# c_sequence = \"ACGTAGGCGACCGCCGGCAAAGGTACGGATTCTTTATGACAAGAAAGTATGACGAGTTGAGGACCTGATGGGAGCAAGAGGTCGTTCCTCTGGCGGTTCG\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "D2SePRgc6no_" }, "source": [ "# Naive Approach to k-mer Counting\n", "This will count k-mers in the simplest way possible. It will iterate through a list checking to see if a k-mer has been seen before. If it has, it increases the count for that k-mer, if it hasn't, it adds it to the list." ] }, { "cell_type": "markdown", "source": [ "##Gets the kmer count in an array using the naive approach" ], "metadata": { "id": "LaD9P0FP9Dkr" } }, { "cell_type": "code", "metadata": { "id": "fDPgQlsO75Gz" }, "source": [ "#functions\n", "\n", "##returns the smaller alphabetically of the forward and reverse complement\n", "def get_alph_comp(forward):\n", " rev_comp = \"\"\n", " for n in range(len(forward),0,-1):\n", " if forward[n-1:n]==\"A\": rev_comp += \"T\"\n", " elif forward[n-1:n]==\"T\": rev_comp += \"A\"\n", " elif forward[n-1:n]==\"C\": rev_comp += \"G\"\n", " else: rev_comp += \"C\"\n", " for n in range(0,len(forward)):\n", " if forward[n:n+1] < rev_comp[n:n+1]:\n", " return forward\n", " elif forward[n:n+1] > rev_comp[n:n+1]:\n", " return rev_comp\n", " return rev_comp\n", "\n", "\n", "#Gets k-mer count given a k value using a naive vector indexed search approach\n", "def naive_count(k, seq, kmer_count):\n", " for i in range(0,len(seq)-(k-1)):\n", " check = seq[i:i+k]\n", " check = get_alph_comp(check)\n", " check_val = 0\n", " for j in range(0,len(kmer_count)):\n", " if check == kmer_count[j][0]:\n", " kmer_count[j][1] += 1\n", " check_val=1\n", " break\n", " if check_val==0:\n", " kmer_count.append([check,1])\n", "\n", " return(kmer_count)\n", "\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Testing to make sure things are working." ], "metadata": { "id": "iB0tHIE770la" } }, { "cell_type": "code", "metadata": { "id": "2DX7ZbGPANGs" }, "source": [ "# print(naive_count(5,c_sequence))\n", "# print(naive_count(10,c_sequence))\n", "# print(naive_count(1,c_sequence))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "sSyk8INTZQ8d" }, "source": [ "# test = generate_rand_sequence(1000)\n", "# naive_count(31,test)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "3E_PVdrstn-C" }, "source": [ "# Implementing Hash Tables (Dictionaries)\n", "I have done some reading on hash tables and how they work. I am going to now attempt to implement a basic python dictionary to see how it performs with the same data." ] }, { "cell_type": "markdown", "source": [ "##Gets the dictionary of kmers from a sequence" ], "metadata": { "id": "xkvgGWfz8t1G" } }, { "cell_type": "code", "metadata": { "id": "nQx8jX6Duw26" }, "source": [ "#Gets k-mer count given a k value using a dictionary\n", "def dictionary_count(k, seq, kmer_dict):\n", " for i in range(0,len(seq)-(k-1)):\n", " check = seq[i:i+k]\n", " check = get_alph_comp(check)\n", " if check in kmer_dict: kmer_dict[check] += 1\n", " else: kmer_dict[check] = 1\n", " return(kmer_dict)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Comparing how the naive approach and the dictionary approach perform at various differen sequence lengths." ], "metadata": { "id": "KCEREeJG8D7w" } }, { "cell_type": "code", "metadata": { "id": "bBBvgCQUxJ0q" }, "source": [ "# t_1k = generate_rand_sequence(1000)\n", "# t_10k = generate_rand_sequence(10000)\n", "# t_100k = generate_rand_sequence(100000)\n", "# t_1m = generate_rand_sequence(1000000)\n", "\n", "# n_t_1k = []\n", "# d_t_1k = {}\n", "# n_t_10k = []\n", "# d_t_10k = {}\n", "# n_t_100k = []\n", "# d_t_100k = {}\n", "# d_t_1m = {}\n", "\n", "# print(\"1,000\")\n", "# start = datetime.now().time()\n", "# n_t_1k = naive_count(31, t_1k, n_t_1k)\n", "# end = datetime.now().time()\n", "# print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n", "# start_time = datetime.now().time()\n", "# d_t_1k = dictionary_count(31, t_1k, d_t_1k)\n", "# end = datetime.now().time()\n", "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n", "\n", "# print(\"10,000\")\n", "# start = datetime.now().time()\n", "# n_t_10k = naive_count(31, t_10k, n_t_10k)\n", "# end = datetime.now().time()\n", "# print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n", "# start = datetime.now().time()\n", "# d_t_10k = dictionary_count(31, t_10k, d_t_10k)\n", "# end = datetime.now().time()\n", "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n", "\n", "\n", "# print(\"100,000\")\n", "# start = datetime.now().time()\n", "# n_t_100k = naive_count(31, t_100k, n_t_100k)\n", "# end = datetime.now().time()\n", "# print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n", "# start = datetime.now().time()\n", "# d_t_100k = dictionary_count(31, t_100k, d_t_100k)\n", "# end = datetime.now().time()\n", "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n", "\n", "# print(\"1,000,000\")\n", "# #start = datetime.now().time()\n", "# #n_t_1m = naive_count(31,t_1m)\n", "# #end = datetime.now().time()\n", "# #print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n", "# start = datetime.now().time()\n", "# d_t_1m = dictionary_count(31, t_1m, d_t_1m)\n", "# end = datetime.now().time()\n", "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hLaB8v5vwy0y" }, "source": [ "This next section of code was used to verfiy that both functions were running as intended and providing the same output." ] }, { "cell_type": "code", "metadata": { "id": "6-MQpOMjp3v3" }, "source": [ "# sublist = n_t_1k[85:95]\n", "# print(sublist)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "XbHLRisaqOLM" }, "source": [ "# keylist = []\n", "# for item in sublist:\n", "# keylist.append(item[0])\n", "# print(keylist)\n", "# type(keylist[0])\n", "# for x in keylist:\n", "# print(d_t_1k.get(x))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "yeCHkUJwyr_2" }, "source": [ "# Running Code on Real Data\n", "This next section will focus on trying to get the code to work on a real FASTA file that comes from the dataset I am trying to work with. Some small adjustments to the code need to be done so that I can read in sequences from the file one at a time as not to overload the memory." ] }, { "cell_type": "markdown", "source": [ "##Gets the dictionary of kmers from a file" ], "metadata": { "id": "874y_Mg-8ltg" } }, { "cell_type": "code", "metadata": { "id": "RBkZP7jwza-3" }, "source": [ "def kmer_count(k, file):\n", " kmer_dict = {}\n", " with open(file) as seq_file:\n", " #Initializing variables needed for loops\n", " remainder = \"\"\n", " #Loop through lines of file\n", " for line in seq_file:\n", " #Check to see if we are starting a new sequence\n", " if line[0] == '>':\n", " remainder = \"\" #Clear the remainder as the next sequence is not related to previous\n", " #This means the line contains sequencing data\n", " else:\n", " line = remainder + line #Add the remainder from the previous line in the sequence\n", " kmer_dict = dictionary_count(k,line,kmer_dict) #update the dictionary\n", " remainder = line[-(k-1)+1:-1] #Update the remainder with last k-1 characters of the line (the +1 is to avoid new line character)\n", " return(kmer_dict)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "TckE61Ex1seZ" }, "source": [ "\n", "# #skip #40 and 92 as the files are not included for some reason\n", "\n", "# for i in range(36,96):\n", "# if i == 40 or i == 92:\n", "# continue\n", "# filename = \"SRR87375\" + str(i) + \"_scaffolds.fasta\"\n", "# filepath = '/content/drive/MyDrive/AMR/Data/Scaffold_fasta/' + filename\n", "# savefile = '/content/drive/MyDrive/AMR/Data/Scaffold_kmers/SRR87375' + str(i) + '_kmers.npy'\n", "# start = datetime.now().time()\n", "# kmer_dict = kmer_count(31,filepath)\n", "# end = datetime.now().time()\n", "# np.save(savefile,kmer_dict)\n", "# print(\"Compute time for \",filename ,\": \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now that we can count k-mer on an entire set of files, we can begin testing mahine learning algorithms on these datasets. One test I would like to run is seeing if using the count data, as opposed to a binary presence/absence, would increase the accuracy of AMR prediction when using SCM. Additionally, Dr. Drouin talks about using unitigs as opossed to kmers due to the redundancy of k-mers amongst individual genome sequences.\n", "\n", "Dr. Bailey was wondering whether or not a different length k-mer would matter when dealing with count data as opposed to binary presence/abscence data for each k-mer.\n", "\n", "Also, want to compare classification tree to SCM to see advantages/disadvantages of each. Thea reason for this is these two models are both often seen as interpretable.\n", "\n", "# Converting dictionaries files to absence/presence vector\n", "\n", "I need to create another version of dictionary_count that takes in these numpy files and creates one large dictionary for all the kmers present in the entire dataset." ], "metadata": { "id": "8RntNlE3XqJI" } }, { "cell_type": "markdown", "source": [ "##Compiles all .npy dictionaries into a single .npy dictionary for the entire dataset" ], "metadata": { "id": "7U8rWiyC-TFy" } }, { "cell_type": "code", "source": [ "# #Gets k-mer count given a k value using a dictionary\n", "# def pres_abs_cumm(pres_abs_dict, kmer_dict):\n", "# for key in kmer_dict:\n", "# if key not in pres_abs_dict: pres_abs_dict[key] = 1\n", "# return(pres_abs_dict)\n", "\n", "# cumm_pres_abs_dict = {}\n", "# for i in range(36,96):\n", "# if i == 40 or i == 92:\n", "# continue\n", "# filepath = '/content/drive/MyDrive/AMR/Data/Scaffold_kmers/SRR87375' + str(i) + '_kmers.npy'\n", "# individual_file_dict = np.load(filepath,allow_pickle='TRUE').item()\n", "# cumm_pres_abs_dict = pres_abs_cumm(cumm_pres_abs_dict, individual_file_dict)\n", "# savefile = '/content/drive/MyDrive/AMR/Data/global_vals/pres_abs__scaff_dict.npy'\n", "# np.save(savefile,cumm_pres_abs_dict)" ], "metadata": { "id": "Lcqip0HN7aS5" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now I need to convert the pres/abs dictionary into a vector that can has an index notation. I should be able to take advantage of the fact that I have a dictionary for each individual sample. I can use this to look up the value in each sample quickly and then assign it to the vector. I plan on making both a pres/abs vector, as well as a count vector in the hopes that I will be able to mess around with the count vector a little more in the future." ], "metadata": { "id": "5E8m_dxudtIi" } }, { "cell_type": "markdown", "source": [ "##Creates a vector of the kmers from the entire dataset dictionary" ], "metadata": { "id": "K9DLXZLI_ILD" } }, { "cell_type": "code", "source": [ "# filepath = '/content/drive/MyDrive/AMR/Data/global_vals/pres_abs__scaff_dict.npy'\n", "# temp_dict = np.load(filepath,allow_pickle='TRUE').item()\n", "# size = len(temp_dict)\n", "# key_vector = np.empty(size,dtype=object)\n", "# count=0\n", "# for item in temp_dict:\n", "# key_vector[count] = item\n", "# count+= 1\n", "# savepath = '/content/drive/MyDrive/AMR/Data/global_vals/key_vector.npy'\n", "# np.save(savepath, key_vector)" ], "metadata": { "id": "Z805O2b9LHhB" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "##Creates the presence/absence vector for each individual dictionary using that dictionary and the key vector" ], "metadata": { "id": "jbQa2NXE_nRg" } }, { "cell_type": "code", "source": [ "\n", "# for i in range(36,96):\n", "# zero_vector = np.zeros(size,dtype=int)\n", "# if i == 40 or i == 92:\n", "# continue\n", "# filename = '/content/drive/MyDrive/AMR/Data/Scaffold_kmers/SRR87375' + str(i) + '_kmers.npy'\n", "# file_dict = np.load(filename,allow_pickle='TRUE').item()\n", "# count = 0\n", "# for j in key_vector:\n", "# if j in file_dict: zero_vector[count]=1 \n", "# count += 1\n", "# savename = '/content/drive/MyDrive/AMR/Data/Scaffold_Presence/SRR87375' + str(i) + '_presence.npy'\n", "# np.save(savename,zero_vector)" ], "metadata": { "id": "0X2XFE69M06o" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "##Testing to make sure the presence/absence vectors appear correct" ], "metadata": { "id": "z0PDbtQjAsPj" } }, { "cell_type": "code", "source": [ "# filename2 = '/content/drive/MyDrive/AMR/Data/global_vals/key_vector.npy'\n", "# key_vector = np.load(filename2,allow_pickle='TRUE')" ], "metadata": { "id": "tvtY8LbmjrZq" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "##Testing cont." ], "metadata": { "id": "LO1RsLv7A0ks" } }, { "cell_type": "code", "source": [ "# filename = '/content/drive/MyDrive/AMR/Data/Scaffold_Presence/SRR8737550_presence.npy'\n", "# file_vector = np.load(filename,allow_pickle='TRUE')\n", "\n", "# print(file_vector[100:1000])" ], "metadata": { "id": "i8v6VOKHiIx1" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Machine Learning Algorithms" ], "metadata": { "id": "LlrfT-7Aa-Ga" } }, { "cell_type": "markdown", "source": [ "First, need to create a label vector for each resistance type that tells whether the individual sample is resistant or not" ], "metadata": { "id": "Ur1S3jsBrcW_" } }, { "cell_type": "markdown", "source": [ "##Function: Creates label vector" ], "metadata": { "id": "TR0ZrXOTMk6N" } }, { "cell_type": "code", "source": [ "def get_label_vector(resistance, matrix):\n", " column = 0\n", " for i in matrix[0,:]:\n", " if resistance == i:\n", " break\n", " column += 1\n", " if column == matrix[0,:].size:\n", " sys.exit(\"Resistance type not applicable\")\n", " submatrix = matrix[:,[0,column]]\n", " delete_rows = []\n", " for i in range(0,submatrix[:,1].size):\n", " if submatrix[i,1] =='-':\n", " delete_rows.append(i)\n", " submatrix = np.delete(submatrix,delete_rows,0)\n", " return submatrix" ], "metadata": { "id": "eKJqoCWma96U" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "##Creating the label vectors\n" ], "metadata": { "id": "GT-ud_bxMt-O" } }, { "cell_type": "code", "source": [ "labels_csv = \"/content/drive/MyDrive/AMR/Data/global_vals/Pseudomonas_Resistance_Labels.csv\"\n", "labels_matrix = np.genfromtxt(labels_csv,delimiter=',',dtype=str)\n", "Tobra = get_label_vector('Tobramycin',labels_matrix)\n", "Cefta = get_label_vector('Ceftazidim',labels_matrix)\n", "Cipro = get_label_vector('Ciprofloxacin',labels_matrix)\n", "Mero = get_label_vector('Meropenem',labels_matrix)\n", "Coli = get_label_vector('Colistin',labels_matrix)" ], "metadata": { "id": "5T_aIG5UD9S2" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "##SVM" ], "metadata": { "id": "LCHkXziaPcne" } }, { "cell_type": "code", "source": [ "resist = Tobra\n", "sample_num = resist[1:,0].size\n", "\n", "##Used to obtain number of features\n", "# file_path = \"/content/drive/MyDrive/AMR/Data/Scaffold_Presence/SRR8737536_presence.npy\"\n", "# sample = np.load(file_path,allow_pickle='TRUE')\n", "# sample.size\n", "\n", "feat_num = 28522323 #28.5 million\n", "\n", "X = np.empty(shape=(sample_num,feat_num),dtype=str)\n", "for i in range(1, sample_num+1):\n", " file_path = \"/content/drive/MyDrive/AMR/Data/Scaffold_Presence/\" + resist[i,0] + \"_presence.npy\"\n", " sample = np.load(file_path,allow_pickle='TRUE')\n", " X[i-1,:] = sample\n" ], "metadata": { "id": "sSpgYp0TPeZN" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "X_train, X_test, y_train, y_test = train_test_split(\n", " X, resist[1:,1], test_size=0.2, random_state=0)\n", "kf = KFold(n_splits = 5)\n", "for train_index, test_index in kf.split(X):\n", " print(\"Train: \", train_index,\" Test: \", test_index)\n", "\n", "# >>> X_train.shape, y_train.shape\n", "# ((90, 4), (90,))\n", "# >>> X_test.shape, y_test.shape\n", "# ((60, 4), (60,))\n", "\n", "# >>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)\n", "# >>> clf.score(X_test, y_test)" ], "metadata": { "id": "IEXYbpuZemQ2" }, "execution_count": null, "outputs": [] } ] }