{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Toy_dataset.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "#Preliminary Code"
      ],
      "metadata": {
        "id": "EyFJUn2r9YYA"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RVbTCZgUZG3K"
      },
      "source": [
        "## Mount Google Drive Data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-cnkjmUFZBIQ",
        "outputId": "9132cc55-4b86-4df8-c616-552413383320"
      },
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ccNtiNZM22CB"
      },
      "source": [
        "First, I am going to work with just trying to get a k-mer coutning program working. Thus, I will start small with one sequence of length 100. Then I will move on to multiple sequences of length 100 to replicate using contigs as opposed to an assembled genome. Eventually, once I get the k-mer counting working, I will move on to trying to utilize the k-mer counts in machine learning algorithms."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Import Packages"
      ],
      "metadata": {
        "id": "u1YMvNSO7fLk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#Import packages\n",
        "from datetime import datetime, date\n",
        "import random\n",
        "from operator import itemgetter\n",
        "from sklearn.model_selection import train_test_split, KFold\n",
        "from sklearn import svm\n",
        "import sys\n",
        "import numpy as np"
      ],
      "metadata": {
        "id": "Pu8MbgNtVJW4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HoZYY5NP3svM"
      },
      "source": [
        "## Create Random Sequence"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GRenITuZ2dl_"
      },
      "source": [
        "# def generate_rand_sequence(length):\n",
        "#   sequence = \"\" #Initialize sequence\n",
        "#   for i in range(0,length):\n",
        "#     check = random.randint(1,4)\n",
        "#     if check == 1: sequence += \"A\"\n",
        "#     elif check == 2: sequence += \"T\"\n",
        "#     elif check == 3: sequence += \"C\"\n",
        "#     else: sequence += \"G\"\n",
        "#   return(sequence)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ARt7tuFd6VVg"
      },
      "source": [
        "Saving first sequence created in its own variable so it will not change with each run."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "noYFJPZD6N7F"
      },
      "source": [
        "# c_sequence = \"ACGTAGGCGACCGCCGGCAAAGGTACGGATTCTTTATGACAAGAAAGTATGACGAGTTGAGGACCTGATGGGAGCAAGAGGTCGTTCCTCTGGCGGTTCG\""
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "D2SePRgc6no_"
      },
      "source": [
        "# Naive Approach to k-mer Counting\n",
        "This will count k-mers in the simplest way possible. It will iterate through a list checking to see if a k-mer has been seen before. If it has, it increases the count for that k-mer, if it hasn't, it adds it to the list."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Gets the kmer count in an array using the naive approach"
      ],
      "metadata": {
        "id": "LaD9P0FP9Dkr"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fDPgQlsO75Gz"
      },
      "source": [
        "#functions\n",
        "\n",
        "##returns the smaller alphabetically of the forward and reverse complement\n",
        "def get_alph_comp(forward):\n",
        "  rev_comp = \"\"\n",
        "  for n in range(len(forward),0,-1):\n",
        "    if forward[n-1:n]==\"A\": rev_comp += \"T\"\n",
        "    elif forward[n-1:n]==\"T\": rev_comp += \"A\"\n",
        "    elif forward[n-1:n]==\"C\": rev_comp += \"G\"\n",
        "    else: rev_comp += \"C\"\n",
        "  for n in range(0,len(forward)):\n",
        "    if forward[n:n+1] < rev_comp[n:n+1]:\n",
        "      return forward\n",
        "    elif forward[n:n+1] > rev_comp[n:n+1]:\n",
        "      return rev_comp\n",
        "  return rev_comp\n",
        "\n",
        "\n",
        "#Gets k-mer count given a k value using a naive vector indexed search approach\n",
        "def naive_count(k, seq, kmer_count):\n",
        "  for i in range(0,len(seq)-(k-1)):\n",
        "    check = seq[i:i+k]\n",
        "    check = get_alph_comp(check)\n",
        "    check_val = 0\n",
        "    for j in range(0,len(kmer_count)):\n",
        "      if check == kmer_count[j][0]:\n",
        "        kmer_count[j][1] += 1\n",
        "        check_val=1\n",
        "        break\n",
        "    if check_val==0:\n",
        "      kmer_count.append([check,1])\n",
        "\n",
        "  return(kmer_count)\n",
        "\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Testing to make sure things are working."
      ],
      "metadata": {
        "id": "iB0tHIE770la"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2DX7ZbGPANGs"
      },
      "source": [
        "# print(naive_count(5,c_sequence))\n",
        "# print(naive_count(10,c_sequence))\n",
        "# print(naive_count(1,c_sequence))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sSyk8INTZQ8d"
      },
      "source": [
        "# test = generate_rand_sequence(1000)\n",
        "# naive_count(31,test)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3E_PVdrstn-C"
      },
      "source": [
        "# Implementing Hash Tables (Dictionaries)\n",
        "I have done some reading on hash tables and how they work. I am going to now attempt to implement a basic python dictionary to see how it performs with the same data."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Gets the dictionary of kmers from a sequence"
      ],
      "metadata": {
        "id": "xkvgGWfz8t1G"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nQx8jX6Duw26"
      },
      "source": [
        "#Gets k-mer count given a k value using a dictionary\n",
        "def dictionary_count(k, seq, kmer_dict):\n",
        "  for i in range(0,len(seq)-(k-1)):\n",
        "    check = seq[i:i+k]\n",
        "    check = get_alph_comp(check)\n",
        "    if check in kmer_dict: kmer_dict[check] += 1\n",
        "    else: kmer_dict[check] = 1\n",
        "  return(kmer_dict)\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Comparing how the naive approach and the dictionary approach perform at various differen sequence lengths."
      ],
      "metadata": {
        "id": "KCEREeJG8D7w"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bBBvgCQUxJ0q"
      },
      "source": [
        "# t_1k = generate_rand_sequence(1000)\n",
        "# t_10k = generate_rand_sequence(10000)\n",
        "# t_100k = generate_rand_sequence(100000)\n",
        "# t_1m = generate_rand_sequence(1000000)\n",
        "\n",
        "# n_t_1k = []\n",
        "# d_t_1k = {}\n",
        "# n_t_10k = []\n",
        "# d_t_10k = {}\n",
        "# n_t_100k = []\n",
        "# d_t_100k = {}\n",
        "# d_t_1m = {}\n",
        "\n",
        "# print(\"1,000\")\n",
        "# start = datetime.now().time()\n",
        "# n_t_1k = naive_count(31, t_1k, n_t_1k)\n",
        "# end = datetime.now().time()\n",
        "# print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n",
        "# start_time = datetime.now().time()\n",
        "# d_t_1k = dictionary_count(31, t_1k, d_t_1k)\n",
        "# end = datetime.now().time()\n",
        "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n",
        "\n",
        "# print(\"10,000\")\n",
        "# start = datetime.now().time()\n",
        "# n_t_10k = naive_count(31, t_10k, n_t_10k)\n",
        "# end = datetime.now().time()\n",
        "# print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n",
        "# start = datetime.now().time()\n",
        "# d_t_10k = dictionary_count(31, t_10k, d_t_10k)\n",
        "# end = datetime.now().time()\n",
        "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n",
        "\n",
        "\n",
        "# print(\"100,000\")\n",
        "# start = datetime.now().time()\n",
        "# n_t_100k = naive_count(31, t_100k, n_t_100k)\n",
        "# end = datetime.now().time()\n",
        "# print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n",
        "# start = datetime.now().time()\n",
        "# d_t_100k = dictionary_count(31, t_100k, d_t_100k)\n",
        "# end = datetime.now().time()\n",
        "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n",
        "\n",
        "# print(\"1,000,000\")\n",
        "# #start = datetime.now().time()\n",
        "# #n_t_1m = naive_count(31,t_1m)\n",
        "# #end = datetime.now().time()\n",
        "# #print(\"Compute time for Naive: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n",
        "# start = datetime.now().time()\n",
        "# d_t_1m = dictionary_count(31, t_1m, d_t_1m)\n",
        "# end = datetime.now().time()\n",
        "# print(\"Compute time for Dictionary: \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hLaB8v5vwy0y"
      },
      "source": [
        "This next section of code was used to verfiy that both functions were running as intended and providing the same output."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6-MQpOMjp3v3"
      },
      "source": [
        "# sublist = n_t_1k[85:95]\n",
        "# print(sublist)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XbHLRisaqOLM"
      },
      "source": [
        "# keylist = []\n",
        "# for item in sublist:\n",
        "#   keylist.append(item[0])\n",
        "# print(keylist)\n",
        "# type(keylist[0])\n",
        "# for x in keylist:\n",
        "#   print(d_t_1k.get(x))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yeCHkUJwyr_2"
      },
      "source": [
        "# Running Code on Real Data\n",
        "This next section will focus on trying to get the code to work on a real FASTA file that comes from the dataset I am trying to work with. Some small adjustments to the code need to be done so that I can read in sequences from the file one at a time as not to overload the memory."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Gets the dictionary of kmers from a file"
      ],
      "metadata": {
        "id": "874y_Mg-8ltg"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RBkZP7jwza-3"
      },
      "source": [
        "def kmer_count(k, file):\n",
        "  kmer_dict = {}\n",
        "  with open(file) as seq_file:\n",
        "  #Initializing variables needed for loops\n",
        "    remainder = \"\"\n",
        "    #Loop through lines of file\n",
        "    for line in seq_file:\n",
        "      #Check to see if we are starting a new sequence\n",
        "      if line[0] == '>':\n",
        "        remainder = \"\" #Clear the remainder as the next sequence is not related to previous\n",
        "      #This means the line contains sequencing data\n",
        "      else:\n",
        "        line = remainder + line #Add the remainder from the previous line in the sequence\n",
        "        kmer_dict = dictionary_count(k,line,kmer_dict) #update the dictionary\n",
        "        remainder = line[-(k-1)+1:-1] #Update the remainder with last k-1 characters of the line (the +1 is to avoid new line character)\n",
        "  return(kmer_dict)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TckE61Ex1seZ"
      },
      "source": [
        "\n",
        "# #skip #40 and 92 as the files are not included for some reason\n",
        "\n",
        "# for i in range(36,96):\n",
        "#   if i == 40 or i == 92:\n",
        "#     continue\n",
        "#   filename = \"SRR87375\" + str(i) + \"_scaffolds.fasta\"\n",
        "#   filepath = '/content/drive/MyDrive/AMR/Data/Scaffold_fasta/' + filename\n",
        "#   savefile = '/content/drive/MyDrive/AMR/Data/Scaffold_kmers/SRR87375' + str(i) + '_kmers.npy'\n",
        "#   start = datetime.now().time()\n",
        "#   kmer_dict = kmer_count(31,filepath)\n",
        "#   end = datetime.now().time()\n",
        "#   np.save(savefile,kmer_dict)\n",
        "#   print(\"Compute time for \",filename ,\": \", datetime.combine(date.min, end)-datetime.combine(date.min, start))\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now that we can count k-mer on an entire set of files, we can begin testing mahine learning algorithms on these datasets. One test I would like to run is seeing if using the count data, as opposed to a binary presence/absence, would increase the accuracy of AMR prediction when using SCM. Additionally, Dr. Drouin talks about using unitigs as opossed to kmers due to the redundancy of k-mers amongst individual genome sequences.\n",
        "\n",
        "Dr. Bailey was wondering whether or not a different length k-mer would matter when dealing with count data as opposed to binary presence/abscence data for each k-mer.\n",
        "\n",
        "Also, want to compare classification tree to SCM to see advantages/disadvantages of each. Thea reason for this is these two models are both often seen as interpretable.\n",
        "\n",
        "# Converting dictionaries files to absence/presence vector\n",
        "\n",
        "I need to create another version of dictionary_count that takes in these numpy files and creates one large dictionary for all the kmers present in the entire dataset."
      ],
      "metadata": {
        "id": "8RntNlE3XqJI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Compiles all .npy dictionaries into a single .npy dictionary for the entire dataset"
      ],
      "metadata": {
        "id": "7U8rWiyC-TFy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# #Gets k-mer count given a k value using a dictionary\n",
        "# def pres_abs_cumm(pres_abs_dict, kmer_dict):\n",
        "#   for key in kmer_dict:\n",
        "#     if key not in pres_abs_dict: pres_abs_dict[key] = 1\n",
        "#   return(pres_abs_dict)\n",
        "\n",
        "# cumm_pres_abs_dict = {}\n",
        "# for i in range(36,96):\n",
        "#   if i == 40 or i == 92:\n",
        "#     continue\n",
        "#   filepath = '/content/drive/MyDrive/AMR/Data/Scaffold_kmers/SRR87375' + str(i) + '_kmers.npy'\n",
        "#   individual_file_dict = np.load(filepath,allow_pickle='TRUE').item()\n",
        "#   cumm_pres_abs_dict = pres_abs_cumm(cumm_pres_abs_dict, individual_file_dict)\n",
        "# savefile = '/content/drive/MyDrive/AMR/Data/global_vals/pres_abs__scaff_dict.npy'\n",
        "# np.save(savefile,cumm_pres_abs_dict)"
      ],
      "metadata": {
        "id": "Lcqip0HN7aS5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now I need to convert the pres/abs dictionary into a vector that can has an index notation. I should be able to take advantage of the fact that I have a dictionary for each individual sample. I can use this to look up the value in each sample quickly and then assign it to the vector. I plan on making both a pres/abs vector, as well as a count vector in the hopes that I will be able to mess around with the count vector a little more in the future."
      ],
      "metadata": {
        "id": "5E8m_dxudtIi"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Creates a vector of the kmers from the entire dataset dictionary"
      ],
      "metadata": {
        "id": "K9DLXZLI_ILD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# filepath = '/content/drive/MyDrive/AMR/Data/global_vals/pres_abs__scaff_dict.npy'\n",
        "# temp_dict = np.load(filepath,allow_pickle='TRUE').item()\n",
        "# size = len(temp_dict)\n",
        "# key_vector = np.empty(size,dtype=object)\n",
        "# count=0\n",
        "# for item in temp_dict:\n",
        "#   key_vector[count] = item\n",
        "#   count+= 1\n",
        "# savepath = '/content/drive/MyDrive/AMR/Data/global_vals/key_vector.npy'\n",
        "# np.save(savepath, key_vector)"
      ],
      "metadata": {
        "id": "Z805O2b9LHhB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Creates the presence/absence vector for each individual dictionary using that dictionary and the key vector"
      ],
      "metadata": {
        "id": "jbQa2NXE_nRg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "\n",
        "# for i in range(36,96):\n",
        "#   zero_vector = np.zeros(size,dtype=int)\n",
        "#   if i == 40 or i == 92:\n",
        "#     continue\n",
        "#   filename = '/content/drive/MyDrive/AMR/Data/Scaffold_kmers/SRR87375' + str(i) + '_kmers.npy'\n",
        "#   file_dict = np.load(filename,allow_pickle='TRUE').item()\n",
        "#   count = 0\n",
        "#   for j in key_vector:\n",
        "#     if j in file_dict: zero_vector[count]=1 \n",
        "#     count += 1\n",
        "#   savename = '/content/drive/MyDrive/AMR/Data/Scaffold_Presence/SRR87375' + str(i) + '_presence.npy'\n",
        "#   np.save(savename,zero_vector)"
      ],
      "metadata": {
        "id": "0X2XFE69M06o"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Testing to make sure the presence/absence vectors appear correct"
      ],
      "metadata": {
        "id": "z0PDbtQjAsPj"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# filename2 = '/content/drive/MyDrive/AMR/Data/global_vals/key_vector.npy'\n",
        "# key_vector = np.load(filename2,allow_pickle='TRUE')"
      ],
      "metadata": {
        "id": "tvtY8LbmjrZq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Testing cont."
      ],
      "metadata": {
        "id": "LO1RsLv7A0ks"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# filename = '/content/drive/MyDrive/AMR/Data/Scaffold_Presence/SRR8737550_presence.npy'\n",
        "# file_vector = np.load(filename,allow_pickle='TRUE')\n",
        "\n",
        "# print(file_vector[100:1000])"
      ],
      "metadata": {
        "id": "i8v6VOKHiIx1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#Machine Learning Algorithms"
      ],
      "metadata": {
        "id": "LlrfT-7Aa-Ga"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "First, need to create a label vector for each resistance type that tells whether the individual sample is resistant or not"
      ],
      "metadata": {
        "id": "Ur1S3jsBrcW_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Function: Creates label vector"
      ],
      "metadata": {
        "id": "TR0ZrXOTMk6N"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def get_label_vector(resistance, matrix):\n",
        "  column = 0\n",
        "  for i in matrix[0,:]:\n",
        "    if resistance == i:\n",
        "      break\n",
        "    column += 1\n",
        "    if column == matrix[0,:].size:\n",
        "      sys.exit(\"Resistance type not applicable\")\n",
        "  submatrix = matrix[:,[0,column]]\n",
        "  delete_rows = []\n",
        "  for i in range(0,submatrix[:,1].size):\n",
        "    if submatrix[i,1] =='-':\n",
        "      delete_rows.append(i)\n",
        "  submatrix = np.delete(submatrix,delete_rows,0)\n",
        "  return submatrix"
      ],
      "metadata": {
        "id": "eKJqoCWma96U"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Creating the label vectors\n"
      ],
      "metadata": {
        "id": "GT-ud_bxMt-O"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "labels_csv = \"/content/drive/MyDrive/AMR/Data/global_vals/Pseudomonas_Resistance_Labels.csv\"\n",
        "labels_matrix = np.genfromtxt(labels_csv,delimiter=',',dtype=str)\n",
        "Tobra = get_label_vector('Tobramycin',labels_matrix)\n",
        "Cefta = get_label_vector('Ceftazidim',labels_matrix)\n",
        "Cipro = get_label_vector('Ciprofloxacin',labels_matrix)\n",
        "Mero = get_label_vector('Meropenem',labels_matrix)\n",
        "Coli = get_label_vector('Colistin',labels_matrix)"
      ],
      "metadata": {
        "id": "5T_aIG5UD9S2"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "##SVM"
      ],
      "metadata": {
        "id": "LCHkXziaPcne"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "resist = Tobra\n",
        "sample_num = resist[1:,0].size\n",
        "\n",
        "##Used to obtain number of features\n",
        "# file_path = \"/content/drive/MyDrive/AMR/Data/Scaffold_Presence/SRR8737536_presence.npy\"\n",
        "# sample = np.load(file_path,allow_pickle='TRUE')\n",
        "# sample.size\n",
        "\n",
        "feat_num = 28522323   #28.5 million\n",
        "\n",
        "X = np.empty(shape=(sample_num,feat_num),dtype=str)\n",
        "for i in range(1, sample_num+1):\n",
        "  file_path = \"/content/drive/MyDrive/AMR/Data/Scaffold_Presence/\" + resist[i,0] + \"_presence.npy\"\n",
        "  sample = np.load(file_path,allow_pickle='TRUE')\n",
        "  X[i-1,:] = sample\n"
      ],
      "metadata": {
        "id": "sSpgYp0TPeZN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "X_train, X_test, y_train, y_test = train_test_split(\n",
        "    X, resist[1:,1], test_size=0.2, random_state=0)\n",
        "kf = KFold(n_splits = 5)\n",
        "for train_index, test_index in kf.split(X):\n",
        "  print(\"Train: \", train_index,\" Test: \", test_index)\n",
        "\n",
        "# >>> X_train.shape, y_train.shape\n",
        "# ((90, 4), (90,))\n",
        "# >>> X_test.shape, y_test.shape\n",
        "# ((60, 4), (60,))\n",
        "\n",
        "# >>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)\n",
        "# >>> clf.score(X_test, y_test)"
      ],
      "metadata": {
        "id": "IEXYbpuZemQ2"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}