Saturday, December 22, 2018

Building a chatbot using TF-IDF


We want to build a basic chatbot which trains on previous messages and responses. In this tutorial we look at the math that we are using to convert the messages and their associated responses into weights using term frequency and inverse document frequency. (tf-idf).

Once we have the appropriate weights of words present in messages and responses. We write the messages and responses in vector form of the weight present. We then try to find how similar are these vectors using cosine similarity.

We multiply term-frequency and inverse document frequency to obtain the final weight of the word that would be used to construct the vector.  

Cosine Similarity:
This is a measure of orientation and not magnitude. The reason we are not considering magnitude of the vectors is because the magnitude can be more depending on the length of the query or response associated but that does not tell us about how similar is the query and the messages that we have in our training data.

Angle gives us the direction where the vector points towards thus if the query has similar weighted words only 5 times and the message has 500 words but having similar weights then they would point in same direction and be more similar.

The reason for choosing cos(theta) is because it is monotonically decreasing function in [0, pi/2]. We use dot product to calculate the cos(theta) as shown in figure.


In this tutorial we would give a walkthrough of the code. The libraries that have been used are the scikit learn and numpy.

Full code present on github.

Friday, December 21, 2018

(College Education - 1) Data Analytics for Teachers/Students.

Disclaimer: the views presented in this article are personal. 

Initially, I was going to write a rant on how teachers are shit in colleges and continue the age old blame game. In this game, the teachers' think that students are stupid or uninterested and students think that teachers don't know how to teach. It is true (to a certain extent ofc), but the problem is no one ever addresses it. No one thinks of any innovative methods that can be adapted to address what's wrong. Most people (including me) are involved in their own self-interests (includes teachers and students both) and to some extent rightfully so.

Before I propose the solution, I would like you to go through my line of thought.

Teachers of today feel inclined to play entertainers as compared to knowledge imparters. In this information age which we've become privy to, fuelling curiosity is far more important as compared to imparting knowledge. Students need to be introduced to concepts in a way which makes the learning process heuristic. Enabling them to relate these to life, applications around them and have a positive impact using that knowledge.

A lot of students tend to blame the syllabus but I disagree. I think that the syllabus is well defined and in accordance with a given branch of study. The reason most students feel disengaged from the syllabus is that they are unaware of the possibilities that it holds. As students move away from immersive learning and focus only on the parts that are necessary to get them better grades the whole ideology of a model student and a model teacher changes drastically. A model teacher is often one who is able to make sure that knowledge (or the method involved in its dissemination) is transmitted to students in a way that aids them in remembering it for a duration often limited to the exam period. If a teacher can assign tasks to students that lead to good marks then they are a model teacher (and hence they are diligent to their duties) and a model student becomes one who duly completes the tasks assigned to them. The students who are regular, sincere and complete everything on time.

Let us consider the problems that arise because of this.
  • Less than 10% of students/teachers fall into the model student - model teacher zone. 
  • Little accountability and deliverables on teacher's part.  
  • Independent line of thought by the student is not given proper importance. 
  • Fuelling and engaging with the community (online forums) is more important than completing the assigned tasks. 
  • Holistic development is not taken with the same level of sincerity as compared to knowledge importation. 

Solution: Proper Data Analytics for students and teachers. 

1. ) Actionable insights for teachers and students.

Teachers often do not have the time for every student. and students struggle needlessly on things that can be quickly understood. By enabling collection of proper data (for both students and teachers) following actionable insights can be generated.

2.) Regular after class tests instead of end semester / mid-semester examination patterns. 

In order to create real-time data for analysis and actionable insights for teachers and students, it is important to create data points on a short-term basis. This would also allow machine learning techniques such as reinforcement learning come into play and interact with students, thus reducing the workload for teachers. 
Not just that, more data points would result in more answerability on teacher's part. 

3.) Venn Like Diagram for multi-discipline projects and grading on basis of those projects.

I personally think this would be super cool if implemented. The idea is to use a graphical representation shown below to grade projects. 

Here is how it could work.

  • a radius of a circle would be determined by the number of topics covered by the project. 
  • the colour of the circle would be determined by the depth of the topic understood by the person. The darker the shade of circle would imply better understanding. 
  • community comments (feedback) from people who have expertise in that area would be also listed for every project.
  • deep learning model on the employability of these projects based on the above data as input parameters to be measured. 

4.) Awarding in-depth knowledge and understanding in a unique way.

Instead of assignment submissions (which have been reduced to handwriting practice for the majority of students), the assignments should include engaging with the online community (such as StackExchange/medium) on different topics of interest. The idea is to enforce students' interests instead of adding work pressure. By having communications with a community the students would feel more appreciated for their work as opposed to now. 

5.) Incorporating extra-curricular activities (sports) as an important part of a system. 

There is nothing more important than sports. A consistent sport should have some weight-age associated with it in all educational institutes of every field as it teaches teamwork, risk-taking and communication.  

Tuesday, December 4, 2018

Installing Anaconda, Running Jupyter on Google Cloud Remotely

I was just using google collab when I realised it cannot really replace a remote server with a GPU. It is super awesome if you are trying to collaborate on a notebook with multiple authors but it does not really provide you the flexibility of terminal. There is certain extent to which "!" can go. Had google collab provided a virtual instance, it would have been super.
It is already amazing that they are providing GPUs and TPUs completely free of cost. It is too much to ask to give shell access free too, and it would be hard for them to nail down the activities such as mining or torrenting if they did, thus people would be making money on their hardware meant for educational purposes.

This post is about how to setup NGINX along with jupyter notebook.

1.) Let us first install Anaconda by downloading it from here,

Once, you have installed anaconda on your virtual machine, it is time to install and make sure nginx is running.

2.) Start jupyter notebook using the following command. Copy the link

3.) Go to terminal and type the following command.