The system that saw tomorrow — how technology can make an effective change in school education

12 min readOct 7, 2019

There is an old saying that sunshine is the best medicine, and one of the biggest issues in any education system is that far too many children are falling through the cracks. This can be attributed to factors such as a lack of individual attention, and classes being so large that teachers find it difficult to notice certain things. With just a bit of personalised attention, every student could reach their potential.

To achieve this, a great place to start would be to understand what level a student is at and what concepts he/she is struggling with. School, being the motherlode of children’s learning and teaching, is the only place where such rich data is available. Our system is able to provide each student with individualised attention by shining a light on exactly what happens within the school. Since other models don’t have access to this information they are forced to start from scratch which in edtech jargon is known as the “cold-start problem”.

Another aspect to investigate is whether an individual student had an issue with the topic or whether the whole class had difficulty understanding it. This is the “information opacity problem” i.e. an outside system cannot peer into what happens in a school. To overcome this, most edtech systems need the students to spend long hours on the system to decipher whether the lacunae is idiosyncratic or systematic. While this could easily be made to look like the metric for the platform’s success, it, in fact, misrepresents a fatal shortcoming of their system.

Another vital issue that we have identified within current systems is what one could call the “feedback problem”. A senior employee at one of the largest international publishing houses informed us of an experience he encountered. His daughter was able to score high marks on the online learning platform but was unable to match the same performance in a school examination! He explains that since most online systems use an MCQ (Multiple Choice Questions) format, she had figured out how to game the system hence when it came to a summative assessment she could not live up to the expected scores. This leads us to surmise that most edtech products available in the market today are variations of tips and tricks to crack standardised exams rather than propagating true learning.

To recap the problems that Edtech systems face

Cold start problem — not knowing the gaps in learning ahead of time and spending valuable student time to figure it out.
Information opacity problem — not being able to decipher, ahead of time, whether the problem is based on class-wide lacunae or student-specific.
Feedback problem — not being able to close the learning loop with what happens in school assessments thereby validating the efficacy of the learning system.
Focus on tips and tricks — unhealthy focus on how to crack a competitive exam as opposed to fostering learning, leading only self-motivated high scorers to benefit most from the system.
False metric — time spent on the system as opposed to the efficacy of the platform leads to the greater need to spend on videos and other methods of teaching which the jury is still out on their efficacy

The Solution that saw tomorrow

We set about designing our system like no other edtech company has, we went to the epicenter of student data, the school. We built a system that was robust and could, therefore, be used by schools in tier-3 cities. It is also easy to use, and unlike most current platforms works for all types of questions (MCQ, true/false, match the columns, summative) which allows students to continue to answer questions with pen and paper.

We currently have about 6 million data points of children answering questions across grades and from tracking their progress across grades. The dense nature of the data allows for fine-grained insights and differentiation. Think about the amount of useful information gathered when 100 kids are answering the same question in their unique way. We are proud that this system is now in the process of getting a global patent.

This lodestone of data, that is growing daily, has given us insights like no other, furthermore, it allowed us to model student behaviour in ways that have never been seen before. A strange corollary is that because we train our models with such a large dataset we can make very effective predictions with very little data when we see a new student in our system. Given that most students take school assessments seriously and also the fact that most assessments are interleaved [4]. We can get a very high-quality representation of student learning. As a consequence, we are also able to plot the learning journey of children through grades. This gave us the ability to backtrack and pinpoint the root causes of a child’s performance. We also factored in systemic issues like grade inflation caused by making questions and assessments easy, which current systems do due to their inability to model the difficulty level of the questions. Our graph-based system allows us to recommend learning paths that use an incremental approach that doesn’t shock the students into bouncing out of the system.

Key Limitations we wanted to avoid while designing a revision system

The key limitations of most models currently used by edtech companies, IRT(Item Response Theory) or BKT(Bayesian Knowledge Tracing) are as follows.

Recency Effects — concepts that have been mastered recently will have a greater influence on the ability of the student to perform on the given concept compared to ones mastered in the past. This effect essentially considers the human nature of “forgetting”. Whereas, BKT models propose that once things are learned they remain learned.
Contextualized Trial Sequence — Psychological literature has shown that interleaving questions with different concepts have a strong positive influence on learning and knowledge states. To gain optimal learning, it is important to use the entire sequence of exercises a student receives, in that same order, as it can potentially infer the effect of exercises or concept ordering. (looking at how you performed at bio, physics, chem at the same time, not just individually)
Inter-Skill Similarity — each question answered by a student has an associated concept or concepts attached to it. Here there are two issues, if there are two skills that are correlated and related in the graph, one would expect that the two skills are learned simultaneously. further, if two skills are connected in the subject matter experts graph as a Parent -> Child relationship, one should reasonably expect that if one has mastery of the Child skill they would have even greater mastery of the Parent skill. These sorts of complex relationships are hard to model in BKT.
Individual Variation in Ability — Here we see two effects. The first being that if a student is an average performer in the past, they are likely to continue being an average performer, this could be at a certain subject or an overall level. The second is if a student has been unable to answer higher Bloom level questions in the past they will likely continue to be unable to do so. Again, since BKT models each skill separately it does not have the contextualized information needed to estimate average ability or the overall ability depth (Bloom’s Taxonomy being a measure for knowledge depth)

The model design — where the rubber hits the road

Image source :

1. Penghe Chen, Yu Lu, Vincent W. Zheng, Yang Xian. Prerequisite-Driven Deep Knowledge Tracing. 2018 IEEE International Conference on Data Mining. p41
2. http://snap.stanford.edu/graphsage/

The inputs:

Our model is based on the seminal research paper written in 2015 by researchers from Stanford University and the University of Denver [1]. We have extended it to be more relevant to our problem given the richer data set that we possessed.

In the first instance, we used BERT embeddings from a model that was fine-tuned with data from nearly 500 school textbooks, 100,000 concept-relevant websites, and about 500,000 question-answer pairs. We used BERT embeddings instead of one-hot vectors of either concepts or questions. The one-hot vectors would have become too large as we have close to 500,000 unique questions. We decided to use BERT embeddings for two reasons:

a) It allowed us to compress a k sparse signal into a d-dimension dense representation using the concept of compressed sensing.

b) The BERT encoding of questions was not random and contained fine-grained information about the nature of the question, i.e. whether it was an understand or remember level bloom’s question, etc. This allowed us to address individual variation in ability based on a student’s depth of understanding.

Thus the input to the model was defined as follows:

A sequence of student interactions that could be in the form of either offline pen and paper exams or online revision and remediation system.

Where X is the question that student i answered at time t and is represented as:

Where X’ is the question student i answered at time t represented by its d embedding that is concatenated with 0 vector of the same dimension at the d embedding pre or post depending on whether the student got it correct or incorrect. Additionally, we also incorporate the difficulty of the question that is obtained empirically or if the question is appearing for the first time, the difficulty assigned to it by a SME (subject matter expert).

This makes it clear that the input into the model is very rich and specific in nature which allows for a differentiation to be made between remember level questions and analyse level questions. This is important because an analyse level question might be more difficult to answer than a remember level question, since it may require an understanding of an arcane concept or be associated with a misconception that trips up children.

We also have a very rich and well-documented knowledge graph that has been designed by SMEs and is based on research in the field. All questions in our system are mapped to one or many concepts based on the relevance. We pass on both, the concept/s that the question is mapped to and the predecessors of the mapped concept, to the model. This helps with addressing the Inter-Skill Similarity issue.

The Model:

The Encoding Layer:

Knowledge is incremental in nature, so the knowledge state of a child that we receive at I’ should be predicted based on what the knowledge state was at the previous step. As you go along, at every step the knowledge state has to be adding to itself incrementally. As in the Piech et al paper (2015), we use an RNN, an LSTM in our case, which addresses the Recency Effects and the Contextualized Trial Sequence. We extended their model to incorporate additional information like the rich question embedding and their difficulty levels, but most importantly we also used the inferred concept embeddings from the model. The output of this layer is a knowledge state which allows us to do training and inference. If there is a knowledge state O and concept C, the interaction between O and C will give us the probability of having ‘mastered’ that particular concept.

The concept embedding layer:

The representation of the concept embedding was initially a rudimentary ‘Embedding Matrix’ however, we felt a lot of information was lost leading to a weaker Inter-Skill Similarity modelling. We also wanted to incorporate SME expertise into the model without necessarily allowing it to be the dominant feature and instead allowing the model to infer these dependencies in an unsupervised way. There has been a lot of progress on Graph Convolutional Networks (GCN). Since we already had the concept dependency graph (i.e. which concept depends on which other concepts), we used that information to instead generate an embedding not only from itself, but also its’ neighbours. This helped in the case where each concept itself is not represented quite often but benefits from the fact that its neighbours have their own data samples as well. Therefore the concepts collectively improve their own as well as their neighbours’ representation. This technique of utilizing the underlying graph structure is termed as Graph Convolution in literature.

The Loss function

The main loss that we are trying to minimize is to predict how the student will perform on Q conditioned on his knowledge state Y at time t, and C is the embedding of the concept of question Q, and F is a function transformation of the difficulty of the question Q.

l could be any loss function. We tried cross-entropy loss on our prediction and target prediction. We use MSE as we use data from summative, formative and remediation systems, therefore partial grades are possible.

Bringing internal consistency

The second requirement of the Inter-Skill Similarity is that if two skills S(child) and S(parent) are connected in the subject matter experts graph as a Parent->Child relationship, one should reasonably expect that if an individual has mastery of the Child skill they would have even greater mastery of the Parent skill [3]. This is one additional regularization that we incorporate into the model via the loss function.

The other regularization we perform is that we try to constrain the inherent “model drift” that the DKT model was observed to have. What this means is that if we go through 100 questions and at the 100th question we decide to go back to question 1, we should be able to recover what the knowledge state of the child was at this time. Accordingly, when the input order (Qt,0), (Qt+1,0) occurs frequently enough, the model will tend to learn that if a student answers Qt incorrectly, he/she is also likely to answer Qt+1 incorrectly, but not Qt since it is not consistent [2]. Thus we add an extra loss term to capture the loss which predicts how the student will perform on the current question. This loss is multiplied by a small weighting term.

In Conclusion:

The model was trained on the dense in house data that we have collected over the years for science. When training the model we only considered students who had more than 25 interactions. After this filtering, the training data consists of about 1.6 million interactions on around 13 thousand students from a wide variety of schools and grades all across the country. Using this data the model was trained for 500 epochs, and we have used over 1700 students for validation. When using the binary representation for whether the question was answered correctly or not, the model obtained an AUC of 0.856 on the unseen validation set. However as mentioned above, with the availability of summative, formative and remediation systems data using a continuous representation of the student’s interaction, we used MSE loss which reduced to 0.256. We trained the system not just to predict whether the student will get the question right, but also to predict what is the highest level of question difficulty that they can answer for a given concept. This allows us to execute an interesting clustering of the students for revision and remediation tasks both online and in a school setting. Hence the teacher is now able to get fine-grained lacunae of the class.

References:

1. Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas Guibas Jascha Sohl-Dickstein, Stanford University, Khan Academy, Google. “Deep Knowledge Tracing”. 2015.

2. Chun-Kit Yeung and Dit-Yan Yeung, Hong Kong University of Science and Technology, “Addressing two problems in Deep Knowledge Tracing via Prediction-Consistent Regularization”. 2018.

3. Penghe Chen, Yu Lu, Vincent W. Zheng, Yang Pian. “Prerequisite-Driven Deep Knowledge Tracing”. 2018 IEEE International Conference on Data Mining.

4. Doug Roher, Robert F Dedrick et al. “ Interleaved Practice Improves Mathematics Learning”. 2014 Journal of Educational Psychology

5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. May 2019.

Written by : Dev Roy, Kshiteej Kalambarkar, Sushmita Narayana, Sidharth S Rao.