Data Analysis Techniques Applied to the MathE Database ⋆

. MathE is an international online platform that aims to provide a resource for in-class support as well as an alternative instrument to teach and study mathematics. This work focuses on the investigations of the students’ behavior when answering the training questions available in the platform. In order to draw conclusions about the value of the platform, the ways in which the students use it and what are the most wanted mathematical topics, thus deepening the knowledge about the difficulties faced by the users and finding how to make the platform more efficient, the data collected since the it was launched (3 years ago) is analyzed through the use of data mining and machine learning techniques. In a first moment, a general analysis was performed in order to identify the students’ behavior as well as the topics that require reorganization; it was followed by a second iteration, according to the students’ country


Introduction
All actors in the educational process are aware of the need to improve the quality of lectures and intensify research on innovations that contribute to better engage students and lower failure rates in the discipline. Due to its cumulative nature, courses that rely on a strong mathematical core present enormous challenges both to professors and students: in mathematics, students learn that through adequate reasoning and relying on proper assumptions, they can arrive at results that are fully trustable and applicable in a wide variety of scientific and reallife contexts. Guiding the students to the appropriate degree of attainment, comprehension, and autonomy has long been one of the significant challenges for professors, including at the higher-education level. Poor performance, especially in introductory courses, is a massive concern that college mathematics lecturers face [10,11].
Although lecturing, in an exposition-centered approach, has been the traditional way of teaching, there is theoretical evidence of the need for students to be more active in constructing their understanding [2,9,14,19].
Active learning methodologies, grounded in the constructivist theory of teaching and learning that holds that humans learn by actively using new information and experiences and that reality is shaped by the experiences of the learner, can be a meaningful contribution to boosting the students' engagement. Some of the main features of the constructivist teaching practice, such as the encouragement of the students' autonomy, initiative, and dialogue among their peers and with the professors, promote a sense of personal agency since the students have control of their learning and, to some extent, to their assessment [5,7,12,18].
Students retain much more if they are challenged to reflect on and do more than just passively receive information. Active learning interventions can include approaches as diverse as workshops, group problem-solving and team quizzes, worksheets or tutorials completed throughout the class, use of personal response systems ("clickers") displaying a graph with the responses (there are many online applications for this purpose, such as https://www.polleverywhere.com), moments of individual thinking alternating with small group activities, all subject to immediate feed-back, are all powerful techniques for helping students work through and understand and solve a problem and are among the evidence-based best practice in active learning, and lead to greater learning [6,17]. Cooperative learning is a component of active learning that is worth highlighting: it refers to work developed by the students, organized into teams, in order to produce an outcome of some sort: a laboratory or project report, the design of a product or a process or, within the context of learning mathematics, the solutions to a set of problems. The dynamics of cooperative learning should encourage face-toface interaction, interdependence, individual accountability, appropriate use of interpersonal skills, and several moments of self-assessment of team performance and dynamics. Extensive research has shown that, when compared to traditional pedagogical models, cooperative learning -when it is implemented adequately -leads to greater learning and development of communication and teamwork skills, such as leadership, project management, and conflict resolution. Furthermore, the characteristics described before, go in the same direction as the 4th goal of the Sustainable Development Goals (SDGs), it is quality education, that intends to ensure inclusive and equitable quality education for all and promote lifelong learning [8].
The MathE online teaching platform is in line with the described scenario: it provides educational resources for students and lecturers, covering the traditional mathematics contents of higher-education courses. By registering on the platform, the users' have access to a wide variety of educational resources such as videos, solved exercises, podcasts, or pdf files, as well as questions that allow the students to undertake self-assessment tests and the professors to perform evaluation. MathE also provides the users with a forum where the students can share questions and challenges, teaching each other and, therefore, being active agents in the construction of new skills and knowledge. Professors and researchers also benefit from the existence of their forum in the platform where there is in-depth peer-to-peer interaction for the exchange of expertise and knowledge [13].
The platform aims to offer a dynamic and engaging tool to teach and learn mathematics, relying on interactive digital technologies that enable customized study. The goal of this research is to analyze the data collected on the MathE platform, over the 3 years, the platform has been online. So, the aims consist of investigating the topics available on the platform that need to be restructured in terms of questions level and also analyzing the students' performance according to the countries they belong to. This information will be combined with the conclusion of previous works [4,3], in which [3] had investigated the profiles of different groups of students exclusively in the Linear Algebra topic, and [4] analyzed the optimum way to reorganize the resources available on the platform into different levels of difficulty. The information acquired in this work will complement the conclusions obtained in both papers. It will help the platform developers to trace the future path to provide intelligence for the MathE platform since it is expected that shortly the MathE will be able to make use of intelligent mechanisms, based on optimization algorithms and machine learning, to make autonomous decisions, tailored according to the needs of each user.
The rest of this paper is organized as follows. In Section 2, the collaborative educational platform MathE is briefly described. Section 3 presents the methodology adopted. Section 4, describes the data collected throughout the time that the platform is online, that are analyze in this paper. The results and discussion obtained based on the data analyze is presented in Section 5. Finally, the conclusion and consequently the direction of future works are presented in Section 6.

The MathE Platform
The development of Information and Communication Technologies (ICT) facilitates access to education and made the learning process more accessible, effective. Promoting an e-learning method requires different types of resources, in particular digital and technological resources. MathE is an e-learning platform focused on the mathematical contents of higher education courses. On the platform, any student or professor, has free access to a collection of questions, videos, and other pedagogical materials related to mathematics at higher education level. MathE was developed and implemented by a consortium of seven institutional partners from five European countries: Polytechnic Institute of Bragança (Portugal), the Limerick Institute of Technology (Ireland), the University of Genova, Pixel (Italy), Kaunas University of Technology (Lithuania), Technical University of Iasi (Romania) and EuroED (Romania). Each partner institution has built a solid community of professors in the corresponding countries that has been actively collaborating and responding to the project's challenges.
The MathE platform comprises three main sections: the Student's Assessment section is subdivided into two subsections: Self Need Assessment (SNA) and Student Final Assessment (SFA); the students can self-evaluate their knowledge using the subsection SNA whereas, on the other hand, under SFA, the professors can organize online tests about selected topics in this subsection. On the MathE library, the users can access a collection of videos and additional resources about the topics covered by the platform. Finally, the Community of Practice provides a free forum where users can create and share their experience, knowledge and information: the students are invited to discuss related issues and challenges in the Students' Community, and the professors can build a solid network of learning and teaching practices in the Lecturers' Community.
Moreover, MathE also offers a YouTube channel where all the videos of the platform are available. There are two types of videos (both available in the platform and in the MathE YouTube channel): the ones that were selected from the internet by the MathE experts (all linked with the MathE platform) and others exclusively produced by the MathE consortium according to the platform's needs (provided on the MathE platform and MathE YouTube Channel).
The MathE platform is currently being used by a significant number of users: there are enrolled 1171 students of 15 nationalities -Portuguese, Brazilian, Turkish, Tunisian, Greek, German, Kazakh, Italian, Russian, Lithuanian, Irish, Spanish, Slovenian, Dutch and Romanian. There are also 99 professors from 12 countries and 49 higher education institutions registered. It is important to emphasize that, besides the users signed up in the MathE portal, there are users from countries like India, Philippines and Egypt, taking into account the information obtained from the YouTube channel. Fig. 1 illustrates the MathE presence around the world, that is, the countries where the MathE has, at least, one person enrolled -either a professor or a student.
Currently the platform has 1841 questions, covering the fifteen most classical mathematical topics addressed in graduation courses. The questions available are divided into two levels of difficulty (basic and advanced) -this categorization  Table 1 describes the number of questions available in each topic at each section, SNA and SFA. It is essential to clarify that each time a student selects a topic and a question difficulty level to answer on SNA, a set of seven multiple-choice questions is randomly generated from an assessment platform database. After submitting the test for evaluation, the students will immediately receive feedback on their scores and some suggestions (extra material) will be given in the questions with the wrong answer. On the other hand, on the section SFA the quantity and which questions will compose the test are defined by the professor, who will schedule a test on the platform system composed by questions from an exclusive SFA database. In this case, after a student submits the test for evaluation, the professor immediately receives the student's score; the student only has access to their score 24 hours after the end of the test. Additional details about each section of the platform are described in [4,3], and can also be found in its website (mathe.pixel-online.org) or at the MathE Youtube channel (MathE Channel).

Methodology
The methodology adopted in this paper consists in the application of strategies of data mining and machine learning to assess the data collected under the MathE platform. Using statistic tools, the data is analyzed with regard to student's hit probability for each performed topic. In this way, it is possible to identify the topic that requests more attention according to the student's difficulties to answer the available questions. Thereafter, the data is evaluated in compliance to the country of origin of the student, in order to search for different students' profiles according to their nationalities.
After that, the k-means clustering algorithm is used to identify the similarities and dissimilarities in the students' behavior, per country. Among the unsupervised methods, clustering techniques can be considered the most popular for grouping a set of elements with similarities in the same group and dissimilarities in other groups [15], an approach that is appropriate for exploring relationships between data and detecting the underlying structures.
The k-means partitioning clustering algorithm is one of the most well-known clustering algorithms. It consists of trying to separate samples into groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (WSS) [1]. As k-means is not an automatic clustering algorithm, it requires the definition of the initial parameter k, that represents the number of clusters division. The value of k can be specified by different techniques, but in this work the Silhouette method [16], which is a similarity measurement, is adopted. Once this value is established, the k-means algorithm divides a set of X samples X 1 , X 2 , ..., X m into k disjoint clusters C k , each described by the mean of the samples in the cluster, µ i , also denoted as cluster "centroids". In this way, the k-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion, presented in Equation (1) [1].
From these centers, a clustering is defined, grouping data points according to the center to which each point is assigned. The k-means clustering algorithm and the Silhouette algorithm exist in the MatLab ® library and they were applied in the research that this work describes.

Description of the MathE Data
MathE is an international platform and, currently there are 1171 students enrolled and 99 professors and researchers from 15 different nationalities. Table 2 describes this information with more detail; it is possible to observe that the countries with more users under the profile of student are Portugal, Lithuania and Italy. The data collected for analysis in this work considers information of 6927 answers distributed among the 15 topics of Table 1. These answers were provided by 284 students that uses the SNA section, since the platform' launch, in 2019. It is important to highlight that the questions and the topics are constantly being added to the platform, then, naturally some topics have more questions answered than others. The data is fully characterized by the topic and the two levels of difficulty -basic and advanced. The Topics column describes all the MathE topics available on the platform. Moreover, in both levels (middle block and right block) the Std column shows the number of students that answered questions on that topic; the number of correct answers and incorrect answers is described in the columns CA and IA, respectively. Finally, the columns TQ present the total number of questions answered at each level (by topic), which corresponds to the sum of all correct and incorrect answers of that difficulty level.

Data Analysis
In this section, the analysis of the data previously described is presented, which aims to investigate the students' behavior on the MathE Platform since it has been online. First of all, it is essential to clarify that, as mentioned above, there are 1171 students enrolled on the platform, of which 284 use the SNA section; the others students use other resources of the platform such as videos and/or pedagogical materials or, even, the community of Practice. These 284 students belong to 8 countries: Portugal, Lithuania, Italy, Ireland, Romania, Russia, Spain, and Slovenia. Considering the information these students provided, a global analysis of the data set is done after a complementary analysis by countries.

General Database Analysis
Initially, the global performance of the students on the platform was analyzed, that is, the data of students who used the platform to answer the questions available in the topics of the Student Need Assessment section (SNA). From Table 3, it is possible to see that the number of basic questions answered (5710 in total) represents 82% of the total questions answered on the platform against 1217 advanced questions answered. So, it is clear that the students prefer to utilize the basic questions more than the advanced ones. In terms of the type of answer, in general, the number of incorrect answers is higher than the correct ones, 3185 incorrect answers in the basic level (56%) and 628 (51, 6%) in the advanced one.
Comparing the data presented in Table 1, which describes the number of questions available, and Table 3 that presents the number of questions answered by the students, it is possible to note that the topics most required by students are those with the great variety of questions available. It can be justified because the contents of the MathE platform are constantly updated, with additional questions on each topic. So, in Table 3, Linear Algebra is the most used topic, followed by Fundamentals of Mathematics, Differentiation, and Complex Number, with more than 450 answers in terms of total answers. Therefore, to investigate the distribution of the hit obtained in each topic, the individual hit probability per topic is calculated for each student. The graphic results of this evaluation are presented in terms question level Basic (Fig. 2 -275 students) in which it is intend to compare the probability of questions correctly answered on the 15 topics available on the MathE Platform. As it is possible to see, in some topics the students distribution is almost homogeneous over the interval [0, 1], which means there are both students with excellent performance (close to 1), and students with poor performance (close to 0), as well as students with average performance. These characteristics can be found in topics 1, 2, 4, and 9, which are the topics with more presence of students, which is denoted by the colorful points.
Nonetheless, in some topics, it is possible to observe the presence of gaps in the performance interval, which can also be associated with the low number of students that use the topic, but some peculiarities can be observed that lead us to meaningful observations. The topic 5 (Integration) is used by 13 students, only 2 have a performance greater than 0.5, whereas in the topic 7 (Complex Numbers), which 29 students use, only 1 student has a performance between the interval ]0.5, 1[. The absence of students in these topics may indicate the lack of easier questions. On the other hand, in the topic 8 (Differential Equation), in which there are 10 students enrolled, there is no student with a performance in the interval ]0, 0.6[, which indicates that is mandatory more complex questions in the basic level of this topic. Finally, on the topic 15 (Numerical Methods), the majority of students have a performance between ]0.35, 0.6[, which indicates the necessity of questions with more variability in terms of difficulties.
Although all the questions belong to a basic level, some are more basic than others, while others have a more difficult degree, the questions are not all at the same level. The variability in difficulty within a given level is expected, and it is important to maintain this, considering that there are students with different needs enrolled on the platform. But some topics are not meeting this expectation, which calls for a better distribution of the questions in more levels of difficulty, as already indicated in [4]. Such observations are fundamental for the level of difficulty of the future questions that will be inserted in the topics; mainly, the topics 5 and 7 need easier questions, and the topic 8 requires more complex questions. In contrast, the topic 15 needs both of them.
Considering the few questions answered and also the few students practicing advanced questions, it is not possible to have consistent conclusions about the topics and the advanced level of the questions' difficulty.

Students Assessment per Country
As previously mentioned, in the SNA section, there are students from 7 countries, so in this section, the students' performance according to the countries is surveyed. Table 4 describes the data through countries in terms of the number of students per country that answered basic and advanced questions; and also in terms of the type of the answer (correct and incorrect) in both difficulty levels (basic and advanced). Finally, at the last column the sum of all questions answered is presented.
As can be seen, most students using the SNA are from Portugal, Lithuania, and Italy. These three countries have at least one institution on the platform's developer team, contributing to greater platform dissemination. Table 2 shows that the three countries have the most registered students, professors, and institutions on the platform. Besides, it is worth mentioning that Portuguese students correspond to practically half of the students enrolled in the platform 646 (out of 1171). Concerning the SNA section, Portuguese students are more than 60% of the total students, it is 174 out of 284.
Thus, to analyze the students' performance by country, the probability of correct answers for the questions by country was obtained and is shown in Table 5. Thus, the columns Basic Questions and Advanced Questions correspond to the hit average of the students in the basic and advanced levels, respectively.  Portugal 168  60  174  1529  1879  366  402  4176  Lithuania 53  11  55  486  634  113  148  1381  Italy  35  7  35  333  446  48  28  855  Ireland  12  3  13  100  138  30  19  287  Romania  3  1  3  18  Furthermore, the last column, All Questions is the hit probability considering both levels. From Table 5, it can be seen that the hit average for the advanced questions is almost always more significant than the probability of correct answers for the basic questions, and the opposite was expected, since in the basic questions, the students make many mistakes, so a low hit probability was expected at the advanced questions. This observation may indicate that the questions are not adequately organized on the platform since the basic questions have a degree of difficulty higher than expected and the advanced ones are not as complex as wished. Thus, for the best use of it, this issue is one of the urgent points to be reviewed for platform improvement.
The OECD Programme for International Student Assessment (PISA) examines what students know about mathematics, and according to this ranking, the countries and their mean classification are: Slovenia (509), Ireland (500), Por-tugal (492), Russia (488), Italy (487), Lithuania and Spain (481) and Romania (430) [11]. Since there is not an expressive number of students in all countries, it is not easy to establish a highly reliable comparison. Thus, in order to consider only countries with more than 30 students (Portugal, Lithuania, and Italy), it is noted that the average of both in PISA is close, with Portugal in 28th position, Italy in 31st and Lithuania in 34th [11]. Thus, it was already expected that the student's performance would be similar, as can be seen in the averages of correct answers in Table 5, mainly by the last column, which considers all the questions answered by the students of that country.

Portuguese, Lithuanian and Italian Students Assessment on Basic Questions
As already mentioned, there is a small number of students per country, so it is not feasible to assess the profile of students from the 8 countries. Therefore, this section will only consider data from Portuguese, Lithuanian, and Italian students, as there are more than 30 students in each group. Moreover, the number of advanced answers is few representative when compared to the basic answers. Therefore, only basic answers are considered in this section.  As shown in Fig. 3, students from the three countries predominantly answer topic 1 -Linear Algebra. This one is widespread in practically all higher education courses, regardless of the country. This may be the main reason for such a significant number of answers.
In respect to Portuguese students, in Fig. 3a, it is possible to verify that they pay extreme attention to topic 1 (Linear Algebra) reaching approximately 2000 responses, while all other topics have less than 210 responses (with the exception of topic 2 -Fundamentals of Mathematics). In addition to the already presented justification of Algebra Linear being present in several courses, the fact that the other topics are less used may be related to the encouragement given by the professors during the classes. Similar behavior is found in Italian students, Fig. 3b, in this case, while Algebra Linear collects almost 500 responses, the other topics have less than 100. On the other hand, in Lithuania, Fig. 3c this pattern is less expressive, and although with a smaller amount of answers than in other countries, the topics 2 -Fundamentals of Mathematics, 4 -Differentiation, 6 -Analytic Geometry, 7 -Complex Numbers and 15 -Numerical Methods, are also being significantly explored by the students in relation to the other topics used by the Lithuanian students.
The data collected presented on Fig. 3 is interesting and worthy to be explored in future works. If one can perceive strengths that lead students to have a preference for Linear Algebra, it will be possible to export this characteristics to others, thus captivating students to use the platform constantly and intensively in other topics too.
Finally, to identify the similarities and dissimilarities in the students' behavior, a clustering analysis was performed and the results are shown in Fig. 4.
Regarding the Portuguese students, in Fig. 4a, the algorithm grouped the students into 3 clusters. Thence, in cluster 1 (red), there are the students that answered fewer questions in relation to the other clusters, as it is the cluster with the highest population density. These students answer a maximum of 50 questions, which is represented by the sum of the x and y coordinates. Thus, the cluster 1 represents students who use less the platform, with mean equal to 12 answered and the students have an average performance in relation to the others. On the other hand, the cluster 2 (blue), are the students who answered more basic questions correctly. All students of this cluster answered at least 30 basic questions correctly and more than 18 incorrectly, while the majority answered less than 40 incorrectly. Furthermore, the average of answers is 78, so on cluster 2, students who use the platform more often and have to perform better than other students. Finally, in cluster 3 (green), there are students who also use the platform a lot, an average equal to 55 but do not perform well since they have a high error rate and a low rate of success in basic questions.
In the case of Italian students, Fig. 4b there are also 3 clusters, but with different behavior from Portuguese students. In the case of cluster 1 (red) of Italians, we have students who answer a few questions (mean equal to 2 and maximum of 17 question), and most of the answers are incorrect. In cluster 2 (blue), there are the students who answered the largest number of questions,  with mean of answers equal to 66, but the number of incorrect answers is much higher in relation to the number of correct answers. Finally, in cluster 3 (green), we have students with average performance, in this case the number of correct answers and errors is more balanced than the others, and these students answer an average of 36 questions. For the Lithuania, Fig. 4c there are 2 clusters. Students are heavily concentrated in cluster 1 (red), responding to a few questions (means of answer equal to 5) with more incorrect than correct answers. Furthermore, in cluster 2 (blue), we have the students who answer the most questions (means equal to 133); however, this is a small group composed of 4 students, and although the performance is slightly higher than the students in cluster 1, it is still not excellent, considering the number of errors.
Finally, in Fig. 4d, there are the students from the 3 countries, cluster 1 are the students who answer fewer questions and with a low rate of correct answers; cluster 2, students who answer more questions than those in cluster 1 and less than those in cluster 3, but still with an intermediate performance. Moreover, in cluster 3, the students who answer more questions are represented by a small number of students.

Conclusions
The MathE platform is an online educational system that aims to help students who struggle to learn college mathematics, as well as students who want to deepen their knowledge of a multitude of mathematical topics, at their own pace. The platform currently provides a set a diversified questions, videos and pedagogical resources for the higher educational level. The question are randomly generated, independently of the profile of the users (there is only the possibility to choose topic, subtopic and level of difficulty), but it is expected that in the near future the platform will be able to make use of intelligent mechanisms, based on optimization algorithms and machine learning, to make autonomous decisions, able to direct the questions in a customized way, according to the students profile and needs.
The research [3,4] aimed to investigate the difficulties and potentialities of the platform, as well as the characteristics that could be used to make the platform more efficient. Thus, the approach presented in this paper seeked to evaluate the adequate level of difficulty of the questions in the topics that are available on the platform, based on the students' hit probability at the SNA section. In addition, it was also evaluated whether the country of origin is a relevant variable in the students' performance. Thus, the information collected through this research will serve as a guide to make the choice of optimal strategies to improve the performance of the platform.
From the results obtained in this work, together with the others already carried out [3,4], it is evident the need to reorganize the questions in more levels of difficulty. However, the results of this analysis will be fundamental for defining the type of questions that each topic needs. In addition, currently the assignment of a question to a certain level is done by a collaborating professor, so this division is subject to partiality and subjectivity, and may vary from person to person. Thus, finding a way to assign the questions to their respective difficulty level autonomously, through an intelligent system, is one of the possible ways to improve the organization of questions on the platform. This is also a way to keep students constantly active on the platform, as more engaged the students are in the platform uses, more questions they performs.
Finally, in relation to the analysis by countries, from the data analyzed so far, it is not possible to conclude whether students from a particular country perform better than from other countries (due, for example, to the quality of education in the country in question or other factors). In general, countries among the ones that have more than 30 students enrolled in the platform, show very similar outcomes in questions of both levels of difficulty. Thus, with the data that is currently have available, the country of origin does not appear as a determining variable in the customization of questions for students, in a future version of the platform.