In this section we have placed a brief of the tasks, each corresponding to one of the following groups:
- Routine tasks - the description is aimed to help see the possibility of using data mining algorithms.
- Unique tasks - require the use of non-standard layouts and approaches.
1. Medicine: prediction of complications and relapses
Heart Center investigated the prediction of complications occurring during surgical intervention according to the pre-operational data about the patient using physiological and diagnostic examination results. The possibility of reducing the possible risks by choosing the method of surgical intervention using the most appropriate for an each individual patient is of additional interest.
The following project for the same client was the creation of a statistical model of the possibility of relapse of patients during the postoperative period (up to 3 years). For the analysis, the data about the patient that was used was pre-operational, taken during the operation, and the rehabilitation period.
2. The oil and gas industry: predicting the fractional composition
This project was aimed at the creation of a device for the prediction of a mixture flowing into a pipe in natural gas production in terms of fractional composition on the basis of short-term calibrations using extra tools.
There were three fractions considered during the production of gas and the mixture mainly consisted of gas, oil and water going into the tube. The process has a number of parameters of different measurement and complexity (and hence cost).
What was required:
- Investigate the possibility of composition of a predictive characteristics model (gas discharge measurement and oil and water discharge measurement) according to the observed characteristics, such as the differences of pressures, pressure, temperature, etc.
- To establish the dependence of the prediction error on the duration, and at the start of the device calibration (predictive model learning) with a fixed mode of the equipment operation.
3. Logistics: write-off errors
With some frequency, the staff of the company wrote off the expended items to different parts of the account. This occurs because one of them is mistaken (consciously or unconsciously). Apparently they believe that these write-offs belong to more than one real expense act. As a result, there is no way to detect an error.
However, if the company’s write-off practices do not change, then such "mistakes" are rather easy to detect by using the data analysis methods.
4. Production: workload on the equipment
It is a typical problem for medical institutions as well as for other companies in the business sector:
Expensive equipment - the use of which requires good control in order to minimize potential conflicts of simultaneous demand for the equipment. The creation of statistical models is designed to predict the periods of high and low demand. Being guided by these predictions of the expected load using the operations research methods, it is not difficult to reasonably redistribute the planning of use of the equipment.
5. Production: order runtime
One more task which is the key to good production process planning is to define the order runtime according to the order parameters, information about other orders which came in production previously, and the resources available (people and the equipment). Of course the introduction of a full-fledged SCM is intended to resolve the situation through a detailed understanding of all the processes and conditions of production. However, often this problem can be solved in a less expensive but equally effective ways, due to statistical analysis between the duration of the order, and other characteristics known about production at the time of the order’s receipt, if such data was regularly collected.
6. Bioinformatics: prediction of gene expression
Gene expression is predicted by the modifications of histones and transcription profiles. For each of the genes there are descriptions of peak presentations for several transcription factors and histone modifications, as well as the results of gene expression measurements.
- To determine which of the histone modifications are the most informative.
- To find out which peaks are the indicators of the level of expression.
- To build models which calculate a predicted value of expression for a given peak profile.
The following factors one can add to the complications of this problem:
- The data wasn’t presented in a table or a cube, the data contained a series of gaps in two dimensions, or there was a very high level of distortion in the data.
- In the source data the target parameter values were measured with substantial inaccuracy.
The following information is in collaboration with the Sobolev Institute of Mathematics data analysis laboratory
1. Criminalistics: spectral analysis
While analyzing the substances, different effects on the patterns of the substances and the reactions of the samples on this exposure dependent on the substance chemical composition are fixed. An investigation of micro-objects and their summation according to the data on X-ray-spectral microanalysis can serve as an example of such analysis. The tested substance is a set of several tens or hundreds of micro-particles. The reaction of each micro-particle with this method of exposure is displayed by spectrum consisting of 1,024 lines (channels). The signal amplitude in the spectral channel may vary from zero to several thousands of conventional units. The spectrum of the same micro-object may vary depending on the controlled and uncontrolled conditions of the experiment.
By request of the Crime Laboratory at the Federal Security Service of Russian Federation, the staff of our company created a software program called Spectran. According to the data of X-ray micro-objects, spectral analysis of the homogeneous substances and their mixtures, the following basic tasks are solved with the help of this system: clustering of particles on the similarity of their spectra, the selection of a subset of the most informative spectral bands, and identification of the particles affiliation and their mixtures with the specified classes of substances and others.
2. Nuclear industry: Reading the symbolic information from the products on the conveyor
The problem arose in the framework of the realization of the program in creating a new generation of fuel cells. For the efficient control of the parameters and the course of the technological process of fuel elements production (cartridges) for the nuclear power stations reactors, it was necessary to read and identify the alphanumeric labeling of products while on the production line.
This development had a number of specific requirements:
- An arbitrary orientation and location of a product during production.
- A variety of geometric distortions in the symbolic label depiction due to difference of the code surface forms and plane, non-uniform lighting, and failures of the equipment used for the labels application.
- A variety of methods used to collect the information.
- Admissible probability of erroneous reading and identifying of the code is not more than 10-6.
3. Medicine: Diagnosis of prostate cancer according to the proteins mass spectra
It was necessary to analyze the data on the protein forms mass spectrum, obtained by a spectrometer of SELDI-MS-TOF to diagnose patients. The number of spectral bands is 15,153. Four classes of patients with different levels of PSA characterizing the degree of prostate cancer development were presented: 63 healthy patients of the class have PSA <1 ng/mL, 26 patients of the class have PSA in the interval of 4 to 10 (4 < PSA < 10 ng-/-mL), 43 patients of the class have a PSA >10 ng/mL and 190 patients of the class have PSA >4 ng/mL.
The problem is interesting because a small number of patients did not allow the sample into training and control. For this reason it was decided to take advantage of the fact that the target characteristics (PSA), indicating the patients` belonging to a particular class, allows to set the relation of the part-linear order according to the disease severity.
The result: 24 spectral bands were selected out of 15,153, which is sufficient for the diagnosis of patients.
4. Medicine: the recognition of two types of leukemia
The analyzed data was presented as a gene expression vectors matrix obtained using biochips for the patients with two types of leukemia - ALL and AML. The training selection obtained in the bone marrow samples contained 38 objects (27 ALL and 11 AML). Test sample - 34 objects (20 ALL and 14 AML), which were obtained in different experimental conditions: 24 on the specimens of bone marrow and 10 on blood specimens. The initial number of signs (genes) is 7,129. The normalized gene expression levels are measured by the biochip images.
Result: from the initial amount of 7,129 signs, 39 signs were selected of which 30 variants of decision rules were built, each containing from four to six signs. 27 out of 30 rules showed 0% recognition error on the test sample.
5. Sales: targeting
At the international Data Mining Cup 2009, the information on how many books of a particular genre has been sold in different retail chain stores during the year was analyzed. This information represented a very sparse table (84% of the squares were empty). At the intersection of the rows and columns the number of books of the given genre (one from 1856) sold for a year at a particular store (the number varied from 0 to 2,300) was indicated.
The purpose of the analysis is to understand the necessary supply volume of the books of a certain genre in every store.
618 teams from 164 organizations from 42 countries volunteered to participate in the competition. 231 teams solved this problem and sent their results. 49 teams overcame the threshold of acceptable results set by the organizers. The average error per one predicted cell by different teams ranged from 0.89 to 100.22. Results of our prediction had an average error rate of less than 1 book per cell. Our team from Novosibirsk State University took 4th place.
6. Medicine: Diagnosis of diseases
The possibility of using a portable gas chromatograph (multi-sensor system for detecting the gas mixtures components) for the diagnosis of the stomach diseases based on the analysis of the patient's exhalation was considered. Preliminary experiments have shown encouraging results. It was necessary to verify that the initial success, resulting in a small number of patients (70 persons), didn`t have a random character.
According to the results of the study, the previous results were not demonstrative and the statistical models built on their basis are unreliable and do not stand up to scrutiny.
Further investigation proved the impossibility of the use of this device for diagnosing diseases with the necessary degree of certainty.
7. Bioinformatics: prediction of the biophysical properties according to the amino acid composition of proteins
On a study of 17 East Siberian tick-borne encephalitis strains (TBE), the natural connections between the amino acid sequences of proteins and three biophysical properties including invasiveness, thermal stability, and thermoresistance were investigated. The data table contained gaps. It was required to detect the positions in the data, and the mutations which determine the biophysical properties of the strains.
• In the table of amino acids, 138 gaps of the 177 were filled. The expected error was 6.2%.
• There are 8 sections of strains with the help of amino acid composition of which the invasiveness value can be predicted with high reliability (correlation coefficient 0.962). Noticeable relations on the structure of strains were found also for thermal stability (14 sections after excluding four big emissions, the correlation was 0.868), and for thermal resistance (3 sections, after the exclusion of five emissions, correlation is 0.785).
• The weak positive correlation between invasiveness and thermal resistance and a weak negative correlation between thermal stability and thermal resistance became apparent. The combinations of sections that proved to be informative, were considered interesting for the further investigations of the TBEV strains biophysical properties markers.
By request of the Sobolev Institute of Mathematics data analysis laboratory, the software implementation of the original laboratory algorithms was accomplished. The program is designed for the user who is not an expert in data analysis. The interface is simple and intuitive. All algorithm parameters have preset values and this frees the user from the need to understand the intricacies of the algorithm control flow. At the same time, an experienced user can change the settings at his discretion to get better results. Much attention is paid to the integration using Microsoft Excel ®.
It is a professional version of the package FRiS-OTEKS and it represents a medium for specialists working in the field of data analysis that can be used to study algorithms, for the construction of various modifications, for the development of new methodologies, and also for the solution of non-standard data analysis problems. The same medium can be used by an analyst for the estimation of the consistence of data analysis techniques with a particular subject area.
An important part of the package is a script editor. Python is a popular general purpose programming language that is used as a scripting language. It can be used not only to describe the execution script of algorithms available, but also to create new computational units. It is possible to change some of the algorithms components.
For the more efficient use of computing resources the store is used. Previous results will not be re-evaluated during the second experiment. During the algorithm modification, only the data, which depends on the changed parts, will be recalculated. This is called lazy evaluation.
One of the most important parts of the FRiS-Pro package development, which is already being worked on, is the creation of a cloud service and its integration with the software package. The user will be able to upload any data to the server, to run the computing script on the server and to get the results as if they were calculated on his own machine.