Meta-Learning on QSAR data
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 4 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/16277 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Place | Eindhoven, The Netherlands |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Figurate numberGroup actionMathematicsMultiplication signTracing (software)Universe (mathematics)Rule of inferenceRoboticsMultiplicationGoodness of fitPhysical systemHypothesisState observerLecture/Conference
00:58
Meta elementInformation managementQuicksortVideo gameStatisticsMultiplication signMachine learningPairwise comparisonProjective planeVirtual machineBayesian networkSoftware testingNumberWordObservational studyLecture/Conference
02:29
RhombusMeasurementProteinForm (programming)Cellular automatonKeyboard shortcutStatisticsMachine learningData structureFunction (mathematics)Logic synthesisElectronic program guideLinear regressionVideo gameMereologyoutputPhysical systemCategory of beingPredictabilityMultiplication signBitInteractive television10 (number)Cellular automatonSingle-precision floating-point formatGame theoryFunction (mathematics)Row (database)Type theorySocial classEndliche ModelltheorieNeuroinformatikProteinData structureRhombusMachine learningDifferent (Kate Ryan album)Computer scienceUniverse (mathematics)Real numberGame controllerQuicksortGraph (mathematics)MeasurementKeyboard shortcutSoftware testingOrder (biology)SpacetimeScaling (geometry)PhysicistDot productPoint (geometry)Virtual machineFunctional (mathematics)Complex (psychology)View (database)Closed setLatent heatStability theoryOpticsProgram slicingFlow separationQuery languageState of matterMetropolitan area networkHybrid computerConnectivity (graph theory)PlanningLie groupCASE <Informatik>NumberClassical physicsRule of inferenceMachine codeNetwork topologyMultiplicationWebsiteSheaf (mathematics)Service (economics)VibrationRing (mathematics)Computer configurationThree-dimensional spaceDirection (geometry)Insertion lossLevel (video gaming)ScalabilityLecture/ConferenceComputer animation
11:29
StatisticsMachine learningFunction (mathematics)Data structureElectronic program guideLogic synthesisLinear regressionDegree (graph theory)Personal digital assistantCASE <Informatik>NumberLogarithmAttribute grammarFingerprintPartial derivativeShape (magazine)RepetitionGroup actionMeta elementLatent heatCAN busForm (programming)Task (computing)TupleTable (information)Standard deviationVirtual machineProteinRepresentation (politics)EmulationDatabaseInformationQuicksortProfil (magazine)Extension (kinesiology)MereologyDatabaseMultiplication signCASE <Informatik>Point (geometry)Standard deviationPresentation of a groupScripting languageMachine codeMultiplicationRule of inferenceCellular automatonProteinTerm (mathematics)Projective planeMassAuthorizationTable (information)Data structureDifferent (Kate Ryan album)Virtual machinePhysical systemGame theoryAdditionResultantFormal languageDescriptive statisticsMusical ensembleField (computer science)BuildingFingerprintGroup actionAttribute grammar10 (number)Software testingLine (geometry)Metropolitan area networkMathematicsShape (magazine)Category of beingInformationCore dumpMembrane keyboardSemiconductor memoryComputer programmingDefault (computer science)Machine learningStatisticsTask (computing)Form (programming)NeuroinformatikTupleLevel (video gaming)Instance (computer science)Video gameLinear regressionoutputParadoxPlanningType theoryReal numberAreaComputer animationLecture/Conference
20:08
DatabaseDifferent (Kate Ryan album)Meta elementIntegrated development environmentHacker (term)Chi-squared distributionInformationAlgorithmEndliche ModelltheoriePredictionRule of inferenceFirst-person shooterFingerprintNumberCategory of beingInformation managementLie groupFinite element methodDatabaseMereologyFingerprintFrequencyQuicksortVulnerability (computing)Exception handlingScaling (geometry)Insertion lossProcess (computing)Mathematical singularityLevel (video gaming)Physical systemTwitterWebsite1 (number)Right angleOptical disc driveVirtual machineCASE <Informatik>Data structureMultiplication signDifferent (Kate Ryan album)Social classDescriptive statisticsImplementationEndliche ModelltheorieNeuroinformatikInformationKey (cryptography)BuildingSoftware testingMembrane keyboardCategory of beingSet (mathematics)Variety (linguistics)ProteinProjective planeBoolean algebraRepresentation (politics)Type theoryLinear regressionPoint (geometry)Instance (computer science)Semantic WebMoment (mathematics)MetadataData storage devicePreprocessorLecture/ConferenceComputer animation
26:18
QuicksortInformationScaling (geometry)CASE <Informatik>CollaborationismGroup actionPhysical lawState of matterRevision controlHypermediaConfidence interval1 (number)Descriptive statisticsFingerprintPresentation of a groupCategory of beingLecture/ConferenceMeeting/Interview
28:13
SicLinear regressionAverageResultantPie chartWeightFingerprintSurjective functionQuicksortEndliche ModelltheorieCross-validation (statistics)Selectivity (electronic)Polygon meshStructural loadSequencePhysical systemStandard deviationNeuroinformatikComputer animation
30:46
Decision tree learningWide area networkKey (cryptography)Relational databaseCompact spaceTime domainRule of inferenceInformation managementVapor barrierVarianceInformationStatisticsFirst-order logic1 (number)Fiber bundleMoment (mathematics)Standard deviationFingerprintRepresentation (politics)Atomic numberRelational databaseData structureLinear regressionMultiplication signAttribute grammarNumeral (linguistics)Arithmetic meanAxiom of choiceNetwork topologyDecision theoryInstance (computer science)QuicksortResultantSubject indexingLevel (video gaming)Group actionProduct (business)Order (biology)Different (Kate Ryan album)CASE <Informatik>Presentation of a groupWebsiteCartesian coordinate systemComputer animationLecture/Conference
33:59
Local GroupExtension (kinesiology)Pattern languageSummierbarkeitRadical (chemistry)MereologyInstance (computer science)Different (Kate Ryan album)MultiplicationProteinInformationData structureRepresentation (politics)Virtual machineService (economics)Metropolitan area networkCycle (graph theory)Strategy gameLecture/Conference
35:34
Data structureLocal GroupInformationGeometryExtension (kinesiology)Representation (politics)Electronic mailing listMultiplication signMultiplicationSemiconductor memoryInstance (computer science)ChainMachine learningRoboticsRule of inferenceMeeting/Interview
36:21
RobotCycle (graph theory)Plasma displayVideo gameLocal ringPhysicalismCircleLogicSpeciesHypothesisWebsiteQuicksortEndliche ModelltheorieImpulse responseLevel (video gaming)Computer configurationMathematicsReduction of orderProcess (computing)Noise (electronics)Physical lawTheory of everythingCycle (graph theory)Physical systemPoint (geometry)Graph (mathematics)Urinary bladderBookmark (World Wide Web)Fitness functionStandard deviationRoboticsNeuroinformatikLibrary (computing)Software testingGaussian processInductive reasoningLecture/Conference
39:34
StrutTable (information)Impulse responseCommutatorMultiplication signNetwork topologySoftware testingAuthorizationFrame problemProof theoryLibrary (computing)Strategy gamePairwise comparisonDoubling the cubeLevel (video gaming)PlanningWebsiteBitLogicSound effectCASE <Informatik>CompilerMeasurementQuicksortRoboticsGaussian processForcing (mathematics)Lecture/Conference
42:06
Lipschitz-StetigkeitNumber theoryUniform resource nameDatabaseRobotKey (cryptography)Set (mathematics)Meta elementVulnerability (computing)NumberComputer hardwareLiquidBarcodesGroup actionInformation systemsRoboticsLibrary (computing)Drop (liquid)QuicksortRoboticsSupersonic speedSpacetimeParameter (computer programming)Endliche ModelltheorieRelational databaseInsertion lossDifferent (Kate Ryan album)DiagramNeuroinformatikLiquidMathematical optimizationComputer hardware1 (number)Classical physicsDatabaseEntropie <Informationstheorie>Revision controlInheritance (object-oriented programming)Complete metric spaceType theoryCellular automatonString (computer science)AreaCASE <Informatik>Software testingComputer fileSpeciesMultiplication signDenial-of-service attackPhysical systemPoint (geometry)Moment (mathematics)TrailPrime idealLevel (video gaming)Universe (mathematics)Lecture/Conference
47:08
Library (computing)Product (business)InformationInformation managementMagneto-optical driveCycle (graph theory)Drum memoryTable (information)ProteinVideo gamePoint (geometry)Local ringLibrary (computing)Position operatorCellular automatonCompilerMultiplication signPhysical systemArchaeological field surveyProgrammer (hardware)Software testingDirection (geometry)RoboticsDifferent (Kate Ryan album)Constructor (object-oriented programming)Self-organizationMereologyMultiplicationCollaborationismLecture/Conference
52:11
AerodynamicsLibrary (computing)Software testingHypothesisFamilyBuildingDataflowBlock (periodic table)RobotContinuum hypothesisComputer chess3 (number)NumberParticle systemOpen setQuicksortSpacetimeGame theoryEstimatorRoboticsLogic synthesisVirtual machineType theoryComputer chessAnalogyContinuum hypothesisCursor (computers)Machine learningMathematical optimizationOcean currentPosition operatorConnectivity (graph theory)Conjugacy classCASE <Informatik>WebsiteSign (mathematics)Network topologyLecture/ConferenceComputer animation
53:57
Computer chessContinuum hypothesisRobotCollaborationismComputerMachine visionLogicLie groupLikelihood-ratio testFunctional (mathematics)NeuroinformatikComputer hardwareRoboticsMachine learningBridging (networking)AdditionLevel (video gaming)Computer chessLaptopCollaborationismDigitizingRow (database)Lecture/ConferenceComputer animation
54:50
Row (database)PhysicsVirtual machinePhysicistPoint (geometry)Key (cryptography)Limit (category theory)Virtual machineRow (database)Wohlgeordnete MengeMultiplication signPhysicistComputer scienceCollaborationismNumberOperator (mathematics)RoboticsComputer animationLecture/Conference
55:40
Lecture/Conference
Transcript: English(auto-generated)
00:04
Welcome to the OpenL workshop. It's my pleasure to introduce our first speaker. It's Ross King from Manchester University. He's done amazing work. He's published multiple times in Science magazine. You probably know him from his work on robot scientists. It's a robot that automatically does experiments, tries to interpret observations and then
00:24
adjusts the hypothesis and then runs new experiments with an amazing system. He's also primarily working on proving P equals NP. So today he's going to talk about meta-learning of QSAR data and how OpenL helps with this research. Okay, well, thank you very much. Is that the mic working okay? Can you hear it at
00:45
the back? Okay, good. So thank you for the invitation to Eindhoven. It's the first time I've ever been here. I'm going to talk about meta-learning on QSAR data and this work goes back over 20 years to this project called the Statlog project, which
01:05
is one of the first comprehensive comparisons of machine learning methods. Now I was thinking about it on the way here actually. So one of the conclusions from that study, the
01:21
reason was that Bayesian networks don't work. They were by far the worst methods that we tested out 20 years ago. But of course that didn't really change. Bayesian networks have been a highly successful method. So I'm not completely sure what that means. I'm
01:42
not sure what it means, but it's interesting I think that when we tested them empirically 20 years ago they didn't work very well at all. Now they work well and lots of people spend a lot of time working on them. Not because of their empirical success at the time, but because of their elegant approach to statistics and machine learning. Okay, meta-learning
02:01
on QSAR data. Okay, so first the motivation. It surprises me and sort of disappoints me that many of the best people in machine learning spend their life trying to get people
02:21
to click on certain online adverts, you know, slightly more efficiently than the rival companies. I don't think that's a good use of their talents. And for the younger people here, I don't think you should spend your life doing such a thing. It's not, life is short and you should try and do something useful I think. Of course it's better than making
02:41
weapons for the military, which is advertising, but it's still not particularly something to be proud of I think. So, parts of the disease, the world is still shocking, it's shocking that the world has still got these diseases, they're major diseases. Malaria kills at
03:01
least a million people a year, perhaps two million, we don't really know, because health, especially in India actually, where the records are a bit unclear, if someone dies in a remote village it's not always clear what they died of. It could very well be malaria. Hundreds of millions of people catch malaria every year. Hundreds of millions
03:23
of people catch cystosomiasis as well, it kills tens of thousands of people, this is horrible parasitic disease from a worm. Malaria is caused by a single cell parasite as is leishmania which kills tens of thousands of people as well and causes horrible
03:40
disfiguration. And Chagas disease is from South America, kills tens of thousands of people as well, mostly through complications through heart disease. So these are major diseases out there, we still need better treatments for them, better drugs. So millions
04:01
of people die from these diseases, hundreds of millions of people suffer infection, they're so called neglected tropical diseases because the pharmaceutical industry in its wisdom has not spent money on them. They are, I'm sorry, in our society they're driven by profits,
04:21
so they think there's not enough money in these diseases. I think actually their modelling is a bit wrong, so let me try to explain why I think that is. How I think they work is that they look at different disease classes and look to see how many rich people or at least people in the western world have them and then they think if I
04:40
could treat type 2 diabetes that would be worth so many billion dollars a year to me. That's how they go about I think. I think the fundamental flaw there is that they don't take into account the a priori probability of succeeding in finding a drug to treat, say diabetes type 2 because we don't really understand how that disease works, you know,
05:00
it's something complicated to do with the systemic control of insulin but we really don't understand it. So it's very hard to treat a disease you don't understand for a single drug. That contrasts with these parasitic diseases I talked about earlier, we actually know very well how to treat them, we just kill the parasite and it's not particularly difficult thing to do because they're very different from human cells,
05:26
you know, the last common ancestor was for most of these is hundreds of millions of years ago, perhaps billions of years. So we actually know how to treat them and the pharmaceutical industry could have treated these diseases very easily if they'd just spent the money but they haven't. So we need in the university sector to be more efficient than the pharmaceutical
05:45
industry because they spend on average something like a billion dollars to find one drug for one disease. We need to be more efficient. Okay the problem of finding a drug to treat a disease called drug design, what we want to do is to find a small molecule, a drug
06:02
which will modulate the biological activity of a larger chemical called a protein which will then affect the whole living system. And that's how we treat diseases. Now we find small molecules that specifically bind with proteins. Okay, that's the name of
06:21
the game. So small molecules, these are example small molecules, ibuprofen is the classical pink colour, you know you take a couple hundred milligrams of that if you have a headache, it works really well. Here's the, so this is one abstract look, often computer
06:45
scientists think of chemicals as some sort of subtype of graph but they're actually sort of three dimensional structures. Here's a sort of space filling model of ibuprofen. You can see the red is oxygen, this is the aromatic ring in the middle, if you remember
07:02
your chemistry. And actually of course they're not static, they actually move around, vibrate. These are proteins, so the protein is the order of maybe a thousand times bigger than
07:20
the small molecules. And these molecules are going to bind to it specifically at places. Okay so this is the diamond synchrotron, so a synchrotron is a big x-ray machine that makes high powered x-rays which are used to get the structures of the protein. So you crystallise a protein, then fire these x-rays at it and you can work out the structure
07:43
of the protein. What's interesting I think is the size of this, so if you see this, these little dots down there are cars, so this is the size of a large football stadium. So computer science, we're not imaginative, if you think of the physicists and the biologists
08:01
managed to build tools this big, we should think how much could we do for a billion dollars. And the justification for diamond was to treat diseases, so it was a CUSAR type justification. Though the physicists want it for their own reasons as well.
08:21
Well here's our big protein, this is a small molecule interacting with it, that's the sort of typical of the scale of the whole thing. Close up view. And this one is just to show you the sort of the level, the complexity of the interactions. This is like a cartoon of that protein I showed you earlier interacting with a small molecule.
08:45
So there's lots of spatial interactions, very specific, in fact, drugs, if you think you're going to put a drug into a person's body it's got to target the right place very specifically. You're only going to add a little bit of drug, and it's got to go
09:03
to that particular target and not interfere with anything else, and that's, can only occur that it's very specifically binding to it. So the probability of it binding there is millions, billions of times more likely than anywhere else, and that's what you try to achieve.
09:27
There's a thing called an assay. An assay is a small biological test which gives you a prediction of how well the actual compound is going to do when you give it to a real human being or whatever. So it's a cheaper test than actually giving it to a living human
09:44
being, also more ethical because you can't test millions of drugs on human beings. Okay, so it's a simple test. Normally there's two approaches, one is to use pure protein, and measure binding on it. So the protein is then called the target. The problem of doing
10:03
that is that you're never sure that when you put this compound into a human being that it's actually going to reach the target, because so many other things can happen in a complicated living system. The other approach is to use whole human cells, and the problem there is that you're never sure what you're hitting when you put something
10:22
in. Is it just the target or what? And both assays are very expensive, so typically a pharmaceutical company will spend a hundred, couple hundred thousand euros designing an assay for one of these trials of compounds. Okay, so QSARs, this is machine learning. So a QSAR is
10:43
a quantity to structure activity relationship. Essentially it's a function where you input the chemical structure that outputs a real number of how good that compound is on the assay. Okay, so it's a function, the input to the function is the structure of a small
11:02
molecule and the output is a real number, which is the predicted activity on the assay. Okay, and you typically learn these assays to help you design new compounds. So the name of the game is to design a new compound, not just to make a good QSAR. That's not
11:26
the point of it. The point is that you're making new compounds. Okay, and the particular QSAR problem depends on what is known. So you know the small molecule structure, and
11:41
that's the default case. In some cases you know the structure of the target. In some cases you know what the target is, but you don't know how the small molecule is binding with that target. Sometimes you just, you know how it's binding like
12:02
we saw in the previous cartoon. Okay, and these are slightly different problems. In general I'm going to talk about just when we know the small molecule structure, we don't know the actual structure of the protein or the extra information. Okay,
12:20
and then there's the problem of how do you represent chemical structure. So you have to have some way of assuring a picture of an ibuprofen in a three-dimensional shape. You have to somehow encode that into something which a machine learning statistical program can use. So it's descriptors for a table. You can represent the bulk properties of the
12:42
molecule. So log P is essentially how hydrophobic it is, how oily it is. And that's important because it turns out that you don't want to be too oily or not oily enough if you want to be. We are a successful drug. Empirically we know that's the case.
13:02
Actually it's a strange story actually. So the pharmaceutical industry, their whole business is based around putting drugs into people. So you think that they would know how drugs get into cells. So a cell's got a memory around the outside. And what
13:24
they always used to believe was that the reason that you want the molecule not to be too oily but oily enough is that because it's going to diffuse through the membrane of the cell into the cells. That was always what they told you. That turns out to be wrong. You think that they would have learned that a long time ago because it's core to
13:42
their business. It turns out there's actually these proteins which import molecules into the cell and export them. And they have to, any drug has to fit into one of these proteins. And one way the pharmaceutical industry should have realized that what they said was wrong is because if you look at the small molecules in the cell, these compounds which
14:01
are actually there, they also have the same amount of oiliness as drugs. So they would have diffused out. It's obvious that would have happened. But I don't know, it's strange I think. They don't really seem to step back and think about what they're doing. They have these rules that says you should make a molecule of this particular oiliness and then we think why. Fingerprints. The standard way to do this in the industry
14:27
now is using these fingerprints which I think is a particularly ugly thing to do but this is the standard. So what you do is you have maybe 100 to 1,000 billion attributes which say something about the molecule. Each attribute says for instance is there an oxygen
14:42
in the molecule? Is there an alcohol group? Is there an alkane group? Is there a benzene group? So they have all these complicated questions and each one you just get yes or no and you get this long fingerprint, typically at least 100 long, possibly 1,000 which is standard. And that's what they use. Some work being done in 3D shape etc. Okay,
15:11
so that's the background of CUSARs and we have this project to work on what we're calling meta-CUSAR. So there is, the literature on CUSARs is vast. Thousands of papers have
15:23
been published. Every possible machine learning method has been applied to the problem and the result of that is not surprising. For some problems some methods work well and for other problems other methods work well. Probably down to some deep bias in the actual
15:46
learning problems. So what we're trying to do in this project is to do some meta-CUSAR learning. We're going to apply lots of CUSAR methods, sorry, apply lots of statistical methods and machine learning methods to CUSAR data and look how well they do on different
16:03
problems and try to figure out why they do what they do. And hopefully there will be some lessons to the pharmaceutical industry and people designing drugs so that we can treat malaria and things better. So that's the basic idea of the project. Okay, so we have different
16:29
databases from CUSAR. We're building this sort of intermediate databases which we're going to use in the learning and we're at the stage where we're sort of building
16:41
the infrastructure for all this. So I'm going to show you some initial results but these are very initial and we're just showing that we actually can't get everything to work. I'm pretty sure every form of statistical machine learning has been applied to CUSARs. How they differ is to the a priori presumptions, they make up the learning
17:04
task and they assume that the data is going to be represented in the standard way which is a tuple of attributes. Okay, so one thing that made this possible is that
17:22
when I started working on CUSARs, also about 20 years ago, that I had to input the data myself by hand. I'd read a scientific paper and I'd have to translate that data into the computer by hand, manually. Now there's this database called Kemble. It was essentially
17:47
what they've done was that this private company manually curated, there was this journal called the Journal of Medicinal Chemistry which is the top one in the field of medicinal chemistry which is the area of chemistry where you design drugs, medicinal chemistry.
18:05
And actually I'm quite proud I have a paper in it, it's a real chemist type journal. So what they did was this private company manually essentially typed all the data from these papers into a big database. So it's based on around about 60,000 publications
18:25
and they manually took all this data and put it into a big database of databases. So each one of the papers, typically in a journal of medicinal chemistry paper you have a description of maybe 100 compounds, maybe less, what the assay was and how well
18:47
the compound did in the assay. That's what a typical paper looks like. And they may well have applied some sort of regression method to that data. So this company sort of collected all this data and they were going to sell it but they went bankrupt.
19:06
Somehow the business plan didn't work, which is good for us because the Wellcome Trust which is this giant medicinal chemist, sorry, this giant medical charity in Britain, they
19:22
sort of stepped in and bought the database and then made it online so the EBI now have this database. So it's publicly available to anyone who wants to look at it. Has anyone heard of the Wellcome Trust? It's their, I don't know, this giant medical
19:42
charity. They're worth, I don't know, they're worth tens of billions of pounds. Never give me a penny in my life I've applied. I don't know, at least six or seven times. So every time now I apply I double how much I ask for. Because it's like the St. Petersburg
20:03
paradox but they're infinitely rich as far as I can tell. But we shall see whether I die for so they give me the money. So this nice database, it's manually extracted, it's very clean. It's got 60,000 publications, 10,000 targets, so target is one particular
20:29
type of protein they're trying to design drugs against and 12 million activities, one and a half million distinct compounds. So it's a very nice large database and this
20:41
allows us for the first time to really do metaQSAR work and there's lots of data there we can actually work on. Okay so I, okay this would be a typical representation of a molecule. We've got the molecular weight, log piece hydrophobicity and here
21:04
we've got the long fingerprints of the Boolean descriptors. And as part of the project we want to find out which ones of these are really important and which ones not. There's different, many different varieties of fingerprints you could choose and we want to test out which ones work. That's one of the parts of the project. And we've been
21:25
putting together all this complicated IT infrastructure so we have the basic databases here, we have the selection of algorithms, the machine learning algorithms. Here what
21:41
we're calling bi-activity database is the database which contains the level one machine learning problems. So these are the QSAR problems. And this one over here is like the metaQSAR database so that's going to describe where each of the problems is one of the examples. And we're open ML so we're going to export the, at least the basic
22:15
QSAR databases to open ML. This is a permanent place to keep them. OK, that's what I said.
22:28
Bi-activity stores the QSAR dataset information and the datasets and the metaQSAR database is the metadata set. And at the moment these are in MySQL. So we have ambitions
22:41
to put it into a semantic web RDF format. OK, this is what we've been working on. So we have this R metaQSAR R package that implements and runs the QSAR models. So it
23:01
tries to put all this together. It takes the data from the medicinal chemistry databases, KEMBL, etc. Computes the fingerprints. These are the descriptors of the molecules. Calculate some molecular properties. These are also descriptors of the molecules. Those only have to be done once each time. And you created these datasets, these roughly 60,000 datasets.
23:28
And we want to learn QSAR models for all the different datasets using different algorithms, using different sets of fingerprints, etc. and learn what's important. We also want to describe the targets, the targets of the proteins. We want to see whether there's different
23:45
classes of target. For instance there's a class called GPCRs. These are probably the most important class of targets. These are a particular type of protein which sit in membranes and receive signals. So your eye is based on GPCRs, your nose is based on them. Lots
24:05
of internal signaling in the body is based on them. And they're one of the most important targets. A couple of years ago the person got Nobel Prize for, actually two people have got Nobel Prizes for finding the structure of GPCRs. One was for retinal,
24:24
the one in your eye, and something like 15 years later for the one in the brain. So most of your brain signaling is done by three GPCRs. So there might be something special about GPCRs which influence the learning. So we want to have a look at that as well.
24:42
So it'll take a long time to put all this together into a system that works. Okay, so I'm going to briefly describe, to show that this actually works. We wanted
25:09
to show that the whole system could work together. So we decided to do our initial hundreds data set problems. We're going to just apply our method.
25:21
One brief question during the talk. I mean you said that basically every method has been applied to this kind of data from machine learning, and I assume from the explanation that it's usually a regression technique, right? Well, mostly yes. How important is pre-processing? Assuming that this is right, what I just said, how
25:42
important is pre-processing for these data sets? Is this something you really need to get right, or do you basically, or do you know how to do this, and just push in the data to the usual regression technique? By what do you mean by pre-processing in this case? I don't know, I mean I don't know your data, but... So the data is taken from these papers, and these are quite clean data because each point
26:04
is an expensive biological experiment. So we're not processing after that. I suppose we could sort of put it all on the same scale or something like that, which may be something to think about, you know, but apart from that
26:23
there is a sort of, they're roughly about the same scale anyway, they're not... The data may be of different reliability depending on how much is known about the assay, but that's quite hard to get out, you know. I don't think they really put too much information in when they extracted all this information out.
26:46
So we're assuming that the data is reasonably good and we're not doing anything with it. Sure. This is your presentation of what the data basically looks like.
27:00
It was mostly just the description of the drug itself, right? So the binary fingerprints and some properties of the chemical. Is also the target always known? In these cases, yes, and that's... I believe so.
27:24
Okay, so actually coming back to the pre-processing, so what we have done is that we have collaborators who are proper medicinal chemists at the University of Dundee, and we've taken their version of the chemo dataset
27:43
in that the ones which they think they have confidence in, so they've gone through it and said that, yeah, we really believe this lot. So it's sort of been cleaned up in that sense in that we haven't just applied everything. We've taken data which our collaborators think is the best data.
28:14
Yeah, so we wanted to just to see whether we can get everything to work. We took 100 datasets, 100 small datasets, that's important when you look at it.
28:24
We just wanted it to take too long. We used the standard fingerprints and the standard descriptors. We used sequential forward search feature selection to all datasets.
28:40
We used five-fold cross-validation and root-mean-square model performance, just to show that everything could work. We took 18 regression methods from the MLR package.
29:05
Yes, so basic standard things. This is a sort of pie chart of which method did best on each of the problems.
29:21
I'm not sure if you can put much weight onto this. This is standard linear regression worked really well. That's probably something to do with the size of the datasets, I think. If you've got a really small dataset, it's hard to apply something more sophisticated.
29:40
So these are just showing that we could actually get everything working together. This is the average root-mean-square for the different methods. Which one? This is linear regression again.
30:00
What's RVM? What's RVM? It doesn't seem to do worse here. I don't know. Sorry? Probably. You know better. It's doing very badly here. For whatever reason. I have to admit, I haven't used this very often.
30:23
I don't put any weight on these results. Just showing that we can actually get things to work. The handle doesn't fall off when you try to turn it. This is for one particular dataset, the different methods applied. This is the average for the different datasets of all the methods together.
30:44
So some are harder to break than others. Okay, and for the metaQSAR problem, we need to have some way of describing the data.
31:00
At the moment we've just used some really basic ones about the data, which are completely generic, like dimensionality, instance count, things like that.
31:20
And this is the decision tree you get out of it. So it just shows, if you can see, the first choice is mean standard deviation of numerical attributes. Explicit diversity index. But as I say, this is initial results.
31:45
Just showing that everything works and it can be done. And hopefully in one year's time it will have been done. Okay, I wanted to say something about relational learning.
32:01
So this is where it's all started. So I have a long history of working on relational learning. So trying to represent molecules, not by this sort of fingerprint approach, which I think is remarkably ugly, but using first order methods, using predicate logic.
32:28
And we've been working on this for a long time. And one of the reasons, I didn't put actually in the grant application, but one of the real reasons for doing this work is I really want to test whether relational methods,
32:41
how well they work against all the best regression methods on a large proper dataset. Because no one's ever really compared things. We have our own evidence ourselves when we're playing around with these things that they work pretty well.
33:02
But we've never had enough data to show that. So the nice thing about relational methods for drug design is that you have a nice representation that's really close to what the chemists use. Okay, so drug design and relational learning, we've been working on this for 20 years.
33:25
We have this really nice representation where you can represent the relational structure of the molecule and sort of map it into the logic.
33:41
On the basic level you can just put in the atoms and bonds and the relationships between them and use that as the representation. But you can also add background knowledge about different structural groups. And there's no need to actually do all this fingerprint stuff.
34:01
This is some initial work we did showing that you could actually find certain sub-patterns in bigger molecules. So this pattern serves to discriminate between mutagenic and non-mutagenic compounds. So what I'm wanting to do as part of this metaQSAR is to also compare relational methods
34:25
to see how well they do, whether this radical different representation works. It's very nice as well because you can add the three-dimensional strough, you can add chemical group information. I told you molecules move.
34:43
They're all constantly vibrating. This is important because when you, if you do the physical chemistry of Triton-Wadlett, you won't probably get one minimal structure, you'll have several minimal structures. And you're not sure which one is the one that's actually physically interacting with the protein necessarily.
35:02
So it's, one of these representations is important, but you're not sure which one. So that's an interesting machine learning problem as well. What's that called technically again? I forgot. Where you have different representations of the same instance.
35:25
Is it a multiple instance problem? Yes. Multiple instance? I think, yeah. No, it's multiple representations of the same thing.
35:42
So this was where the problem started in this drug design one. So you can kind of expect these kinds of features and look at the data in this way and this way and this way. No, no, it's the same, same features, but you're not sure which one of these is the correct one. It sounds like multi instance. It is multi instance, yes. My memory has come back. The features are the same, just the values are different.
36:01
Yes. Like a bunch of, like a chain of keys, you know which one fits? Yeah, so this is, this was the problem which, the first one that came out of machine learning was this multi confirmations of drugs. Okay, I wanted to say about the robot scientists because we're working on, so robot scientists we're trying to do is automate scientific research.
36:29
You represent the problem. Okay, we want to make a computer robotic system which can, in some sense, do its own research. So we have background knowledge about our problem, normally represented in logic.
36:42
We have some way of forming hypotheses, some novel hypotheses about that background data using abduction or induction. In QSAR we're actually going to use induction. We have some way of forming efficient experiments. We have laboratory automation to do the experiments and we cycle around until there's a final theory or we run into some resource.
37:08
And our robot scientist Eve is designed to do QSAR learning and early stage drug design. Okay, so the whole thing sort of fits together, we want to do the meta-QSAR for Eve.
37:22
These are the diseases we're looking at. These are the actual parasites we want to find drugs which kill. Plasmodium falciparum, plasmodium vivax. This is an interesting, this is one we've been working on a lot actually.
37:40
So this species here is the one that kills most people, especially children in Africa, falciparum. Most people in the world get vivax, that's more common in South East Asia, South America. It used to be very common in Britain, it was called the ague. I'm sure it used to be very common here, you know, all this water you've got.
38:01
It used to go all the way up to the Arctic Circle because although there's no mosquitoes in the winter, unlike falciparum it can hide in your body over the winter. So fresh infections were caused in the summer by someone overwintering the parasites.
38:22
These are our targets, dihydrofolate reductase. This is my one favourite target in the world. For some reason this is probably the best target in the world. The first anti-cancer drug was against this enzyme.
38:42
If you have a bladder infection you get an antibiotic which targets this enzyme. If you get malaria you're very likely to get a drug which targets this enzyme. It's the most important choke point in living systems. Okay, to formalise it for the robot scientists we use
39:03
graphs and standard chemoformatics methods for the background knowledge. We use, Eva's using Gaussian process modelling to do the QSAR. And we use active learning to decide on efficient experiments.
39:24
So how the pharmaceutical industry does drug design is that they have an assay. I'll have to explain to you what an assay is, it's some cheap test. And then they have a large compound library. Normally this consists of hundreds of thousand compounds, maybe millions of compounds.
39:41
And what they do is they test every single compound one after the other against the assay. And once they've done that, which typically takes even if a high-throughput robotics will still take them weeks to do that. Then look at the active compounds, double check them for more expensive assay to make sure it's not a false positive
40:02
because most of the compounds are going to be inactive. And then they do the QSAR learning and make some new compounds to fit the drugs. What Eve does is try to automate these three steps. So Eve starts with a compound library, starts screening them randomly.
40:21
After it's seen enough hits, it stops random screening, goes back, does a more expensive assay, and then learns a QSAR. And then chooses compounds from its library to test that QSAR using active learning. And the hope was that would be more efficient and cost-effective than this sort of stupid way of brute force testing everything.
40:48
And the idea is that if you can find most of the hits without going through the whole library, you'll save money in time. Because time is very important. If you actually do find a blockbuster drug, a blockbuster is one where you earn at least a billion dollars a year.
41:04
So saving a couple of weeks time on the patent, what is that? That's quite a lot of money. So that's one twenty-fifth of a billion. What's that? That's quite a lot of money. So you really want to do it quickly if you're the pharmaceutical industry because once you've made your patent, time is rolling.
41:29
So it's possible that it's more efficient to do it this way and that's what we were testing. I'll give you Gaussian process models. The nice thing about them is they're generative, which helps with the active learning.
41:44
We wanted to compare this intelligence strategy of choosing compounds from your library, which you think are going to be hits, and to test the QSARs. Again, it's just doing everything, which is begin at the beginning and go until you come to the end and stop.
42:01
Can you add one more sentence on how you use the Gaussian processes in this active learning? Is it basically deciding where to do another experiment, which is unlabeled? Yes, so you want to take a compound from your library, which you don't know yet, which you're going to... Okay, how to do the active learning is still a research question. So you do some kind of optimized entropy for that through the Gaussian process, so you take the next one where you're most unsure about?
42:26
No, because unlike in classical active learning, you don't care how well you predict inactive drugs or ones at the low end. So you don't want to minimize your uncertainty down there. It's at the top end you're interested in.
42:43
Is it something like expected improvement that you do with the process? Yeah, we've tried lots of different things, yes. It's some sort of compromise between exploration and optimizing at the top end, but it's not completely clear what's best. Okay, this is what it's saying here. So you need to balance this exploration and this optimization here.
43:10
The approach we used is where we combined SMIN activity and high variance, so we tried to balance the two things together. So this was work with the University of Leuven.
43:25
Another complication is that you want to do it in batch, which makes the computation much, much harder, because it's easy to optimize one, but then if you want to choose the best 64, something that's really hard.
43:41
Okay, I'll try to explain these diagrams. So this over here is the compounds, and this is the active learning, so that here we're finding compounds faster than randomly by using the sort of active learning, and we do it to completion.
44:04
And this is the cost here, so stopping about here is the most cost effective thing to do. After here you're starting to lose money relative. And this is some sort of exploration of most of the space, so we had this model of how much everything costs, and by playing around with the different costs you can make different things.
44:27
So how much does it cost you to miss one of the active compounds? How valuable would that be? How much does each compound cost? So we explored the parameter space, and most of the space is rational to do more intelligent and
44:42
just try everything, especially if you can do the assays quickly, and you have a very large library. Okay, so we have, using each database for MetaQSAR as well, the advantage here is that we've used the same target
45:01
from different species, which is an unusual thing to do for the pharmacy industry, so it allows us to compare different things. Okay, this is Eve's hardware. The most interesting thing, I think, is this acoustic liquid handler.
45:20
So it turns out now that if you want to move small amounts of liquid around, the best way to do that is not to use pipette tips anymore, but to use some sort of sonic system, which sort of makes the liquid vibrate, and little droplets, exactly two and a half nanolitres, fly up and land on the plate where you want it to stick to.
45:44
And if you want ten nanolitres, you say four drops please, and four drops are pinged up. And this is much more accurate and much cheaper than using pipette tips. Okay, so I'll try to show a movie here. Okay, this is what, so Eve is about from here to that pillar and about this wide.
46:23
And it's got these two robot arms, these Mitsubishi ones, which, they're a smaller version of the ones that build cars. They're very, very precise, which was accurate in their movements. Now this is the liquid handler, which does the pinging of the droplets.
47:03
And this is what's called a 384 plate, so there's 384 little, small little vessels. Each one will be one of the experiments, one of the tests, and put different drug into different ones. Okay, this comes from the compound library. Each one of these different wells has a different drug in it, which some chemist has made at some point.
47:26
We only have about 15,000 compounds, which, for the pharmacy industry, they have, I'd say, millions. It's the so-called crown jewels.
48:16
Okay, so I haven't discussed the assay, because one of the most successful design parts is the actual assay.
48:24
We use this clever idea from biology to make assays, which allow you to target particular enzymes, but also do it in a living system, which is more robust than human cells.
48:40
So we use yeast as an assay system.
49:32
I'll stop it there. We deliberately have the robots going slowly, because, especially if you move them really fast, it's scary.
49:47
And they may hit something, and also it's more likely to drop something.
50:00
Okay, this is the, we have found lots of new compounds. We've also been working on repositioning drugs. So the idea of repositioning drugs is that you take a compound, which has been shown to be not too dangerous, because they're using it for some other, for some disease X, and you show that it works against disease Y.
50:24
Now this is work we did on trypanosome brucei, which is, this is the organism which causes sleeping sickness in Africa. Okay, this is the most exciting thing, is that we found this compound, which is active against malaria,
50:42
the hydrophilic reductase inhibitor. And it's, I'm really sure it's safe, because it's in a well-known brand of toothpaste. You get it, and I've seen it in mouth washes. And toothpaste is not that dangerous to eat, you know, children do it all the time.
51:05
Yes, so it's, I'm quite excited with this. And we're just trying to do, so it works best against this malaria called the vivax one I mentioned. The problem with vivax is that we still don't know how to cultivate it in the lab.
51:26
I said this once at this meeting, and I said, I did it, it's my PhD. But what they meant was, if they had a fresh supply of blood locally, you keep them going for maybe a few days. We still can't cultivate them really well.
51:40
So if you want vivax, you need to go somewhere where there's malaria, so we have this collaboration in Manias in the Amazon, where people have, and that's what's quite shocking. It's often come in, you can see from the genetics, they've been infected multiple times. There's multiple strains of vivax in them.
52:02
Okay, I should say something about constructive learning. So the point is that we want to really actually make a new compound, not just test compounds from the library. So it's not active learning, we want to, what compound will optimize this particular assay? And that's still an open research question, you know.
52:21
The number of compounds that have been synthesized ever is a few million. The sort of space of compounds you could synthesize is literally astronomic, you know. It's, there's different estimates, but there's a slight number of games with chess, it's a ridiculously large number.
52:42
Yeah, so okay, 10 to the 60 is a reasonable estimate of how many compounds you could synthesize. Yeah, and we've only ever made a couple of million compounds in the whole of human chemistry. And what's really nice now is that they get these chemical synthesis robots,
53:00
so you can actually get a machine that can do a lot of chemistry. They can't do everything yet, but most of chemistry they can probably do. But there's a complicated question in machine learning and optimization, how do you decide which compounds to make?
53:21
Because you have to take into account the synthesis aspects of it. Also, how do you optimize this particular cursor to finish off? I'm going to talk about, yeah, robot scientists and automation of science.
53:40
So I, in chess, there's this analogy in chess and science, in chess that there is this continuum from beginners to grandmasters, and I think the same is true for science, between the type of science that Eve can do now, to what I can do to your Angstines and your Newtons and things.
54:01
And if you believe that there is this continuum, it's not just no step function, then robots I think will get better and better at science. And certainly the hardware is getting better, computer and machine learning is getting better, the robotics is getting better, there's very little now that robots can't do in the lab.
54:24
And I think that the collaboration of human-robot scientists together is better than either one on its own. Just like even now in chess, even if my laptop can beat the world champion, human and computers together play better chess than computers do alone.
54:41
And humans and computers playing science can do better than either alone. Ah yes, so this Nobel Laureate Frank Wilczewicz is on record saying that in a hundred years time, that the best physicist will be a machine, which I like, because it obviously means the best scientist of course.
55:03
Computer scientists and biologists don't really count if you're a physicist. I don't know, we shall see, it's a pretty cool thing. Oh, okay, no conclusions. Okay, ah, I'd like to thank my collaborators in Manchester, and Brunel, and Dundee,
55:26
who are on this metafusar project. Ah, the collaborators in Cambridge who've worked on making the assays for the drug design work. Collaborators in Leuven, helped in the machine learning. And in Aberystwyth, the robotics. And I'd like to thank you for inviting me here, and listening to my talk. Thank you.
Recommendations
Series of 6 media