Importing Wikipedia in Plone
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 39 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/47821 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Plone Conference 201310 / 39
1
9
10
12
16
17
23
25
29
31
32
37
00:00
SpeicherabzugAttributierte GrammatikElektronische PublikationAdressraumInformationsspeicherungUmwandlungsenthalpieDatensatzAutomatische IndexierungPunktLokales MinimumAutomatische HandlungsplanungTypentheorieInstantiierungDatenmodellDatenfeldInhalt <Mathematik>Klon <Mathematik>DatenbankCASE <Informatik>HalbleiterspeicherGüte der AnpassungStatistikResultanteDateiformatMAPSoftwareentwicklerAbfrageKartesische KoordinatenOnline-KatalogGebäude <Mathematik>ZahlenbereichDatenstrukturMultiplikationsoperatorDifferenteDifferenzkernBildschirmmaskeComputersicherheitDemo <Programm>Pay-TVStochastische AbhängigkeitPhysikalisches SystemObjekt <Kategorie>Dynamisches SystemVersionsverwaltungStellenringRepository <Informatik>MultiplikationSchwerpunktsystemHochdruckCanadian Mathematical SocietyMereologieProzess <Informatik>MenütechnikVererbungshierarchieATMXMLUMLVorlesung/Konferenz
09:36
CASE <Informatik>AdressraumPunktMAPWeb SiteInverser LimesResultanteAutomatische IndexierungJensen-MaßSystemzusammenbruchAutomatische HandlungsplanungVorlesung/Konferenz
10:33
SpeicherabzugDatensatzJensen-MaßVorlesung/Konferenz
11:20
Inhalt <Mathematik>StapeldateiMailing-ListeInstantiierungTabelleZusammenhängender GraphVorlesung/Konferenz
12:07
Web-SeiteMultiplikationsoperatorResultanteSyntaktische AnalyseAbfrageCodeElektronische PublikationObjekt <Kategorie>NP-hartes ProblemAutomatische HandlungsplanungInstantiierungDienst <Informatik>ProgrammbibliothekDefaultInformationDateiformatGlobale OptimierungDatenbankProzess <Informatik>Dynamisches SystemDifferenteEinfach zusammenhängender RaumHyperlinkFreewarePunktAttributierte GrammatikGanze FunktionReguläres MaßZweiMathematikFormation <Mathematik>DatensatzGamecontrollerWikiLastCachingBitVererbungshierarchieVorlesung/Konferenz
16:53
Rechter WinkelAutomatische HandlungsplanungSoftwareentwicklerElastische DeformationElektronische PublikationMathematikInhalt <Mathematik>EinfügungsdämpfungSchwerpunktsystemVersionsverwaltungRuhmasseAdressraumAutomatische IndexierungInstantiierungProgrammierungWeb SiteDatenbankREST <Informatik>PunktObjekt <Kategorie>Demo <Programm>Dynamisches SystemCodeFront-End <Software>Regulärer GraphTransaktionPROMGrenzschichtablösungDatensatzMultiplikationsoperatorInformationsspeicherungProdukt <Mathematik>Jensen-MaßStrömungsrichtungVererbungshierarchieLesen <Datenverarbeitung>InternetworkingDebuggingVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
00:09
So importing Wikipedia in Plone There is a demo inside. So what do you think? Zulibi is good to store objects, okay?
00:24
Plone contents are objects. We store them in Zulibi So it does work. Okay, no problem. We all do that every time. That's Plone. But What if you want to store A lot of records. Non-content-ish records, let's say that. Like, I don't know
00:46
addresses, contacts, polls, results statistics Mainly subscribers, this kind of stuff Any business specific structured data. Tiny data, structured
01:01
Not content. Well, you can store them as content anyway You can create content type and store them that way. It will work pretty fine as far as you do not have too many data to store. Like, let's say, yeah
01:20
100,000 is okay, but it's pretty much a maximum. So Another approach. You can store them in an SQL database Okay, it just works. I mean, that's a good solution We can do that pretty easily with Zope. Okay, but
01:41
two major problems first one You need to manage a secondary system. This means you need to deploy it. You need to Back up this system. You need to make sure it's secured Security is just fine in Zope But when you start putting data outside Zope, then you have to implement a security somehow. So it's all a mess
02:07
Okay, that's premium one Problem number two, I hate SQL. So basically I can't That's the way it is Maybe I can just cannot digest it. No way
02:23
So, how could I do that? How could I store many many many records in my ZODB because I just love my ZODB I won't stick to it Is the ZODB strong enough to manage such an amount of data? Is the Z catalog strong enough to index the data because I probably need to index them to be able to search filter and so on
02:46
well My grandmother always told me that if you're not if you want to become stronger you need to eat your soup And that was a really good advice. She could have been a good ZODB developer by the way
03:03
So, where do we find a good soup for Plone? in a super super So meet super super Plone and super are two two package which provide Storage and indexing
03:21
Into your ZODB for tiny records, but a big amount of tiny records basically, what is it it's just a way to Record any pickable data in a persistent structure It's based on B3 ZODB B3 and it use
03:42
node X ZODB Which is really nice stuff and it just use repos catalog to index That's something, you know, it have been created by blue blue dynamics Those people rocks are really good. So and it's just fantastic. So let me introduce
04:01
super It's quite straightforward to use you can create a soup so the soup is container for records You can create as many soup as you want in your in your clone You create a soup then you create a record and you set attributes. Okay, no big deal You can store anything that you see what which is pickable. Okay into records
04:26
Then you stores in you store it that way soup add record and you're done Okay, so not really complex You can have a record into a record. No problem. So here my record and address
04:42
Gonna be in to gonna have secondary attributes. It makes no problems and you can access your record Very easily as well. So you get your soup and you get your record by its ID So nothing difficult here
05:02
then you can query your your data using a Repose catalog you have okay. We cannot see everything. But anyway, you can write a query using those Those keywords or you can also use a CQ format which is much easier to read
05:21
So like user equal user one and foo in text and you're done It just return a record It can be run in lazy mode, of course, and it's very very efficient So of course to do that you need to define some index different type of index
05:40
It's quite similar to the zip catalog. Of course, that's You have a text index you have field index and so on More about super super a soup container can be moved into a specific zuDB Mon point, okay There is a tool for that
06:02
Really Andy and it can be shared across multiple independent plan instances the same The same database let's call it that way the same soup soup container can be put into an FS file and shared If you need it, and it works on plan and prior mid which is a good point
06:23
so now Well, as you know, I'm I'm Creating I'm managing plumbing. Oh, so I try to put super in from you know, basically so we can Use promenade to build non content content-oriented application very easily
06:42
Dylan J just talked about it. You just design a form and you have a structure and using structure you can start creating data and Using super while the idea to be able to manage a huge amount of data So originally in promenade the records documents we name them documents were just 80 folders
07:07
Okay, so well about 30,000 records were kind of maximum To improve that we moved it to pure CFS Okay, so just a b3 folder a CMF b3 folder CMF object in there and we are about
07:25
once Yeah, 100 and thousand is kind of okay, but if you want to go I Are is can can be difficult you have to optimize stuff You have to make sure you have not too many index in your seat catalog because promenade use
07:43
Local catalog for every database and it can get quite big and it slowdowns everything. Okay Now with super We can reach millions So you have millions of record in your that in your zoo DB and it just works out really nicely
08:08
Yeah, okay Yeah Sorry, I don't I miss your pause
08:29
The maximum amount of data you can Know it was no it was memory was okay. No, it was just too slow
08:41
basically Yeah querying when you are trying to filter data or to to extract data for some reason It was just too slow not usable, but it was working. Okay, you just too slow. No conflicts are okay Memory is fine. So there's no problem like that. Just not usable basically, that's it
09:07
so Typical use case I need I needed to have 500,000 addresses for subpart of France and to be able to query them in full text
09:22
Just like when you type an address on on Google map and to display the result on the map Okay, so let's see the demo so here is my map here I have it's ugly, okay
09:40
It's not there's a real production thing. I Enter any address so the city of not if you know and I get the result, so that's my address and It's full text indexed. I can have any address in the street and
10:06
It's really fast. Okay alpha million of arrests try to do that with archetypes or dexterity and you're gonna crash or Your plan site. Okay, it's really responsive. It's it was really nicely so that was my
10:20
Initial case and it's a and it works But at this point I decide to see well if I can do that Can I go higher and what's the limit? What's the limit basically? So Well, I decide I'll try to figure out something which is known as being big I
10:43
picked Wikipedia Everybody knows that Wikipedia is big, but maybe don't know how many records are in Wikipedia how many articles are in Wikipedia I took the dump from last year 2012 and was about five million and a half, okay
11:02
So I say why not let's try I don't speak specifically need to import Wikipedia in prone, but let's try and let's see First I'm gonna show you how it behave with only alpha million of record. So
11:22
Here we have data tables component, you know where we can display rapidly long list of contents we have four hundred and thousand 400,000 entries here and I can it's full text indexed and
11:41
It is I it behave let's search for John for instance, okay, it's instantaneous. Okay, really fast Fascination is working really nice as well. So there is a batch mode In in super which allow me to do that It's just perfect
12:02
Now that was easy with five million It's not the same story, so first it gonna take more time to load the first page Okay, here we are oh yeah something I want to mention about Wikipedia Wikipedia
12:26
can be download easily as XML is quite shitty XML format really difficult to parse because you have few XML attribute and then you have a big piece of
12:40
Wiki text with a lot of markers. It's not it's not tagged based at all It's really difficult to pass and as I have to extract a lot of information from Wikipedia to build this My objective was to show the connection between the articles so see what is linked to what you have to go into this wiki format and parse everything and it's really long and
13:04
so XML file is is about Is about 60 gigabyte and Process 60 gigabyte with Python is not cool. Okay is it's really painful You have RAM issue you have really lot of issues and that's something
13:22
Okay painful and long and at the end what you get is that in the 10 first result you have fuck you twice I Mean, it's not cool It's not cool Well, that's the way it is by the ways are two songs two hard rock songs. So in the first ten results
13:43
When you saw them alphabetically, okay Well, so now I have my five million and was five million six six thousand Let's see how it behave now, let's try to find blown for instance see few seconds
14:05
But Working let's wait Okay, seven entries. It's not that quick. Okay, we but That's a lot of data so now let's
14:20
Check those as a plan bump. So yeah What I decided to do is to show all the article connected to one article and to make it a little bit fun I'm a build this this re-rendering or we don't see the edge, but there are some edge between the different points So each point is supposed to be an article. This is a central article to plan band and it's okay
14:44
quite nice It's dynamic. So that's d3 d3. Yes, maybe you know this library JavaScript library is just fantastic So now I can click so here the thing is I'm requesting for each article connecting to plumb to plan band all
15:00
the article Linked to those articles. So that's yeah control at the at the first step I was just querying once to get all the prone articles for instance or or just paginates cancer So that's one query big query because requesting on the entire device. Yeah I'm doing a lot a lot of different small queries and it works quite nicely and when I click on any of them
15:25
so Let's see this Sorry You're gonna load all the article connected to this article. Okay, and it appends all the nodes everywhere. So that's quite fun And you get a very big SVG graphics that way which a lot of stuff
15:44
Okay, well, it's not I'm not sure that's really useful. Okay but It's kind of fun. It's kind of fun so That's it. That's here Beyond that behind each click here. So a lot of query on my on my super database and it just work
16:05
Okay, five five millions of record and it's it's okay as you can see speculations. Yes. Sorry Sorry the ZuDB file is The cache
16:26
Sorry, there is only one object the super object Okay, I made no change. This is a default setting default setting for everything no optimization no
16:43
No clustering nothing just a basic instance regular settings for everything So, let's move back to plantation My conclusion well, so usage performances are
17:01
Acceptable they are very good for let's say a million of entry. No problem They are specifically good with tiny records with Wikipedia records were kind of big Compared to addresses for instance. It was probably too big That's why the performance is not that good, but it's kind of usable anyway
17:21
Okay, and the plan performance are totally not impacted Okay, so you can put this kind of big database into plan site and will not make any change for the rest of your plan features so Use it Use use super it's just fantastic. Really easy to use really to install you
17:41
you should use it for your development for your products and so and Few thoughts Maybe we could build a REST API on top of super Could be useful so we could imagine to access it via or JavaScript stuff on front end to store anything
18:02
Into the back end transparently without brothering blown Any an internet package we just call the soup and put stuff or read stuff get stuff extra via REST API could be cool one of my problem during This this work on this demo was the import of the of the Wikipedia content
18:25
Massive import were quite painful. I had to split in in small Small file or small about alpha million of record every time because he was eating all my RAM and so
18:40
even with Intermediary transaction safe point this kind of stuff. It was really difficult So that's something we need to improve because I have already used for instance regarding the address stuff That's a typical thing I would have done with elastic search for instance and with elastic search you can import many data Like that it's really quick. So that's something that could be improved at that point. I don't know if it's possible or not. I've
19:04
Look, I've see the code is quite now I don't see how we could improve it, but they're probably way that's something we maybe could discuss with blue dynamics people well That's it. Yes. Yes. So, thank you question. Yes
19:26
I'll give you the mic Yeah, maybe you could just defer indexing when you're importing just import anything and then index at the end and not index Every every after every insert right? I'm guessing it's in this thing indexing after every insert
19:46
That's something I tried and that's something I did by the way it does help but not it's not the old thing Um, how is this already integrated employmental or not? Do you have to what do you have to install or do extra?
20:05
It's not it's not already releasable the regular promenade version is still working with CMF and This version should be I don't know. I haven't planned a really Now release because I have broken a lot of stuff in promenade to make it work
20:22
and my objective is to try to isolate the storage layer in promenade so we can plug it to super or to Regular CMF object or to SQL or whatever not SQL because I hate it. But if someone want to do it, it will be easy but So that's a work
20:40
I want to do before just releasing super here my problem compared to the current Promenade feature is I am not able to store files Okay, attach files into documents so I plan to have a separate b3 folder to store files related to each documents and This is something I need to do to before we can imagine any release, but that's not big rock
21:05
I think it says okay, so that's something will occur Probably next year. I mean for sure Anybody else? No, well, thank you