We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Importing Wikipedia in Plone

00:00

Formale Metadaten

Titel
Importing Wikipedia in Plone
Serientitel
Anzahl der Teile
39
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Plone Conference 2013 and Palestras da 9ª Conferência Brasileira de Python (PythonBrasil[9]) - Brasília / Brasil
SpeicherabzugAttributierte GrammatikElektronische PublikationAdressraumInformationsspeicherungUmwandlungsenthalpieDatensatzAutomatische IndexierungPunktLokales MinimumAutomatische HandlungsplanungTypentheorieInstantiierungDatenmodellDatenfeldInhalt <Mathematik>Klon <Mathematik>DatenbankCASE <Informatik>HalbleiterspeicherGüte der AnpassungStatistikResultanteDateiformatMAPSoftwareentwicklerAbfrageKartesische KoordinatenOnline-KatalogGebäude <Mathematik>ZahlenbereichDatenstrukturMultiplikationsoperatorDifferenteDifferenzkernBildschirmmaskeComputersicherheitDemo <Programm>Pay-TVStochastische AbhängigkeitPhysikalisches SystemObjekt <Kategorie>Dynamisches SystemVersionsverwaltungStellenringRepository <Informatik>MultiplikationSchwerpunktsystemHochdruckCanadian Mathematical SocietyMereologieProzess <Informatik>MenütechnikVererbungshierarchieATMXMLUMLVorlesung/Konferenz
CASE <Informatik>AdressraumPunktMAPWeb SiteInverser LimesResultanteAutomatische IndexierungJensen-MaßSystemzusammenbruchAutomatische HandlungsplanungVorlesung/Konferenz
SpeicherabzugDatensatzJensen-MaßVorlesung/Konferenz
Inhalt <Mathematik>StapeldateiMailing-ListeInstantiierungTabelleZusammenhängender GraphVorlesung/Konferenz
Web-SeiteMultiplikationsoperatorResultanteSyntaktische AnalyseAbfrageCodeElektronische PublikationObjekt <Kategorie>NP-hartes ProblemAutomatische HandlungsplanungInstantiierungDienst <Informatik>ProgrammbibliothekDefaultInformationDateiformatGlobale OptimierungDatenbankProzess <Informatik>Dynamisches SystemDifferenteEinfach zusammenhängender RaumHyperlinkFreewarePunktAttributierte GrammatikGanze FunktionReguläres MaßZweiMathematikFormation <Mathematik>DatensatzGamecontrollerWikiLastCachingBitVererbungshierarchieVorlesung/Konferenz
Rechter WinkelAutomatische HandlungsplanungSoftwareentwicklerElastische DeformationElektronische PublikationMathematikInhalt <Mathematik>EinfügungsdämpfungSchwerpunktsystemVersionsverwaltungRuhmasseAdressraumAutomatische IndexierungInstantiierungProgrammierungWeb SiteDatenbankREST <Informatik>PunktObjekt <Kategorie>Demo <Programm>Dynamisches SystemCodeFront-End <Software>Regulärer GraphTransaktionPROMGrenzschichtablösungDatensatzMultiplikationsoperatorInformationsspeicherungProdukt <Mathematik>Jensen-MaßStrömungsrichtungVererbungshierarchieLesen <Datenverarbeitung>InternetworkingDebuggingVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
So importing Wikipedia in Plone There is a demo inside. So what do you think? Zulibi is good to store objects, okay?
Plone contents are objects. We store them in Zulibi So it does work. Okay, no problem. We all do that every time. That's Plone. But What if you want to store A lot of records. Non-content-ish records, let's say that. Like, I don't know
addresses, contacts, polls, results statistics Mainly subscribers, this kind of stuff Any business specific structured data. Tiny data, structured
Not content. Well, you can store them as content anyway You can create content type and store them that way. It will work pretty fine as far as you do not have too many data to store. Like, let's say, yeah
100,000 is okay, but it's pretty much a maximum. So Another approach. You can store them in an SQL database Okay, it just works. I mean, that's a good solution We can do that pretty easily with Zope. Okay, but
two major problems first one You need to manage a secondary system. This means you need to deploy it. You need to Back up this system. You need to make sure it's secured Security is just fine in Zope But when you start putting data outside Zope, then you have to implement a security somehow. So it's all a mess
Okay, that's premium one Problem number two, I hate SQL. So basically I can't That's the way it is Maybe I can just cannot digest it. No way
So, how could I do that? How could I store many many many records in my ZODB because I just love my ZODB I won't stick to it Is the ZODB strong enough to manage such an amount of data? Is the Z catalog strong enough to index the data because I probably need to index them to be able to search filter and so on
well My grandmother always told me that if you're not if you want to become stronger you need to eat your soup And that was a really good advice. She could have been a good ZODB developer by the way
So, where do we find a good soup for Plone? in a super super So meet super super Plone and super are two two package which provide Storage and indexing
Into your ZODB for tiny records, but a big amount of tiny records basically, what is it it's just a way to Record any pickable data in a persistent structure It's based on B3 ZODB B3 and it use
node X ZODB Which is really nice stuff and it just use repos catalog to index That's something, you know, it have been created by blue blue dynamics Those people rocks are really good. So and it's just fantastic. So let me introduce
super It's quite straightforward to use you can create a soup so the soup is container for records You can create as many soup as you want in your in your clone You create a soup then you create a record and you set attributes. Okay, no big deal You can store anything that you see what which is pickable. Okay into records
Then you stores in you store it that way soup add record and you're done Okay, so not really complex You can have a record into a record. No problem. So here my record and address
Gonna be in to gonna have secondary attributes. It makes no problems and you can access your record Very easily as well. So you get your soup and you get your record by its ID So nothing difficult here
then you can query your your data using a Repose catalog you have okay. We cannot see everything. But anyway, you can write a query using those Those keywords or you can also use a CQ format which is much easier to read
So like user equal user one and foo in text and you're done It just return a record It can be run in lazy mode, of course, and it's very very efficient So of course to do that you need to define some index different type of index
It's quite similar to the zip catalog. Of course, that's You have a text index you have field index and so on More about super super a soup container can be moved into a specific zuDB Mon point, okay There is a tool for that
Really Andy and it can be shared across multiple independent plan instances the same The same database let's call it that way the same soup soup container can be put into an FS file and shared If you need it, and it works on plan and prior mid which is a good point
so now Well, as you know, I'm I'm Creating I'm managing plumbing. Oh, so I try to put super in from you know, basically so we can Use promenade to build non content content-oriented application very easily
Dylan J just talked about it. You just design a form and you have a structure and using structure you can start creating data and Using super while the idea to be able to manage a huge amount of data So originally in promenade the records documents we name them documents were just 80 folders
Okay, so well about 30,000 records were kind of maximum To improve that we moved it to pure CFS Okay, so just a b3 folder a CMF b3 folder CMF object in there and we are about
once Yeah, 100 and thousand is kind of okay, but if you want to go I Are is can can be difficult you have to optimize stuff You have to make sure you have not too many index in your seat catalog because promenade use
Local catalog for every database and it can get quite big and it slowdowns everything. Okay Now with super We can reach millions So you have millions of record in your that in your zoo DB and it just works out really nicely
Yeah, okay Yeah Sorry, I don't I miss your pause
The maximum amount of data you can Know it was no it was memory was okay. No, it was just too slow
basically Yeah querying when you are trying to filter data or to to extract data for some reason It was just too slow not usable, but it was working. Okay, you just too slow. No conflicts are okay Memory is fine. So there's no problem like that. Just not usable basically, that's it
so Typical use case I need I needed to have 500,000 addresses for subpart of France and to be able to query them in full text
Just like when you type an address on on Google map and to display the result on the map Okay, so let's see the demo so here is my map here I have it's ugly, okay
It's not there's a real production thing. I Enter any address so the city of not if you know and I get the result, so that's my address and It's full text indexed. I can have any address in the street and
It's really fast. Okay alpha million of arrests try to do that with archetypes or dexterity and you're gonna crash or Your plan site. Okay, it's really responsive. It's it was really nicely so that was my
Initial case and it's a and it works But at this point I decide to see well if I can do that Can I go higher and what's the limit? What's the limit basically? So Well, I decide I'll try to figure out something which is known as being big I
picked Wikipedia Everybody knows that Wikipedia is big, but maybe don't know how many records are in Wikipedia how many articles are in Wikipedia I took the dump from last year 2012 and was about five million and a half, okay
So I say why not let's try I don't speak specifically need to import Wikipedia in prone, but let's try and let's see First I'm gonna show you how it behave with only alpha million of record. So
Here we have data tables component, you know where we can display rapidly long list of contents we have four hundred and thousand 400,000 entries here and I can it's full text indexed and
It is I it behave let's search for John for instance, okay, it's instantaneous. Okay, really fast Fascination is working really nice as well. So there is a batch mode In in super which allow me to do that It's just perfect
Now that was easy with five million It's not the same story, so first it gonna take more time to load the first page Okay, here we are oh yeah something I want to mention about Wikipedia Wikipedia
can be download easily as XML is quite shitty XML format really difficult to parse because you have few XML attribute and then you have a big piece of
Wiki text with a lot of markers. It's not it's not tagged based at all It's really difficult to pass and as I have to extract a lot of information from Wikipedia to build this My objective was to show the connection between the articles so see what is linked to what you have to go into this wiki format and parse everything and it's really long and
so XML file is is about Is about 60 gigabyte and Process 60 gigabyte with Python is not cool. Okay is it's really painful You have RAM issue you have really lot of issues and that's something
Okay painful and long and at the end what you get is that in the 10 first result you have fuck you twice I Mean, it's not cool It's not cool Well, that's the way it is by the ways are two songs two hard rock songs. So in the first ten results
When you saw them alphabetically, okay Well, so now I have my five million and was five million six six thousand Let's see how it behave now, let's try to find blown for instance see few seconds
But Working let's wait Okay, seven entries. It's not that quick. Okay, we but That's a lot of data so now let's
Check those as a plan bump. So yeah What I decided to do is to show all the article connected to one article and to make it a little bit fun I'm a build this this re-rendering or we don't see the edge, but there are some edge between the different points So each point is supposed to be an article. This is a central article to plan band and it's okay
quite nice It's dynamic. So that's d3 d3. Yes, maybe you know this library JavaScript library is just fantastic So now I can click so here the thing is I'm requesting for each article connecting to plumb to plan band all
the article Linked to those articles. So that's yeah control at the at the first step I was just querying once to get all the prone articles for instance or or just paginates cancer So that's one query big query because requesting on the entire device. Yeah I'm doing a lot a lot of different small queries and it works quite nicely and when I click on any of them
so Let's see this Sorry You're gonna load all the article connected to this article. Okay, and it appends all the nodes everywhere. So that's quite fun And you get a very big SVG graphics that way which a lot of stuff
Okay, well, it's not I'm not sure that's really useful. Okay but It's kind of fun. It's kind of fun so That's it. That's here Beyond that behind each click here. So a lot of query on my on my super database and it just work
Okay, five five millions of record and it's it's okay as you can see speculations. Yes. Sorry Sorry the ZuDB file is The cache
Sorry, there is only one object the super object Okay, I made no change. This is a default setting default setting for everything no optimization no
No clustering nothing just a basic instance regular settings for everything So, let's move back to plantation My conclusion well, so usage performances are
Acceptable they are very good for let's say a million of entry. No problem They are specifically good with tiny records with Wikipedia records were kind of big Compared to addresses for instance. It was probably too big That's why the performance is not that good, but it's kind of usable anyway
Okay, and the plan performance are totally not impacted Okay, so you can put this kind of big database into plan site and will not make any change for the rest of your plan features so Use it Use use super it's just fantastic. Really easy to use really to install you
you should use it for your development for your products and so and Few thoughts Maybe we could build a REST API on top of super Could be useful so we could imagine to access it via or JavaScript stuff on front end to store anything
Into the back end transparently without brothering blown Any an internet package we just call the soup and put stuff or read stuff get stuff extra via REST API could be cool one of my problem during This this work on this demo was the import of the of the Wikipedia content
Massive import were quite painful. I had to split in in small Small file or small about alpha million of record every time because he was eating all my RAM and so
even with Intermediary transaction safe point this kind of stuff. It was really difficult So that's something we need to improve because I have already used for instance regarding the address stuff That's a typical thing I would have done with elastic search for instance and with elastic search you can import many data Like that it's really quick. So that's something that could be improved at that point. I don't know if it's possible or not. I've
Look, I've see the code is quite now I don't see how we could improve it, but they're probably way that's something we maybe could discuss with blue dynamics people well That's it. Yes. Yes. So, thank you question. Yes
I'll give you the mic Yeah, maybe you could just defer indexing when you're importing just import anything and then index at the end and not index Every every after every insert right? I'm guessing it's in this thing indexing after every insert
That's something I tried and that's something I did by the way it does help but not it's not the old thing Um, how is this already integrated employmental or not? Do you have to what do you have to install or do extra?
It's not it's not already releasable the regular promenade version is still working with CMF and This version should be I don't know. I haven't planned a really Now release because I have broken a lot of stuff in promenade to make it work
and my objective is to try to isolate the storage layer in promenade so we can plug it to super or to Regular CMF object or to SQL or whatever not SQL because I hate it. But if someone want to do it, it will be easy but So that's a work
I want to do before just releasing super here my problem compared to the current Promenade feature is I am not able to store files Okay, attach files into documents so I plan to have a separate b3 folder to store files related to each documents and This is something I need to do to before we can imagine any release, but that's not big rock
I think it says okay, so that's something will occur Probably next year. I mean for sure Anybody else? No, well, thank you