| Svelte Hacker News

points by keane 5 years ago

Example notation for the project, called AbstractText:

————

Input 1:

Subclassification(Wikipedia, Encyclopedia)

Result 1:

English: Wikipedias are encyclopedias.

German: Wikipedien sind Enzyklopädien.

————

Input 2:

  Article(
   content: [
     Instantiation(
       instance: San Francisco (Q62),
       class: Object_with_modifier_and_of(
         object: center,
         modifier: And_modifier(
           conjuncts: [cultural, commercial, financial]
         ),
         of: Northern California (Q1066807)
       )
     ),
     Ranking(
       subject: San Francisco (Q62),
       rank: 4,
       object: city (Q515),
       by: population (Q1613416),
       local_constraint: California (Q99),
       after: [Los Angeles (Q65), San Diego (Q16552), San Jose (Q16553)]
     )
   ]
 )

Result 2:

English: San Francisco is the cultural, commercial, and financial center of Northern California. It is the fourth-most populous city in California, after Los Angeles, San Diego and San Jose.

German: San Francisco ist das kulturelle, kommerzielle und finanzielle Zentrum Nordkaliforniens. Es ist, nach Los Angeles, San Diego und San Jose, die viertgrößte Stadt in Kalifornien.

————

I didn’t understand quite what the proposal was until I saw these examples from https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Examples

aaron695 5 years ago

This example is quite child like.

It's what bright children in the 70's learning about computers thought.

50 years later they haven't solved it because it doesn't work that way.

Is there a real example not using proper nouns?

A city changes, a population changes depending on the country and language and time. Town X having a population Y might be considered a Village X and Population Z, because in some countries population includes the rural parts, the population of San Francisco might be different in another country.

The rabbit hole goes on forever, and more importantly it's been tried constantly for over 50 years.

Unlike Machine translation which is amazing compared to 50 years ago, and getting better, and you could see how you could integrate it better with Wikipedia (It's already used) yet it's tossed out in the white paper for no real good reason I can see. There's also lots of stuff like Duolingo style methods that you could look at.

LudwigNagasena 5 years ago

English and German have very similar vocabulary and syntactic structure. So this example is not very elucidating. Comparing it to Chinese, Turkish or Javanese would probably be better.

tdy721 5 years ago

And worthless for most of us English speakers...
popinman322 5 years ago

This maps rather nicely to something like Grammatical Framework [0]. I wonder whether they'll adopt an existing project for translation; getting things into this graph form seems like the hard part, honestly.
As far as the comparison goes, it should be easy enough to map the trees from the abstract form into language specific trees. We're you hoping to understand the current limitations? Maybe get a benchmark of the state of things that updates automatically as the project continues?
[0]: https://www.grammaticalframework.org/
numpad0 5 years ago

I think the syntax needs to be derived, not designed. This one here is just English.
perl4ever 5 years ago

>English and German have very similar vocabulary and syntactic structure
Hm, the sentences are structured in a parallel way, but is that really proper German? I don't remember anything from high school German class, but people make jokes about putting the verb way at the end. Or is that an obsolete style?
- prennert 5 years ago
  
  It's actually great German. Syntactically sophisticated. I am surprised by the use of the subsentence (not sure what the proper babe for this is) which puts the three larger cities in the middle of the last sentence.
  It would have been possible to place the three larger cities at the end of the sentence similar to the English example. This would have sounded a bit more bot-like, and was somehow what I expected.
  So seeing this particular German example is actually quite a good example showing the power of this approach.

StavrosK 5 years ago

I wonder what happens in more literal languages, where "center" doesn't mean "main area".

willbudd 5 years ago

Hopefully some word sense index is applied (or implied).
- sbergot 5 years ago
  
  This is one big hurdle I think. If one has to refer to the english meaning of words for the whole project to work, then how is this different from just writing the whole thing in english and translating everything from this?
dan-robertson 5 years ago

Also the grammars of English and German are pretty similar. How well would it scan in other languages? Perhaps “well enough” is sufficient.
- shadowgovt 5 years ago
  
  The key idea is that if the semantic description is abstracted enough, a grammar engine can convert the ideas encoded in it into the right structure for the language.
  Not all languages have "X is Y" constructs, but all known human languages have some structure to declare that object X has property Y. Capture the idea "Object X has property Y" in your semantic language, and a grammar engine can wire that down to your target language.
  The largest risk is that the resulting text will be dry as hell, not that it's an impossible task.
  
  popinman322 5 years ago
  
  Though being dry doesn't diminish the value of the text, though. Very exciting.
  I'd also be worried about ambiguity; humans can (sometimes) detect when they may be parsed the wrong way in context. I wonder if there will be a way to flag results that don't properly convey the data. How would that be integrated into the generator? (There's probably an answer in the literature.)
  Lots of fun questions to explore.
  
  knolax 5 years ago
  
  The main problem is that language X has an implicit definition of Foo, which is similar but not identical to language Y's definition of Bar. This might work when the languages share common ancestry like German and English, where Foo and Bar are both descendent from Baz and have similar meanings, but will not work when you try to translate to language Z, whose speakers have a different word Foobar which has a meaning that encompasses Baz and Qux but excluding Xyzzy and with a completely different connotation.
- dmortin 5 years ago
  
  Try Finnish or Hungarian, for example.
  
  Ekaros 5 years ago
  
  Finnish would likely work, though it would require very extensive rules on declinations. Some compound word and list rules are also fun... Finnish is rather liberal in word-order, but that's a simple fix.
  What is hard in that the conjucts do not have unique identifiers in the example. That is an essential thing to have. As there is plenty of synonyms and meaning might change. Same applies to center.
Vinnl 5 years ago

But the word/concept 'Center' does not appear anywhere in the input data, as far as I can see? It just lists a number of things for which SF ranks highly, and whether that means you call it a 'center' is up to the template writer - unless I'm misreading.
- hobofan 5 years ago
  
  Yes it does. Line 6: "object: center"
  
  Vinnl 5 years ago
  
  Well... I retract my previous comment then. Thanks for pointing it out.
  (I blame viewing it on mobile.)
spupy 5 years ago

In Input 2 "center" is a keyword, because the markup is using English for keywords. The example output just happens to be in English as well. I assume it will be mapped to a more appropriate word in another language.
mormegil 5 years ago

Yes, this. Well, in this case, the solution is obvious: you need to have two separate concepts for center. But…
When I first learned about the OmegaWiki project (called WiktionaryZ then, I think), I was thrilled. It tried to represent lexical (Wiktionary) definitions and other language concepts using data. For each sense of each word, a so called DefinedMeaning was created. In the same sense, Wikidata has its entities. But soon, I learned about a problematic aspect of OmegaWiki’s concept, and the same thing appears on Wikidata: You represent some set of concepts in a single language, then another language comes and needs to split some concepts in two, because your language uses one word for both, but the other differentiates between them. Then, a third language comes and it maps its concepts to your existing set still a bit differently, so you might get four entities for just three languages. Etc.
On Wikidata, more focus is, I guess, on “concrete” entities: people, places, etc., where this does not appear that often. But it contains the abstract entities as well, and the problem appears there all the time. You might try to “fix” the problematic entities by splitting them to more elementary, linked using “subclass of” etc.; in some cases it might work quite fine (but losing the interwiki links in the process, which is unfortunate, given those were the original use case of Wikidata), in others, it is basically impossible without a degree in philosophy and deep understanding of ten languages, to be able to correctly distinguish and represent their relations. And imagine somebody trying to _use_ those entities. Like “I would like to say this person was a writer”, but there are seventeen entities with the English label of “writer”, distinguished by some obscure difference used by a group of Sino-Tibetan languages.
And… Wikidata entities represent basically just nouns.
So… I am a bit sceptical.
- Luk3 5 years ago
  
  I believe that Wikidata's Lexeme system is trying to fix that, is it not?

knolax 5 years ago

What a horribly myopic way to organize information. They seem to have unthinkingly copied from vernacular English various loosely defined concepts like "city". What do they mean by San Francisco? The City and County of San Francisco? What about Los Angeles? Is that the entire LA metro or just LA county? Is Santa Monica a part of Los Angeles or a seperate settlement? How is the concept of "city", "metro", and "town" going to translate into "市", "Burg", and "Grad"?

jbob2000 5 years ago

This is getting very close to the Universal Language that Umberto Eco describes in his book The Search for the Perfect Language. I wonder what he would think about this if he were alive today...

gorgoiler 5 years ago

The syntax looks well optimized for human editing.

The example seems like it would be machine generated though.

I hope the syntax learns from SQL, and allows for easy generation by either man or machine, preferably a little of both.

MayeulC 5 years ago

The way I'd do it, would be to store an intermediate representation, and have multiple front-ends with different syntaxes. Have the editable text be generated from the IR.
This would be a huge plus, as it would not require the editor to know English keywords. Most keywords could be translated into the contributor's native language, lowering the barrier for editing.
It would also allow the syntax to be changed over time, or provide multiple different syntax paradigms, a bit like wikipedia's code vs visual editors.
Of course, comments are an issue, but hopefully, this is as close to "self-commenting" code as it gets.
ganafagol 5 years ago

That's the beauty here. It's not the syntax. It's just a syntax to express the abstract thing. Saying this syntax is an issue is like saying "I don't like binary trees because their syntax is so weird". One particular syntax may be weird, but the syntax is only specific to one specific representation. Everybody will be free to choose any representation they like, as long as it can somewhat automatically be translated back into the abstract thing that this project is aiming to produce and maintain.

IAmNotAFix 5 years ago

How does it go beyond the headline and general info?