Dedicated Syntax Rap Robot

In closed-off, inaccessible academic circles, linguist brohs openly hate on computer brohs for creating models of language that are based on “probability” and that don’t take into account the “underlying structure of language”.  The hardcore nativist bros believe that every language has the same underlying structure and that corpus-based language software is useless.

Researchers that use “machine learning algorithms” (a fancy way to say ‘teaching a computer to predict future behavior based on previous behavior’, statistically) to understand something in the world without giving a shit about the ‘meaning of that behavior’ are hated on by the people that study that meaning 4 a living.  When a person comes along and tries to model the behavior of something they do not have a deep understanding of, the community that studies the behavior will get upset.  It is a good and maybe healthy reaction.

The models generated by the math and computer bros have had success in terms of producing viable irl products.  To troll the linguist people, computer guys are now getting kinda good at looking at the “underlying structure” of language, mainly, the syntax of sentences and coupling it with their statistical models to understand language better.

Take for example this simple sentence structure in tree-format:

Instead of deciding statistically that “happy” follows before a word like “linguists”, a model relying on sentence structure allows for “dynamic”-y modelling of language.  A *very* simple model, with the only structure as the one above, can still produce some cool things if used correctly.  Rules:

S  –> NpVp

NP –> AdjN

VP -> VN

So, using this very simple rule, I created a script to generate short rhyming rap bursts following this sentence structure ((combining rhyme algorithm stuffs w/ language modelling stufs)).  Here are some examples I generated from the rap corpus + publicly available n-gram models:

  • classic mouth pull the plow //
    fat red round buy the cow //
    late night crowd make the rounds //
    open routes find the lounge //
  • good stream down beat the count //
    sparkling crowd buy the cow //
    heavy clouds produce a growl //
    common grounds give a cow //
  • vocal booth cut the loops //
    next man do divide the schools //
    certain lieu cut the jews //
    expensive route pull the school //
  • major news combine the attributes //
    several pews preserve the rule //
    own lust to get some value //
    long arc through give the proofs //
  • bad tone to send a group //
    your own tissue ensure the use //
    so many crew bring the groups //
    private pools see no two //

All of these follow the general rule S->NPVP (NP ->AdjN, VP -> VN).  And they rhyme!  Some are nonsensical and this is a far way off from anything that would be *really* interesting but it’s a world I want to learn about and do cool shit in.

If you want to fuck around with the thing, visit this link.

top 5 rhyme-iest bars [detected thru ‘complex’ computational methods]

So, as a part of a new project I’m working on it’s been important to go through lines in rap that have lots of rhymes and don’t use big words. Here are some random bars from random songs that fit this criteria that sound nice.

1. Benzino – Bang Ta Dis

blunts bitches clips guns
bars bricks whips funds

Syllables Per Word: 1.125
Rhyme Density: 0.78

2. JayZ – Parkin Lot Pimpin

big thangs thick chains
aint shit changed get brain in the four dot six range

Syllables Per Word: 1
Rhyme Score: 0.67

3. MF DOOM – Meat Grinder

wild west style fest ya’ll best to lay low
hey bro day glo set the bet pay dough

Syllables Per Word: 1.05
Rhyme Score: 0.85

4. Boo – Boo & Gotti Freestyle [off the Fast/Furious soundtrack]

pop off two clips top off new six
rock frost blue wrist still cop two bricks

Syllables Per Word: 1
Rhyme Score: 0.93

5. Cam’Ron – Bout it Bout it

get your legs snapped arm twist ribs cracked
wig tapped play fair day care kids napped

Syllables Per Word: 1
Rhyme Score: 0.93

Chance The Rapper Dislikes Fox News

Chance The Rapper, a young artist out of Chicago, has released a new mixtape that’s pretty good and “soulful”.  It’s fun to listen to and there are a few really high points that make the listener feel good, I think.  He is a good rapper, in technical terms.  I think he is supposed to be “psychedelic” but listenable. Inoffensive/un-weird enough to listen to around a group of people with mainstream-y tastes but cool enough (for now, at least), to be proud of liking him.

The intro song Good Ass Intro has about 12 instances of Chance pronouncing the rhotic coda consonant {looser, another, brother, mother, glitter, clutter, colour, baller, butter, stainer, studded, mater) .  In other words, pronouncing the ‘r’ in words that end with ‘r’.  Not pronouncing the coda (last consonant) ‘r’ is something common to Non-Rhotic accents.  African American Vernacular English (AAVE) is a non-rhotic accent.

Like other non-rhotic varieties, the rhotic consonant /r/ is usually dropped when not followed by a vowel;

Non-Rhotic accents are generally stigmatized and speakers ’embarrassed’ of it will converge to Standard English via pronouncing the ‘r’.  Unlike G-Dropping rates, which has lots of good literature, I have no what is a common ‘R-Drop’ rate for AAVE is or what it’s relation to socioeconomic conditions are.  There are some unsubstantiated theories about non-rhotic consonants re: class warfare.

Another event that may have influenced Southern dialectical patterns, particularly desegregation, which was accompanied by turmoil in the South from the 1950s through the 1970s. The civil rights struggle seems to have caused both African Americans and southern Whites to stigmatize linguistic variables associated with the other group.

Later on in the mixtape, Chance mentions that he hates Fox News and that Matt Lauer isn’t properly reporting the problems in Chicago.  Which is cool.  And mirrors the sensibilities of the young rap fan with opinions on politics probably.  But to me, even though they are technically sloppy, the other kids out of Chicago do a way better of job of capturing the desperation and chaos Matt Lauer isn’t reporting.  Chance’s use of the rhotic consonant probably says a lot about his relationship to the stuff happening in Chicago, I think.  It almost doesn’t matter that he is from there, at least musically.  Seems like he is pretty amorphous geographically.


Can Hipsters ‘make’ a Street Rapper?

Chief Keef is a famous rapper out of Chicago that started off making paranoia-driven gangster raps and quickly transitioned to molly-induced happy party songs after he got paid.  Keef’s early stuff – BANGBack From the Dead – is especially great as a look into the paranoia that drives divisions between these kids.  These differences being partially geographical and partially ‘merit-based’ {If you ain’t poppin pistols, I ain’t rocking wit ya}.

After a lengthy Gawker profile and two excellent tapes, Keef became thinkpiece fodder.  The central thesis by outraged black people with rap/culture opinions was that “white people shouldn’t/aren’t equipped to discuss violent rap”.

“Brian “B.Dot” Miller, who is black, and an editor at Rap Radar, took Sargent to task directly, tweeting at him to “please stop writing about MY culture,” bemoaning “cultural tourists writing about the music of MY culture” and “outsiders like yourself in hipster media that get a hard-on by overanalyzing black music.”

Whatever.  In the widely circulated New Republic article, a (black) blogger raised similar concerns re: white hipsters writing about Chief Keef:

“Motherfuckers see us as ONE fucking unit and THAT is what we want ‘white bloggers’ to understand. Someone sees Waka and then kills Treyvon … Y’all don’t know that fuckin’ struggle of being judged based on someone else’s actions and you NEVER will … You will never understand. Never feel the pain, shame, guilt … You get to be just you. But in America no matter how hard I try someone is ALWAYS judgin based on my skin and when the Chief Keefs appear, people are thinking OMG look at what years of oppression and demoralization have done to a group. They think: niggers.”

The guy who killed Treyvon[sic] probably hated black people way before Flockaveli dropped.  But, whatever.  Both of these people are implicitly assuming that (white) bloggers *made* Chief Keef.  That the sustained interest in Keef’s work was due to white people that write on the internet.  Without the interest of these *bloggers*, Keef would not matter.  The fear being, I assume, that the *power* held by these White Bloggers could create an ecosystem of similar rappers “without artistic merit” (representing the worst of the worst of black culture) being given a platform.  The truth is, Keef built his fanbase in a wholly organic way by getting listeners similar to him and that white bloggers DO NOT matter in terms of creating a “street rapper” like Chief Keef.

I believe that people who comment on Youtube videos, good or bad, provide a solid, measurable way to understand content.   Social media data mining may or may not be a bullshit thing to study but I think if the results (especially ones on either extremes) make sense and pass the ‘eye test’, it’s probably something worth exploring further.  One thing that I think makes sense to look at, especially for videos from ‘street’ rappers, is to see if the people commenting type in some sort of unique, measurable way.

Ebonics is a rule-governed language that can sometimes be studied on paper.  For example, something like G-Dropping can be looked at on text.  Another rule of ‘Ebonics’ has to do with word-initial fricatives.  This is a fancy way to say that words like {This} get pronounced {dis} in spoken language.  And sometimes, because this is an example where the spelling of the word goes with the general rules of sound and stuff, people will actually write {dis}.

Writing a little script to extract comments from a YouTube video, we can find how often users use words like [da, dis, dat] instead of [the, this, that].  It ends up working really well to distinguish artists, stylistically.  The chart below shows rappers that have a high ‘da’ score (street rappers, generally), medium ‘da’ scores (mixed fan bases) and low ‘da’ scores (generally backpackers or ‘barely rap’ bros):

High /da/ Medium /da/ Low /da/
Soulja Slim GZA Ras Kass
Geurilla Maab Esham Atmosphere
Beanie Sigel Missy Elliot Beastie Boys
Eightball & MJG RZA Brother ALi
Pastor Troy Ghostface Lupe Fiasco

Stylistically, the High /da/ guys seem to be similar and might be classified under the umbrella term ‘street rapper’, although maybe unfairly.  It turns out that as a quick separator of styles this metric works really well.  It relies on a phonological rule of AAVE.  It probably is ‘wrong’ to call the [th] -> [d] phenomenon a “spelling mistake”.  This is a systematic rule of AAVE (just like any other phonological rule in any language) that in this instance finds its way data mine-able through text.

If the ‘hipster media theory’ holds up and Keef’s fan base was cultivated through white people blogosphere link sharing, his initial work should NOT have a High /da/ score.  However, this is not the case.  The BANG mixtape has consistently High /da/ scores which indicates that it was probably kids similar to Keef that listened to him first.  That while the people writing about him online now may be mostly white nerds, the people that fell in love with him initially were black kids like Chief Keef.

The High /da/ guys have a median score of ~0.15 and Low /da/ guys have a median score of ~0.01. Keef’s BANG mixtape has a weighted average /da/ score of ~0.18.  This allows us to classify him as a ‘street rapper’.  His song Setz Up has a /da/ score of 0.13.   Looking through the responses to this song, we find this particular comment below which has an instance of /da/ AND /dat/.  It also specifically explains in detail one of the gang references in the song:

Screen Shot 2013-04-03 at 12.03.44 AM

A song riddled with gang stuffs appealing to kids that are hyper-aware of these references.  The ‘hipster media’ didn’t make these kids care about Keef or understand these references.  Most likely, Keef’s music initially represented a reality to the kids in his city.

I think it is important to relate these High /da/ scores to actual lyrical content from songs.   We see that the fans response for High /da/ rappers seems to follow a general trend.  High /da/ score rappers are all generally ‘street rappers’.  We need to find a way to link the lyrical content from these songs to the particular responses.  Ideally, we should find that High /da/ scores in YouTube commentary is correlated to some sort of particular word-usage in songs.

There have to be certain trends in word usage that can be measured?  For example, I’m sure the word {‘nigger‘} is almost exclusively limited to songs by black guys.  Not sure if there are any exclusive ‘white’ words, since white artists probably don’t own any kind of similar exclusivity to lexical items.

No reason we can’t look at this scientifically.  All you really have to do is get good enough datasets for ‘white’ raps and ‘black’ raps.  Mathematically, of course, {Black} ∩  {White} = ∅ ⇔ One-drop Rule.  So, once we have these two datasets we can run some cool machine learning algorithms to train a computer to identify specific ‘white’ and ‘black’ characteristics.

We know from the earlier chart that Pastor Troy is a High /da/ score guy and that Atmosphere is a Low /da/ score guy.  Ideally, using the text classification tool, Pastor Troy should score as more ‘black’ and less ‘white’ than Atmosphere.  It turns out he does.  Considerably.

Screen Shot 2013-04-02 at 11.05.39 PM

With average scores:

Artist Black White
Atmosphere 7.62 27.45
Pastor Troy 47.56 1.81

The data supports our intuition with regards to Pastor Troy.  It seems that Pastor Troy, a High /da/ guy, also has High ‘Black’ scores and Low ‘White’ scores.  Does this data extend to other ‘street’ rappers?  If we use an arbitrary cutoff of 0.05 (about 25% of the songs we mined in a 1500+ song dataset) we see that High /da/ scores generally correlate to Low ‘White’ Scores.  That is, how the fans are talking about an artist is directly correlated to the actual lyrical content.   A pretty sweet discovery.

Screen Shot 2013-04-03 at 12.47.04 AM

We see that Low ‘White’ scores (0-15) correlate with High /da/ scores (>0.05).  That is, there is a 92% chance that a song with a /da/ score greater than 0.05 will have a White Score less than 15.  Pretty great evidence that the way fan bases discuss a street artist is a predictor of the kind of lyrical content an artist has.  Without even listening to a song, we can know what kind of song we are dealing with just by how the fans are interacting with the work.

The blogosphere simply cannot ‘break’ a street artist.  Any shit-talk to the contrary is without merit.

I am Giving A Talk At Boston College March 14th

The event is scheduled to take place in Gasson Hall Rm. 305 at Boston College on March 14, 2013. It will begin at 6:30 PM.  The event is hosted by the Boston College Economics Association.  If you are in Boston or around Boston please come by and listen to me talk about RapMetrics.  I will talk about past projects and potential new ones.

This is exciting for me and these kinds of things let me know that this project isn’t useless and that it is worth doing.