“Holiday” (it would be so nice)

My wife has leave at the moment, so she suggested that we go on a holiday interstate. I do not have leave at the moment, but I’m the one who does the long-distance driving so I was tangled up in this plan. I often work from home so my boss said it was OK, so next thing you know I’ve driven a thousand kilometres and am staying at a holiday unit at the beach. Here is the view from the balcony.

Of course, I am not on holiday, so I’m busy coding away in between dropping the family at the shops and the train station and so on. I have a couple of pretty big deadlines so I really am busy.

Even worse, the internet here is not as good as at home. For work stuff that’s generally OK. However I’ve done a couple of updates to extstats.drfriendless.com, and it takes quite a while to upload 50 megabytes. However I did finish an update to the user page (the one you get to if you click on your name when you are logged in). It still looks pretty bad because I can’t work Angular Material very well yet, but I think it’s more useable. I’ve also been trying to start using some of the configuration you can set on that page. However I have not really had the time to make a lot of progress. Worst holiday ever.

On the other hand, I was watching my wife web-surf last night, and discovered that the h-index is a popular statistic amongst academics. An academic’s h-index is the highest number n such that they have n papers which have been cited n times each. And of course being academics, whose performance is measured by their h-index and similarly absurdly trivial metrics, they think way too much about this sort of thing.

For example, they have a g-index. The g-index is the largest number n such that their n most-cited papers have been cited on average n times each. We don’t have that metric. NOT YET!

They also have a rational h-index, which is approximately the h-index but with some indication of how close you are to getting to the next number. So we definitely want that! The formula (which took a while to track down) is:

  • say your h-index is h
  • and the minimum number of plays you could play to get it to h+1 is n
  • then your rational h-index hr is:

hr = h + 1 – n / (2h + 1)

and of course you keep the fractional part. OMG, I am so excited! I can’t wait to implement it! But right now, I have to have a “holiday”.

Achievement Unlocked!

Over summer, I was working towards a particular goal. I mean the Australian summer, so that was more than 6 months ago now. My plan was to put into the new site a feature that I had mocked up on the old site 2 years before. However even then the old site was becoming difficult to work on.

I actually implemented the feature with the server in Kotlin, and the web page in AngularJS. AngularJS has since died, and I am now using Angular 7 (which is sooooo much nicer!). Kotlin is a language I do love, but the architecture of the server was so 2015, so that code could not continue to live in the serverless world. And then when I did start writing serverless code, TypeScript on Node.js seemed like a better choice.

So anyway, what did I do? Well, it’s just the Rate of Play of Games New to You, for multiple users (which the page could already do), but now it’s EASY TO USE! So people can actually see that the page can do that.

I tried to add a few more people to the list, but I ran into a problem I haven’t encountered before – there was too much data for the Lambda to return! It seems there’s a 6 megabyte limit. After I return that data, it gets compressed before sending it back to the browser, to only about 10% of the original size, but the limit is before compression.

To tell the truth I’m a little nervous about the amount of data I send back, as it costs me money, so maybe it’s worth some work to make that data smaller.

This new feature is on the test system and is coming to the live system within 24 hours.

That’s the Way the Money Goes!

I’ve just figured out a new way to work the AWS Cloudwatch graphs, so rather than just graphing the number of Lambda invocations, I can graph the total duration. For Lambda, I pay for each invocation, but if an invocation goes long it counts for multiple. So I’m sort of paying for duration as well. I figured out how to graph total duration per Lambda.

This graph is for the last 2 weeks, and shows that inside-dev-processPlaysResult is taking by far the most time. That’s the one that takes plays scraped from BGG and writes them to the database. I’ll take a look at that code. It is a bit on the complex side, as it’s the bit that infers plays of base games from plays of expansions, but I can usually find something to optimise.

Looking at the same graph for the past 4 weeks, all we can see is the Kaboom! Everything else literally pales into insignificance compared to that bug. Cool!

Sorry I am having too much fun graphing AWS performance to graph board game stuff :-).

Cleanin’ Out My Closet

Nah, I’m not going to go all Eminem and aggressive and stuff. I’ve literally been cleaning stuff up today. It was a great, productive day. There were a couple of users that I added over a week ago, and before advising them that their pages were ready, I decided to check whether they were, and they weren’t. This is the sort of bug that cannot be tolerated. With 3034 users, stuff’s got to work without me watching it.

So I hunted down what the problem was, and discovered that I’d modified some SQL in a buggy way a couple of weeks ago, and then swallowed the error so that I never noticed. So I fixed the SQL and the users started being created.

But they still weren’t coming through properly, so I investigated further. There were half a dozen or so users who had deleted themselves from BGG, and so I was unable to process. Yet I kept trying to, every minute. So I deleted them. And then there was one user whose BGG collection is so big that BGG just tells me it’s too big. I’m not sure what to do about that.

You will notice in the graph below of Lambda invocations that there was a solid orange band at the bottom. That was just doing broken things over and over. Oh, and by the way, I pay for the height of this graph – lambda invocations cost some tiny amount of money. The right hand end of the graph shows how much the orange band decreased after fixing that stuff up. It will cost me a bit this month (like, a dollar), but next month it should be better.

These sorts of problems can’t be allowed to persist. So I wrote some code to send errors to the database. When errors happen in the hundreds of thousands of lambda invocations per month, I don’t necessarily notice them. If I write them to the database I can at least find them. With any luck I will find the next similar problem faster.

So then after cleaning that up, the new users started working properly, and I had a clear conscience. I then started emailing people who have been waiting to be added to the site to tell them that there was a new site and they were added. I emailed 280 people, some of whom had been waiting for 18 months. I hope they still play board games.

Anyway, whether or not they still remember who I am, it was nice to get 280 messages out of my inbox, and to have that weight of guilt lifted from my shoulders after such a long time. On the other hand, I’ve increased my potential active users by 280, and that might reveal some other problems. I don’t expect it will be too much, as the architecture I’ve chosen is nothing if not scalable, but you never know. The database is a non-scalable weak link, but I think the impact of users is trivial compared to the impact of the downloader.

And then, because I’m hyperactive or something (not to mention that the weather outside was a bit yukky, so I wasn’t tempted to do anything else), I updated my spreadsheet of ongoing costs. It was 3 months behind.

May 2019 shows a jump in Lambda costs, due to the Kaboom I blogged about previously. It was only $6, but it was an architectural problem that was going to stick around and cost more each month until I dealt with it. Hence why things like that get my attention sooner than actual useful features, and get blogged about.

The kaboom happens about every 35 days, which puts the next one in the first few days of July. Due to the dithering I put in, and fixing the huge database index bug, I don’t expect a big kaboom, just more of a tremor. And due to the continued effect of the dithering, that will become less every 35 days.

The next thing I’m hoping to work on is the update schedule page. It’s not a headline feature, more of a necessary evil. Also I logged a play from October last year, today, so I need it myself. And of course so does everyone from time to time.

Work also continues on the login stuff that I mentioned in the blog post about sticking the cookie. Now that the cookie is working, I need all the bits of code to use it properly, or they won’t be able to access user data. And then I want to write more code which reads and writes user-specific data so I can realise some benefits from all of that mucking around.

Auth0 tells me I have 138 users with accounts, which I think is pretty wonderful since having an account is of little use. But it’s supposed to be a feature, so let me make it that way!

So You Can Take That Cookie…

I’ve been working on the login button for a few days. This is not because I want to, but because I discovered that the way I was handling login was regarded as bad practice. When a user logs in, Auth0 sends me a thing called a JWT (JSON Web Token), which is effectively information about who that user is and what privileges they get. So I was getting that and storing it in browser local storage where other parts of the site could retrieve it later.

It turns out that’sbad, because third party code that I use on the site might look into the browser local storage and get the JWT out, and send it off somewhere else for Nefarious Purposes (TM). Well, we don’t want nefarious porpoises around here. So the better way todo it is forme to send the JWT to my server, and for the server to set a cookie reminding me of who you are. That sounds easy enough.

But oh goodness me, the drama! Because my site is extstats.drfriendless.com, and my server is api.drfriendless.com, which are different, they don’t trust each other unless I do all sorts of “yeah, it’s OK, they’re my friend” stuff in the code. That’s called CORS, and although it’s not so complicated, it’s just too boring to remember.

And you can’t do CORS and cookie stuff with the API Gateway Lambda integration (well, not very easily using serverless framework), you have to use the lambda-proxy integration. Which is OK, but it means everything in the code has to be much more explicit. So I did all that.

And then it still didn’t work. I could see the Set-Cookie header coming back from my server, but Chrome denied it existed. Firefox said it existed, but ignored it. So I poked around for a bit longer, and found out that if you set an expiry time on a cookie, Chrome throws it away. Why? I have no idea. It just does. So I have to set the maximum age for the cookie instead.

And then finally I got the cookie set. And by then I had kinda forgotten what I was trying to achieve. Like a chump!

So I think now the cookie is working as intended, but I have to change the code on the pages to use it properly. At the moment the user page (the one you get to if you click on your user name under the Logout button) is broken, and is awaiting the CDN’s pleasure to be fixed.

Overall I quite like this solution. I feel I have more control over where data is going, and I understand how it works. It has just been pretty painful to get to this point!

Fiddle Faddle!

I’ve been quiet for a couple of weeks, but I’ve been persistently working on the Plays by Month page. I recently added a couple of tables to the Plays page, that should have been on Plays By Month, so I moved them across. Of course it wasn’t quite a perfect match, and it turned out to be a lot more fiddly than I anticipated. There are so many numbers!

For example, on that page the “plays for a month” could mean the plays in the month, the cumulative plays forever until that month, or the cumulative plays from the start of the year until that month. There’s meant to be synergy between the tables, but it turns out there’s just complexity and confusion. Anyway, it’s done now, and I can get onto some more interesting problems.

Those tables from the Plays page will eventually be removed and replaced with other things that use the data that the page has.

I Figured Out the Kaboom!

On the weekend when I noticed that the database was worn out, I updated some code to try to make database operations for saving game data less expensive. When I turned the downloader back on, the situation was just as bad as before, which was much worse than it was last month. I didn’t really know what had changed.

I noticed in the Lambda logs that a lot of updates of game data were timing out after 30 seconds, which I thought was odd. There might be a couple of dozen SQL statements involved, which should not take that long. So I added some logging to the appropriate Lambda to see which bits might be taking a long time.

The answer was that the statement where I calculate a game’s score for the rankings table was taking maybe 8 seconds. Actually I think it takes longer if there are more things hitting the same table (the one which records geeks’ ratings of games), so 30 seconds is not unbelievable. But it is bad.

I checked out the indexes on that table, and realised that I had relatively recently removed an index on that table – the index that lets me quickly find the ratings for a game. So it seems that the lack of that index was slowing the calculation of rankings down, and hence causing updates to games to wreck the database.

So I fixed that index, but there was a reason I took it off. MySQL InnoDB (not sure what that even means) tables have a problem where if you do lots of inserts into the same table you can get deadlocks between the updates to the table and the updates to the indexes. I figured I didn’t need the index on games so much, so I took it off to fix some deadlocks I was seeing. Silly me! Now I suppose the deadlocks will come back at some point.

Next time though, I hope to remember how important that index is. I’ll just rewrite the code that was deadlocking to do smaller transactions and retry if it fails.

POGO Bounces Back!

One of my favourite features on the old site is the Plays of Games Owned Graph. I’ve been intending to put it on the new site for a long time, but have finally achieved it! As far as I can recall it has all the same features as the old one, including the click-through to BGG.

This graph is now on the Owned page. It’s available now on test.drfriendless.com, and tomorrow on extstats.drfriendless.com.

Kaboom!

Hmm, something bad happened yesterday. Database CPU went very high, and database capacity was exceeded and it used up all of the burst capacity. That means the database is worn out for a few hours until it recovers. I wonder what went wrong? I suspect one of the Lambdas went crazy and ran too many times, but I don’t have a good idea why that would happen. For the moment I have turned off the downloader so it will stop hassling the database.

These graphs show database performance. The top-right one is probably a cause – lots of incoming connections – and the bottom right one is a consequence. In particular the blue line diving into the ground is a bad thing.

Looking at the Lambda invocations, it seems about every 35 days there’s a spike, and yesterday’s spike was the biggest ever.

Taking a closer look at the spike, we can see that it was the oranges and the greens wot dunnit. Greens are downloading data about a game from BGG, and orange is storing that game in the database.

I just checked the code, and each game is updated every 839 hours (there’s a boring reason for that). So, that would be what’s causing the problem – every 35 days, I go to update 66000 games, which causes 66000 Lambdas to download data from BGG (sorry Aldie) and then 66000 Lambdas try to update the database. It seems I need some more dithering.

A Massive Overreaction!

I blogged a couple of weeks ago about my unsatisfying experience with React. I then got to pondering what I might do with React – I liked the navigation bar that I did, and so would like to continue using React for little bits of the page outside of the main applications, like the login button.

One of my goals with the system design has been to allow other developers to write pages, by which I mean the data presentation components. Angular and React usually call those things Single Page Applications (SPAs), by which they mean you don’t keep loading new HTML pages as you click around and do stuff. What they don’t typically mention is that because SPAs don’t play well with each other they tend to be You Must Write The Whole Page Using This And No Other JavaScript Applications.

If you try to put two Angular applications on the same page, they both try to install something called zone.js, which can only be installed once. So don’t put two Angular apps on the same page.

If you try to put two React apps on the same page, they both include React libraries. And if the two apps use different versions of the React libraries, then they interfere in unpredictable ways.

The way I discovered this was I rewrote the login button in React, and put it on the same page as the navigation bar. Due to quirks of fate, each used a different version of the React library, and it didn’t work. I consulted the React guys on Reddit, and they suggested I was doing it wrong, and that I should just write the whole page in React. I didn’t want to do that, because what if some other developer wants to write a data presentation component in React? Then they would need to match the React version of the hosting page. I am an extremely stubborn person when it comes to implementing a plan, so that was not going to happen.

I continued thinking about this, and about how the navigation bar didn’t really need React at all, and I had the idea of server-side rendering. SSR is when you run the JavaScript to generate HTML before sending the page out – so you send more HTML and less JavaScript. And there’s a technology called GatsbyJS which is designed specially for writing server-rendered sites in React, so I decided to give it a go. (I also tried one called NextJS but I did not like where that was going.)

Previously, the HTML pages, e.g. index.html, rankings.html, were written using a technology called Mustache, which is just a template language. If I wanted the navigation bar I would just put in {{> navbar}}. See how those curly brackets look like mustaches? That’s the joke.

So to convert to Gatsby I pretty much had to convert my HTML to React JSX, which is basically JavaScript code which looks like XML. That wasn’t so hard. But then Gatsby gets miffed if it doesn’t own the whole world, and if you want to refer to things which are outside the Gatsby world you have to use a feature called dangerouslysetinnerhtml. Being the daredevil and mule-headed SOB that I am, I did that, and pretty much got the site being generated from Gatsby.

There was a hiccup when I generated the Gatsby site – remember the point of server-side rendering is to do the JavaScript on the server, not in the browser – and Gatsby stuffed a whole bunch of its own JavaScript into the page to preload pages it thought the user might go to next. I was pretty annoyed by that – if I want JavaScript in my pages, I’ll put it there! Luckily Gatsby has decent facilities for hacking the result, so I figured out how to tell it to throw away all of that JavaScript I didn’t ask for. I do resent always ending up working on the most advanced topics on my first day with a new technology.

And that was when my IDE (my smart code editor) stopped coping. I had most of Extended Stats in one github repository. So I would open the project in the IDE (I use Webstorm for this) and it would try to find all my code and figure out what bits referred to what other bits, which is very handy when you want to know whether something is used or not. However with 3 separate CloudFormation stacks for Lambdas, Gatsby, and a dozen Angular applications, it would get confused sorting all that out. As far as I could tell it would take an hour to reindex the code, and during that time it wouldn’t allow me to paste more code in. That was unacceptable , so I decided to move the client module out into another repository and project.

Great, except that something had sabotaged the Gatsby project so that Github ignored it. Github is the cloud site where I store my code, and if my code’s not there it’s only a hard drive crash away from ceasing to exist altogether. So I had to convince Github to store the Gatsby code. I never did figure out what was going on, but I did copy the code to a different place and pretended it was new, and that worked.

And then after that worked, I could get the login button working, and then I could build a replacement site using Gatsby. The login button is tricky, because it does require JavaScript to work, and I can’t write that JavaScript in Angular or React, it has to be what they call VanillaJS. But that’s OK, I had a few versions of that code lying around already, so I just copied it into React’s dangerouslysetinnerhtml drama. And now it all works, mostly!

I deployed that version this morning, and it just made the daily sync with the CDN, so the Gatsby version of the site went live a short time ago. I just noticed that the user page is calling itself the Selector Test page, which I will have to fix. But overall I’m happy with this experiment. I still have to delete a few things that have become obsolete, like older versions of the login button and the nav bar, and I will fix up the CSS so the pages don’t look so cluttered and jumbled, but I feel this solution is better than the Mustache one. And I guess I can now really claim to have some experience with React.