Saturday, March 4, 2017

SSSD: {DBus,Socket}-activated responders (2nd try!)

Second time's the charm! :-)


Since the first post about this topic some improvements have been done in order to fix a bug found and reported by a Debian user (Thanks Stric!).

The fix is part of SSSD 1.15.1 release and altogether with the release, some other robustness improvements have been done! Let's go through the changes ...


Avoid starting the responders before SSSD is up!


I've found out that the NSS responder had been started up before SSSD and it's quit problematic during the boot up process as libc does initgroups on pretty much any account, checking all NSS modules in order to be precise.

By calling sss_nss the NSS responder is triggered and tries to talk to the data providers (which are not up yet, as SSSD is not up yet ...), causing the boot up process to hang until libc gives up (causing a timeout on services like systemd-login and all the services depending on this one).

The fix for this issue looks like:
1
2
3
4
5
6
7
8
@@ -1,6 +1,7 @@
  [Unit]
  Description=SSSD @responder@ Service responder socket
  Documentation=man:sssd.conf(5)
+ After=sssd.service
  BindsTo=sssd.service
 
  [Socket]

And, as I've been told by systemd developers that "BindsTo=" must always come together with "After=" (although it is not documented yet ...) this fix has been applied for all responders' unit files.


Avoid starting the responders' sockets before SSSD is up!


We really want (at least for now) to have the responders' sockets completely tied up to SSSD service. We want the responders to be socket-activated only after SSSD is up and just right above this section you can see an explanation why we want to have this kind of control.

In order to achieve this some changes were needed in the sockets' units, as systemd automatically adds "Before=sockets.target" to any socket unit by default (and sockets.target is started up in an really early phase of the boot process).

And there I went again to talk to systemd developers about the best approach to do not start the responder's sockets before SSSD is up and the patch that came out as a result of the discussion looks like:

1
2
3
4
5
6
7
8
9
@@ -3,6 +3,8 @@
  Documentation=man:sssd.conf(5)
  After=sssd.service
  BindsTo=sssd.service
+ DefaultDependencies=no
+ Conflicts=shutdown.target

  [Socket]
  ListenStream=@pipepath@/@responder@

By doing this change the sockets are no longer started before sockets.target, but just after SSSD service is started. The downside of this approach is that we have to deal with conflicts by our own and that is the reason the "Conflicts=shutdown.target" has been added.


Be more robust against misconfigurations!


As now that we have two completely different ways to manage the services, we really have to be robust in order to avoid that the admins will mix them up wrongly.

So far we have been flexible enough to allow admins to have some of the services being started up by the monitor, while other services left for systemd. And it's okay! The problem would start when the monitor has been told to start a responder (by having the responder listed in the services' line of sssd.conf) and this very same responder is supposed to be socket-activated (the admin did systemctl enable sssd-@responder@.socket).

In the situation describe above we could end up with two responders' services running (for the very same responder). The best way found to fix this issue is adding a simple program to check whether the socket-activated responder is also mentioned in the sssd.conf services' line. In case it's mentioned there, just do not start the socket up and leave the whole responsibility to the monitor. Otherwise, take advantage of systemd machinery!

The change on the sockets' unit looks like:
1
2
3
4
5
6
7
8
@@ -7,6 +7,7 @@
  Conflicts=shutdown.target

  [Socket]
+ ExecStartPre=@libexecdir@/sssd/sssd_check_socket_activated_responders -r @responder@
  ListenStream=@pipepath@/@responder@
  SocketUser=@SSSD_USER@
  SocketGroup=@SSSD_USER@


Also, I've decided to be a little bit stricter on our side and also refuse manual start up of the responders' services and the change for this looks like:
1
2
3
4
5
6
7
8
@@ -3,6 +3,7 @@
  Documentation=man:sssd.conf(5)
  After=sssd.service
  BindsTo=sssd.service
+ RefuseManualStart=true

  [Install]
  Also=sssd-@responder@.socket


And how can I start using the socket-activated services?


As by default we still use the monitor to manage services, some little configuration change is need.

See the example below explaining how to enable the PAM and AutoFS services to be socket-activated.

Considering your /etc/sssd/sssd.conf has something like:

1
2
3
[sssd]
services = nss, pam, autofs
...

Enable PAM and AutoFS responders' sockets:
# systemctl enable sssd-pam.socket
# systemctl enable sssd-autofs.socket

Remove both PAM and AutoFS responders from the services' line, like:

1
2
3
[sssd]
services = nss
...

Restart SSSD service
    # systemctl restart sssd.service
    

    And you're ready to go!


    Is there any known issue that I should be aware of?


    Yes, there is! You should avoid having PAC responder, needed by IPA domains, socket-activated for now. The reason for this is that due to an ugly hack on SSSD code this responder is added to the services' list anytime an IPA domain is detected.

    By doing this, the service is always started by the monitor and there is nothing that could be done on our socket's units to detected this situation and avoid starting up the PAC socket.

    A possible way to fix this issue is patching ipa-client-install to either explicitly add the PAC responder to the services' list (in case the admin wants to keep using the monitor) or to enable the PAC responders' socket (in case the admin wants to take advantage of socket-activation).

    Once it's done on IPA side, we would be able to drop the code that enables the PAC responder automatically from SSSD. However, doing this right now would break backwards compatibility!


    Where can I find more info about SSSD?


    More information about SSSD can be found on the project page: https://pagure.io/SSSD/sssd/

    If you want to report us a bug, please, follow this web page and file an issue in the SSSD pagure instance.

    Please, keep in mind that currently we're in the middle of a migration process from FedoraHosted to Pagure and it will take a while to have everything in place, again.

    Even though, you can find more info about SSSD's internals here.

    In case you want to contribute to the project, please, read this webpage and feel free to approach us at #sssd on freenode (irc://irc.freenode.net/sssd).

    Tuesday, January 31, 2017

    SSSD: {DBus,Socket}-activated responders!

    Since its 1.15.0 release, SSSD takes advantage of systemd machinery and introduces a new way to deal with the responders.

    Previously, in order to have a responder initialized, the admin would have to add the specific responder to the "services" line in sssd.conf file, which does make sense for the responders that are often used but not for those rarely used (as the infopipe and PAC responders for instance).

    This old way is still preserved (at least for now) and this new release is fully backwards-compatible with the old config file.

    For this new release, however, adding responders to the "services" isn't needed anymore as the admin can easily enable any of the responders' sockets and those will be {dbus,socket}-activated on demand and will be up while are still being used. In case the responder becomes idle, it will automatically shut itself down after a configurable amount of time.

    The sockets we've created are: sssd-autofs.socket, sssd-nss.socket, sssd-pac.socket, sssd-pam.socket (and sssd-pam-priv.socket, but you don't have to worry about this one), sssd-ssh.socket and sssd-sudo.socket. As an example, considering the admins want to enable the sockets for both NSS and PAM responders, they should do: `systemctl enable sssd-pam.socket sssd-nss.socket` and voilà!

    In some cases the admins may also want to set the "responder_idle_timeout" option added for each of the responders in order to tweak for how long the responder will be running in case itbecomes idle. Setting this option to 0 (zero) disables the responder_idle_timeout. For more details, please, check sssd.conf man page.

    For this release we've taken a more conservative path and are leaving up to the admins to enable the services they want to have enabled in case they would like to try to using {dbus,socket}-activated responders

    It's also important to note that while the SELinux policies are not updated in your distro you may need to have SELinux in permissive mode in order to test/use the {dbus,socket}-activated responders. A bug for this is already filed for Fedora and hopefully will be fixed before the new package is included in the distro.

    And the changes in the code were (a high-level explanation) ...

    Before this work the monitor was the piece of code responsible for handling the responders listed in the services' line of sssd.conf file. And by handling I mean:

    • Gets the list of services to be started (and, consequently, the total number of services);
    • For each service:
      • Gets the service configuration;
      • Starts the service;
      • Adds the service to the services' list;
      • Once the service is up, a dbus message is sent to the monitor, which ...
        • Sets up the sbus* connection to communicate with the service;
        • Marks the service as started;

    Now, the monitor does (considering an empty services' line):

    • Once the service is up, a dbus message is sent to the monitor;
      • The number of services is increased;
      • Gets the service configuration;
      • Adds the service to the services' list
      • Sets up the sbus connection to communicate with the service;
      • Sets up a destructor to the sbus connection in order to properly shutdown the service when this connection is closed;
      • Marks the service as started;

    By looking at those two different processes done by the monitor, some of you may have realized an extra step needed when the service has been {dbus,socket}-activated that was needed at all before. Yep, "Sets up a destructor to the sbus connection in order to properly shutdown the service when this connection is closed" is a completely new thing as, previously, the services were just shut down when SSSD was shut down and now the services are shutdown when they become idle.

    So, what's basically done now is:
     - Once there's no communication to the service, it's (sbus) connection with the monitor is closed;
     - Closing the (sbus) connection triggers the following actions:
        - The number of services is decreased;
        - The connection destructor is unset (otherwise it would be called again on the service has been freed);
        - Service is shut down:

    *sbus: SSSD uses dbus protocol over a private socket to handle its internal communication, so the services do not talk over system bus.

    And how do the unit files look like?

    SSSD has 7 services: autofs, ifp, nss, pac, pam, ssh and sudo. From those 7 services 4 of them have pretty much these unit files:

    AutoFS, PAC, SSH and Sudo unit files:


    sssd-$responder.service:
    [Unit]
    Description=SSSD $(responder) Service responder
    Documentation=man:sssd.conf(5)
    After=sssd.service
    BindsTo=sssd.service
    
    [Install]
    Also=sssd-$responder.socket
    
    [Service]
    ExecStartPre=-/bin/chown $sssd_user:$sssd_user /var/log/sssd/sssd_autofs.log
    ExecStart=/usr/libexec/sssd/sssd_$responder --debug-to-files --socket-activated
    Restart=on-failure
    User=$sssd_user
    Group=$ssd_user
    PermissionsStartOnly=true

    sssd-$responder.socket:
    
    
    
    
    [Unit]
    Description=SSSD $(responder) Service responder socket
    Documentation=man:sssd.conf(5)
    BindsTo=sssd.service
    
    [Socket]
    ListenStream=/var/lib/sss/pipes/$responder
    SocketUser=$sssd_user
    SocketGroup=$sssd_user
    
    [Install]
    WantedBy=sssd.service
    


    And about the different ones? We will get there ... and also explain why they are different.

    The infopipe (ifp) unit file:

    As the infopipe won't be socket-activated, it doesn't have the its respective .socket unit.
    Also, differently than the others responders the infopipe responder can only be run as root nowadays.
    In the end, its .service unit looks like:

    sssd-ifp.service:
    [Unit]
    Description=SSSD IFP Service responder
    Documentation=man:sssd-ifp(5)
    After=sssd.service
    BindsTo=sssd.service
    
    [Service]
    Type=dbus
    BusName=org.freedesktop.sssd.infopipe
    ExecStart=/usr/libexec/sssd/sssd_ifp --uid 0 --gid 0 --debug-to-files --dbus-activated
    Restart=on-failure

    The PAM unit files:

    The main difference between PAM responder and the others is that PAM has two sockets that can end up socket-activating its service. Also, these sockets have a special permission.
    In the end, its unit files look like:

    sssd-pam.service:
    [Unit]
    Description=SSSD PAM Service responder
    Documentation=man:sssd.conf(5)
    After=sssd.service
    BindsTo=sssd.service
    
    [Install]
    Also=sssd-pam.socket sssd-pam-priv.socket
    
    [Service]
    ExecStartPre=-/bin/chown $sssd_user:$sssd_user @logpath@/sssd_pam.log
    ExecStart=@libexecdir@/sssd/sssd_pam --debug-to-files --socket-activated
    Restart=on-failure
    User=$sssd_user
    Group=$sssd_user
    PermissionsStartOnly=true
    

    sssd-pam.socket:
    [Unit]
    Description=SSSD PAM Service responder socket
    Documentation=man:sssd.conf(5)
    BindsTo=sssd.service
    BindsTo=sssd-pam-priv.socket
    
    [Socket]
    ListenStream=@pipepath@/pam
    SocketUser=root
    SocketGroup=root
    
    [Install]
    WantedBy=sssd.service
    

    sssd-pam-priv.socket:
    [Unit]
    Description=SSSD PAM Service responder private socket
    Documentation=man:sssd.conf(5)
    BindsTo=sssd.service
    BindsTo=sssd-pam.socket
    
    [Socket]
    Service=sssd-pam.service
    ListenStream=@pipepath@/private/pam
    SocketUser=root
    SocketGroup=root
    SocketMode=0600
    
    [Install]
    WantedBy=sssd.service
    

    The NSS unit files:

    The NSS responder was the trickiest one to have working properly, mainly because when socket-activated it has to run as root.
    The reason behind this is that systemd calls getpwnam() and getgrnam() when using "User="/"Group=" different than root. By doing this libc ends up querying for $sssd_user, trying to talk to NSS responder which is not up yet and then the clients would end up hanging for a few minutes (due to our default_client_timeout) which is something we really want to avoid.

    In the end, its unit files look like:

    sssd-nss.service:
    Description=SSSD NSS Service responder
    Documentation=man:sssd.conf(5)
    After=sssd.service
    BindsTo=sssd.service
    
    [Install]
    Also=sssd-nss.socket
    
    [Service]
    ExecStartPre=-/bin/chown root:root @logpath@/sssd_nss.log
    ExecStart=@libexecdir@/sssd/sssd_nss --debug-to-files --socket-activated
    Restart=on-failure
    

    sssd-nss.socket:
    [Unit]
    Description=SSSD NSS Service responder socket
    Documentation=man:sssd.conf(5)
    BindsTo=sssd.service
    
    [Socket]
    ListenStream=@pipepath@/nss
    SocketUser=$sssd_user
    SocketGroup=$sssd_user
    

    All the services' units have a "BindsTo=sssd.service" in order to ensure that the service will be stopped when sssd.service is stopped so in case SSSD is shutdown/restart those actions will be propagated to the responders as well.

    Similarly to "BindsTo=ssssd.service" there's "WantedBy=sssd.service" in every socket unit and it's there to ensure that, once the socket is enabled it will be automatically started by SSSD when SSSD is started.

    And that's pretty much all changes that I've covered with this work.

    I really have to say a big thank you to ...

    • Lukas Nykryn and Michal Sekletar who patiently reviewed the unit files we're using and gave me a lot if good tips while doing this work;
    • Sumit Bose who helped me to find out the issue with the NSS responder when trying to run it as a non-privileged user;
    • Jakub Hrozek, Lukas Slebodnik and Pavel Brezina for reviewing and helping me to find bugs, crashes, regressions that fortunately were avoided.

    And what's next?

    There's already a patch making the {dbus,socket}-activated automatically enabled when SSSD starts, which changes our approach from having to explicit enable the sockets in order to take advantage of this work to explicitly mask the disable (actually, mask) the sockets of the processes that shouldn't be {dbus,socket}-activated.

    Also, a bigger work for the future is to also have the providers being socket-activated, but this is material for a different blob post. ;-)

    Nice, nice. But I'm having issues with what you've described!

    In case it happens to you, please, keep in mind that the referred way to diagnose any issues would be:

    • Inspecting sssd.conf in order to check which are the explicitly activated responders in the services' line;
    • `systemctl status sssd.service`;
    • `systemctl status sssd-$responder.service` (for the {dbus,socket}-activated ones);
    • `journalctl -u sssd.service`;
    • `journalctl -u sssd-$responder.service` (for the {dbus,socket}-activated ones);
    • `journalctl -br`;
    • Checking SSSD debug logs in order to see whether SSSD sockets where communicated